Why am I observing multiple inheritance to be faster than single?

Question

Note, this answer is highly speculative.

Unlike some of my other answers to questions of the type “Why is X slower than Y”, I’ve been unable to provide solid evidence to backup this answer.

After tinkering with this for about an hour now, I think it’s due to the address alignment of three things:

The address of obj
The address of the Virtual Method Table of A
The address of function f()

(owagh’s answer also hints at the possibility of instruction alignment.)

The reason why multiple inheritance is slower than the single inheritance is not because it is “magically” fast, but because the single inheritance case is running into either a compiler or a hardware “hiccup”.

If you dump out the assembly for the single and multiple inheritance cases, they are identical (register names and everything) within the nested loop.

Here’s the code I compiled:

#include <iostream>
#include <stdlib.h>
#include <time.h>
using namespace std;
unsigned long a=0;


#ifdef SINGLE
class A {
  public:
    virtual int f() { return a; } 
};

class B : public A {
  public:
    virtual int f() { return a; }                                      
    void g() { return; }                                               
};       
#endif

#ifdef MULTIPLE
class A {
  public:
    virtual int f() { return a; }
};

class dummy {
  public:
    virtual void g() { return; }
};

class B : public A, public dummy {
  public:
    virtual int f() { return a; }
    virtual void g() { return; }
};
#endif

int main() {
    cin >> a;
    A* obj;
    if (a > 3)
        obj = new B();
    else
        obj = new A();

    unsigned long result = 0;


    clock_t time0 = clock();

    for (int i=0; i<65535; i++) {
        for (int j=0; j<65535; j++) {
            result += obj->f();
        }
    }      

    clock_t time1 = clock();   
    cout << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;    

    cout << result << "\n";
    system("pause");  //  This is useless in Linux, but I left it here for a reason.
}

The assembly for the nested loop is identical in both single and multiple inheritance cases:

.L5:
    call    clock
    movl    $65535, %r13d
    movq    %rax, %r14
    xorl    %r12d, %r12d
    .p2align 4,,10
    .p2align 3
.L6:
    movl    $65535, %ebx
    .p2align 4,,10
    .p2align 3
.L7:
    movq    0(%rbp), %rax
    movq    %rbp, %rdi
    call    *(%rax)
    cltq
    addq    %rax, %r12
    subl    $1, %ebx
    jne .L7
    subl    $1, %r13d
    jne .L6
    call    clock

Yet the performance difference I see is:

Single: 9.4 seconds
Multiple: 8.06 seconds

Xeon X5482, Ubuntu, GCC 4.6.1 x64.

This leads me to the conclusion that the difference must be data dependent.

If you look at that assembly, you’ll notice that the only instructions that could have variable latency are the loads:

    ; %rbp = vtable

movq    0(%rbp), %rax   ; Dereference function pointer from vtable
movq    %rbp, %rdi
call    *(%rax)         ; Call function pointer - f()

followed by a few more memory accesses inside the call the f().

It just happens to be that in the single inheritance example, the offsets of the aforementioned values are not favorable to the processor. I have no idea why. ~~But I had to suspect something, it’d be cache-bank conflicts in a similar manner to region 2 in the diagram of this question.~~

By rearranging the code and adding dummy functions, I can change these offsets – which in a lot of cases will eliminate this slow down and make the single inheritance as fast as the multiple inheritance case.

For example, removing the system("pause") inverts the times:

#ifdef SINGLE
class A {
  public:
    virtual int f() { return a; } 
};

class B : public A {
  public:
    virtual int f() { return a; }                                      
    void g() { return; }                                               
};       
#endif

#ifdef MULTIPLE
class A {
  public:
    virtual int f() { return a; }
};

class dummy {
  public:
    virtual void g() { return; }
};

class B : public A, public dummy {
  public:
    virtual int f() { return a; }
    virtual void g() { return; }
};
#endif

int main() {
    cin >> a;
    A* obj;
    if (a > 3)
        obj = new B();
    else
        obj = new A();

    unsigned long result = 0;


    clock_t time0 = clock();

    for (int i=0; i<65535; i++) {
        for (int j=0; j<65535; j++) {
            result += obj->f();
        }
    }      

    clock_t time1 = clock();   
    cout << (double)(time1 - time0) / CLOCKS_PER_SEC << endl;    

    cout << result << "\n";
//    system("pause");
}

Single: 8.06 seconds
Multiple: 9.4 seconds

Leave a Comment Cancel reply