EDIT: Problem solved. I have to shout out a huge Sorry to the community and a big thank you for your hints. Sorry to user anonymous, who seems to be involved into kernel development. What happened? We spent another 2 days debugging and fiddling around with the program code. No implementation problems were found. BUT: the main code involves another helper program. This helper program calculates weights for the ART algorithm on demand. So after debugging and testing, this helper program messed up, when running at least 4 processes. So this was NOT a Kernel / hardware problem, but a software (memory access) problem.
Lessons learned:
- Debug every tool that is involved into the calculation process.
- Microcode was outdated. SuperMicro is informed about this.
- Ubuntu 15.04 possibly needs additional tools, so that all Cores of the CPU run at full speed. Achieved this by installing Ubuntu 14.04 – all cores running at 2,5GHz.
- I need to spent some beer if we ever meet up at a conference.
So after three days of thinking, testing and fiddling around with the machine, I discovered the following observations today:
-
Ubuntu 15.04 runs the CPU with 420 – 650 MHz per Core. Okay I thought this is an Energy-saving option, so I followed various guides to set the speed to the maximum (2.50 GHz). It didn’t work. Checked with
cpufreq-utils
. -
Results still remained wrong after several tests on this machine. Other (i5, i7, XEON) machines produced correct results.
-
I read that other users experienced issues with Ubuntu 15.04 and the CPU frequency. So I decided to plug in a SSD and install Ubuntu 14.04. Checked again what the CPU frequency is now.. and it showed 2.50 GHz as I expected it.
-
Again started the reconstruction algorithm (which was now like 4-5 times faster than on Ubuntu 15.04) and waited for the results. Okay. Results are correct now! I double checked, started 9 processes and compared results. Still correct.
So I can only assume that there might be a problem in Ubuntu 15.04 / kernel using Speedstep in this CPU. CPU in 15.04 ran all the time between 420 – 650 MHz, while the min CPU speed is expected to be 1,20 GHz and the max CPU speed is 3,30 GHz. If somebody wants the check, I can offer the source code and example data leading to this problem.
Sorry for suspecting this be a CPU bug.
EDIT: after some more testing, the problem is only solved for some scenarios but not yet for all. I’ll do more testing.