Since we updated 3D Slicer to use ITK5, we got multiple error reports on Windows and Linux that the application crashes at startup in GDCM due to illegal instruction.
Windows:
Displayed to user in popup The application was unable to start correctly (0xc0000142). Click OK to close the application.
With debugger: exception 0xc000001d (Illegal instruction) in ITKIOGDCM-5.1.dll
Linux:
Displayed to the user: error: [/home/user/programs/Slicer-4.11.0-2019-08-22-linuxamd64/bin/SlicerApp-real] exit abnormally - Report the problem.
With debugger: [21639.829126] traps: SlicerApp-real[15453] trap invalid opcode ip:7f8aa07fa1c8 sp:7ffce616fca8 error:0 in libitkgdcmcharls-5.0.so.1[7f8aa07e4000+84000]
It seems that operations in GDCM (probably related to bitfield handling) generate shlx instructions, which are part of BMI2 instruction set. These instructions are only supported in 4+ generation Intel CPUs. These are typically 6-8-year-old computers, so there are a good number of them still out there. See more details and discussion here.
Probably there is not much to do about this at ITK level, as build options are usually passed down from the application. I just posted this information here so that ITK users become aware of this issue and maybe save some debugging time.
I had a look at ITK history and maybe there is something to do about this problem in ITK. There might have been some incorrect assumptions when ITKSetStandardCompilerFlags.cmake was modified recently.
See for example this discussion. Maybe capabilities of the build computer are used to decide what CPU instruction set is allowed? That would not be ideal, because build computers are often newer/more powerful than computers of regular users.
Thank you, it is interesting and a little bit worrying.
Could not reproduce with Toshiba Tecra (Intel® Core™ i7-3520M CPU @ 2.90GHz), this is Ivy Bridge and quite usable notebook (8GB RAM, Radeon HD 7xxx).
Both, actual Slicer nightly and Slicer-4.11.0-2019-08-22-linux-amd64 are working.
The only binary (in lib/Slicer-4.11 folder) with shlx instruction is libcrypto.so.1.0.0
If you set CMake variables ITK_C_OPTIMIZATION_FLAGS and ITK_CXX_OPTIMIZATION_FLAGS to empty stings you can globally control the architecture requirements for you distribution by just adding the flags to CXXFLAGS, and CFLAGS like normal.
Thanks a lot for testing this. shlx is added by MSVC compiler. The discussion linked above has more information about linux environment and error.
This sounds promising, I’ll try this tomorrow.
Has anyone done profiling to see what is the performance impact of enabling/disabling advanced instruction sets? For 10-20% speed improvement I would not sacrifice compatibility, but if we get more than 50-100% improvement then it may worth the trouble.
Thanks a lot for testing this. shlx is added by MSVC compiler.
Yes! I could reproduce on the same Ivy Bridge notebook and Windows 7. Latest nightly failed to start
Instructions in libcrypto.so on Linux are probably not a problem. I checked, the system’s libcrypto.so (Debian 10.1) also has them, probably there is some way to switch or whatever.
I am glad i still build my public releases for Windows with ITK-4.13 and VS 2013. Not yet sure which flag is responsible for BMI2 instructions, but AVX2 is enough to make big problems, ITK5 build system on Windows forces to use AVX2 and runs test program on build host, but the test is useless for public releases. AVX2 was available on AMD hardware from Q2 2015. I definitely don’t want this in public release, too early, sorry. I don’t have very many users, but even if the number is small i don’t want upset some of them, who bought their hardware 3 years ago. Sorry.
Edit: were interesting to see where in ITK the appropriate intrinsic functions are used. Just enabling SSE2/3/AVX or whatever SIMD may do nothing, performance gain may be zero, i would not wonder. I shall check…
Edit:
Current hardware data from Steam (August 2019)
I’ve run ITK tests with and without allowing new instruction sets and the results were not convincing. Out of 126 test groups, there is no significant difference in 124 groups (few percent execution time difference - sometimes faster, sometimes slower). The two exceptions are: ITKRegistrationMethodsv4 (15.9% faster) and RUNS_LONG (11.8% faster).
Detailed test results
Here are the execution times of results for tests longer than 10sec (baseline -> with advanced instruction set):
For 3D Slicer, we will probably revert to use the compiler’s defaults optimization settings, as the modest registration speed improvement does not justify excluding users that have computers that are 5-10 years old (and implementing dynamic dispatching or creating multiple releases would be too much extra effort).
In general, ITK should not silently override the compiler’s instruction set defaults. Those defaults are carefully chosen and used universally in most libraries. Even performance-focused libraries provide full compatibility by default (e.g., by using fat binaries and dynamic dispatching). ITK should just make it easy to switch to more recent instruction sets and maybe display a CMake message at configuration time that would tell what CMake option to change to get better performance.
That is great you did the performance comparison. I am not aware of other performance analysis done on these flags. There is the ITKPerformanceBenchmark module whose goal is to provide real world testing cases for performance comparisons. The regular ITK tests mostly run on small images, that would not be going for performance comparison cases, as they should be designed to test corner cases and code coverage of options. The 15% performance for real registration problems looks quite good to me.
But I do agree with your expectations about the default compiler flags.
does not justify excluding users that have computers that are 5-10 years old
BTW, according to Wikipedia AVX may be not available on computers bought yesterday too, just FYI
Not all CPUs from the listed families support AVX. Generally, CPUs with the commercial denomination "Core i3/i5/i7" support them, whereas "Pentium" and "Celeron" CPUs don't.
I guess it is reason only < 90% computers from Steam statistics have them.
@dzenanz, may be it were possible to make cmake(-gui) options and set AVX/AVX2 and something what enabled BMI2 to false by default, at least until everything is clarified?
Thanks for checking. This confirms that ITK is the only outlier out of all Slicer’s third-party libraries.
Qt WebEngine uses Chromium, which is compiled as a fat binary (containing multiple versions of performance-critical functions, optimized for different CPUs). Probably the same technique is used in Qt Multimedia component.
May be Cmake options were not bad? BTW, Eigen heavily uses advanced instructions, there is also support for AVX512. That performance gain in some test may be related to Eigen. If there were options for AVX/AVX2/AVX512, etc. it were also better for performance enthusiasts - were possible to try AVX512, for example. And not only for Windows, please.
Having the option of using optimized instruction set is great. Silently overriding low-level compiler defaults that reduces hardware compatibility is not nice.
Now that we know how to force compiler-default instruction sets, we’ll just do that in Slicer (by default ITK would use /arch:AVX /arch:AVX2, which is clearly not a good choice). Whenever compiler developers feel that it is time to update their defaults, we’ll automatically get upgraded to new instruction sets.
Proper solution would be to use dynamic dispatch for performance-critical parts in ITK. Libraries, such as OpenCV and Intel MKL, already do this transparently (users of these libraries don’t need to be aware of this at all).