ITK5 crashes on older CPUs

Since we updated 3D Slicer to use ITK5, we got multiple error reports on Windows and Linux that the application crashes at startup in GDCM due to illegal instruction.

Windows:

  • Displayed to user in popup The application was unable to start correctly (0xc0000142). Click OK to close the application.
  • With debugger: exception 0xc000001d (Illegal instruction) in ITKIOGDCM-5.1.dll

Linux:

  • Displayed to the user: error: [/home/user/programs/Slicer-4.11.0-2019-08-22-linuxamd64/bin/SlicerApp-real] exit abnormally - Report the problem.
  • With debugger: [21639.829126] traps: SlicerApp-real[15453] trap invalid opcode ip:7f8aa07fa1c8 sp:7ffce616fca8 error:0 in libitkgdcmcharls-5.0.so.1[7f8aa07e4000+84000]

It seems that operations in GDCM (probably related to bitfield handling) generate shlx instructions, which are part of BMI2 instruction set. These instructions are only supported in 4+ generation Intel CPUs. These are typically 6-8-year-old computers, so there are a good number of them still out there. See more details and discussion here.

Probably there is not much to do about this at ITK level, as build options are usually passed down from the application. I just posted this information here so that ITK users become aware of this issue and maybe save some debugging time.

2 Likes

I had a look at ITK history and maybe there is something to do about this problem in ITK. There might have been some incorrect assumptions when ITKSetStandardCompilerFlags.cmake was modified recently.

See for example this discussion. Maybe capabilities of the build computer are used to decide what CPU instruction set is allowed? That would not be ideal, because build computers are often newer/more powerful than computers of regular users.

2 Likes

Thank you, it is interesting and a little bit worrying.

Could not reproduce with Toshiba Tecra (Intel® Core™ i7-3520M CPU @ 2.90GHz), this is Ivy Bridge and quite usable notebook (8GB RAM, Radeon HD 7xxx).
Both, actual Slicer nightly and Slicer-4.11.0-2019-08-22-linux-amd64 are working.

The only binary (in lib/Slicer-4.11 folder) with shlx instruction is libcrypto.so.1.0.0

$ cd Slicer-4.11.0-2019-08-22-linux-amd64/lib/Slicer-4.11/
$ objdump -d libitkgdcmcharls-5.0.so.1 | grep -i shlx
$ objdump -d libcrypto.so.1.0.0 | grep -i shlx
   f692e:	c4 c2 89 f7 e8       	shlx   %r14,%r8,%rbp
   f69ac:	c4 c2 89 f7 c9       	shlx   %r14,%r9,%rcx
   f6a39:	c4 c2 89 f7 ca       	shlx   %r14,%r10,%rcx
   f6ac6:	c4 c2 89 f7 cb       	shlx   %r14,%r11,%rcx
   f6c4d:	c4 c2 c9 f7 c8       	shlx   %rsi,%r8,%rcx
   f6c6e:	c4 c2 c9 f7 c9       	shlx   %rsi,%r9,%rcx
   f6c8a:	c4 c2 c9 f7 ca       	shlx   %rsi,%r10,%rcx
   f6ca6:	c4 c2 c9 f7 cb       	shlx   %rsi,%r11,%rcx

Edit: may be objdump -D is better, but same results

1 Like

If you set CMake variables ITK_C_OPTIMIZATION_FLAGS and ITK_CXX_OPTIMIZATION_FLAGS to empty stings you can globally control the architecture requirements for you distribution by just adding the flags to CXXFLAGS, and CFLAGS like normal.

3 Likes

Thanks a lot for testing this. shlx is added by MSVC compiler. The discussion linked above has more information about linux environment and error.

This sounds promising, I’ll try this tomorrow.

Has anyone done profiling to see what is the performance impact of enabling/disabling advanced instruction sets? For 10-20% speed improvement I would not sacrifice compatibility, but if we get more than 50-100% improvement then it may worth the trouble.

2 Likes

Thanks a lot for testing this. shlx is added by MSVC compiler.

Yes! I could reproduce on the same Ivy Bridge notebook and Windows 7. Latest nightly failed to start

error1

error2

Instructions in libcrypto.so on Linux are probably not a problem. I checked, the system’s libcrypto.so (Debian 10.1) also has them, probably there is some way to switch or whatever.

Edit:

$ objdump -D ITKIOGDCM-5.1.dll | grep -i shlx
   180016e36:	c4 e2 f9 f7 c1       	shlx   %rax,%rcx,%rax
   180016e53:	c4 e2 f9 f7 c1       	shlx   %rax,%rcx,%rax
   180016fa6:	c4 e2 f9 f7 c1       	shlx   %rax,%rcx,%rax
   18005460e:	c4 e2 69 f7 cf       	shlx   %edx,%edi,%ecx
   18005461d:	c4 e2 69 f7 cf       	shlx   %edx,%edi,%ecx
   18005b5a9:	c4 e2 49 f7 c0       	shlx   %esi,%eax,%eax
   18006cb37:	c4 62 f1 f7 c8       	shlx   %rcx,%rax,%r9
...

I am glad i still build my public releases for Windows with ITK-4.13 and VS 2013. Not yet sure which flag is responsible for BMI2 instructions, but AVX2 is enough to make big problems, ITK5 build system on Windows forces to use AVX2 and runs test program on build host, but the test is useless for public releases. AVX2 was available on AMD hardware from Q2 2015. I definitely don’t want this in public release, too early, sorry. I don’t have very many users, but even if the number is small i don’t want upset some of them, who bought their hardware 3 years ago. Sorry.

users

Edit: were interesting to see where in ITK the appropriate intrinsic functions are used. Just enabling SSE2/3/AVX or whatever SIMD may do nothing, performance gain may be zero, i would not wonder. I shall check…

Edit:
Current hardware data from Steam (August 2019)

SSE2          100.00%
SSE3          100.00%
SSSE3         97.70%
SSE4.1        96.54%
SSE4.2        95.49%
AVX           89.29%
1 Like

I’ve run ITK tests with and without allowing new instruction sets and the results were not convincing. Out of 126 test groups, there is no significant difference in 124 groups (few percent execution time difference - sometimes faster, sometimes slower). The two exceptions are: ITKRegistrationMethodsv4 (15.9% faster) and RUNS_LONG (11.8% faster).

Detailed test results

Here are the execution times of results for tests longer than 10sec (baseline -> with advanced instruction set):

  • ITKCommon: 35.39->35.53sec
  • ITKFastMarching: 26.87->27.43sec
  • ITKIOImageBase: 14.19->12.42sec
  • ITKIOMINC: 10.65->10.26sec
  • ITKImageIntensity: 11.77->11.04sec
  • ITKLabelMap: 16.07->16.14sec
  • ITKLevelSetsv4: 14.14->13.67sec
  • ITKMathematicalMorphology: 14.10->13.63sec
  • ITKMetricsv4: 20.72->20.22sec
  • ITKQuadEdgeMeshFiltering: 21.10->21.10sec
  • ITKRegistrationCommon: 19.80sec->19.50sec
  • ITKRegistrationMethodsv4: 67.45->56.76sec (-15.9%)
  • ITKReview: 56.73->57.72sec
  • ITKStatistics: 15.75->15.18sec
  • ITKTransform: 18.45->18.06sec
  • RUNS_LONG: 83.23->73.43sec (-11.8%)

For 3D Slicer, we will probably revert to use the compiler’s defaults optimization settings, as the modest registration speed improvement does not justify excluding users that have computers that are 5-10 years old (and implementing dynamic dispatching or creating multiple releases would be too much extra effort).

In general, ITK should not silently override the compiler’s instruction set defaults. Those defaults are carefully chosen and used universally in most libraries. Even performance-focused libraries provide full compatibility by default (e.g., by using fat binaries and dynamic dispatching). ITK should just make it easy to switch to more recent instruction sets and maybe display a CMake message at configuration time that would tell what CMake option to change to get better performance.

3 Likes

That is great you did the performance comparison. I am not aware of other performance analysis done on these flags. There is the ITKPerformanceBenchmark module whose goal is to provide real world testing cases for performance comparisons. The regular ITK tests mostly run on small images, that would not be going for performance comparison cases, as they should be designed to test corner cases and code coverage of options. The 15% performance for real registration problems looks quite good to me.

But I do agree with your expectations about the default compiler flags.

1 Like

@hjmjohnson Maybe we can have AVX flags only if ITK_LEGACY_REMOVE is ON?

I think added CMake variable interdependencies will complicate things. IMHO it is a stretch for ITK_LEGACY_REMOVE to effect the architecture flags.

Packagers expect the configured C,CXX flags are respected with regards to the architectures specified.

1 Like

I agree with @blowekamp here.

does not justify excluding users that have computers that are 5-10 years old

BTW, according to Wikipedia AVX may be not available on computers bought yesterday too, just FYI

Not all CPUs from the listed families support AVX. Generally, CPUs with the commercial denomination "Core i3/i5/i7" support them, whereas "Pentium" and "Celeron" CPUs don't.

I guess it is reason only < 90% computers from Steam statistics have them.

@dzenanz, may be it were possible to make cmake(-gui) options and set AVX/AVX2 and something what enabled BMI2 to false by default, at least until everything is clarified?

Edit: FYI, very good tool analyze-x86

Edit: analyze_x86 clone with BMI/BMI2/TBM support (experimental)

Edit:

FYI @lassoan

AVX2 in Slicer’s bin (DLLs)

ITKCommon-5.1.dll avx2 3
ITKIOGDCM-5.1.dll avx2 42
ITKIOHDF5-5.1.dll avx2 180
ITKIOJPEG2000-5.1.dll avx2 6
ITKIOJPEG-5.1.dll avx2 7
ITKIOLSM-5.1.dll avx2 7
ITKIOMINC-5.1.dll avx2 182
ITKIONIFTI-5.1.dll avx2 4
ITKIONRRD-5.1.dll avx2 7
ITKIOTIFF-5.1.dll avx2 7
ITKIOTransformBase-5.1.dll avx2 4
ITKIOTransformHDF5-5.1.dll avx2 180
ITKOptimizers-5.1.dll avx2 11
ITKSpatialObjects-5.1.dll avx2 2
Qt5Gui.dll avx2 49
Qt5Multimedia.dll avx2 2
Qt5WebEngineCore.dll avx2 593
SimpleITKCommon-1.3.dll avx2 2
SimpleITK_ITKLabelMap-1.3.dll avx2 5

(Qt highly likely not a problem, uses when available, i guess)

BMI/BMI2 in Slicer’s bin (DLLs)

ITKCommon-5.1.dll bmi 1
ITKCommon-5.1.dll bmi2 37
ITKIOBMP-5.1.dll bmi2 2
ITKIOGDCM-5.1.dll bmi 49
ITKIOGDCM-5.1.dll bmi2 1004
ITKIOHDF5-5.1.dll bmi2 369
ITKIOJPEG2000-5.1.dll bmi2 198
ITKIOJPEG-5.1.dll bmi2 72
ITKIOLSM-5.1.dll bmi2 194
ITKIOMeshGifti-5.1.dll bmi2 9
ITKIOMINC-5.1.dll bmi2 370
ITKIOPNG-5.1.dll bmi 3
ITKIOPNG-5.1.dll bmi2 95
ITKIOTIFF-5.1.dll bmi2 198
ITKIOTransformBase-5.1.dll bmi2 4
ITKIOTransformHDF5-5.1.dll bmi2 369
ITKIOXML-5.1.dll bmi2 9
ITKStatistics-5.1.dll bmi2 3
SimpleITKCommon-1.3.dll bmi2 4
SimpleITK_ITKCommon-1.3.dll bmi 28
SimpleITK_ITKCommon-1.3.dll bmi2 1

Not much, may be it were worth to try AVX, but not AVX2, as trade-off. Not sure what to do with Pentiums/Celerons, can not test.

Whole output (2151 lines).

I guess only tests can provide real results, everything is too complicated

Edit:
Qt5 is in fact OK, system’s Qt5Gui on Debian 10.1 also has avx2 without problems on Ivy Bridge

1 Like

Thanks for checking. This confirms that ITK is the only outlier out of all Slicer’s third-party libraries.

Qt WebEngine uses Chromium, which is compiled as a fat binary (containing multiple versions of performance-critical functions, optimized for different CPUs). Probably the same technique is used in Qt Multimedia component.

2 Likes

May be Cmake options were not bad? BTW, Eigen heavily uses advanced instructions, there is also support for AVX512. That performance gain in some test may be related to Eigen. If there were options for AVX/AVX2/AVX512, etc. it were also better for performance enthusiasts - were possible to try AVX512, for example. And not only for Windows, please.

Having the option of using optimized instruction set is great. Silently overriding low-level compiler defaults that reduces hardware compatibility is not nice.

Now that we know how to force compiler-default instruction sets, we’ll just do that in Slicer (by default ITK would use /arch:AVX /arch:AVX2, which is clearly not a good choice). Whenever compiler developers feel that it is time to update their defaults, we’ll automatically get upgraded to new instruction sets.

Proper solution would be to use dynamic dispatch for performance-critical parts in ITK. Libraries, such as OpenCV and Intel MKL, already do this transparently (users of these libraries don’t need to be aware of this at all).

1 Like