FFTW, MKL, CUFFT and USE_GPU

I notice there’s quite a few “accelerator” type options for ITK builds, but the documentation regarding what they do/impact is very sparse to non-existent.

Can anyone point me at some docs, or enlighten me as to how much benefit I can actually expect if I go about flipping these switches on?

FFTW, MKL, cuFFT are all helpful if your processing is FFT-based. Here are some benchmarks @dzenanz did with ITKMontage:

Great. I’m a fairly high-level user of the ITK functions, how do I dig down to determine whether what I’m using is FFT-based :slight_smile:

The module a class located in will have ITKFFT as its dependency, and this is displayed in the Doxygen documentation. For example,

https://itk.org/Doxygen/html/group__Montage.html

1 Like

Thanks @matt.mccormick.

I see there’s also the GPU filters, but it looks like those aren’t drop-in in the sense you need to change your code to use them. Are you aware of any benchmarks there?

There is more work required with the GPU infrastructure for modernization, benchmarking, and to allow drop-in replacements.

Hi Matt,

I have been trying to get MKL with TBB working in ITK (v5.2rc03) with no luck so far.

  • FFTW gives me good performance. (Turned on ITK_USE_FFTW* options) 45+% occupancy on all 32 cores
  • I was unable to get any performance boost using cuFFT (Turned on using ITK_USE_CUFFTW)
  • MKL seems to be only using sequential no matter what I do i.e. turn on Module_ITKTBB or change ITK_DEFAULT_THREADER to TBB. I am using oneAPI TBB binary release from Release oneTBB 2021.1.1 · oneapi-src/oneTBB · GitHub

Can you please guide me on what I am missing in terms of configuration or otherwise.

Hardware is 32-Core Intel CPU with a Tesla K20c GPU.

Update1: I found the ITK_USE_MKL_WITH_TBB was commented out and that is the reason it was always sequential. Numbers make sense now. However, When I tried to uncomment that section and use MKL support without Module_ITKTBB turned ON, I get the following runtime error.

terminate called after throwing an instance of 'itk::ExceptionObject'
  what():  ../Modules/Core/Common/src/itkMultiThreaderBase.cxx:408:
itk::ERROR: ITK has been built without TBB support!
Aborted (core dumped)

Update 2: With Module_ITKTBB turned on, the above error is gone but there isn’t much difference in performance. The cores only reach a max of 11% occupancy that too only occasionally. Given below is the ldd output of executable. I am guessing the two different TBBs are the cause of this low performance. Please confirm if that is the case and also a work around for it :slight_smile:

	linux-vdso.so.1 (0x00007ffc425e0000)
	libdl.so.2 => /lib64/libdl.so.2 (0x00007f993c78f000)
	libtbb.so.2 => /opt/intel/compilers_and_libraries_2019.3.199/linux/tbb/lib/intel64_lin/gcc4.7/libtbb.so.2 (0x00007f993c531000)
	libstdc++.so.6 => /lib64/libstdc++.so.6 (0x00007f993c399000)
	librt.so.1 => /lib64/librt.so.1 (0x00007f993c38f000)
	libtbb.so.12 => /home/pradeep/oneapi-tbb-2021.1.1/lib/intel64/gcc4.8/libtbb.so.12 (0x00007f993c132000)
	libpthread.so.0 => /lib64/libpthread.so.0 (0x00007f993c10f000)
	libm.so.6 => /lib64/libm.so.6 (0x00007f993bf8b000)
	libgcc_s.so.1 => /lib64/libgcc_s.so.1 (0x00007f993bf70000)
	libc.so.6 => /lib64/libc.so.6 (0x00007f993bdaa000)
	/lib64/ld-linux-x86-64.so.2 (0x00007f993c7c4000)

Thank you,
Pradeep.

I found out there is only minor performance difference between MKL+TBB and FFTW. On Linux, it is more convenient to use FFTW, while on Windows MKL+TBB is easier.

Yes, the same TBB is desired – there may be a relevant option in the CMake cache entries to specify the TBB location.

Okay, there is some progress with respect to finding TBB and MKL automatically.

Using oneMKL from oneAPI base toolkit, both MKL and TBB are found automatically like a breeze. I am using same TBB for both Module_ITKTBB and MKL.

However, I am still facing similar performance issue where FFTW based ITK gives better performance than MKL based one. The only difference between these two ITK build binaries is that ITK_USE_MKL is ON/OFF. All other CMake options are identical.

Also, you haven’t mentioned anything about why even if I set cuFFT values manually, FFTs don’t seem to show any improvement in performance ? I haven’t looked at cuFFT cmake in ITK yet. Hints from you guys might improve my chances of figuring out the problem.

Thank you :slight_smile:

1 Like

In terms of the build configuration, cuFFT is using the FFTW interface to cuFFT, so make sure to enable FFTW CMake options.

The relative performance will depend on the data size, the processing pipeline, and hardware.

FFTW may provide better performance in your use case.