In short, Platform is worse than Pool. And within Pool, using 12 threads per job instead of 1 gives a ~2× speedup. But there’s a ~4× penalty I can’t eliminate when many jobs are run in parallel.
There is a lot of different things going on with this scenario with threads, process, and the Python GIL. And also with the thread pool being shared in a single process.
ITK Python 6.0 will enable the GIL to be unlocked but that is just in beta now. I have a local compiled version of SimpleITK with Elastix which I added to your benchmarking. SimpleITK does unlock the GIL similar to ITK Python for the next release.
This is on a Mac M1 Pro with 10-ish cores. For high through put of jobs (getting through many jobs as quick as possible), I find 2-threads per ITK job, and 0.5-0.75 number of jobs to be the most efficient.
This looks awesome! Do you think I can use sitk as a drop-in replacement for elastix, given the example registration in this benchmark ? Maybe, if I understand correctly, the fancy GIL free versions have not been released yet ? Neither for itk-elastix nor for sitk ? Do you know if there’s an approximate timeline for this release ?
Thanks!
With many jobs run in parallel, you should strongly consider using TBB multi threader. It does some kind of machine-wide balancing. I think it is available with already released ITK, and therefore itk-elastix.
I agree with Brad, when you have many jobs running in parallel, within-job parallelism should be low.
I’m still trying to understand the problem observed with the original benchmark. It looks like the specified number of threads doesn’t really matter in this case, as long as it is greater than zero!
The original slowness of the benchmark function run_itk disappears when I only add the following line, directly after the import itk:
I’m trying to understand why! I do know that this GlobalDefaultNumberOfThreads is used by the constructor of ITK’s ThreadPool:
So then, what happens when GlobalDefaultNumberOfThreads is not explicitly specified? Is ITK’s ThreadPool then still properly constructed? Does the Python wrapper tweak the ThreadPool somehow? Do you have a clue?
The entire function is here, which (in the absence of other specifications) delegates to the platform-specific function. There is a global maximum defined here.
I did try a similar benchmark in C++ as well, having an executable written in C++, linked to “elastix-5.3.1.lib” and its dependencies, built with MSVC 2026 (Release). It suggested the following relation between selected number of threads (erm->SetNumberOfThreads(n_threads)) and registration duration:
Update: I realize now that calling SetNumberOfThreads multiple times in C++, during the run of one and the same process isn’t so effective, because the size of the ITK ThreadPool is not adjusted anymore after its initialization. To be continued…
Prints 2, when I’m using itk-elastix 0.25.2 on Windows.
print(erm.GetMultiThreader())
prints:
TBBMultiThreader (000001B8F5D0DD60)
RTTI typeinfo: class itk::TBBMultiThreader
Reference Count: 2
Modified Time: 68
Debug: Off
Object Name:
Observers:
none
Number of Work Units: 1024
Number of Threads: 64
Global Maximum Number Of Threads: 128
Global Default Number Of Threads: 64
Global Default Threader Type: itk::MultiThreaderBaseEnums::Threader::TBB
SingleMethod: 0000000000000000
SingleData: 0000000000000000
GetNumberOfThreads 0
GetGlobalDefaultNumberOfThreads 64
GetGlobalMaximumNumberOfThreads 128
Does TBB MultiThreader require calling SetGlobalDefaultNumberOfThreads or SetGlobalMaximumNumberOfThreads, before it can be used efficiently?
It does some load balancing while splitting the region into chunks. I don’t think that requires changing the number of threads. But if you know stuff about number of threads, it is advisable to exploit it.