~8x slower registration with itk-elastix Python API vs elastix CLI — minimal reproducible example

should be fixed by adding a line of code, erm.SetNumberOfThreads(n_threads), in order to do a fair comparison between CLI and itk-elastix?

:+1: Yes that’s right, I modified the script below to take this into account.

I’ve modified the benchmark to mimic the scenarios I have in my application, which is the execution of many registration jobs in parallel:

bench-parallel.py (12.4 KB)

I tested:

  • ITK_GLOBAL_DEFAULT_THREADER = Platform vs Pool
  • Number of threads per job = 1 or 12
  • Number of parallel jobs = 1 or 6 or 32

If a single job is processed, then itk-elastix is on par with CLI. But when jobs are run in parallel, itk-elastix stays at least 4× slower than CLI:

In short, Platform is worse than Pool. And within Pool, using 12 threads per job instead of 1 gives a ~2× speedup. But there’s a ~4× penalty I can’t eliminate when many jobs are run in parallel.

(claude discussion with full results: https://claude.ai/share/de059ec6-e4b8-4c55-9cc4-cac89f18af4d)

(I used itk-elastix 0.25.2 vs elastix-5.2.0)

There is a lot of different things going on with this scenario with threads, process, and the Python GIL. And also with the thread pool being shared in a single process.

ITK Python 6.0 will enable the GIL to be unlocked but that is just in beta now. I have a local compiled version of SimpleITK with Elastix which I added to your benchmarking. SimpleITK does unlock the GIL similar to ITK Python for the next release.

$ python bench_parallel.py --fixed blobs.tif --moving blobs-rot15deg.tif --runs 4 --threads 2 --elastix bin/elastix --parallel 4

Run CLI wall(s) CLI avg(s) ITK wall(s) ITK avg(s) SITK wall(s) SITK avg(s) itk/cli wall sitk/cli wall
*1 1.619 1.577 18.198 1.527 1.910 1.831 11.24x 1.18x
2 1.592 1.555 6.194 1.536 1.922 1.804 3.89x 1.21x
3 1.570 1.542 6.103 1.508 1.884 1.801 3.89x 1.20x
4 1.599 1.541 6.132 1.507 1.861 1.793 3.84x 1.16x

$python bench_parallel.py --fixed blobs.tif --moving blobs-rot15deg.tif --runs 6 --threads 2 --elastix bin/elastix --parallel 4

Run CLI wall(s) CLI avg(s) ITK wall(s) ITK avg(s) SITK wall(s) SITK avg(s) itk/cli wall sitk/cli wall
*1 2.219 2.112 20.769 1.517 2.293 2.170 9.36x 1.03x
2 2.144 2.114 9.198 1.506 2.215 2.136 4.29x 1.03x
3 2.091 2.015 9.128 1.507 2.289 2.194 4.37x 1.09x
4 2.068 2.030 9.139 1.493 2.313 2.199 4.42x 1.12x

This is on a Mac M1 Pro with 10-ish cores. For high through put of jobs (getting through many jobs as quick as possible), I find 2-threads per ITK job, and 0.5-0.75 number of jobs to be the most efficient.

There is ALOT going on here for you to tune!

1 Like

This looks awesome! Do you think I can use sitk as a drop-in replacement for elastix, given the example registration in this benchmark ? Maybe, if I understand correctly, the fancy GIL free versions have not been released yet ? Neither for itk-elastix nor for sitk ? Do you know if there’s an approximate timeline for this release ?
Thanks!

With many jobs run in parallel, you should strongly consider using TBB multi threader. It does some kind of machine-wide balancing. I think it is available with already released ITK, and therefore itk-elastix.

I agree with Brad, when you have many jobs running in parallel, within-job parallelism should be low.

2 Likes

I’m still trying to understand the problem observed with the original benchmark. It looks like the specified number of threads doesn’t really matter in this case, as long as it is greater than zero!

The original slowness of the benchmark function run_itk disappears when I only add the following line, directly after the import itk:

itk.MultiThreaderBase.SetGlobalDefaultNumberOfThreads(1)

I’m trying to understand why! I do know that this GlobalDefaultNumberOfThreads is used by the constructor of ITK’s ThreadPool:

So then, what happens when GlobalDefaultNumberOfThreads is not explicitly specified? Is ITK’s ThreadPool then still properly constructed? Does the Python wrapper tweak the ThreadPool somehow? Do you have a clue?

The entire function is here, which (in the absence of other specifications) delegates to the platform-specific function. There is a global maximum defined here.

I did try a similar benchmark in C++ as well, having an executable written in C++, linked to “elastix-5.3.1.lib” and its dependencies, built with MSVC 2026 (Release). It suggested the following relation between selected number of threads (erm->SetNumberOfThreads(n_threads)) and registration duration:

n_threads = 0  ==> 2.8 sec
n_threads = 1  ==> 2.9 sec
n_threads = 2  ==> 1.6 sec
n_threads = 4  ==> 1.0 sec
n_threads = 8  ==> 0.7 sec
n_threads = 16 ==> 0.8 sec
n_threads = 32 ==> 0.9 sec
n_threads = 64 ==> 1.4 sec

So in C++, I do not see a particular slowness when n_threads = 0.

The code is below here:

#include <itkElastixRegistrationMethod.h>
#include <elxParameterObject.h>

#include <itkImage.h>
#include <itkImageFileReader.h>
#include <itkSize.h>

#include <cassert>
#include <chrono>
#include <ctime>
#include <filesystem>
#include <sstream>


namespace
{
void
run_cpp(const std::string & fixed,
        const std::string & moving,
        const std::string & param_file,
        const unsigned int  n_threads)
{
  using ImageType = itk::Image<float>;

  const auto fixed_img = itk::ReadImage<ImageType>(fixed);
  const auto moving_img = itk::ReadImage<ImageType>(moving);

  const auto param_obj = elx::ParameterObject::New();

  param_obj->ReadParameterFile(param_file);

  const std::filesystem::path out_dir = [] {
    const std::time_t nowTime = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now());
    std::tm           nowTm{};
    localtime_s(&nowTm, &nowTime);

    std::ostringstream outputStringStream;
    outputStringStream << std::put_time(&nowTm, "elastix_cpp_%Y-%m-%dT%H.%M.%S");
    return outputStringStream.str();
  }();

  std::filesystem::create_directory(out_dir);
  const auto erm = itk::ElastixRegistrationMethod<ImageType, ImageType>::New();
  erm->SetFixedImage(fixed_img);
  erm->SetMovingImage(moving_img);
  erm->SetParameterObject(param_obj);
  erm->SetOutputDirectory(out_dir.string());
  erm->SetLogToConsole(false);
  erm->SetLogToFile(true);
  erm->SetNumberOfThreads(n_threads);

  const auto t0 = std::chrono::high_resolution_clock::now();
  erm->UpdateLargestPossibleRegion();
  const auto elapsed =
    std::chrono::duration_cast<std::chrono::duration<double>>(std::chrono::high_resolution_clock::now() - t0).count();

  const auto transform = out_dir / "TransformParameters.0.txt";
  const bool ok{ std::filesystem::exists(transform) };
  std::cout << "  [elx-c++] " << elapsed << " s transform exists:  ok " << std::boolalpha << ok << std::endl;
}
} // namespace


int
main()
{
  try
  {
    for (const unsigned int n_threads : { 0, 1, 2, 4, 8, 16, 32, 64 })
    {
      std::cout << "n_threads = " << n_threads << std::endl;
      for (int i{}; i < 3; ++i)
      {
        run_cpp("blobs.tif", "blobs-rot15deg.tif", "params_bspline.txt", n_threads);
      }
    }
  }
  catch (const std::exception & stdException)
  {
    std::cout << stdException.what();
  }
} // end main()

Update: I realize now that calling SetNumberOfThreads multiple times in C++, during the run of one and the same process isn’t so effective, because the size of the ITK ThreadPool is not adjusted anymore after its initialization. To be continued…

1 Like

It looks like TBB is already the default multi threader in itk-elastix!

print(itk.MultiThreaderBase.GetGlobalDefaultThreader())

Prints 2, when I’m using itk-elastix 0.25.2 on Windows.

print(erm.GetMultiThreader())

prints:

TBBMultiThreader (000001B8F5D0DD60)
  RTTI typeinfo:   class itk::TBBMultiThreader
  Reference Count: 2
  Modified Time: 68
  Debug: Off
  Object Name:
  Observers:
    none
  Number of Work Units: 1024
  Number of Threads: 64
  Global Maximum Number Of Threads: 128
  Global Default Number Of Threads: 64
  Global Default Threader Type: itk::MultiThreaderBaseEnums::Threader::TBB
  SingleMethod: 0000000000000000
  SingleData: 0000000000000000

 GetNumberOfThreads 0
 GetGlobalDefaultNumberOfThreads 64
 GetGlobalMaximumNumberOfThreads 128

Does TBB MultiThreader require calling SetGlobalDefaultNumberOfThreads or SetGlobalMaximumNumberOfThreads, before it can be used efficiently?

It does some load balancing while splitting the region into chunks. I don’t think that requires changing the number of threads. But if you know stuff about number of threads, it is advisable to exploit it.

It looks like itk::TBBMultiThreader has many more Work Units than itk::PoolMultiThreader, by default. As I mentioned before:

TBBMultiThreader (000001B8F5D0DD60)
  RTTI typeinfo:   class itk::TBBMultiThreader
  Reference Count: 2
  Modified Time: 68
  Debug: Off
  Object Name:
  Observers:
    none
  Number of Work Units: 1024

Whereas when ITK’s PoolThreader is selected (by itk.MultiThreaderBase.SetGlobalDefaultThreader(1)), print(erm.GetMultiThreader()) prints:


PoolMultiThreader (0000016A759A2260)
  RTTI typeinfo:   class itk::PoolMultiThreader
  Reference Count: 2
  Modified Time: 68
  Debug: Off
  Object Name:
  Observers:
    none
  Number of Work Units: 128

Does that explain why TBBMultiThreader is so much slower, by default? Because it uses 1024 work units, whereas PoolThreader just uses 128 work units?

It looks like itk-elastix uses TBBMultiThreader by default, whereas the CLI (elastix.exe) uses the PoolThreader.


For the record, the default-constructors of TBBMultiThreader and PoolMultiThreader assign different values to m_NumberOfWorkUnits, here:

3 Likes

Excellent discussion, everyone - thank you for the detailed profiling and repros.

I’ve reproduced the results others have reported and can confirm that, for many paths, the ITK‑elastix Python wrapping runs the same or even faster than the elastix binary when using the Pool threading backend. It is interesting that the avg part of the call is sometimes faster, while the wall time is significantly slower; this gap is not just noise but points to real overhead outside the actual registration kernel.

One key factor is the TBB threading backend. As Neils noted, TBB has sub‑optimal tuning for modern multi‑core systems and may not play well with the particular classes and workloads being exercised here. For ITK 6, it may make sense to either remove TBB support from the Python packages or, at minimum, tune the work‑unit model far more aggressively both in general and for the classes involved in elastix‑driven registration.

The difference between wall and avg metrics largely reflects module‑import and initialization overhead. The recently released itk‑elastix 0.23.3 includes additional GIL‑release protections, which helps in some scenarios, but there are other factors at play. In Nico’s bench‑parallel.py benchmark, one notable element is an extra thread‑lock that surfaces a hint at part of this overhead; the bottleneck is not only the registration loop itself but also the machinery around it.

ITK has long used a custom lazy‑loading system for its wrapping, which dates from the Python 2.6 / early‑WrapITK era when developers were still using techonology like CVS. This system does improve the developer experience by limiting import time and memory footprint, but it comes at the cost of numerous locks, bookkeeping, and extra infrastructure for pickling and cross‑module references, which both degrades performance and introduces subtle bugs in corner cases.

Since Python 3.7, PEP 562 has provided the core language support needed to replace this custom lazy‑loading machinery with a more standard, CPython mechanism. The scikit‑image community has already embraced this pattern, and the broader scientific‑Python ecosystem is moving toward it via the Scientific‑Python SPEC‑0001.

A corresponding patch for ITK 6 is under review here, which proposes using PEP 562‑style __getattr__‑driven lazy loading instead of the current home‑grown scheme:

ENH: Replace custom LazyITKModule with native PEP 562 lazy loading by thewtex · Pull Request #6183 · InsightSoftwareConsortium/ITK · GitHub.

This change will help, but what is even more critical is that we (re‑)organize ITK’s classes into modules and clean up the module‑dependency graph so that importing from one module does not implicitly pull in nearly every other ITK module. CC @hjmjohnson @simon.rit

4 Likes