~8x slower registration with itk-elastix Python API vs elastix CLI — minimal reproducible example

should be fixed by adding a line of code, erm.SetNumberOfThreads(n_threads), in order to do a fair comparison between CLI and itk-elastix?

:+1: Yes that’s right, I modified the script below to take this into account.

I’ve modified the benchmark to mimic the scenarios I have in my application, which is the execution of many registration jobs in parallel:

bench-parallel.py (12.4 KB)

I tested:

  • ITK_GLOBAL_DEFAULT_THREADER = Platform vs Pool
  • Number of threads per job = 1 or 12
  • Number of parallel jobs = 1 or 6 or 32

If a single job is processed, then itk-elastix is on par with CLI. But when jobs are run in parallel, itk-elastix stays at least 4× slower than CLI:

In short, Platform is worse than Pool. And within Pool, using 12 threads per job instead of 1 gives a ~2× speedup. But there’s a ~4× penalty I can’t eliminate when many jobs are run in parallel.

(claude discussion with full results: https://claude.ai/share/de059ec6-e4b8-4c55-9cc4-cac89f18af4d)

(I used itk-elastix 0.25.2 vs elastix-5.2.0)

There is a lot of different things going on with this scenario with threads, process, and the Python GIL. And also with the thread pool being shared in a single process.

ITK Python 6.0 will enable the GIL to be unlocked but that is just in beta now. I have a local compiled version of SimpleITK with Elastix which I added to your benchmarking. SimpleITK does unlock the GIL similar to ITK Python for the next release.

$ python bench_parallel.py --fixed blobs.tif --moving blobs-rot15deg.tif --runs 4 --threads 2 --elastix bin/elastix --parallel 4

Run CLI wall(s) CLI avg(s) ITK wall(s) ITK avg(s) SITK wall(s) SITK avg(s) itk/cli wall sitk/cli wall
*1 1.619 1.577 18.198 1.527 1.910 1.831 11.24x 1.18x
2 1.592 1.555 6.194 1.536 1.922 1.804 3.89x 1.21x
3 1.570 1.542 6.103 1.508 1.884 1.801 3.89x 1.20x
4 1.599 1.541 6.132 1.507 1.861 1.793 3.84x 1.16x

$python bench_parallel.py --fixed blobs.tif --moving blobs-rot15deg.tif --runs 6 --threads 2 --elastix bin/elastix --parallel 4

Run CLI wall(s) CLI avg(s) ITK wall(s) ITK avg(s) SITK wall(s) SITK avg(s) itk/cli wall sitk/cli wall
*1 2.219 2.112 20.769 1.517 2.293 2.170 9.36x 1.03x
2 2.144 2.114 9.198 1.506 2.215 2.136 4.29x 1.03x
3 2.091 2.015 9.128 1.507 2.289 2.194 4.37x 1.09x
4 2.068 2.030 9.139 1.493 2.313 2.199 4.42x 1.12x

This is on a Mac M1 Pro with 10-ish cores. For high through put of jobs (getting through many jobs as quick as possible), I find 2-threads per ITK job, and 0.5-0.75 number of jobs to be the most efficient.

There is ALOT going on here for you to tune!

1 Like

This looks awesome! Do you think I can use sitk as a drop-in replacement for elastix, given the example registration in this benchmark ? Maybe, if I understand correctly, the fancy GIL free versions have not been released yet ? Neither for itk-elastix nor for sitk ? Do you know if there’s an approximate timeline for this release ?
Thanks!

With many jobs run in parallel, you should strongly consider using TBB multi threader. It does some kind of machine-wide balancing. I think it is available with already released ITK, and therefore itk-elastix.

I agree with Brad, when you have many jobs running in parallel, within-job parallelism should be low.

2 Likes

I’m still trying to understand the problem observed with the original benchmark. It looks like the specified number of threads doesn’t really matter in this case, as long as it is greater than zero!

The original slowness of the benchmark function run_itk disappears when I only add the following line, directly after the import itk:

itk.MultiThreaderBase.SetGlobalDefaultNumberOfThreads(1)

I’m trying to understand why! I do know that this GlobalDefaultNumberOfThreads is used by the constructor of ITK’s ThreadPool:

So then, what happens when GlobalDefaultNumberOfThreads is not explicitly specified? Is ITK’s ThreadPool then still properly constructed? Does the Python wrapper tweak the ThreadPool somehow? Do you have a clue?

The entire function is here, which (in the absence of other specifications) delegates to the platform-specific function. There is a global maximum defined here.

I did try a similar benchmark in C++ as well, having an executable written in C++, linked to “elastix-5.3.1.lib” and its dependencies, built with MSVC 2026 (Release). It suggested the following relation between selected number of threads (erm->SetNumberOfThreads(n_threads)) and registration duration:

n_threads = 0  ==> 2.8 sec
n_threads = 1  ==> 2.9 sec
n_threads = 2  ==> 1.6 sec
n_threads = 4  ==> 1.0 sec
n_threads = 8  ==> 0.7 sec
n_threads = 16 ==> 0.8 sec
n_threads = 32 ==> 0.9 sec
n_threads = 64 ==> 1.4 sec

So in C++, I do not see a particular slowness when n_threads = 0.

The code is below here:

#include <itkElastixRegistrationMethod.h>
#include <elxParameterObject.h>

#include <itkImage.h>
#include <itkImageFileReader.h>
#include <itkSize.h>

#include <cassert>
#include <chrono>
#include <ctime>
#include <filesystem>
#include <sstream>


namespace
{
void
run_cpp(const std::string & fixed,
        const std::string & moving,
        const std::string & param_file,
        const unsigned int  n_threads)
{
  using ImageType = itk::Image<float>;

  const auto fixed_img = itk::ReadImage<ImageType>(fixed);
  const auto moving_img = itk::ReadImage<ImageType>(moving);

  const auto param_obj = elx::ParameterObject::New();

  param_obj->ReadParameterFile(param_file);

  const std::filesystem::path out_dir = [] {
    const std::time_t nowTime = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now());
    std::tm           nowTm{};
    localtime_s(&nowTm, &nowTime);

    std::ostringstream outputStringStream;
    outputStringStream << std::put_time(&nowTm, "elastix_cpp_%Y-%m-%dT%H.%M.%S");
    return outputStringStream.str();
  }();

  std::filesystem::create_directory(out_dir);
  const auto erm = itk::ElastixRegistrationMethod<ImageType, ImageType>::New();
  erm->SetFixedImage(fixed_img);
  erm->SetMovingImage(moving_img);
  erm->SetParameterObject(param_obj);
  erm->SetOutputDirectory(out_dir.string());
  erm->SetLogToConsole(false);
  erm->SetLogToFile(true);
  erm->SetNumberOfThreads(n_threads);

  const auto t0 = std::chrono::high_resolution_clock::now();
  erm->UpdateLargestPossibleRegion();
  const auto elapsed =
    std::chrono::duration_cast<std::chrono::duration<double>>(std::chrono::high_resolution_clock::now() - t0).count();

  const auto transform = out_dir / "TransformParameters.0.txt";
  const bool ok{ std::filesystem::exists(transform) };
  std::cout << "  [elx-c++] " << elapsed << " s transform exists:  ok " << std::boolalpha << ok << std::endl;
}
} // namespace


int
main()
{
  try
  {
    for (const unsigned int n_threads : { 0, 1, 2, 4, 8, 16, 32, 64 })
    {
      std::cout << "n_threads = " << n_threads << std::endl;
      for (int i{}; i < 3; ++i)
      {
        run_cpp("blobs.tif", "blobs-rot15deg.tif", "params_bspline.txt", n_threads);
      }
    }
  }
  catch (const std::exception & stdException)
  {
    std::cout << stdException.what();
  }
} // end main()

Update: I realize now that calling SetNumberOfThreads multiple times in C++, during the run of one and the same process isn’t so effective, because the size of the ITK ThreadPool is not adjusted anymore after its initialization. To be continued…

1 Like

It looks like TBB is already the default multi threader in itk-elastix!

print(itk.MultiThreaderBase.GetGlobalDefaultThreader())

Prints 2, when I’m using itk-elastix 0.25.2 on Windows.

print(erm.GetMultiThreader())

prints:

TBBMultiThreader (000001B8F5D0DD60)
  RTTI typeinfo:   class itk::TBBMultiThreader
  Reference Count: 2
  Modified Time: 68
  Debug: Off
  Object Name:
  Observers:
    none
  Number of Work Units: 1024
  Number of Threads: 64
  Global Maximum Number Of Threads: 128
  Global Default Number Of Threads: 64
  Global Default Threader Type: itk::MultiThreaderBaseEnums::Threader::TBB
  SingleMethod: 0000000000000000
  SingleData: 0000000000000000

 GetNumberOfThreads 0
 GetGlobalDefaultNumberOfThreads 64
 GetGlobalMaximumNumberOfThreads 128

Does TBB MultiThreader require calling SetGlobalDefaultNumberOfThreads or SetGlobalMaximumNumberOfThreads, before it can be used efficiently?

It does some load balancing while splitting the region into chunks. I don’t think that requires changing the number of threads. But if you know stuff about number of threads, it is advisable to exploit it.