~8x slower registration with itk-elastix Python API vs elastix CLI — minimal reproducible example

NicoKiaru · March 22, 2026, 1:48pm

Disclaimer: this is a [crosspost from image.sc]( ~8x slower registration with itk-elastix Python API vs elastix CLI — minimal reproducible example - Usage & Issues - Image.sc Forum )

Hi all,

I’m integrating itk-elastix as a drop-in replacement for the elastix CLI in a Java/Fiji plugin, and I’m hitting a consistent ~8.5x wall-clock slowdown when using the Python API compared to spawning the elastix executable directly. This is not a warm-up effect — the ratio holds across all runs.

Environment

Windows 11, Python 3.11 (conda-forge), itk-elastix==0.21.0 (pinned to match elastix 5.2.0)
14 physical cores, same thread count passed to both paths
Test images: blobs.tif / blobs-rot15deg.tif (256×256, standard ImageJ sample images)
Transform: BSpline, 6 resolutions, 100 iterations/level, 4096 random samples, FinalGridSpacingInVoxels=20
Exactly the same parameter file is used for both paths

The two execution paths

  # Path 1 — CLI (fast)
  subprocess.run([
      "elastix.exe",
      "-f", fixed, "-m", moving,
      "-p", params_bspline.txt,
      "-out", out_dir,
      "-threads", "14"
  ])

Path 2 — itk-elastix Python API (slow)

  erm = itk.ElastixRegistrationMethod[ImageType, ImageType].New()
  erm.SetFixedImage(fixed_img)
  erm.SetMovingImage(moving_img)
  erm.SetParameterObject(param_obj)   # same param file, NumberOfThreads patched to 14
  erm.SetOutputDirectory(out_dir)
  erm.SetLogToConsole(False)
  erm.SetLogToFile(True)
  erm.UpdateLargestPossibleRegion()

Results (3 runs, no warm-up effect observed)

Run CLI (s) itk-API (s) ratio

1 2.50 22.08 8.82x
2 2.73 22.24 8.16x
3 2.41 21.70 8.99x

Steady-state avg CLI: 2.57 s
Steady-state avg itk-API: 21.97 s
Steady-state ratio: 8.55x

What I’ve already checked / ruled out

Thread count is identical (NumberOfThreads is explicitly patched in the ParameterMap before calling UpdateLargestPossibleRegion, confirmed via the parameter dump)
The parameter file is bit-for-bit identical between the two paths
The slowdown is not a warm-up / import cost — Run 1 and Run 3 are essentially the same
The ITK pipeline overhead (filter setup, New(), etc.) is negligible compared to 20 s
WriteResultImage is false in both paths, so resampling is not the culprit
Both paths produce a valid TransformParameters.0.txt

Questions

Is there a known performance difference between the CLI and the Python API for BSpline registration specifically? The CLI likely uses elastix’s own thread pool management —
does the Python API use a different ITK thread scheduler?
Could there be a memory allocation or cache-thrashing issue when ITK is loaded inside a Python process vs. running standalone?
Is there a way to configure the ITK thread pool (e.g. itk.MultiThreaderBase.SetGlobalDefaultNumberOfThreads()) that would close the gap?
Any known difference in how the BSpline Jacobian accumulation is handled between the two paths?

Reproducible example

I’ve attached a zip containing bench.py, params_bspline.txt, and the two test images. You only need itk-elastix==0.21.0 (plus psutil for physical-core detection) and an elastix
5.2.0 binary in your PATH or passed via --elastix.

pip install itk-elastix==0.21.0 psutil

  python bench.py \
      --elastix /path/to/elastix \
      --fixed  blobs.tif \
      --moving blobs-rot15deg.tif \
      --runs 3

Any pointers to what’s going on under the hood would be very much appreciated!

PS:

This post has been redacted with claude’s help. I’ve been trying to debug and find the reason for the discrepancy with the help of claude for a while now, but I’ve hit a wall, so I ask for community help.
I’ve been trying to replace my CLI integration with Appose and itk-elastix, but a 8x difference in performance won’t cut it… The problem is not due to Appose since I can reproduce it in a pure python pixi env.
I tried to look at the itk discourse forum to find an answer but did not find anything relevant.

Pinging @dzenanz, @Niels_Dekker

Thanks!

Here’s the zip file of the minimal self-contained benchmark:

benchmark-itk.zip (72.2 KB)

Niels_Dekker · March 23, 2026, 10:10am

Thank you for your question, @NicoKiaru

elastix offers two implementations of the BSpline transform. The “RecursiveBSplineTransform” should be faster than the regular “BSplineTransform”:

(Transform "RecursiveBSplineTransform")

See: elastix: elastix::RecursiveBSplineTransform< TElastix > Class Template Reference

Would that possibly be of help to you?

Tomorrow I’ll try to have a closer look at your case!

NicoKiaru · March 23, 2026, 10:33am

Thanks a lot for the fast response! I quickly tested with RecursiveBSplineTransform and I get a small performance increase, but the ratio is still 10 between the compiled version and itk-elastix:

(itk-elastix) C:\\Users\\chiarutt\\Downloads\\benchmark-itk\\benchmark-itk>python bench.py --elastix “C:/Users/chiarutt/AppData/Local/abba-python-0.11.0/win/elastix-5.2.0-win64/elastix.exe” --fixed blobs-rot15deg.tif --moving  blobs.tif
Python 3.11.15 | packaged by conda-forge | (main, Mar  5 2026, 16:36:00) \[MSC v.1944 64 bit (AMD64)\]
Threads: 28  (requested: 0)
Param file: C:\\Users\\chiarutt\\Downloads\\benchmark-itk\\benchmark-itk\\params_bspline.txt
Fixed:      blobs-rot15deg.tif
Moving:     blobs.tif

— Run 1/3 (warm-up) —
\[CLI\]     2.14s  transform exists: True
\[itk-API\] 21.73s  transform exists: True

— Run 2/3 —
\[CLI\]     2.06s  transform exists: True
\[itk-API\] 21.90s  transform exists: True

— Run 3/3 —
\[CLI\]     2.12s  transform exists: True
\[itk-API\] 21.99s  transform exists: True

## Run            CLI (s)     itk-API (s)  ratio (itk/cli)

1               2.137          21.726          10.17x
2               2.056          21.901          10.65x
3               2.122          21.989          10.36x

Steady-state avg CLI     (runs 2+): 2.089s
Steady-state avg itk-API (runs 2+): 21.945s
Steady-state ratio (itk/cli): 10.50x

On another note, I did perform some affine registration, and the time were comparable in that case.

Cheers,

Nicolas

dzenanz · March 23, 2026, 5:58pm

I don’t remember ever using command line version of elastix (aside under the hood in Slicer). But import itk followed by instantiation of a few classes takes multiple seconds in Python. That somehow fails to show up in your testing, which makes me think that you somehow pay this penalty during each invocation.

NicoKiaru · March 23, 2026, 6:34pm

Thanks a lot @dzenanz for looking at it! You’re right, I’ll check by taking import itk out of the loop. I’ll test and retry.

But I don’t think that will solve the full picture: I also had a different way of invoking several runs with a much larger 2-step registration and the factor remains the same (you can even see a small warmup effect here, and the penalty for the first affine registration, probably linked to the import). If the import was the limiting factor, I would expect the difference to reduce with a longer registration, but I do not observe that:

╔═══════════════════╦═════════════╦═════════════╦═════════════╗                                                                                                                     
║                   ║   Affine    ║   BSpline   ║    Total    ║                                                                                                                     
╠═══════════════════╬═════════════╬═════════════╬═════════════╣                                                                                                                     
║ Appose run 1      ║   36970 ms  ║  211337 ms  ║  248307 ms  ║                                                                                                                   
║ Appose run 2      ║    6495 ms  ║  203000 ms  ║  209495 ms  ║                                                                                                                     
║ Appose run 3      ║    5759 ms  ║  201558 ms  ║  207317 ms  ║                                                                                                                     
╠═══════════════════╬═════════════╬═════════════╬═════════════╣                                                                                                                     
║ CLI    run 1      ║    6714 ms  ║   39521 ms  ║   46235 ms  ║                                                                                                                   
║ CLI    run 2      ║    6798 ms  ║   39982 ms  ║   46780 ms  ║                                                                                                                     
║ CLI    run 3      ║    6691 ms  ║   40067 ms  ║   46758 ms  ║                                                                                                                     
╚═══════════════════╩═════════════╩═════════════╩═════════════╝

But I will test with the bench I send you and report back.

(Note: Appose runs itk-elastix)

EDIT:

I tried to move import, image reading and parameter settings upfront, I still get this:

Run            CLI (s)     itk-API (s)  ratio (itk/cli)
-------------------------------------------------------
*1               2.051          22.421          10.93x
 2               2.173          22.279          10.25x
 3               2.139          22.068          10.32x

The slightly modified bench.py is here:

"""
Minimal benchmark: elastix CLI vs itk-elastix Python API.

Context
-------
In a Java/Appose-based setup we observed a ~5x slowdown when running
itk-elastix inside a persistent Python subprocess (via Appose) compared
to calling the elastix CLI executable directly.  This script reproduces
the two execution paths in pure Python to isolate where the time goes.

Two methods are compared:

  CLI     -- subprocess.run(["elastix", "-f", ..., "-m", ..., "-p", ..., "-out", ...])
             Each call spawns a fresh elastix process, exactly like DefaultElastixTask.

  itk-API -- itk.ElastixRegistrationMethod[...].UpdateLargestPossibleRegion()
             Registration runs inside the current Python process, exactly like the
             script that Appose dispatches to its persistent worker process.

Usage
-----
    python bench.py \\
        --elastix /path/to/elastix \\
        --fixed   ../src/test/resources/blobs-rot15deg.tif \\
        --moving  ../src/test/resources/blobs.tif

Optional flags:
    --threads N   number of ITK/elastix threads (0 = auto-detect physical cores)
    --runs    N   total number of timed repetitions (default 3; run 1 is warm-up)
    --no-cli      skip the CLI measurements
    --no-itk      skip the itk-elastix measurements
"""

import argparse
import os
import shutil
import subprocess
import sys
import tempfile
import time
import itk  

PARAM_FILE = os.path.join(os.path.dirname(os.path.abspath(__file__)), "params_bspline.txt")


# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------

def physical_cores():
    try:
        import psutil
        return psutil.cpu_count(logical=False) or os.cpu_count()
    except ImportError:
        return os.cpu_count()


def resolve_threads(n):
    return physical_cores() if n == 0 else n


# ---------------------------------------------------------------------------
# CLI backend
# ---------------------------------------------------------------------------

def run_cli(elastix_exe, fixed, moving, param_file, n_threads):
    out_dir = tempfile.mkdtemp(prefix="elastix_cli_")
    try:
        cmd = [
            elastix_exe,
            "-f",       fixed,
            "-m",       moving,
            "-p",       param_file,
            "-out",     out_dir,
            "-threads", str(n_threads),
        ]
        t0 = time.perf_counter()
        result = subprocess.run(cmd, capture_output=True, text=True)
        elapsed = time.perf_counter() - t0
        if result.returncode != 0:
            print("  [CLI] STDERR (last 600 chars):", result.stderr[-600:], file=sys.stderr)
            raise RuntimeError(f"elastix CLI failed (rc={result.returncode})")
        transform = os.path.join(out_dir, "TransformParameters.0.txt")
        ok = os.path.exists(transform)
        print(f"  [CLI]     {elapsed:.2f}s  transform exists: {ok}")
        return elapsed
    finally:
        shutil.rmtree(out_dir, ignore_errors=True)


# ---------------------------------------------------------------------------
# itk-elastix backend (in-process)
# ---------------------------------------------------------------------------

def run_itk(fixed_img, moving_img, param_obj, n_threads):
    

    out_dir = tempfile.mkdtemp(prefix="elastix_itk_")
    try:
        ImageType = type(fixed_img)
        erm = itk.ElastixRegistrationMethod[ImageType, ImageType].New()
        erm.SetFixedImage(fixed_img)
        erm.SetMovingImage(moving_img)
        erm.SetParameterObject(param_obj)
        erm.SetOutputDirectory(out_dir)
        erm.SetLogToConsole(False)
        erm.SetLogToFile(True)

        t0 = time.perf_counter()
        erm.UpdateLargestPossibleRegion()
        elapsed = time.perf_counter() - t0

        transform = os.path.join(out_dir, "TransformParameters.0.txt")
        ok = os.path.exists(transform)
        print(f"  [itk-API] {elapsed:.2f}s  transform exists: {ok}")
        return elapsed
    finally:
        shutil.rmtree(out_dir, ignore_errors=True)


# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------

def main():
    parser = argparse.ArgumentParser(
        description="Benchmark elastix CLI vs itk-elastix Python API"
    )
    parser.add_argument("--elastix", default=None,
                        help="Path to elastix CLI executable (required unless --no-cli)")
    parser.add_argument("--fixed",   required=True, help="Fixed image (TIFF/MHD/...)")
    parser.add_argument("--moving",  required=True, help="Moving image")
    parser.add_argument("--threads", type=int, default=0,
                        help="Number of threads (0 = physical cores, default)")
    parser.add_argument("--runs",    type=int, default=3,
                        help="Number of timed runs (first is warm-up)")
    parser.add_argument("--no-cli",  action="store_true", help="Skip CLI measurements")
    parser.add_argument("--no-itk",  action="store_true", help="Skip itk-elastix measurements")
    args = parser.parse_args()

    if not args.no_cli and args.elastix is None:
        parser.error("--elastix is required unless --no-cli is set")

    n_threads = resolve_threads(args.threads)

    print(f"Python {sys.version}")
    print(f"Threads: {n_threads}  (requested: {args.threads})")
    try:
        import itk_elastix
        print(f"itk-elastix version: {itk_elastix.__version__}")
    except Exception:
        pass
    print(f"Param file: {PARAM_FILE}")
    print(f"Fixed:      {args.fixed}")
    print(f"Moving:     {args.moving}")
    print()

    cli_times = []
    itk_times = []
    
    

    fixed_img  = itk.imread(args.fixed,  itk.F)
    moving_img = itk.imread(args.moving, itk.F)
    
    
    param_obj = itk.ParameterObject.New()
    param_obj.ReadParameterFile(PARAM_FILE)

    pm = param_obj.GetParameterMap(0)
    pm["NumberOfThreads"] = [str(n_threads)]
    param_obj.SetParameterMap(0, pm)

    for run in range(args.runs):
        label = f"Run {run + 1}/{args.runs}" + (" (warm-up)" if run == 0 else "")
        print(f"--- {label} ---")

        if not args.no_cli:
            t = run_cli(args.elastix, args.fixed, args.moving, PARAM_FILE, n_threads)
            cli_times.append(t)

        if not args.no_itk:
            t = run_itk(fixed_img, moving_img, param_obj, n_threads)
            itk_times.append(t)

        print()

    # Summary table
    col = 14
    header = f"{'Run':<6}  {'CLI (s)':>{col}}  {'itk-API (s)':>{col}}  {'ratio (itk/cli)':>{col}}"
    print(header)
    print("-" * len(header))
    for i in range(args.runs):
        warm = "*" if i == 0 else " "
        c = f"{cli_times[i]:.3f}" if cli_times else "n/a"
        t = f"{itk_times[i]:.3f}" if itk_times else "n/a"
        if cli_times and itk_times:
            ratio = f"{itk_times[i] / cli_times[i]:.2f}x"
        else:
            ratio = "n/a"
        print(f"{warm}{i + 1:<5}  {c:>{col}}  {t:>{col}}  {ratio:>{col}}")

    if args.runs > 1:
        print()
        if cli_times and len(cli_times) > 1:
            avg_c = sum(cli_times[1:]) / (len(cli_times) - 1)
            print(f"Steady-state avg CLI     (runs 2+): {avg_c:.3f}s")
        if itk_times and len(itk_times) > 1:
            avg_t = sum(itk_times[1:]) / (len(itk_times) - 1)
            print(f"Steady-state avg itk-API (runs 2+): {avg_t:.3f}s")
        if cli_times and itk_times and len(cli_times) > 1:
            print(f"Steady-state ratio (itk/cli): {avg_t / avg_c:.2f}x")


if __name__ == "__main__":
    main()

I may be doing something stupid, I just don’t know what. Maybe some parameters are ignored in the cli and not in itk-elastix or vice versa.

dzenanz · March 23, 2026, 7:38pm

Thank you for providing more detail. I believe that Niels is best positioned to look into this performance discrepancy. A self-contained reproducible example is of great help for any such analysis.

Niels_Dekker · March 24, 2026, 4:33pm

Small update: I’m able to run your benchmark, and I see large performance differences between CLI and the “itk-API” run, indeed. Do those performance differences also occur with itk-elastix==0.25.0 vs elastix 5.3.0? (Of course, I can also try that out myself, just wondering if you already did so!)

NicoKiaru · March 24, 2026, 5:09pm

Sure, no worry.

So I’ve initially tested the latest itk-elastix and saw a slow computation vs cli 5.2.0 (already installed on my system). I thought maybe it was due to the version difference so I used an older itk-elastix to make a fair comparison. But I did not test the latest itk-elastix against elastix 5.3.0 using cli. I can do it though.

EDIT: I compared elastix 5.3.1 vs itk-elastix 0.25.1 (python 3.14) and the difference still holds with this benchmark:

(itk-elastix-latest) D:\code\benchmark-itk\benchmark-itk>python bench.py --elastix "C:/elastix-5.3.1-windows/elastix.exe" --fixed blobs-rot15deg.tif --moving  blobs.tif
Python 3.14.3 | packaged by conda-forge | (main, Feb  9 2026, 21:56:48) [MSC v.1944 64 bit (AMD64)]
Threads: 28  (requested: 0)
Param file: D:\code\benchmark-itk\benchmark-itk\params_bspline.txt
Fixed:      blobs-rot15deg.tif
Moving:     blobs.tif

--- Run 1/3 (warm-up) ---
  [CLI]     4.99s  transform exists: True
  [itk-API] 23.61s  transform exists: True

--- Run 2/3 ---
  [CLI]     2.05s  transform exists: True
  [itk-API] 22.31s  transform exists: True

--- Run 3/3 ---
  [CLI]     2.10s  transform exists: True
  [itk-API] 23.21s  transform exists: True

Run            CLI (s)     itk-API (s)  ratio (itk/cli)
-------------------------------------------------------
*1               4.986          23.614           4.74x
 2               2.048          22.310          10.89x
 3               2.098          23.213          11.06x

Steady-state avg CLI     (runs 2+): 2.073s
Steady-state avg itk-API (runs 2+): 22.761s
Steady-state ratio (itk/cli): 10.98x

blowekamp · March 24, 2026, 8:20pm

Hello,

Thank you for sharing the bench mark. I ran it on ARM Mac OSX and the ITK Python and the elastix binary were performing very similar. Is this registration case 2d, and image size similar to the application you are working on?

Perhaps there is something going on with threading? What is the performance difference with only 1 thread?

You also may want to profile difference multi-threader back ends. You can set the env ITK_GLOBAL_DEFAULT_THREADER to “Pool”, “Platform” and some time “TBB” to get different backends.

NicoKiaru · March 25, 2026, 8:00am

I use elastix for multi-modal 2D image registration, with much larger images. The benchmark is a very artificial example, but it reproduces what I observe with much larger images.

Thanks! I can try!

dzenanz · March 25, 2026, 1:52pm

Running this benchmark on my computer with itk-elastix-0.25.2 and elastix 5.3.1 yields this:

(venv) C:\a\benchmark-itk>python bench.py --fixed blobs.tif --moving blobs-rot15deg.tif --runs 5 --elastix "C:\Misc\elastix\vs26\bin\Release\elastix.exe"
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
Threads: 12  (requested: 0)
Param file: C:\a\benchmark-itk\params_bspline.txt
Fixed:      blobs.tif
Moving:     blobs-rot15deg.tif

--- Run 1/5 (warm-up) ---
  [CLI]     1.50s  transform exists: True
  [itk-API] 13.27s  transform exists: True

--- Run 2/5 ---
  [CLI]     1.37s  transform exists: True
  [itk-API] 11.02s  transform exists: True

--- Run 3/5 ---
  [CLI]     1.39s  transform exists: True
  [itk-API] 11.77s  transform exists: True

--- Run 4/5 ---
  [CLI]     1.39s  transform exists: True
  [itk-API] 13.29s  transform exists: True

--- Run 5/5 ---
  [CLI]     1.32s  transform exists: True
  [itk-API] 13.43s  transform exists: True

Run            CLI (s)     itk-API (s)  ratio (itk/cli)
-------------------------------------------------------
*1               1.504          13.268           8.82x
 2               1.373          11.020           8.03x
 3               1.389          11.769           8.47x
 4               1.392          13.287           9.55x
 5               1.323          13.430          10.15x

Steady-state avg CLI     (runs 2+): 1.369s
Steady-state avg itk-API (runs 2+): 12.377s
Steady-state ratio (itk/cli): 9.04x

blowekamp · March 25, 2026, 1:59pm

Could you run it with just 1 thread. And then setting ITK_GLOBAL_DEFAULT_THREADER to “Platform”?

dzenanz · March 25, 2026, 4:44pm

These are some incredible results:

(venv) C:\a\benchmark-itk>python bench.py --fixed blobs.tif --moving blobs-rot15deg.tif --runs 3 --elastix "C:\Misc\elastix\vs26\bin\Release\elastix.exe"
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
Threads: 12  (requested: 0)
Param file: C:\a\benchmark-itk\params_bspline.txt
Fixed:      blobs.tif
Moving:     blobs-rot15deg.tif

--- Run 1/3 (warm-up) ---
  [CLI]     1.29s  transform exists: True
  [itk-API] 11.89s  transform exists: True

--- Run 2/3 ---
  [CLI]     1.45s  transform exists: True
  [itk-API] 11.66s  transform exists: True

--- Run 3/3 ---
  [CLI]     1.49s  transform exists: True
  [itk-API] 11.23s  transform exists: True

Run            CLI (s)     itk-API (s)  ratio (itk/cli)
-------------------------------------------------------
*1               1.289          11.889           9.22x
 2               1.453          11.665           8.03x
 3               1.491          11.235           7.54x

Steady-state avg CLI     (runs 2+): 1.472s
Steady-state avg itk-API (runs 2+): 11.450s
Steady-state ratio (itk/cli): 7.78x

(venv) C:\a\benchmark-itk>set NSLOTS=1

(venv) C:\a\benchmark-itk>python bench.py --fixed blobs.tif --moving blobs-rot15deg.tif --runs 3 --elastix "C:\Misc\elastix\vs26\bin\Release\elastix.exe"
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
Threads: 12  (requested: 0)
Param file: C:\a\benchmark-itk\params_bspline.txt
Fixed:      blobs.tif
Moving:     blobs-rot15deg.tif

--- Run 1/3 (warm-up) ---
  [CLI]     3.95s  transform exists: True
  [itk-API] 3.78s  transform exists: True

--- Run 2/3 ---
  [CLI]     3.95s  transform exists: True
  [itk-API] 3.79s  transform exists: True

--- Run 3/3 ---
  [CLI]     3.96s  transform exists: True
  [itk-API] 3.82s  transform exists: True

Run            CLI (s)     itk-API (s)  ratio (itk/cli)
-------------------------------------------------------
*1               3.946           3.782           0.96x
 2               3.952           3.794           0.96x
 3               3.964           3.821           0.96x

Steady-state avg CLI     (runs 2+): 3.958s
Steady-state avg itk-API (runs 2+): 3.807s
Steady-state ratio (itk/cli): 0.96x

(venv) C:\a\benchmark-itk>set ITK_GLOBAL_DEFAULT_THREADER=Platform

(venv) C:\a\benchmark-itk>python bench.py --fixed blobs.tif --moving blobs-rot15deg.tif --runs 3 --elastix "C:\Misc\elastix\vs26\bin\Release\elastix.exe"
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
Threads: 12  (requested: 0)
Param file: C:\a\benchmark-itk\params_bspline.txt
Fixed:      blobs.tif
Moving:     blobs-rot15deg.tif

--- Run 1/3 (warm-up) ---
  [CLI]     2.66s  transform exists: True
  [itk-API] 3.77s  transform exists: True

--- Run 2/3 ---
  [CLI]     2.79s  transform exists: True
  [itk-API] 3.77s  transform exists: True

--- Run 3/3 ---
  [CLI]     2.71s  transform exists: True
  [itk-API] 3.81s  transform exists: True

Run            CLI (s)     itk-API (s)  ratio (itk/cli)
-------------------------------------------------------
*1               2.661           3.773           1.42x
 2               2.791           3.765           1.35x
 3               2.710           3.807           1.40x

Steady-state avg CLI     (runs 2+): 2.750s
Steady-state avg itk-API (runs 2+): 3.786s
Steady-state ratio (itk/cli): 1.38x

(venv) C:\a\benchmark-itk>set NSLOTS=12

(venv) C:\a\benchmark-itk>python bench.py --fixed blobs.tif --moving blobs-rot15deg.tif --runs 3 --elastix "C:\Misc\elastix\vs26\bin\Release\elastix.exe"
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
Threads: 12  (requested: 0)
Param file: C:\a\benchmark-itk\params_bspline.txt
Fixed:      blobs.tif
Moving:     blobs-rot15deg.tif

--- Run 1/3 (warm-up) ---
  [CLI]     3.18s  transform exists: True
  [itk-API] 9.53s  transform exists: True

--- Run 2/3 ---
  [CLI]     4.01s  transform exists: True
  [itk-API] 9.62s  transform exists: True

--- Run 3/3 ---
  [CLI]     3.77s  transform exists: True
  [itk-API] 7.52s  transform exists: True

Run            CLI (s)     itk-API (s)  ratio (itk/cli)
-------------------------------------------------------
*1               3.176           9.532           3.00x
 2               4.008           9.624           2.40x
 3               3.772           7.519           1.99x

Steady-state avg CLI     (runs 2+): 3.890s
Steady-state avg itk-API (runs 2+): 8.572s
Steady-state ratio (itk/cli): 2.20x

dzenanz · March 25, 2026, 4:54pm

And after building with TBB enabled:

(venv) C:\a\benchmark-itk>set ITK_GLOBAL_DEFAULT_THREADER=TBB

(venv) C:\a\benchmark-itk>python bench.py --fixed blobs.tif --moving blobs-rot15deg.tif --runs 3 --elastix "C:\Misc\elastix\vs26\bin\Release\elastix.exe"
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
Threads: 12  (requested: 0)
Param file: C:\a\benchmark-itk\params_bspline.txt
Fixed:      blobs.tif
Moving:     blobs-rot15deg.tif

--- Run 1/3 (warm-up) ---
  [CLI]     1.84s  transform exists: True
  [itk-API] 4.82s  transform exists: True

--- Run 2/3 ---
  [CLI]     1.75s  transform exists: True
  [itk-API] 4.59s  transform exists: True

--- Run 3/3 ---
  [CLI]     1.84s  transform exists: True
  [itk-API] 4.37s  transform exists: True

Run            CLI (s)     itk-API (s)  ratio (itk/cli)
-------------------------------------------------------
*1               1.839           4.817           2.62x
 2               1.747           4.592           2.63x
 3               1.839           4.368           2.38x

Steady-state avg CLI     (runs 2+): 1.793s
Steady-state avg itk-API (runs 2+): 4.480s
Steady-state ratio (itk/cli): 2.50

dzenanz · March 25, 2026, 4:56pm

With TBB and no threads specified via environment variable:

(venv) C:\a\benchmark-itk>set NSLOTS=

(venv) C:\a\benchmark-itk>python bench.py --fixed blobs.tif --moving blobs-rot15deg.tif --runs 3 --elastix "C:\Misc\elastix\vs26\bin\Release\elastix.exe"
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
Threads: 12  (requested: 0)
Param file: C:\a\benchmark-itk\params_bspline.txt
Fixed:      blobs.tif
Moving:     blobs-rot15deg.tif

--- Run 1/3 (warm-up) ---
  [CLI]     1.69s  transform exists: True
  [itk-API] 11.89s  transform exists: True

--- Run 2/3 ---
  [CLI]     1.75s  transform exists: True
  [itk-API] 12.15s  transform exists: True

--- Run 3/3 ---
  [CLI]     1.71s  transform exists: True
  [itk-API] 13.30s  transform exists: True

Run            CLI (s)     itk-API (s)  ratio (itk/cli)
-------------------------------------------------------
*1               1.689          11.887           7.04x
 2               1.753          12.155           6.94x
 3               1.708          13.301           7.79x

Steady-state avg CLI     (runs 2+): 1.730s
Steady-state avg itk-API (runs 2+): 12.728s
Steady-state ratio (itk/cli): 7.36x

Niels_Dekker · March 26, 2026, 4:51pm

In your “benchmark-itk\bench.py” it says in run_itk:

pm["NumberOfThreads"] = [str(n_threads)]

While this looks reasonable, I’m sorry to say elastix does not support “NumberOfThreads” as registration parameter in a parameter map . Instead, the number of threads can be passed at the command-line by “-threads” (as you already do in run_cli), or by the ElastixRegistrationMethod member function SetNumberOfThreads. So in run_itk:

erm.SetNumberOfThreads(n_threads)

Does that indeed solve your issue?

NicoKiaru · March 26, 2026, 7:19pm

I’ve decided to stick with the elastix CLI for now. The deeper I investigated the itk-elastix integration, the more complex it became, and since I already have a working CLI solution, further investment seems like a rabbit hole with diminishing returns.

Key Observations:

Single-threaded performance: When using NSLOTS = 1, the speed issue disappears, as @dzenanz also observed.

However, my application typically needs to run tens to a hundred small registrations in batch (one thread per registration). With CLI the CPU is nicely maxed out and performs well. But:

Parallel benchmarking: I tested parallel execution (--parallel flag) with 3 parallel jobs and found a ~40x slowdown with itk-elastix compared to the CLI, even with erm.SetNumberOfThreads(1). The ratio improves slightly but remains unacceptable.

Run        CLI wall(s)      CLI avg(s)     ITK wall(s)      ITK avg(s)      wall ratio
--------------------------------------------------------------------------------------
*1               2.181           2.170         106.048          27.142          48.62x
 2               2.163           2.147          67.918          22.613          31.40x
 3               2.205           2.183          66.479          22.130          30.14x

Here’s my parallel benchmark (–parallel flag) :

"""
Minimal benchmark: elastix CLI vs itk-elastix Python API.

Context
-------
In a Java/Appose-based setup we observed a ~5x slowdown when running
itk-elastix inside a persistent Python subprocess (via Appose) compared
to calling the elastix CLI executable directly.  This script reproduces
the two execution paths in pure Python to isolate where the time goes.

Two methods are compared:

  CLI     -- subprocess.run(["elastix", "-f", ..., "-m", ..., "-p", ..., "-out", ...])
             Each call spawns a fresh elastix process, exactly like DefaultElastixTask.

  itk-API -- itk.ElastixRegistrationMethod[...].UpdateLargestPossibleRegion()
             Registration runs inside the current Python process, exactly like the
             script that Appose dispatches to its persistent worker process.

Usage
-----
    python bench.py \\
        --elastix /path/to/elastix \\
        --fixed   ../src/test/resources/blobs-rot15deg.tif \\
        --moving  ../src/test/resources/blobs.tif

Optional flags:
    --threads  N  number of ITK/elastix threads per job (0 = auto-detect physical cores)
    --runs     N  total number of timed repetitions (default 3; run 1 is warm-up)
    --parallel N  number of concurrent jobs per run (default 1 = sequential)
    --no-cli      skip the CLI measurements
    --no-itk      skip the itk-elastix measurements

Parallel mode (--parallel N > 1)
---------------------------------
Each run dispatches N independent registrations of the same image pair
concurrently via a ThreadPoolExecutor.  Because ITK C++ releases the GIL
during UpdateLargestPossibleRegion, multiple itk-API jobs can overlap in
the same process.  CLI jobs each spawn a separate elastix process, so they
are always truly parallel.  The reported wall time is the elapsed time from
the first job starting to the last one finishing.
"""

import argparse
import os
import shutil
import subprocess
import sys
import tempfile
import threading
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

# itk uses lazy attribute loading (itkTemplate) that is not thread-safe on first access.
# This lock serializes only the setup phase (object construction + parameter loading);
# the expensive UpdateLargestPossibleRegion() runs outside the lock so jobs truly overlap.
_itk_init_lock = threading.Lock()

PARAM_FILE = os.path.join(os.path.dirname(os.path.abspath(__file__)), "params_bspline.txt")


# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------

def physical_cores():
    try:
        import psutil
        return psutil.cpu_count(logical=False) or os.cpu_count()
    except ImportError:
        return os.cpu_count()


def resolve_threads(n):
    return physical_cores() if n == 0 else n


# ---------------------------------------------------------------------------
# CLI backend
# ---------------------------------------------------------------------------

def run_cli(elastix_exe, fixed, moving, param_file, n_threads, job_id=None):
    tag = f"[CLI job {job_id}]" if job_id is not None else "[CLI]"
    out_dir = tempfile.mkdtemp(prefix="elastix_cli_")
    try:
        cmd = [
            elastix_exe,
            "-f",       fixed,
            "-m",       moving,
            "-p",       param_file,
            "-out",     out_dir,
            "-threads", str(n_threads),
        ]
        t0 = time.perf_counter()
        result = subprocess.run(cmd, capture_output=True, text=True)
        elapsed = time.perf_counter() - t0
        if result.returncode != 0:
            print(f"  {tag} STDERR (last 600 chars):", result.stderr[-600:], file=sys.stderr)
            raise RuntimeError(f"elastix CLI failed (rc={result.returncode})")
        transform = os.path.join(out_dir, "TransformParameters.0.txt")
        ok = os.path.exists(transform)
        print(f"  {tag}     {elapsed:.2f}s  transform exists: {ok}")
        return elapsed
    finally:
        shutil.rmtree(out_dir, ignore_errors=True)


# ---------------------------------------------------------------------------
# itk-elastix backend (in-process)
# ---------------------------------------------------------------------------

def run_itk(fixed, moving, param_file, n_threads, job_id=None):
    import itk  # noqa: import inside function to avoid forcing itk at startup

    tag = f"[itk-API job {job_id}]" if job_id is not None else "[itk-API]"

    out_dir = tempfile.mkdtemp(prefix="elastix_itk_")
    try:
        # Serialize itk object construction: itk's lazy attribute loader (itkTemplate)
        # is not thread-safe on first access, causing AttributeError under parallelism.
        with _itk_init_lock:
            fixed_img  = itk.imread(fixed,  itk.F)
            moving_img = itk.imread(moving, itk.F)

            param_obj = itk.ParameterObject.New()
            param_obj.ReadParameterFile(param_file)
            pm = param_obj.GetParameterMap(0)
            param_obj.SetParameterMap(0, pm)

            ImageType = type(fixed_img)
            erm = itk.ElastixRegistrationMethod[ImageType, ImageType].New()
            erm.SetFixedImage(fixed_img)
            erm.SetMovingImage(moving_img)
            erm.SetParameterObject(param_obj)
            erm.SetOutputDirectory(out_dir)
            erm.SetLogToConsole(False)
            erm.SetLogToFile(True)
            erm.SetNumberOfThreads(n_threads)

        # Release the lock before the heavy computation so jobs run in parallel.
        t0 = time.perf_counter()
        erm.UpdateLargestPossibleRegion()
        elapsed = time.perf_counter() - t0

        transform = os.path.join(out_dir, "TransformParameters.0.txt")
        ok = os.path.exists(transform)
        print(f"  {tag} {elapsed:.2f}s  transform exists: {ok}")
        return elapsed
    finally:
        shutil.rmtree(out_dir, ignore_errors=True)


# ---------------------------------------------------------------------------
# Parallel runner
# ---------------------------------------------------------------------------

def run_parallel(fn, n_parallel, *fn_args):
    """Run fn(*fn_args, job_id=i) for i in range(n_parallel) concurrently.

    Returns (per_job_times, wall_time) where per_job_times is a list of floats.
    ITK C++ releases the GIL during registration, so multiple itk-API jobs
    can genuinely overlap within a single process.
    """
    wall_t0 = time.perf_counter()
    with ThreadPoolExecutor(max_workers=n_parallel) as executor:
        futures = [executor.submit(fn, *fn_args, job_id=i) for i in range(n_parallel)]
        per_job = [f.result() for f in futures]
    wall = time.perf_counter() - wall_t0
    return per_job, wall


def parallel_stats(per_job):
    """Return (min, max, avg) for a list of per-job times."""
    return min(per_job), max(per_job), sum(per_job) / len(per_job)


# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------

def main():
    parser = argparse.ArgumentParser(
        description="Benchmark elastix CLI vs itk-elastix Python API"
    )
    parser.add_argument("--elastix",  default=None,
                        help="Path to elastix CLI executable (required unless --no-cli)")
    parser.add_argument("--fixed",    required=True, help="Fixed image (TIFF/MHD/...)")
    parser.add_argument("--moving",   required=True, help="Moving image")
    parser.add_argument("--threads",  type=int, default=0,
                        help="Number of threads per job (0 = physical cores, default)")
    parser.add_argument("--runs",     type=int, default=3,
                        help="Number of timed runs (first is warm-up)")
    parser.add_argument("--parallel", type=int, default=1,
                        help="Number of concurrent jobs per run (default 1 = sequential)")
    parser.add_argument("--no-cli",   action="store_true", help="Skip CLI measurements")
    parser.add_argument("--no-itk",   action="store_true", help="Skip itk-elastix measurements")
    args = parser.parse_args()

    if not args.no_cli and args.elastix is None:
        parser.error("--elastix is required unless --no-cli is set")
    if args.parallel < 1:
        parser.error("--parallel must be >= 1")

    n_threads = resolve_threads(args.threads)
    n_parallel = args.parallel

    print(f"Python {sys.version}")
    print(f"Threads per job: {n_threads}  (requested: {args.threads})")
    print(f"Parallel jobs:   {n_parallel}")
    try:
        import itk_elastix
        print(f"itk-elastix version: {itk_elastix.__version__}")
    except Exception:
        pass
    print(f"Param file: {PARAM_FILE}")
    print(f"Fixed:      {args.fixed}")
    print(f"Moving:     {args.moving}")
    print()

    # Each entry is (wall_time, per_job_times_list)
    cli_results = []
    itk_results = []

    for run in range(args.runs):
        label = f"Run {run + 1}/{args.runs}" + (" (warm-up)" if run == 0 else "")
        print(f"--- {label} (parallel={n_parallel}) ---")

        if not args.no_cli:
            if n_parallel == 1:
                t = run_cli(args.elastix, args.fixed, args.moving, PARAM_FILE, n_threads)
                cli_results.append((t, [t]))
            else:
                per_job, wall = run_parallel(
                    run_cli, n_parallel,
                    args.elastix, args.fixed, args.moving, PARAM_FILE, n_threads)
                mn, mx, avg = parallel_stats(per_job)
                print(f"  [CLI parallel] wall={wall:.2f}s  min={mn:.2f}s  max={mx:.2f}s  avg={avg:.2f}s")
                cli_results.append((wall, per_job))

        if not args.no_itk:
            if n_parallel == 1:
                t = run_itk(args.fixed, args.moving, PARAM_FILE, n_threads)
                itk_results.append((t, [t]))
            else:
                per_job, wall = run_parallel(
                    run_itk, n_parallel,
                    args.fixed, args.moving, PARAM_FILE, n_threads)
                mn, mx, avg = parallel_stats(per_job)
                print(f"  [itk parallel] wall={wall:.2f}s  min={mn:.2f}s  max={mx:.2f}s  avg={avg:.2f}s")
                itk_results.append((wall, per_job))

        print()

    # Summary table — show wall time (= job time when sequential)
    col = 14
    if n_parallel == 1:
        header = f"{'Run':<6}  {'CLI (s)':>{col}}  {'itk-API (s)':>{col}}  {'ratio (itk/cli)':>{col}}"
        print(header)
        print("-" * len(header))
        for i in range(args.runs):
            warm = "*" if i == 0 else " "
            c = f"{cli_results[i][0]:.3f}" if cli_results else "n/a"
            t = f"{itk_results[i][0]:.3f}" if itk_results else "n/a"
            if cli_results and itk_results:
                ratio = f"{itk_results[i][0] / cli_results[i][0]:.2f}x"
            else:
                ratio = "n/a"
            print(f"{warm}{i + 1:<5}  {c:>{col}}  {t:>{col}}  {ratio:>{col}}")
    else:
        header = (f"{'Run':<6}  {'CLI wall(s)':>{col}}  {'CLI avg(s)':>{col}}"
                  f"  {'ITK wall(s)':>{col}}  {'ITK avg(s)':>{col}}  {'wall ratio':>{col}}")
        print(header)
        print("-" * len(header))
        for i in range(args.runs):
            warm = "*" if i == 0 else " "
            if cli_results:
                cw = f"{cli_results[i][0]:.3f}"
                ca = f"{parallel_stats(cli_results[i][1])[2]:.3f}"
            else:
                cw = ca = "n/a"
            if itk_results:
                tw = f"{itk_results[i][0]:.3f}"
                ta = f"{parallel_stats(itk_results[i][1])[2]:.3f}"
            else:
                tw = ta = "n/a"
            if cli_results and itk_results:
                ratio = f"{itk_results[i][0] / cli_results[i][0]:.2f}x"
            else:
                ratio = "n/a"
            print(f"{warm}{i + 1:<5}  {cw:>{col}}  {ca:>{col}}  {tw:>{col}}  {ta:>{col}}  {ratio:>{col}}")

    if args.runs > 1:
        print()
        if cli_results and len(cli_results) > 1:
            avg_c = sum(r[0] for r in cli_results[1:]) / (len(cli_results) - 1)
            print(f"Steady-state avg CLI     wall (runs 2+): {avg_c:.3f}s")
        if itk_results and len(itk_results) > 1:
            avg_t = sum(r[0] for r in itk_results[1:]) / (len(itk_results) - 1)
            print(f"Steady-state avg itk-API wall (runs 2+): {avg_t:.3f}s")
        if cli_results and itk_results and len(cli_results) > 1:
            print(f"Steady-state wall ratio (itk/cli): {avg_t / avg_c:.2f}x")


if __name__ == "__main__":
    main()

Additional Issues:

Threading errors: With more parallel jobs, I occasionally encounter:

AttributeError: module 'itk' has no attribute 'ParameterObject'

This is puzzling, as the error occurs after several jobs complete successfully.

Next Steps:

@matt.mccormick suggested trying the newly released ITK version, but I’m unsure how to proceed and have already spent significant time on this.
For now, I’ll pause further investigation but am open to revisiting this once a new itk-elastix version is available.

Thanks to everyone for your suggestions and support!

blowekamp · March 26, 2026, 7:38pm

@NicoKiaru Did you try setting the environment variable ITK_GLOBAL_DEFAULT_THREADER to “Pool” and “Platform”?

Niels_Dekker · March 27, 2026, 10:54am

Do you agree now that the benchmark that you shared with your initial post (benchmark-itk.zip) should be fixed by adding a line of code, erm.SetNumberOfThreads(n_threads), in order to do a fair comparison between CLI and itk-elastix?

For the sake of completeness, here are my benchmark results, comparing itk-elastix 0.25.2 with elastix-5.3.1:

Python 3.10.20 | packaged by Anaconda, Inc. | (main, Mar 11 2026, 17:42:35) [MSC v.1942 64 bit (AMD64)]
Threads: 64  (requested: 0)
Param file: D:\X\Src\B\benchmark-itk\params_bspline.txt
Fixed:      blobs.tif
Moving:     blobs-rot15deg.tif

— Run 1/3 (warm-up) —
[CLI]     2.06s  transform exists: True
[itk-API] 1.48s  transform exists: True

— Run 2/3 —
[CLI]     1.91s  transform exists: True
[itk-API] 1.49s  transform exists: True

— Run 3/3 —
[CLI]     2.02s  transform exists: True
[itk-API] 1.80s  transform exists: True

Run            CLI (s)     itk-API (s)  ratio (itk/cli)

*1               2.060           1.479           0.72x
2               1.906           1.491           0.78x
3               2.016           1.803           0.89x

Steady-state avg CLI     (runs 2+): 1.961s
Steady-state avg itk-API (runs 2+): 1.647s
Steady-state ratio (itk/cli): 0.84x

So in this case, itk-elastix certainly isn’t slower than the CLI. It looks like it is even faster! Right?

dzenanz · March 27, 2026, 2:22pm

With that change, here are my results:

(venv) C:\a\benchmark-itk>python bench.py --fixed blobs.tif --moving blobs-rot15deg.tif --runs 3 --elastix "C:\Misc\elastix\vs26\bin\Release\elastix.exe"
Python 3.10.11 (tags/v3.10.11:7d4cc5a, Apr  5 2023, 00:38:17) [MSC v.1929 64 bit (AMD64)]
Threads: 12  (requested: 0)
Param file: C:\a\benchmark-itk\params_bspline.txt
Fixed:      blobs.tif
Moving:     blobs-rot15deg.tif

--- Run 1/3 (warm-up) ---
  [CLI]     1.80s  transform exists: True
  [itk-API] 1.74s  transform exists: True

--- Run 2/3 ---
  [CLI]     1.75s  transform exists: True
  [itk-API] 1.42s  transform exists: True

--- Run 3/3 ---
  [CLI]     1.67s  transform exists: True
  [itk-API] 1.46s  transform exists: True

Run            CLI (s)     itk-API (s)  ratio (itk/cli)
-------------------------------------------------------
*1               1.797           1.736           0.97x
 2               1.749           1.424           0.81x
 3               1.665           1.459           0.88x

Steady-state avg CLI     (runs 2+): 1.707s
Steady-state avg itk-API (runs 2+): 1.441s
Steady-state ratio (itk/cli): 0.84x