~8x slower registration with itk-elastix Python API vs elastix CLI — minimal reproducible example

Disclaimer: this is a [crosspost from image.sc]( ~8x slower registration with itk-elastix Python API vs elastix CLI — minimal reproducible example - Usage & Issues - Image.sc Forum )

Hi all,

I’m integrating itk-elastix as a drop-in replacement for the elastix CLI in a Java/Fiji plugin, and I’m hitting a consistent ~8.5x wall-clock slowdown when using the Python API compared to spawning the elastix executable directly. This is not a warm-up effect — the ratio holds across all runs.

Environment

  • Windows 11, Python 3.11 (conda-forge), itk-elastix==0.21.0 (pinned to match elastix 5.2.0)
  • 14 physical cores, same thread count passed to both paths
  • Test images: blobs.tif / blobs-rot15deg.tif (256×256, standard ImageJ sample images)
  • Transform: BSpline, 6 resolutions, 100 iterations/level, 4096 random samples, FinalGridSpacingInVoxels=20
  • Exactly the same parameter file is used for both paths

The two execution paths

  # Path 1 — CLI (fast)
  subprocess.run([
      "elastix.exe",
      "-f", fixed, "-m", moving,
      "-p", params_bspline.txt,
      "-out", out_dir,
      "-threads", "14"
  ])

Path 2 — itk-elastix Python API (slow)

  erm = itk.ElastixRegistrationMethod[ImageType, ImageType].New()
  erm.SetFixedImage(fixed_img)
  erm.SetMovingImage(moving_img)
  erm.SetParameterObject(param_obj)   # same param file, NumberOfThreads patched to 14
  erm.SetOutputDirectory(out_dir)
  erm.SetLogToConsole(False)
  erm.SetLogToFile(True)
  erm.UpdateLargestPossibleRegion()

Results (3 runs, no warm-up effect observed)

Run CLI (s) itk-API (s) ratio

  • 1 2.50 22.08 8.82x
    2 2.73 22.24 8.16x
    3 2.41 21.70 8.99x

Steady-state avg CLI: 2.57 s
Steady-state avg itk-API: 21.97 s
Steady-state ratio: 8.55x

What I’ve already checked / ruled out

  • Thread count is identical (NumberOfThreads is explicitly patched in the ParameterMap before calling UpdateLargestPossibleRegion, confirmed via the parameter dump)
  • The parameter file is bit-for-bit identical between the two paths
  • The slowdown is not a warm-up / import cost — Run 1 and Run 3 are essentially the same
  • The ITK pipeline overhead (filter setup, New(), etc.) is negligible compared to 20 s
  • WriteResultImage is false in both paths, so resampling is not the culprit
  • Both paths produce a valid TransformParameters.0.txt

Questions

  1. Is there a known performance difference between the CLI and the Python API for BSpline registration specifically? The CLI likely uses elastix’s own thread pool management —
    does the Python API use a different ITK thread scheduler?
  2. Could there be a memory allocation or cache-thrashing issue when ITK is loaded inside a Python process vs. running standalone?
  3. Is there a way to configure the ITK thread pool (e.g. itk.MultiThreaderBase.SetGlobalDefaultNumberOfThreads()) that would close the gap?
  4. Any known difference in how the BSpline Jacobian accumulation is handled between the two paths?

Reproducible example

I’ve attached a zip containing bench.py, params_bspline.txt, and the two test images. You only need itk-elastix==0.21.0 (plus psutil for physical-core detection) and an elastix
5.2.0 binary in your PATH or passed via --elastix.

pip install itk-elastix==0.21.0 psutil

  python bench.py \
      --elastix /path/to/elastix \
      --fixed  blobs.tif \
      --moving blobs-rot15deg.tif \
      --runs 3

Any pointers to what’s going on under the hood would be very much appreciated!


PS:

  • This post has been redacted with claude’s help. I’ve been trying to debug and find the reason for the discrepancy with the help of claude for a while now, but I’ve hit a wall, so I ask for community help.
  • I’ve been trying to replace my CLI integration with Appose and itk-elastix, but a 8x difference in performance won’t cut it… The problem is not due to Appose since I can reproduce it in a pure python pixi env.
  • I tried to look at the itk discourse forum to find an answer but did not find anything relevant.

Pinging @dzenanz, @Niels_Dekker

Thanks!

Here’s the zip file of the minimal self-contained benchmark:

benchmark-itk.zip (72.2 KB)

Thank you for your question, @NicoKiaru

elastix offers two implementations of the BSpline transform. The “RecursiveBSplineTransform” should be faster than the regular “BSplineTransform”:

(Transform "RecursiveBSplineTransform")

See: elastix: elastix::RecursiveBSplineTransform< TElastix > Class Template Reference

Would that possibly be of help to you?

Tomorrow I’ll try to have a closer look at your case!

1 Like

Thanks a lot for the fast response! I quickly tested with RecursiveBSplineTransform and I get a small performance increase, but the ratio is still 10 between the compiled version and itk-elastix:

(itk-elastix) C:\\Users\\chiarutt\\Downloads\\benchmark-itk\\benchmark-itk>python bench.py --elastix “C:/Users/chiarutt/AppData/Local/abba-python-0.11.0/win/elastix-5.2.0-win64/elastix.exe” --fixed blobs-rot15deg.tif --moving  blobs.tif
Python 3.11.15 | packaged by conda-forge | (main, Mar  5 2026, 16:36:00) \[MSC v.1944 64 bit (AMD64)\]
Threads: 28  (requested: 0)
Param file: C:\\Users\\chiarutt\\Downloads\\benchmark-itk\\benchmark-itk\\params_bspline.txt
Fixed:      blobs-rot15deg.tif
Moving:     blobs.tif

— Run 1/3 (warm-up) —
\[CLI\]     2.14s  transform exists: True
\[itk-API\] 21.73s  transform exists: True

— Run 2/3 —
\[CLI\]     2.06s  transform exists: True
\[itk-API\] 21.90s  transform exists: True

— Run 3/3 —
\[CLI\]     2.12s  transform exists: True
\[itk-API\] 21.99s  transform exists: True

## Run            CLI (s)     itk-API (s)  ratio (itk/cli)

1               2.137          21.726          10.17x
2               2.056          21.901          10.65x
3               2.122          21.989          10.36x

Steady-state avg CLI     (runs 2+): 2.089s
Steady-state avg itk-API (runs 2+): 21.945s
Steady-state ratio (itk/cli): 10.50x

On another note, I did perform some affine registration, and the time were comparable in that case.

Cheers,

Nicolas

I don’t remember ever using command line version of elastix (aside under the hood in Slicer). But import itk followed by instantiation of a few classes takes multiple seconds in Python. That somehow fails to show up in your testing, which makes me think that you somehow pay this penalty during each invocation.

Thanks a lot @dzenanz for looking at it! You’re right, I’ll check by taking import itk out of the loop. I’ll test and retry.

But I don’t think that will solve the full picture: I also had a different way of invoking several runs with a much larger 2-step registration and the factor remains the same (you can even see a small warmup effect here, and the penalty for the first affine registration, probably linked to the import). If the import was the limiting factor, I would expect the difference to reduce with a longer registration, but I do not observe that:

╔═══════════════════╦═════════════╦═════════════╦═════════════╗                                                                                                                     
║                   ║   Affine    ║   BSpline   ║    Total    ║                                                                                                                     
╠═══════════════════╬═════════════╬═════════════╬═════════════╣                                                                                                                     
║ Appose run 1      ║   36970 ms  ║  211337 ms  ║  248307 ms  ║                                                                                                                   
║ Appose run 2      ║    6495 ms  ║  203000 ms  ║  209495 ms  ║                                                                                                                     
║ Appose run 3      ║    5759 ms  ║  201558 ms  ║  207317 ms  ║                                                                                                                     
╠═══════════════════╬═════════════╬═════════════╬═════════════╣                                                                                                                     
║ CLI    run 1      ║    6714 ms  ║   39521 ms  ║   46235 ms  ║                                                                                                                   
║ CLI    run 2      ║    6798 ms  ║   39982 ms  ║   46780 ms  ║                                                                                                                     
║ CLI    run 3      ║    6691 ms  ║   40067 ms  ║   46758 ms  ║                                                                                                                     
╚═══════════════════╩═════════════╩═════════════╩═════════════╝ 

But I will test with the bench I send you and report back.

(Note: Appose runs itk-elastix)

EDIT:

I tried to move import, image reading and parameter settings upfront, I still get this:

Run            CLI (s)     itk-API (s)  ratio (itk/cli)
-------------------------------------------------------
*1               2.051          22.421          10.93x
 2               2.173          22.279          10.25x
 3               2.139          22.068          10.32x

The slightly modified bench.py is here:

"""
Minimal benchmark: elastix CLI vs itk-elastix Python API.

Context
-------
In a Java/Appose-based setup we observed a ~5x slowdown when running
itk-elastix inside a persistent Python subprocess (via Appose) compared
to calling the elastix CLI executable directly.  This script reproduces
the two execution paths in pure Python to isolate where the time goes.

Two methods are compared:

  CLI     -- subprocess.run(["elastix", "-f", ..., "-m", ..., "-p", ..., "-out", ...])
             Each call spawns a fresh elastix process, exactly like DefaultElastixTask.

  itk-API -- itk.ElastixRegistrationMethod[...].UpdateLargestPossibleRegion()
             Registration runs inside the current Python process, exactly like the
             script that Appose dispatches to its persistent worker process.

Usage
-----
    python bench.py \\
        --elastix /path/to/elastix \\
        --fixed   ../src/test/resources/blobs-rot15deg.tif \\
        --moving  ../src/test/resources/blobs.tif

Optional flags:
    --threads N   number of ITK/elastix threads (0 = auto-detect physical cores)
    --runs    N   total number of timed repetitions (default 3; run 1 is warm-up)
    --no-cli      skip the CLI measurements
    --no-itk      skip the itk-elastix measurements
"""

import argparse
import os
import shutil
import subprocess
import sys
import tempfile
import time
import itk  

PARAM_FILE = os.path.join(os.path.dirname(os.path.abspath(__file__)), "params_bspline.txt")


# ---------------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------------

def physical_cores():
    try:
        import psutil
        return psutil.cpu_count(logical=False) or os.cpu_count()
    except ImportError:
        return os.cpu_count()


def resolve_threads(n):
    return physical_cores() if n == 0 else n


# ---------------------------------------------------------------------------
# CLI backend
# ---------------------------------------------------------------------------

def run_cli(elastix_exe, fixed, moving, param_file, n_threads):
    out_dir = tempfile.mkdtemp(prefix="elastix_cli_")
    try:
        cmd = [
            elastix_exe,
            "-f",       fixed,
            "-m",       moving,
            "-p",       param_file,
            "-out",     out_dir,
            "-threads", str(n_threads),
        ]
        t0 = time.perf_counter()
        result = subprocess.run(cmd, capture_output=True, text=True)
        elapsed = time.perf_counter() - t0
        if result.returncode != 0:
            print("  [CLI] STDERR (last 600 chars):", result.stderr[-600:], file=sys.stderr)
            raise RuntimeError(f"elastix CLI failed (rc={result.returncode})")
        transform = os.path.join(out_dir, "TransformParameters.0.txt")
        ok = os.path.exists(transform)
        print(f"  [CLI]     {elapsed:.2f}s  transform exists: {ok}")
        return elapsed
    finally:
        shutil.rmtree(out_dir, ignore_errors=True)


# ---------------------------------------------------------------------------
# itk-elastix backend (in-process)
# ---------------------------------------------------------------------------

def run_itk(fixed_img, moving_img, param_obj, n_threads):
    

    out_dir = tempfile.mkdtemp(prefix="elastix_itk_")
    try:
        ImageType = type(fixed_img)
        erm = itk.ElastixRegistrationMethod[ImageType, ImageType].New()
        erm.SetFixedImage(fixed_img)
        erm.SetMovingImage(moving_img)
        erm.SetParameterObject(param_obj)
        erm.SetOutputDirectory(out_dir)
        erm.SetLogToConsole(False)
        erm.SetLogToFile(True)

        t0 = time.perf_counter()
        erm.UpdateLargestPossibleRegion()
        elapsed = time.perf_counter() - t0

        transform = os.path.join(out_dir, "TransformParameters.0.txt")
        ok = os.path.exists(transform)
        print(f"  [itk-API] {elapsed:.2f}s  transform exists: {ok}")
        return elapsed
    finally:
        shutil.rmtree(out_dir, ignore_errors=True)


# ---------------------------------------------------------------------------
# Main
# ---------------------------------------------------------------------------

def main():
    parser = argparse.ArgumentParser(
        description="Benchmark elastix CLI vs itk-elastix Python API"
    )
    parser.add_argument("--elastix", default=None,
                        help="Path to elastix CLI executable (required unless --no-cli)")
    parser.add_argument("--fixed",   required=True, help="Fixed image (TIFF/MHD/...)")
    parser.add_argument("--moving",  required=True, help="Moving image")
    parser.add_argument("--threads", type=int, default=0,
                        help="Number of threads (0 = physical cores, default)")
    parser.add_argument("--runs",    type=int, default=3,
                        help="Number of timed runs (first is warm-up)")
    parser.add_argument("--no-cli",  action="store_true", help="Skip CLI measurements")
    parser.add_argument("--no-itk",  action="store_true", help="Skip itk-elastix measurements")
    args = parser.parse_args()

    if not args.no_cli and args.elastix is None:
        parser.error("--elastix is required unless --no-cli is set")

    n_threads = resolve_threads(args.threads)

    print(f"Python {sys.version}")
    print(f"Threads: {n_threads}  (requested: {args.threads})")
    try:
        import itk_elastix
        print(f"itk-elastix version: {itk_elastix.__version__}")
    except Exception:
        pass
    print(f"Param file: {PARAM_FILE}")
    print(f"Fixed:      {args.fixed}")
    print(f"Moving:     {args.moving}")
    print()

    cli_times = []
    itk_times = []
    
    

    fixed_img  = itk.imread(args.fixed,  itk.F)
    moving_img = itk.imread(args.moving, itk.F)
    
    
    param_obj = itk.ParameterObject.New()
    param_obj.ReadParameterFile(PARAM_FILE)

    pm = param_obj.GetParameterMap(0)
    pm["NumberOfThreads"] = [str(n_threads)]
    param_obj.SetParameterMap(0, pm)

    for run in range(args.runs):
        label = f"Run {run + 1}/{args.runs}" + (" (warm-up)" if run == 0 else "")
        print(f"--- {label} ---")

        if not args.no_cli:
            t = run_cli(args.elastix, args.fixed, args.moving, PARAM_FILE, n_threads)
            cli_times.append(t)

        if not args.no_itk:
            t = run_itk(fixed_img, moving_img, param_obj, n_threads)
            itk_times.append(t)

        print()

    # Summary table
    col = 14
    header = f"{'Run':<6}  {'CLI (s)':>{col}}  {'itk-API (s)':>{col}}  {'ratio (itk/cli)':>{col}}"
    print(header)
    print("-" * len(header))
    for i in range(args.runs):
        warm = "*" if i == 0 else " "
        c = f"{cli_times[i]:.3f}" if cli_times else "n/a"
        t = f"{itk_times[i]:.3f}" if itk_times else "n/a"
        if cli_times and itk_times:
            ratio = f"{itk_times[i] / cli_times[i]:.2f}x"
        else:
            ratio = "n/a"
        print(f"{warm}{i + 1:<5}  {c:>{col}}  {t:>{col}}  {ratio:>{col}}")

    if args.runs > 1:
        print()
        if cli_times and len(cli_times) > 1:
            avg_c = sum(cli_times[1:]) / (len(cli_times) - 1)
            print(f"Steady-state avg CLI     (runs 2+): {avg_c:.3f}s")
        if itk_times and len(itk_times) > 1:
            avg_t = sum(itk_times[1:]) / (len(itk_times) - 1)
            print(f"Steady-state avg itk-API (runs 2+): {avg_t:.3f}s")
        if cli_times and itk_times and len(cli_times) > 1:
            print(f"Steady-state ratio (itk/cli): {avg_t / avg_c:.2f}x")


if __name__ == "__main__":
    main()

I may be doing something stupid, I just don’t know what. Maybe some parameters are ignored in the cli and not in itk-elastix or vice versa.

2 Likes

Thank you for providing more detail. I believe that Niels is best positioned to look into this performance discrepancy. A self-contained reproducible example is of great help for any such analysis.

2 Likes

Small update: I’m able to run your benchmark, and I see large performance differences between CLI and the “itk-API” run, indeed. Do those performance differences also occur with itk-elastix==0.25.0 vs elastix 5.3.0? (Of course, I can also try that out myself, just wondering if you already did so!)

1 Like

Sure, no worry.

So I’ve initially tested the latest itk-elastix and saw a slow computation vs cli 5.2.0 (already installed on my system). I thought maybe it was due to the version difference so I used an older itk-elastix to make a fair comparison. But I did not test the latest itk-elastix against elastix 5.3.0 using cli. I can do it though.

EDIT: I compared elastix 5.3.1 vs itk-elastix 0.25.1 (python 3.14) and the difference still holds with this benchmark:

(itk-elastix-latest) D:\code\benchmark-itk\benchmark-itk>python bench.py --elastix "C:/elastix-5.3.1-windows/elastix.exe" --fixed blobs-rot15deg.tif --moving  blobs.tif
Python 3.14.3 | packaged by conda-forge | (main, Feb  9 2026, 21:56:48) [MSC v.1944 64 bit (AMD64)]
Threads: 28  (requested: 0)
Param file: D:\code\benchmark-itk\benchmark-itk\params_bspline.txt
Fixed:      blobs-rot15deg.tif
Moving:     blobs.tif

--- Run 1/3 (warm-up) ---
  [CLI]     4.99s  transform exists: True
  [itk-API] 23.61s  transform exists: True

--- Run 2/3 ---
  [CLI]     2.05s  transform exists: True
  [itk-API] 22.31s  transform exists: True

--- Run 3/3 ---
  [CLI]     2.10s  transform exists: True
  [itk-API] 23.21s  transform exists: True

Run            CLI (s)     itk-API (s)  ratio (itk/cli)
-------------------------------------------------------
*1               4.986          23.614           4.74x
 2               2.048          22.310          10.89x
 3               2.098          23.213          11.06x

Steady-state avg CLI     (runs 2+): 2.073s
Steady-state avg itk-API (runs 2+): 22.761s
Steady-state ratio (itk/cli): 10.98x