Image Registration with Pseudo-Random Sampling throws exception randomly (maybe bug)

Hi all,

It will take a lot of words to describe the issue. But I think it might be a bug.
Thank you for your time reading this.

Let me first specify the functions I used.
I’m doing image registration, translation first, then B-Spline.
Before using sampling, everything works fine.
In order to speed up the process, I use stochastic gradient descent.
by adding the following lines

    registration_->SetMetricSamplingPercentage(0.1);
    registration_->SetMetricSamplingStrategy(RegistrationType::RANDOM);
    registration_->MetricSamplingReinitializeSeed(1);// if no para : random (take time as seed)
    registration_->InPlaceOn();

between SetInitialTransform(transform_); and Update();

If there’s any problem with the code, please let me know.
If not, here’s the long story.

What I’m working on is doing pairwise registration on all image pairs of one video (~30000 frames).
I would like to reproduce the result, therefore I use “MetricSamplingReinitializeSeed(1)”
After test, the result is reproducible so I believe I’m using correct code.

However, the strange thing is, when I tried to process the same video 8 times in a single execution of program
It always throws exception at the 4th and 7th trial, but at different frame index. (this “semi-reproducible” behavior really confused me)
By tracing the dump file and also reading the exception message, there are 2 lines of code where exception occured.

  1. itkObjectToObjectMetric.hxx, line 425. which means the ObjectToObjectMetric::m_VirtualImage is null.
    If I understand correctly, this line (or this function) is for dense sampling, which means in principle, it should not be called.
  2. itkImageToImageMetricv4.hxx, line 532. which means ImageToImageMetricv4::m_FixedSampledPointSet is null.
    This function is in normal function call routine. m_FixedSampledPointSet should not be null

Since most of the time, the result is reproducible, the only reason I know for these kinds of exception (wrong function call, pointer becomes null) is some part of the code writes to where it should not write (like out of array boundary)

Things are more complicated since I used 44 threads. Each thread calls its own itk registration,(for example, thread 0 processing frame 1~800, thread 2, 801~1600, … ,etc.)
Inside each thread, I have set
itk::MultiThreader::SetGlobalMaximumNumberOfThreads(1);
So there should not be multi-threading inside itk registration object

Since everything is fine before using the sampling method.
Either I used the function in a wrong way, or there might be some bug in these function.
(or it’s my other code having bug, but didn’t affect anything before using sampling)

I’m also trying to check the source code, I will report if I found any clue.

Which version of ITK are you using, 4.13.1 or some recent master?@blowekamp made some changes to virtual image part of registration framework, so he might know more. See the discussion.

Since you are dividing your overall video by frames, it is possible that the crashing can occur in each segment of frames, and which one is encountered first depends on operating system’s scheduling of threads. For example, if the crash is on index 5, it means thread 0 reached its crashing point first. If the crash is on index 1605, it means thread 2 reached its crashing point first.

Thanks @dzenanz!
I’m using 4.12.2. I have tried to update to 4.13.1 but it didn’t solve the problem.
I’ll look into the discussion you mentioned.

About your comment on threads, I think it’s not the case.
This is just more details.
I handled the exception so the code won’t crash and will continue the registration of next pair.
Which means I can (and I do) record in which frames the exception occurred.
It differed at each tried, (roughly 5~10 exception in one video, but again, just in the 4th and 7th trial).
The exception might be triggered in the same thread, and might be not.
Even if it was in the same thread, the frame count differs.

Some work has been done to improve reproducibility in ITK, so it makes sense to try ITK 5.0 beta 3 which is the most recent pre-release. If the problem is still not deterministic, that would be a bug in either library or your code. If it is deterministic, it will allow you to set conditional breakpoints based on frame index etc allowing easier debugging.

I have not looked deep into this, but it seem like you have a complicated registration system needed to create this issue. I have a couple thoughts:

  • You are doing all the pair wise registrations in one process? or do you create a new process for each registration?
  • Are you reusing any registration objects between registrations? I suggest creating all new ones for each registration
  • Creating a minimal sharable and reproducible example will enable others to track down the bug.

Thanks @dzenanz, @blowekamp!

dzenanz,
I did tried update to ITK 5.0 beta 3. The exception is still present (In trail 6 instead of trial 4&7 though)

blowekamp,
Answering your questions

  • I’m doing all the pairwise registrations in one process.
  • I create new registration objects for each image pair.
  • I will tried
2 Likes

Hi,

I’ve finally come to a minimum shareable and reproducible example
I put it on github


Please let me know if there’s any problem or you need any other information

1 Like

Great! Thanks for taking the time to create that!

It says that you are using a system with 44 cores. Any idea if it is reproducible with fewer?

Have you tried or been able to reproduce it on Linux/Mac?

I don’t think I can get a system with that many cores on windows. So looking for alternative ways to repoduce.

1 Like

I haven’t try it on Linux/Mac.
I have a 8 core Windows machine.
I will test on that one.
(By the way, I know single thread process runs fine without exceptions)

Is there any other thing I can do?

It’s reproducible on my Windows10 PC with 8 core (using 8 threads)
But the occurrence rate might be lower.
In one test, the exception occur at 52xxx, 79xxx

1 Like

This is great info! I think with the information you provided anyone can try to tackle this problem!

I am not sure when I’ll have time and the proper system, but this issue is very interesting to me.

2 Likes

Thanks! What about the coding itself? Am I using ITK library properly?

Hello, is there anyone that can reproduce the exception? Many thanks!

Hello,

I was unable to reproduce your problem. I used a 36 core windows system along with a freshly install Visual Studio 15 2017 installation. Also I used the current ITKv5 master. I even tried over subscribing the system with 72 thread.

I did make a couple changes to you test case, and create a PR here: https://github.com/qmokid/PairwiseStitching/pull/1

I’ll keep running the test a bit longer to see of anything happens.

1 Like

Hi,

Thanks for your kind help!
May I ask which version of master are you using?
I just tried create new project with your cmake file and freshly built ITKv5 master (SHA-1: b48ceb1463eb01999a85f9de4ee7607d75786a11). But it keep emitting exceptions

Description: itk::ERROR: CorrelationImageToImageMetricv4(000002260D5A2390): FixedSampledPointSet must have 1 or more points.

Using ITK v4.12 (which I used originally) works fine. I’m running it to see whether it will has exception or not. But I would like to ITKv5 since it works for you.

Let us get down to the specifics here.

ITK version: 82dfa9ee752cdc240c46ded10242591dba97b21a
MSVC: 19.10.25027.0

I compiled ITK for x64 with:

 cmake -A x64 -G "Visual Studio 15 2017" ~/ITK
 cmake --build . --config Release

They the pairwise project with:

cmake -A x64 -G "Visual Studio 15 2017" -DITK_DIR:PATH=/d/ITK-bld/ ~/PairwiseStitching/
cmake --build . --config Release

Please make very sure you have a clean environment without and installed ITK or ITK shared libraries around. How have you been configuring ITK?

1 Like

I used cmake GUI to configure, and there are 4 ITK versions in my computer.
I tried your configuration and I can used ITK 5.0 master to run.
However, it still emit exception at trial 52205.
I’m working on setting up everything on a win10 virtualbox so everything should be clean.
I will report the result.

1 Like

Would these ITK versions be configured with shared libraries? This can some time lead to the runtime using ITK libraries from different builds, produced “undefined” results.

1 Like

They are configured as static libraries.

I have tested in a win10 virtual box, building everything from scratch.
It still emit exception at trial 55205
There were some steps that I tried a couple of time to make it work, so maybe the system is not that clean.

I will try again and write down the steps, making the steps as simple as possible.

Testing on a Win10 VirtualBox

Test steps:

  1. Setup VirtualBox
  1. Install programs
  • cmake 3.14.0, x64 for windows (select set CMake to the system PATH for all uses)
  • git for Windows 2.21.0, x64 (default setting)
  • Visual Stuidio 2017 15.9.8 (built in in the Win10 VM)
  • MSVC 19.16.27027.1 (Launch “Visual Studio Installer”, under “Visual Studio Community 2017” select “more → modify”, select “Desktop development with C++”, press “Modify”)
  1. Prepare code
  • create C:\ITK and C:\PairwiseStitching
  • cd to C:\ITK in command line

git clone GitHub - InsightSoftwareConsortium/ITK: Insight Toolkit (ITK) -- Official Repository. ITK builds on a proven, spatially-oriented architecture for processing, segmentation, and registration of scientific images in two, three, or more dimensions. .
git reset --hard 82dfa9ee752cdc240c46ded10242591dba97b21a

  • cd to C:\PairwiseStitching

git clone GitHub - qmokid/PairwiseStitching .

  1. build.
  • create new folder C:\ITK-bld, cd to it in command line

cmake -A x64 -G “Visual Studio 15 2017” …\ITK
cmake --build . --config Release

  • cd to C:\PairwiseStitching

cmake -A x64 -G “Visual Studio 15 2017” -DITK_DIR:PATH=C:\ITK-bld .
cmake --build . --config Release

  1. run
  • move 4 png files in C:\PairwiseStitching to C:\PairwiseStitching\Release
  • execute PairwiseStitching.exe
  • press any key to continue
  • make sure the program has occupied all the CPU
  • it should print out 1000,2000,… as trials being tested
  • see if there’s any exception, may need days (in my previous case, exception happen at trial 52205 in 1 day. The test on VirtualBox in still running)
1 Like