Multi-threads registration reproducibility

I already asked this question on the ANTs project and they told me to ask here. Yes, I have found the ANTs wiki page that explains why the registration is not reproducible.

  • I’m already using a seed. The results are nearly identical, but nearly is not enough :slight_smile:
  • I don’t want to use a single thread. I want to understand the problem and possibly fix it.

Some helpful ANTs users told me that MSQ and CC are reproducible, but not MI, because all threads are working on the same joint histogram (PDF?) before they calculate the sum. I tried to study the code but it’s not simple! Still, I understand that the mutex and joint histogram is created here, they are passed to each thread. The rest of the action is in those files

  • itkMattesMutualInformationImageToImageMetricv4.hxx
  • itkMattesMutualInformationImageToImageMetricv4GetValueAndDerivativeThreader.hxx

but I would need to study them for much longer to understand what’s going on.

I have several questions!

  1. I’m using the word “histogram” here (that’s what stnava wrote), but I see PDF in the code. I’m not a statistician, but I know that those things are different. Which word should I use to better understand the algorithm?
  2. Why is it done like that? Using a global variable used by all threads. Can’t each thread works on it’s own local histogram, then we merge them all after, calculate the sum, etc.?
  3. Is there a non-threaded version that I can study? If not, can somebody explain me what’s going on. It’s complicated and hard to follow!

Thank you for your time.

  1. I assume that the joint histogram is an (approximation to) probability density function, hence abbreviation PDF.
  2. Maybe to save memory?
  3. I don’t think there is a non-threaded version.

Hopefully someone will provide more insight than this.

During intensity based image registration you perform tens of millions of floating-point operations. Result of these operations slightly differ depending on what hardware, compiler version and compiler flags you use. Since registration is a highly non-linear process, small differences at any point may result in arbitrary large differences in the final output. If the registration problem is not too hard and optimization and metric parameters are reasonable then error in the final output is negligible.

It is extremely difficult to achieve 100% reproducibility. Reproducibility also has a huge runtime penalty, as high-accuracy, maximum-compatibility implementation of floating-point operations are usually much slower than fast approximations. You probably also need to force running most multi-threaded operations to run on a single thread because typically registration algorithms are optimized for better speed at the cost of allowing negligible differences in the output.

2 Likes

Yes, I understand and accept (!) that IEEE754 is complicated :slight_smile: That’s not the problem.

I’ve been studying the code in my spare time and, altough I don’t understand everything, I see that the task I had in mind is “impossible”. If the histogram is built on a single thread, it’s slow and reproducible. If it’s threaded, it’s much faster but not reproducible. The reason is simple : the order of the operations will never be the same if all threads are contributing (adding) to the same object. And that order is essential to have the same exact results.

So, it’s either “fast” or reproducible. We can’t have both. There’s probably not much else to be said about this subject!

After discussing with my team, we has 2 ideas. I’m interested in your opinions.

1. What if we run MI with several threads, as it’s currently done, but they can only add their results to the global histogram when it’s their turn. Thread 1, then 2, then …, then N, and we start over with 1 with this thread is ready. If I understand correctly, this would

  1. Slow down the process, of course, but not by much
  2. Make the results reproducible because the order of the operations would now be determined. If this was the only problem. I think it was.
  3. Reproducible only when using the same number of threads. 4 threads wouldn’t divide the image like 5 threads would, thus the results would change relative to the number of threads. This is somewhat strange, but it’s an acceptable solution for us (at work).

2. What if we run MI with several threads, as it’s currently done, but the global histogram is an array of vectors, instead of an array. We append to the the right vector instead of adding to the right value. At the end, we sort axis(1) and sum axis(1), then we run the rest of the algorithm as usual. If I understand correctly, this would

  1. Slow down the process, maybe too much. This represents a lot of allocations and sorting.
  2. Same as other 2) in previous section.
  3. Be reproducible on any number of threads.
  4. Be more precise because summing small numbers and progressively adding bigger numbers is more precise. (If I remember my IEEE754 course). Although the goal is to be reproducible, not as precise as we possibly can.

As I wrote above, can I have your opinions on those ideas please?

Reproducing numerically exactly same result for a highly non-linear optimization process is an interesting academic exercise, but I cannot think of any practical applications of this effort for medical image registration.

A complete registration workflow is much more than just computing a transform from two images. What is your motivation for trying to achieve extreme reproducibility for a small part of the workflow (computing a transform from two images), instead of focusing on practical/clinical reproducibility of the entire process?

Hi @lassoan,

I am a colleague of @Nil and process data with the ANTs registration script. In our case it is important in term of 1. clinical aspect and 2. the use of the image registered.

  1. We compute some data in clinical studies and this domain requiert a high reproducibility of the results. Even if the reproducibility of the results are high, it is important for us to be fully reproducible.

  2. In our case, we registered an anatomical image (T1w) and after the registered image is segmented to extract WM, GM and CSF tissue maps. Is the registration is not 100% reproducible the maps will have some voxels changes across two runs of the same registration process. If you want more details about the impact of the reproducibility process in our domain, you can check this paper: TractoFlow: A robust, efficient and reproducible diffusion MRI pipeline leveraging Nextflow & Singularity - ScienceDirect. In this paper, section 3.2 show the impact of the small reproucibility error on different metrics that we use.

Best

Guillaume

It is of course better to reduce any unnecessary variation in the processing workflow. However, since there is uncontrollable variation in the early phases anyway (in patient imaging), it is impossible to get perfectly reproducible results in the end, and so it should be possible to determine a reasonable level of acceptable error.

You may deal with variation of registration result the same way as you deal with variation in imaging: perform it only once and use the result in all subsequent runs.

When I talk about reproducibility, I mean in test-test. Not in test-retest. So the only uncontrollable variation is the software. As I said, in clinical application, it is really important to be fully reproducible in test-test.

With simpler words that everybody can understand: a program should produce the exact same results with the same inputs. Of course the final results will be different if we feed our diffusion pipeline a new DWI, but we’re aiming for per-program reproducibility when we provide the same inputs.

1 Like

Just wondering, in itkMattesMutualInformationImageToImageMetric.h have you tried to change the NumberOfHistogramsBins? Say, from the default 50 to 200 and see if that helps with your reproducibility problem.
Docs say 50 is the optimal value for the algorithm, but histograms on float data are sensible to floating point precision errors.

You can also try to set UseExplicitPDFDerivitaves = false and see if it helps or make it worse.

@phcerdan No, I didn’t test this because I don’t understand how this could fix the problem. We don’t want to make things “a little better” :slight_smile: It’s either reproducible or it is not.

So, can I please have your opinion on the 2 suggestions I shared above? I will try to code one but I prefer if an itk expert confirms that my ideas can work!

I think that your approach one (threads can only add their results to the global histogram when it’s their turn) makes more sense to try first. It is also more likely to be useful to someone else. Approach two will probably have too big of an overhead to be useful practically.

Hi Nil, I am also battling reproducibility issues for a similar reason. Did you by any chance try to code your solution #1 and how did it work for you? Would be interesting to know what works :slight_smile:
Thanks!

Hi @michalm24 . This project was for the enterprise I work for. My boss choose to rework our diffusion pipeline using a single thread for the registration step instead of asking me to continue this project. Thus, I have no proof of concept, code or document to share. The only thing I did was to study the code and my conclusion are in this thread above.

1 Like

Hi,

Something to check is whether UseFloatingPointCorrection is enabled. The links in the CompensatedSummation documentation also provide some insights into why this issue occurs and why that feature helps with the faster and (more) reproducible when multi-threading.

1 Like

OK, thanks! the thread is informative even as it is, I was just wondering if you succeeded before we attempt something like that in my team :slight_smile:

1 Like

@michalm24 if you manage to solve this I would be very interested to test it in the ANTs registration tools.

I think the approach proposed above, of having workers update the global histogram in a pre-defined order, is the way to go. There would still be a dependence on the number of threads - you wouldn’t get the same results running with 3 vs 4 threads - but the same input on the same machine with the same number of threads would be reproducible run-to-run, which would help with testing and other applications.

1 Like