ResampleImageFilter non-determinism with threading

Simon · April 23, 2018, 7:04pm

I’m seeing some non-deterministic results in ResampleImageFilter on with different thread counts.

I’m assuming this is unexpected behaviour, but thought I’d ask here first if there are known issues it might relate to before I dig into it further.

I’ve reproduces with 4.10 and 4.13, and it seems to depend on particular changes in image geometry.

matt.mccormick · April 23, 2018, 7:10pm

Hi Simon,

Correct – ResampleImageFilter should not change its result based on the thread count.

Thanks,
Matt

Simon · April 23, 2018, 7:16pm

Here is a somewhat simple test case that fails for me on visual studio 2015 builds. You can see from the slicer screen grab the patterned differences where interpolator results are different by (typically) 1.

If I come up with a better test case I’ll add it here. The one from the actual code is a bit more complicated and also larger difference, this is the simplest trigger I’ve come up with so far. The geometry numbers come from that example though and seem to be significant.

Any thoughts?

test.cxx (3.5 KB)

Simon · April 23, 2018, 7:18pm

Forgot to mention, to trigger on my machine I run threadcount from 1 to 10 (i7 with 8 virtua cores). I get something like this:

$ for i in {1..10}; do build/Release/Test.exe $i.nrrd $i; CompareImages.exe 1.nrrd  $i.nrrd;done
PSNR = inf
PSNR = inf
PSNR = inf
PSNR = inf
PSNR = 90.0928
PSNR = inf
PSNR = 95.1087
PSNR = 97.5865
PSNR = inf
PSNR = 90.6642

Simon · April 23, 2018, 7:27pm

Also, I hope to have a proper look at why this is happening this week, but can’t for a couple of days. So if anyone has ideas or similar experiences, I’d love to hear them. I’m also not very familiar with that codebase so it may be slow going.

blowekamp · April 23, 2018, 7:34pm

Do you have a measure of the magnitude of the per pixel error?

The algorithm used for “linear transforms” is done in an optimized scan line basis:

github.com

InsightSoftwareConsortium/ITK/blob/master/Modules/Filtering/ImageGrid/include/itkResampleImageFilter.hxx#L338-L470


::LinearThreadedGenerateData(const OutputImageRegionType &
                           outputRegionForThread,
                           ThreadIdType threadId)
{
// Get the output pointers
OutputImageType *outputPtr = this->GetOutput();


// Get this input pointers
const InputImageType *inputPtr = this->GetInput();


// Get the input transform
const TransformType *transformPtr = this->GetTransform();


// Create an iterator that will walk the output region for this thread.
using OutputIterator = ImageScanlineIterator< TOutputImage >;


OutputIterator outIt(outputPtr, outputRegionForThread);


// Define a few indices that will be used to translate from an input pixel
// to an output pixel

This file has been truncated. show original

This methods just compute the “delta” in the scanline direction and accumulates it as it move along the scanline. This would result is subtle numeric differences than if the full transform from the fixed to the moving was done.

Now if a different “ImageRegionsSplitter” was used the results could be slightly different becausethe default way to split the images (currently) should ensure that the fast scan line is never split.

@dzenanz

Simon · April 23, 2018, 7:36pm

In this test case, the magnitude is usually 1.

However in the problem that caused this, i was using NearestNeighbor, not linear. Unfortunately I haven’t found as simple a test case that generates it with NN. I will try to autogen an image that trips that up too if it helps. Or perhaps find one I can share.

In the NN case I would expect direction to be irrelevant, no?

dzenanz · April 23, 2018, 9:08pm

The direction matters for which pixels get interpolated, and interpolation matters for coming up with values. So image direction does affect nearest neighbor interpolation.

dzenanz · April 23, 2018, 9:32pm

I tried the example, and I have consistently been getting this kind of image: stripes with value 254 and 253. I didn’t notice whether this is thread-dependent. I used ITK-4.12.0-vs2017.

Simon · April 24, 2018, 1:12pm

Sorry, I should have been more careful with my choice of words. Clearly the image direction matrix matters in choosing which pixels are interpolated. However, while accumulation order could change the linear result a little bit, I can’s see how the choices of ImageRegionsSplitter could affect a nearest neighbors result. Am I missing something?

blowekamp · April 24, 2018, 1:17pm

So there are two things to consider here:

What is the numeric cause of the difference?
The resampling is does from continuous physical space to continuous physical space, so it is done with double precision. With the limitation of finite bits, the computations for coordinates to sample from is imprecise which will require some interpolation. The pixel type is of “unsigned char” the result of the interpolator is then implicitly rounded via truncation. So a slight difference in location can result in 254 vs 255. Consider using “float” for the output pixel type followed by the “RoundImageFilter”
Why is it different with thread numbers?
This is tougher as I suggested I could see a different Region Splitter could the cause. Could you be using a different one? The other is some funny SSE instructions making the scan line dependent on each other. OR other bad optimization done.

Simon · April 24, 2018, 1:21pm

Bradley, good points.

My intuition is the numerics are not problematic, but will keep in mind.

I’m not convinced my simple test captures enough of it, I’ll try and construct a more representative one. Once I have that I can try with all optimizations off to see if it’s optimizer related.

blowekamp · April 24, 2018, 1:22pm

The Splitter could most definitely have an effect even with NearestNeighbor interpolator. If the result of the accumulation is .499999 vs .500000001 you could land on a different pixel.

Simon · April 24, 2018, 1:39pm

Ah, of course, I was thinking of the value accumulation and should have read your snippit more carefully.

I have to leave this for a bit now, but I have verified behaviour is the same with all optimizations turned off and SIMD disabled.

Simon · April 24, 2018, 2:29pm

@blowekamp : ata first pass I think LinearThreadedGenerateData would be more stable with a step count rather than a rolling accumulation.

I’ll pull a branch and try that.

dzenanz · April 24, 2018, 2:32pm

It might be, but multiplication is more computationally expensive than addition. That might not matter much on leading edge modern processors though

Simon · April 24, 2018, 2:34pm

Yes, we would have try it. Regardless it would be nice to know if that is the root of the issue.

Simon · April 25, 2018, 7:20pm

Update. The striped image in above example made it more difficult to see what was going on, so I reverted to using a mask image with similar geometry.

As Bradley surmised, replacing the line

      inputIndex += delta;

with

outputPtr->TransformIndexToPhysicalPoint(outIt.GetIndex(), outputPoint);
inputPtr->TransformPhysicalPointToContinuousIndex(transformPtr->TransformPoint(outputPoint),inputIndex);

removed all of the thread variability.

Unsurprisingly, the accumulation version doesn’t match recomputing position at every point. Attached snapshot shows difference between reference (position recomputed always) and 1 and 8 thread resamplers using the acumulation menthod (i.e. code in 4.13).

bottom is reference image, top is the deltas between reference and 1-thread (bright) 8-thread (dim). You can see how feeding this mask through additional morphological operations can result in quite different outputs for different threads. I should point out that the screenshot is centered on one of the slices that have differences, of course much of the volume is identical.

The key issue here I think is the delta, as computed it is {0.99999999999999756, 6.9111383282915995e-15, 3.5527136788005009e-15} }

blowekamp · April 25, 2018, 7:35pm

My preferred method for resampling masks has be to create a 0-level set from the mask via a distance field filter. Then resample the distance field. This approach is free from many aliasing issue, and more numerically stable with the usage of a linear or higher order interpolator.

Simon · April 25, 2018, 7:40pm

@blowekamp : true, I use that approach in other places, and this is an oversimplified test. However this isn’t about aliasing in the result - I was surprised by the different behaviour based on thread count. This will affect any resampling of similar geometry, of course, just be more difficult to see. In the particular case I have I can make it more stable, but I’m still looking to avoid regressions between machines with different core counts…

Fundamentally this isn’t a big impact but I’m not sure the speed tradeoff is always worth the non-determinism of the scanline approach. Any correction factors for the accumulation errors, etc. are probably going to get close to cost of recomputing the position each time.

For what it’s worth, I’ve already resolved the original regression a different way.