I’ve notices strange performance drop in ResampleImageFilter.
Here is my logs from ITK 4.12.2, 25 Oct 2017 on the same machine same compiler etc.
ResampleImageFilter ; 3 ; 512x512x256 ; 79.6922 sec
ResampleImageFilter ; 3 ; 512x512x256 ; 131.241 sec
Interpolator : BSplineInterpolateImageFunction, Transform : CompositeTransform [TranslationTransform, TranslationTransform, BSplineTransform, BSplineTransform, BSplineTransform]
Almost twice performance drop for for all other interpolation and transform combination as well.
More detailed log could be found in CPUGPULog-10-25-17-GPU-GeForce-GTX-980-Ti.txt of the
Regards, Denis P. Shamonin
Why is it slow in general?
The OpenCL implementation on Intel OpenCL SDK on CPU (not GPU) is faster by the factor of 3.
Here is my log from OpenCL execution on ITK CPU vs Intel OpenCL CPU:
CPUGPULog-10-25-17-CPU-Intel®-Xeon®-CPU-E5-1620-v3-@-3.50GHz.txt (38.8 KB)
You could get image 512x512x256 to be re-sampled on CPU with BSpline interpolation and number of BSpline transforms plus extra in 18 sec, but now you have to wait 131 sec.
After cooling down. I’ve probably found somebody to blame for this performance drop.
It most likely the Microsoft compiler 2015. I did performance tests for
ITK.4.12.2 with VS-2013 and VS-2015 (all updates installed)
CPUGPULog-08-23-18-GPU-GeForce-GTX-980-Ti-ITK.4.12.2.VS-2013.txt (29.6 KB)
CPUGPULog-08-23-18-GPU-GeForce-GTX-980-Ti-ITK.4.12.2.VS-2015.txt (29.4 KB)
You could see like 40% performance drop in ResampleImageFilter.
I’ve build everything without any extra setting for CMake, compiler or anything else, just ITK out of the box.
I can not test VS-2013 compiler with ITK 5.0.0 because it was dropped. Currently VS-2015 is our production compiler of choice for Windows. I did not pay attention when we made a compiler switch.
If anyone knows what could be adjusted to get it at least to VS-2013 speed, let me know.
Sorry for too many edits on the posts. Was a bit unhappy with such performance drop after years of battles for milliseconds optimization on GPU.
Anyway, let me know if you have some ideas.
I just had a discussion with @matt.mccormick. He thinks it would be good to add a benchmark for resampling. That way we will be able to track performance of resampling as we develop and maintain ITK. If you are willing to create that benchmark, that would be great. Otherwise, he will probably find some time to do it over the next few weeks.
Also, I plan to refactor domain threaders which are heavily used in the registration. Once I get started, general feedback and code reviews will be welcome.
I did a quick profiling of Resampling with the BSplineIterpolator.
It appears the current ITK implementation is optimized for resampling an entire image. During the BSplineIterpolator::SetImage method, which is called during initialization theBSplineDecompositionImageFilter is called on the entire input image to create coefficient images. This filter is still single threaded! That appears to be the most obvious bottleneck.
Aside from that, I recommend you profile and time the results of compilers with different flags, architectures, and optimization flags. Constructively sharing comparisons of different compilers performance and the effect flags have would be quite welcomed, perhaps even with little easy to digest figures
I will try to make it, have a bit of time somewhere around this weeks. Will modify my test forwards ITKPerformanceBenchmarking style. That test already contains most of the checks requred like 1/2/3D, interpolations (Nearest, Linear, BSpline) and transforms (BSpline, Euler2D/3D, Similarity2D/3D, Translation).
I’d love to see the itkResampleImageFilter to be multi threaded (at least for simple resampling with an identity transform), but when I had a look at the source to see if I could get that implemented I had to agree to the comment there that this is far from trivial in case of an arbitrary transform. Perhaps it could be split for unit transforms (i.e. no transform) or given a new filter (like itkResampleWithUnitTransformImageFilter) for the cases where just up/down sampling is needed.
ResampleImageFilter is multi-threaded.
Sorry, mixed up multi-threading and streaming which is (if I’m not mistaken) still not possible?
The streaming should be possible downstream from resample filter. But upstream will be processed whole, as resample requests the entire input. And resample filter itself can process the image in chunks.
It is not easy, and it is probably infeasible to determine the correct input region for arbitrary transform. But for MatrixWithOffset and its descendants it is possible. If you would like to go ahead and handle those special cases, it would be welcome.
And resample filter itself can process the image in chunks.
Can it? If so, it is unclear to me why those chunks cannot be used for streaming?
That’s what I am saying. The streaming can work for this filter, and the ones after. But since it request the entire input even for partial output, the streaming before this filter is broken. Unless you manually insert streaming filter.
We could add streaming to
ResampleImageFilter for linear transforms, i.e. MatrixWithOffset or transforms whose
IsLinear() method returns
true, by transforming the bounding box defined by the output
RequestedRegion and setting the input
RequestedRegion to the bounds area in voxel space in
Gave it a try but I feel unfamiliar with some of the ITK internals here. What do you think of:
Looks reasonable to me. Have you tested it, and does it work? What happens with composite transforms?
Thanks @dzenanz for your feedback. It compiles and works with streaming in the frame of my simple resample_SDI CLI. The output is nearly the same as without streaming (resample CLI) except that the SDI output is larger in extent (in my case one more z-slice at the end). Disregarding this slice leads to a difference image that is const. zero.
Are there any test in the ITK test suite that would be appropriate for testing this?
There are several tests for resample filter. For example, before this line add
writer1->SetNumberOfStreamDivisions(8); //split into 8 pieces for streaming. And you can similarly enable streaming in other tests too, but please not all of them.
Depending on the interpolation, you will need to pad the requested region to account for the input area of interpolator selected. What is the padding for nearest? Linear? Bspline? Previously in this thread I noted that the bspline interpolator operated on the entire input, that will need some special consideration.
Hm, sure the padding should be considered, I’m a bit lost on this and wondering how it was done for the itkWarpImageFilter. Some of my code is based on the ideas of this contribution but it is not clear to me how the padding for the chosen interpolator entered there.