Container iterators

Hello,

i would like to know if there is an container iterator that provide the functionality I need for the C++17 features like std::reduce() with an parallel execution policy. This might improve the performance of some filters.

I have something like this in mind:

ImageRegionConstIterator<MovingGradientImageType> iter(m_SubtractionImageFilter->GetOutput(), this->GetFixedImageRegion());
MovingGradientPixelType initialValue; initialValue.Fill(0);

TPixelType sum = std::reduce(std::execution::par, iter.GoToBegin(), iter.GoToBegin()+numberOfPixel, initialValue, binary_op);

I would like to use the std::end() on the image container. is there a workaround or solution for this problem?

Beside the usual iterators I did not find anything that could help me.

ImageRange introduced by @Niels_Dekker might fit the bill.

2 Likes

@Gordian If you need std::reduce to iterate over the entire image buffer (all pixels), I would recommend ImageBufferRange. If you need to iterate over a subregion of the image, you may use ImageRegionRange instead.

Please let us know if that solves your problem!

1 Like

Thanks to @Niels_Dekker and @dzenanz for pointing to this class. It does a good job for single and double precision. Easy to implement.

Here a small code example for comparison:

    #include "itkImage.h"
    #include "HighPrecisionTimer.h"
    #include <execution>
    #include "itkImageBufferRange.h"
   
    int main(int argc, char* argv[])
    {
    	auto img = itk::Image<float, 3>::New();
    	itk::Image<float, 3>::IndexType idx;
    	idx.Fill(0);

    	itk::Image<float, 3>::SizeType sz; //arbitrary numbers here
    	sz.Fill(512);
    	sz[2] = 1;

    	itk::Image<float, 3>::RegionType reg(idx, sz);
    	img->SetRegions(reg);
    	img->Allocate();
    	img->FillBuffer(0.1);

    	for (auto i = 0; i < 4; ++i)
    	{
    		itk::ImageRegionConstIterator<itk::Image<float, 3>> iter(img, img->GetLargestPossibleRegion());
    		iter.GoToBegin();
    		float sum = 0;

    		{
    			auto Timer{ HighPrecisionTimer<TimeUnits::Microseconds>() };
    			while (!iter.IsAtEnd())
    			{
    				sum = sum + iter.Get();
    				++iter;
    			}
    		}
    		if(i = 3) std::cout << sum << std::endl; ->257 microseconds

    		itk::Experimental::ImageBufferRange<itk::Image<float, 3>> range{ *img };
    		sum = 0;
    		{
    			auto Timer{ HighPrecisionTimer<TimeUnits::Microseconds>() };
    			for each (auto&& var in range)
    			{
    				sum = sum + var;
    			}
    		}

    		if (i = 3)  std::cout << sum << std::endl; -> 251 microseconds

    		sum = 0;
    		{
    			auto Timer{ HighPrecisionTimer<TimeUnits::Microseconds>() };
    			sum = std::reduce(std::execution::par, range.begin(), range.end(), 0.0);
    		}
    		if (i = 3)  std::cout << sum << std::endl; -> 223 microseconds

    		sum = 0;
    		{
    			auto Timer{ HighPrecisionTimer<TimeUnits::Microseconds>() };
    			sum = std::reduce(std::execution::par_unseq, range.begin(), range.end(), 0.0);
    		}
    		if (i = 3)  std::cout << sum << std::endl; -> 86 microseconds
    	}
    	return 0;
    }  

Returns incorrect results for the first two tests. The run times for the four tests are at given numbers. The Timer is basically a class that uses the chrono library. The number are not relable in any way; they are more an idea of the runtime.

Edit: Tried the same with itk::Covariant<float, 3> as pixel type. The times were really inconsistent for the range tests but mostly the std::execution::par_unseq was the fastest.

1 Like

@Gordian Glad to see you got some promising results from using ImageBufferRange with std::reduce. Did you build in Release mode (optimized for speed)?

Three more questions on your code example:

  • When I tried to compile your code example, I got a suspicious warning (four times) from Visual C++ 2017, on if (i = 3):

warning C4706: assignment within conditional expression

That looks like a bug! Did you mean to write if (i == 3) instead?

  • for each (auto&& var in range) appears to be a Microsoft specific (.NET/CLR) “for each” loop. Could you maybe also try the standard C++ range-based for-loop, for (auto&& var : range)? I’d be interested to hear if one is faster than the other!

  • I could not compile HighPrecisionTimer<TimeUnits::Microseconds>, where can I find the “HighPrecisionTimer.h” header file? For previous benchmarks, I used std::chrono, for example:

    #include <chrono>
    ....
    using namespace std::chrono;
    const auto timePointBefore = high_resolution_clock::now();
    const auto result = func();
    const auto timePointAfter = high_resolution_clock::now();
    const auto durationSeconds =
      duration_cast<duration<double>>(timePointAfter - timePointBefore);
    std::cout
      << " Duration: " << durationSeconds.count() << " seconds"
      << " Sum: " << result
      << std::endl;
    

Another example, using std::chrono: https://gist.github.com/N-Dekker/e629feb4f2ebcbc2f085d355d81f0f26

Kind regards, Niels

@Niels_Dekker

  • I build the test application in RelWithDebInfo mode (as far as I have seen this is close to release mode)

  • You are correct about the warnings. It was late and the compiler does not throw any errors (and as we all know a code without errors is fine…:slight_smile: )

  • There is no significant change in duration between the microsoft for each and the standard C++ range-based for-loop (at least on my system).

  • The HighPrecisionTimer<TimeUnits::Microseconds> is basically like your suggestion for time measurements. It’s wrapped in a class and will destroy itself at the end of the scope after the measurement. It’s a custom convenience class for cleaner code.

As small extra I tested the code using itk::CovariantVector<float,3> as pixeltype and added a plus functor for the std::reduce(...):

auto binary_op = [](VectorPixel num1, VectorPixel num2) {
		VectorPixel sum;
		for (auto i = 0; i < VectorPixel::Dimension; ++i)
			sum[i] = num1[i] + num2[i];
		return sum; };

The times using the custom functor are (about):
2200 microseconds for itk::iterator
2200 microseconds for ranged-based for-loop
650 microseconds for std::reduce(std::execution::par) (at least twice if itk::CovariantVector<float,3>::operator+() is used)
600 microseconds for std::reduce(std::execution::par_unseq) (at least twice if itk::CovariantVector<float,3>::operator+() is used)

1 Like

Did you do these timing tests on a 4-core computer?

It seems you could also use MultiThreaderBase::ParallelizeImageRegion with a little custom summing code. Examples:
UseParallelizeImageRegion
FilterAndParallelizeImageRegion

I did run the test on a 4-core machine (Intel i5-2500K @ 3,30GHz, Windows 10 Pro, VS2017).

I will look at the examples for ParallizedImageRegion. Thanks.