Container iterators

Gordian · March 2, 2020, 12:51am

Hello,

i would like to know if there is an container iterator that provide the functionality I need for the C++17 features like std::reduce() with an parallel execution policy. This might improve the performance of some filters.

I have something like this in mind:

ImageRegionConstIterator<MovingGradientImageType> iter(m_SubtractionImageFilter->GetOutput(), this->GetFixedImageRegion());
MovingGradientPixelType initialValue; initialValue.Fill(0);

TPixelType sum = std::reduce(std::execution::par, iter.GoToBegin(), iter.GoToBegin()+numberOfPixel, initialValue, binary_op);

I would like to use the std::end() on the image container. is there a workaround or solution for this problem?

Beside the usual iterators I did not find anything that could help me.

dzenanz · March 2, 2020, 3:37pm

ImageRange introduced by @Niels_Dekker might fit the bill.

Niels_Dekker · March 2, 2020, 8:51pm

@Gordian If you need std::reduce to iterate over the entire image buffer (all pixels), I would recommend ImageBufferRange. If you need to iterate over a subregion of the image, you may use ImageRegionRange instead.

Please let us know if that solves your problem!

Gordian · March 2, 2020, 10:26pm

Thanks to @Niels_Dekker and @dzenanz for pointing to this class. It does a good job for single and double precision. Easy to implement.

Here a small code example for comparison:

    #include "itkImage.h"
    #include "HighPrecisionTimer.h"
    #include <execution>
    #include "itkImageBufferRange.h"
   
    int main(int argc, char* argv[])
    {
    	auto img = itk::Image<float, 3>::New();
    	itk::Image<float, 3>::IndexType idx;
    	idx.Fill(0);

    	itk::Image<float, 3>::SizeType sz; //arbitrary numbers here
    	sz.Fill(512);
    	sz[2] = 1;

    	itk::Image<float, 3>::RegionType reg(idx, sz);
    	img->SetRegions(reg);
    	img->Allocate();
    	img->FillBuffer(0.1);

    	for (auto i = 0; i < 4; ++i)
    	{
    		itk::ImageRegionConstIterator<itk::Image<float, 3>> iter(img, img->GetLargestPossibleRegion());
    		iter.GoToBegin();
    		float sum = 0;

    		{
    			auto Timer{ HighPrecisionTimer<TimeUnits::Microseconds>() };
    			while (!iter.IsAtEnd())
    			{
    				sum = sum + iter.Get();
    				++iter;
    			}
    		}
    		if(i = 3) std::cout << sum << std::endl; ->257 microseconds

    		itk::Experimental::ImageBufferRange<itk::Image<float, 3>> range{ *img };
    		sum = 0;
    		{
    			auto Timer{ HighPrecisionTimer<TimeUnits::Microseconds>() };
    			for each (auto&& var in range)
    			{
    				sum = sum + var;
    			}
    		}

    		if (i = 3)  std::cout << sum << std::endl; -> 251 microseconds

    		sum = 0;
    		{
    			auto Timer{ HighPrecisionTimer<TimeUnits::Microseconds>() };
    			sum = std::reduce(std::execution::par, range.begin(), range.end(), 0.0);
    		}
    		if (i = 3)  std::cout << sum << std::endl; -> 223 microseconds

    		sum = 0;
    		{
    			auto Timer{ HighPrecisionTimer<TimeUnits::Microseconds>() };
    			sum = std::reduce(std::execution::par_unseq, range.begin(), range.end(), 0.0);
    		}
    		if (i = 3)  std::cout << sum << std::endl; -> 86 microseconds
    	}
    	return 0;
    }

Returns incorrect results for the first two tests. The run times for the four tests are at given numbers. The Timer is basically a class that uses the chrono library. The number are not relable in any way; they are more an idea of the runtime.

Edit: Tried the same with itk::Covariant<float, 3> as pixel type. The times were really inconsistent for the range tests but mostly the std::execution::par_unseq was the fastest.

Niels_Dekker · March 3, 2020, 1:41pm

@Gordian Glad to see you got some promising results from using ImageBufferRange with std::reduce. Did you build in Release mode (optimized for speed)?

Three more questions on your code example:

When I tried to compile your code example, I got a suspicious warning (four times) from Visual C++ 2017, on if (i = 3):

warning C4706: assignment within conditional expression

That looks like a bug! Did you mean to write if (i == 3) instead?

for each (auto&& var in range) appears to be a Microsoft specific (.NET/CLR) “for each” loop. Could you maybe also try the standard C++ range-based for-loop, for (auto&& var : range)? I’d be interested to hear if one is faster than the other!

I could not compile HighPrecisionTimer<TimeUnits::Microseconds>, where can I find the “HighPrecisionTimer.h” header file? For previous benchmarks, I used std::chrono, for example:

#include <chrono>
....
using namespace std::chrono;
const auto timePointBefore = high_resolution_clock::now();
const auto result = func();
const auto timePointAfter = high_resolution_clock::now();
const auto durationSeconds =
  duration_cast<duration<double>>(timePointAfter - timePointBefore);
std::cout
  << " Duration: " << durationSeconds.count() << " seconds"
  << " Sum: " << result
  << std::endl;

Another example, using std::chrono: Estimates duration of increasing the capacity of `std::vector<itk::NeighborhoodAllocator<int>>` · GitHub

Kind regards, Niels

Gordian · March 3, 2020, 7:57pm

@Niels_Dekker

I build the test application in RelWithDebInfo mode (as far as I have seen this is close to release mode)
You are correct about the warnings. It was late and the compiler does not throw any errors (and as we all know a code without errors is fine… )
There is no significant change in duration between the microsoft for each and the standard C++ range-based for-loop (at least on my system).
The HighPrecisionTimer<TimeUnits::Microseconds> is basically like your suggestion for time measurements. It’s wrapped in a class and will destroy itself at the end of the scope after the measurement. It’s a custom convenience class for cleaner code.

As small extra I tested the code using itk::CovariantVector<float,3> as pixeltype and added a plus functor for the std::reduce(...):

auto binary_op = [](VectorPixel num1, VectorPixel num2) {
		VectorPixel sum;
		for (auto i = 0; i < VectorPixel::Dimension; ++i)
			sum[i] = num1[i] + num2[i];
		return sum; };

The times using the custom functor are (about):
2200 microseconds for itk::iterator
2200 microseconds for ranged-based for-loop
650 microseconds for std::reduce(std::execution::par) (at least twice if itk::CovariantVector<float,3>::operator+() is used)
600 microseconds for std::reduce(std::execution::par_unseq) (at least twice if itk::CovariantVector<float,3>::operator+() is used)

dzenanz · March 3, 2020, 9:40pm

Did you do these timing tests on a 4-core computer?

It seems you could also use MultiThreaderBase::ParallelizeImageRegion with a little custom summing code. Examples:
UseParallelizeImageRegion
FilterAndParallelizeImageRegion

Gordian · March 4, 2020, 12:35am

I did run the test on a 4-core machine (Intel i5-2500K @ 3,30GHz, Windows 10 Pro, VS2017).

I will look at the examples for ParallizedImageRegion. Thanks.