Threading in ITK

HsuehWen · March 20, 2018, 5:39am

Hi,

I’m using ITK registration on a 28core, 56 processor machine with 2 scenarios.
(1) execute the program 24 times with different command
(2) single program, but use 24 threads do mimic the process of scenario (1)

In the (1) case, the CPU goes to 100%, which is normal since ITK registration also use multi-threading. What I’m curious is that in case (2), the CPU loading is only around 20~25%.

I wonder whether there’s a limit (default or can be set) on ITK thread number in a single process?
I tried
itk::MultiThreader::SetGlobalMaximumNumberOfThreads( n );
itk::MultiThreader::SetGlobalDefaultNumberOfThreads( n );
But that doesn’t help.

Does anyone has some suggestion on this?
Thanks.

BR,
HsuehWen

dzenanz · March 20, 2018, 1:27pm

In current master version, you should execute itk::MultiThreaderBase::SetGlobalDefaultNumberOfThreads( n );. Also, you should execute this before you instantiate other filters.

Alternatively, you can set number of threads directly on a filter: myReg->SetNumberOfThreads(n);.

matt.mccormick · March 20, 2018, 1:40pm

To improve thread utilization in ITK 4.13.0, call

itk::MultiThreader::SetGlobalDefaultUseThreadPool(true);

at the beginning of the program. Or, set the environmental variable:

ITK_USE_THREADPOOL=1

We anticipate release of fine grained threading with Threading Building Blocks (TBB) and native threads in one of the upcoming ITK 5 alpha releases – your testing would be appreciated!

blowekamp · March 20, 2018, 1:52pm

There is also the environmental variable ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS which can be set to an integer of the default number of threads allocated to an algorithm. I prefer setting the environmental variable when tweaking performance of a specific algorithm executed on the command line, vs recompiling the program.

matt.mccormick · March 20, 2018, 2:18pm

→ ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS

HsuehWen · March 21, 2018, 6:08am

Hi all,

Thanks for your kind and fast help!
However, for my case, it turns out that the problem is at class itk::OpenCVImageBridge

I do transformation between ITK image and OpenCV Mat

When I use 56 threads to repeat calling itk::OpenCVImageBridge::CVMatToITKImage, CPU is around 5%.
When I use 56 threads to repeat calling registration, CPU is 100%
When I combine both, CPU is 20~100% (fluctuating roughly like sine wave)

So I think the process is stuck in the CVMatToITKImage function.

Does this sound reasonable to you?
Do you know how to resolve this issue ?
Thanks!

By the way, I’m currently using ITK4.12.2. Not sure whether it differs in this case.

BR,
HsuehWen

matt.mccormick · March 21, 2018, 1:33pm

Hi @HsuehWen,

is a copy operation. You may be able to see some improvements by contributing an improvement that changes the memcpy to a std::copy.

However, if it is possible to refactor to avoid unnecessary calls to this function in your pipeline, this will help even more ;-).

Hope this helps,
Matt

blowekamp · March 21, 2018, 1:37pm

Also please verify that you are compiling in Release mode and not Debug.

HsuehWen · March 22, 2018, 1:29am

Thanks again~

I know it’s a copy operation and it’s suggested not to use it often.
In my case, using single thread with multi-process, this function is not a bottle neck, because the registration process takes much longer time (and CPU is always 100% occupied). However, when executing multi-thread with single process, it is a bottle neck in the sense that CPU is NOT fully used. It cost time but not CPU power.

A reasonable argue (for me) is that there is a mutex in the function that only one thread can call this function at a time, but I didn’t find one in the source code (and I don’ think mutex is needed in this function). That’s what bothers me.

I’ll try your suggestions and keep trying to figure out what happened. Thanks.

BR,
HsuehWen

HsuehWen · March 23, 2018, 3:36am

Hi all,

I think I found the bottleneck, which is the memory allocation. In multi-process architecture, each process owns a chunk of memory block, while in multi-thread case, all the threads share the same memory block. Since all threads share the same memory, when doing memory allocation, there is a mutex to prevent more than one threads allocate memory with overlapping ranges (information here).

In transforming CV mat to ITK mat (and vice versa), there are memory allocations. That’s why CPU is not fully used, threads are waiting for memory allocation. I have done simple tests that continuing cloning cv::Mat (56 thread) cost CPU 13%. If I only do memcpy, CPU reaches 100% immediately

For CV-ITK mat transform, since my images are fix-sized, I can allocate memory at the beginning so there’s no need to new memory at each function call.

But for other simple image processing tasks in ITK (for example BinaryThresholdImageFilter, RGBToLuminanceImageFilter), do I have choices? It seems the ITK architecture is using an object to do transformation and get image as return of function call. I think that means memory are allocated in the object. Is there an alternative? for example, passing allocated memory and ask the functon/object to do transform on that chunk of memory?

Thanks.

BR,
HsuehWen

blowekamp · March 23, 2018, 1:15pm

Some filters have an “InPlace” option. This option enables the filter to use the same buffer for the input as the output.

There are some implementation of “malloc” which are designed for improved thread performance. By default this may very significantly between OS, and compilers. I have read that Intel offers a “scalable” malloc: https://software.intel.com/en-us/node/506096 There may be other options available.

From the ITK architecture side, I have pondered the usefulness of a pluggable “ITK Allocator”, which can be utilized for allocating image buffers. If this is a programmable and modular part interesting algorithms could be exploited. For the Image buffers, it may be a common patter for the same size image to be allocated, and de-allocated over and over during a pipeline. It may be advantages to develop an allocate which does not immediately release the buffer to the OS, but instead holds on to it to be used the next time.

matt.mccormick · March 23, 2018, 2:52pm

@HsuehWen Well done on the investigation, and thanks for sharing your findings!

For ITK filter development, while a different allocator could be helpful, this performance issue emphasizes the need to avoid de-allocating and re-allocating data structure memory buffers in filters when possible.

In terms of pipeline memory management, there is a ReleaseData flag that can be set on itk::ProcessObject's to control the tradeoff between the amount of memory used and avoiding memory re-allocations. ReleaseData is disabled by default.

@HsuehWen To achieve your goal of using pre-allocated memory, combine the itk::ImportImageFilter with the GraftOutput method of the filter at the end of the pipeline. This will tell the filter to write into the provided image’s buffer.

HTH,
Matt

pradeep · September 19, 2018, 10:14am

Hello Everyone,

I have some queries about properly enabling threading support in ITK. Based on the above comments and ITK documentation I have tried both the following settings and neither helps me in getting ITK filters run on all cores.

Environment variables ITK_USE_THREADPOOL and ITK_GLOBAL_DEFAUL_NUMBER_OF_THREADS

itk::MultiThreader::SetGlobalDefaultUseThreadPool(true);
itk::MultiThreaderBase::SetGlobalDefaultNumberOfThreads( n );

Alternatively, I tried setting the number of threads on filter too but that also doesn’t help.

I have built ITK 4.13.1 from source and enabled FFTW usage using the flags ITK_USE_FFTWF and ITK_USE_SYSTEM_FFTW.

Any suggestions are greatly appreciated.

Thank you,
Pradeep.

matt.mccormick · September 19, 2018, 1:07pm

Hello @pradeep,

For optimal threading,

Use the latest release of ITK, currently ITK 5.0 Beta 1
Enable TBB (Intel Thread Building Blocks) support in your build. Even without this, the default thread pool and thread use has been improved dramatically since ITK 4.

HTH,
Matt

blowekamp · September 19, 2018, 1:13pm

What specific algorithms are your try to get to run on all your cores? FFTs? Some algorithms are only implemented single threaded, others don’t scale too well. How may cores are your trying to scale to and how big is your data?

pradeep · September 19, 2018, 1:19pm

@matt.mccormick
I am trying to run DeconvolutionFilters(Landweber, RichardsonLucy) and ConnectedComponentImageFilter.

I was assuming deconvolution filters, both iterative and inverse, use fftw for transformations. Is that correct ?

pradeep · September 19, 2018, 1:20pm

Sure, I will try ITK 5.0 Beta 1 and check if there is any change in results.