I’m using ITK registration on a 28core, 56 processor machine with 2 scenarios.
(1) execute the program 24 times with different command
(2) single program, but use 24 threads do mimic the process of scenario (1)
In the (1) case, the CPU goes to 100%, which is normal since ITK registration also use multi-threading. What I’m curious is that in case (2), the CPU loading is only around 20~25%.
I wonder whether there’s a limit (default or can be set) on ITK thread number in a single process?
I tried
itk::MultiThreader::SetGlobalMaximumNumberOfThreads( n );
itk::MultiThreader::SetGlobalDefaultNumberOfThreads( n );
But that doesn’t help.
In current master version, you should execute itk::MultiThreaderBase::SetGlobalDefaultNumberOfThreads( n );. Also, you should execute this before you instantiate other filters.
Alternatively, you can set number of threads directly on a filter: myReg->SetNumberOfThreads(n);.
at the beginning of the program. Or, set the environmental variable:
ITK_USE_THREADPOOL=1
We anticipate release of fine grained threading with Threading Building Blocks (TBB) and native threads in one of the upcoming ITK 5 alpha releases – your testing would be appreciated!
There is also the environmental variable ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS which can be set to an integer of the default number of threads allocated to an algorithm. I prefer setting the environmental variable when tweaking performance of a specific algorithm executed on the command line, vs recompiling the program.
Thanks for your kind and fast help!
However, for my case, it turns out that the problem is at class itk::OpenCVImageBridge
I do transformation between ITK image and OpenCV Mat
When I use 56 threads to repeat calling itk::OpenCVImageBridge::CVMatToITKImage, CPU is around 5%.
When I use 56 threads to repeat calling registration, CPU is 100%
When I combine both, CPU is 20~100% (fluctuating roughly like sine wave)
So I think the process is stuck in the CVMatToITKImage function.
Does this sound reasonable to you?
Do you know how to resolve this issue ?
Thanks!
By the way, I’m currently using ITK4.12.2. Not sure whether it differs in this case.
I know it’s a copy operation and it’s suggested not to use it often.
In my case, using single thread with multi-process, this function is not a bottle neck, because the registration process takes much longer time (and CPU is always 100% occupied). However, when executing multi-thread with single process, it is a bottle neck in the sense that CPU is NOT fully used. It cost time but not CPU power.
A reasonable argue (for me) is that there is a mutex in the function that only one thread can call this function at a time, but I didn’t find one in the source code (and I don’ think mutex is needed in this function). That’s what bothers me.
I’ll try your suggestions and keep trying to figure out what happened. Thanks.
I think I found the bottleneck, which is the memory allocation. In multi-process architecture, each process owns a chunk of memory block, while in multi-thread case, all the threads share the same memory block. Since all threads share the same memory, when doing memory allocation, there is a mutex to prevent more than one threads allocate memory with overlapping ranges (information here).
In transforming CV mat to ITK mat (and vice versa), there are memory allocations. That’s why CPU is not fully used, threads are waiting for memory allocation. I have done simple tests that continuing cloning cv::Mat (56 thread) cost CPU 13%. If I only do memcpy, CPU reaches 100% immediately
For CV-ITK mat transform, since my images are fix-sized, I can allocate memory at the beginning so there’s no need to new memory at each function call.
But for other simple image processing tasks in ITK (for example BinaryThresholdImageFilter, RGBToLuminanceImageFilter), do I have choices? It seems the ITK architecture is using an object to do transformation and get image as return of function call. I think that means memory are allocated in the object. Is there an alternative? for example, passing allocated memory and ask the functon/object to do transform on that chunk of memory?
Some filters have an “InPlace” option. This option enables the filter to use the same buffer for the input as the output.
There are some implementation of “malloc” which are designed for improved thread performance. By default this may very significantly between OS, and compilers. I have read that Intel offers a “scalable” malloc: https://software.intel.com/en-us/node/506096 There may be other options available.
From the ITK architecture side, I have pondered the usefulness of a pluggable “ITK Allocator”, which can be utilized for allocating image buffers. If this is a programmable and modular part interesting algorithms could be exploited. For the Image buffers, it may be a common patter for the same size image to be allocated, and de-allocated over and over during a pipeline. It may be advantages to develop an allocate which does not immediately release the buffer to the OS, but instead holds on to it to be used the next time.
@HsuehWen Well done on the investigation, and thanks for sharing your findings!
For ITK filter development, while a different allocator could be helpful, this performance issue emphasizes the need to avoid de-allocating and re-allocating data structure memory buffers in filters when possible.
In terms of pipeline memory management, there is a ReleaseData flag that can be set on itk::ProcessObject's to control the tradeoff between the amount of memory used and avoiding memory re-allocations. ReleaseData is disabled by default.
@HsuehWen To achieve your goal of using pre-allocated memory, combine the itk::ImportImageFilter with the GraftOutput method of the filter at the end of the pipeline. This will tell the filter to write into the provided image’s buffer.
I have some queries about properly enabling threading support in ITK. Based on the above comments and ITK documentation I have tried both the following settings and neither helps me in getting ITK filters run on all cores.
Environment variables ITK_USE_THREADPOOL and ITK_GLOBAL_DEFAUL_NUMBER_OF_THREADS
itk::MultiThreader::SetGlobalDefaultUseThreadPool(true);
itk::MultiThreaderBase::SetGlobalDefaultNumberOfThreads( n );
Alternatively, I tried setting the number of threads on filter too but that also doesn’t help.
I have built ITK 4.13.1 from source and enabled FFTW usage using the flags ITK_USE_FFTWF and ITK_USE_SYSTEM_FFTW.
Enable TBB (Intel Thread Building Blocks) support in your build. Even without this, the default thread pool and thread use has been improved dramatically since ITK 4.
What specific algorithms are your try to get to run on all your cores? FFTs? Some algorithms are only implemented single threaded, others don’t scale too well. How may cores are your trying to scale to and how big is your data?