The recent merging of the thread pools seems impressively successful on the dashboard.
So I am tyring to figure out the impact of this recent ThreadPool topic. I am seeing the following things:
New ITK_USE_THREAD_POOL environment variable controls the use of classic threading vs thread pool, defaults to on
The default number of “threads” has been changed from the number of virtual cores to the number of physical cores
With the thread pool, when ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS is set it determines the number of chunks not the number of threads
Similarly, with the thread pool the ProcessObject::SetNumberOfThreads sets the number of chunks not the number of threads used.
I find this current implementation is mixing up number of threads and number of chunks for an algorithm. These need to be separately managed IMHO. Hopefully, this is just the start, and there is a plane to ensure the filters and users have control with the number of threads used and how the jobs is chucked, alone with control to determine if all threads run concurrently.
Additionally, when working with large images image filters I found that the virtual cores generally improved the performance of the filtering algorithms and never did it hurt. There could be a variety of reasons of this, including the level of optimization of the code and how the images were chucked. I’d like to run the timing experiment of running #physical_cores threads with #virutal_cores chuncks, vs #virtual_cores threads with #virtual_cores chunks to determine if the improve performance if from chunking or better CPU utilization by utilizing the vcores. How can I do that with the current implementation?
@dzenanz is working with Jared to change how chunking is handled – changes are expected.
Yes, the default number of threads could likely be refined, and your help in testing will be appreciated. It will be more worthwhile to fine tune this after chunking has been updated. We are also planning on improving benchmark result visualization, which will be helpful.
I look forward to seeing a prototype of the improved chunking models. I have gone through quite a number of use cases of the ImageSplitter, and have tweaked many pipelines for performance. I look forward to lending a hand in designing a flexible interface that can meet many people needs and provide accessible options to get desired behavior of individual filters and entire pipelines. Please let me know what I can do before the implementation get too concrete.
I think I just moved that code from MultiThreader to ThreadPool. If I changed the semantic, it is entirely accidental. But why do you think so? The below test returns 24 for both measures:
If you want non-default number of threads, invoke the test executable with ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS environment variable set to the desired number, e.g.:
The commit message was supposed to contrast N (now) vs 1 (before) as number of threads at startup.
The GetNumberOfPhysicalCPU in itksys::SystemInformation seems to be a misnomer. I suppose that was done to give old code (which was written before hyper-threading) a chance to use the new virtual cores.
The comment “Avoid using hyper thread” does not help either
Looking at that quoted function above, I am not sure the result variable is really used. It looks like the “num” is most likely used from the code block above. There is also a itksys::SystemInformation::GetNumberOfLogicalCPU() method which would clarify the code, and should work across platforms.
Are you aware of why this code is like this? Or even if the GetGlobalDefaultNumberOfThreadsByPlatform method could just utilizes calls to itksys::System
I just took a fast look at some of it. I notice on macOS that semaphore_create() is used. Did you consider using GCD’s dispatch_semaphore_create() instead?
During the v4 effort I recall having a hack of the MultiThreader to uses Apples Grand Central Dispatch. As I recall there were some impressive results for some applications. I found this branch on GitHub which may be some early version of it:
If we are now accepting change in the ITK threading model, to the thread pool model where we cover threads to jobs asynchronously executed, And willing to consider system specific back ends the the multi-thread it may be worth looking into GCD again…
@sean for your specific proposal are you suggesting missing pthreads with GCD?
Here are the declarations in SystemInformation.cxx:
unsigned int GetNumberOfLogicalCPU(); // per physical cpu
unsigned int GetNumberOfPhysicalCPU();
It looks like this code is not just old, but also heavily backwards compatible
With ThreadPool patch merged and tested for a few days with nightly submissions, we could turn it off again by default, until we reach consensus on what should be the default. @matt.mccormick@fbudin can you pitch in?
The was the issue with the itkBarrierTest, the test assumed 4 cores and it was being run on a machine with 2 cores, resulting in a failure. Hence there was the follow-up. I will submit a patch to turn it off by default.