New Thread Pool Update

The recent merging of the thread pools seems impressively successful on the dashboard.

So I am tyring to figure out the impact of this recent ThreadPool topic. I am seeing the following things:

  1. New ITK_USE_THREAD_POOL environment variable controls the use of classic threading vs thread pool, defaults to on
  2. The default number of “threads” has been changed from the number of virtual cores to the number of physical cores
  3. With the thread pool, when ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS is set it determines the number of chunks not the number of threads
  4. Similarly, with the thread pool the ProcessObject::SetNumberOfThreads sets the number of chunks not the number of threads used.

I find this current implementation is mixing up number of threads and number of chunks for an algorithm. These need to be separately managed IMHO. Hopefully, this is just the start, and there is a plane to ensure the filters and users have control with the number of threads used and how the jobs is chucked, alone with control to determine if all threads run concurrently.

Additionally, when working with large images image filters I found that the virtual cores generally improved the performance of the filtering algorithms and never did it hurt. There could be a variety of reasons of this, including the level of optimization of the code and how the images were chucked. I’d like to run the timing experiment of running #physical_cores threads with #virutal_cores chuncks, vs #virtual_cores threads with #virtual_cores chunks to determine if the improve performance if from chunking or better CPU utilization by utilizing the vcores. How can I do that with the current implementation?

@dzenanz is working with Jared to change how chunking is handled – changes are expected.

Yes, the default number of threads could likely be refined, and your help in testing will be appreciated. It will be more worthwhile to fine tune this after chunking has been updated. We are also planning on improving benchmark result visualization, which will be helpful.

I look forward to seeing a prototype of the improved chunking models. I have gone through quite a number of use cases of the ImageSplitter, and have tweaked many pipelines for performance. I look forward to lending a hand in designing a flexible interface that can meet many people needs and provide accessible options to get desired behavior of individual filters and entire pipelines. Please let me know what I can do before the implementation get too concrete.

The first step is to iteratively make improvements that are empirically shown to improve performance. Then, the interface can be refined.

I think I just moved that code from MultiThreader to ThreadPool. If I changed the semantic, it is entirely accidental. But why do you think so? The below test returns 24 for both measures:

#include "itkImage.h"
#include <iostream>

int main(int argc, char *argv[])
{
    std::cout << "GlobalDefaultNumberOfThreads=" <<
        itk::MultiThreader::GetGlobalDefaultNumberOfThreads() << std::endl;

    std::cout << "ThreadPool threads=" <<
        itk::ThreadPool::GetInstance()->GetNumberOfCurrentlyIdleThreads() << std::endl;

    return EXIT_SUCCESS;
}

on a processor with 12 physical cores and 2 threads per core:

[dzenan@kauai-kitware-com threadCountITK]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
Stepping:              4
CPU MHz:               1199.960
BogoMIPS:              5205.43
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
NUMA node0 CPU(s):     0-5,12-17
NUMA node1 CPU(s):     6-11,18-23

If you want non-default number of threads, invoke the test executable with ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS environment variable set to the desired number, e.g.:

ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=12 ./tester 24
ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=24 ./tester 24
ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=24 ./tester 12
ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=12 ./tester 12
ITK_USE_THREADPOOL=OFF ./tester 24
ITK_USE_THREADPOOL=OFF ./tester 12

and tester.cpp might have this:

unsigned int threadCount=atoi(argv[1]);
//sets number of chunks if thread pool is in use
filter->SetNumberOfThreads(threadCount);

This should give you results for all the cases you wanted to examine.

1 Like

I saw this line:

And the commit message says “Now ThreadPool starts with number of threads equal to number of CPU cores”, as opposed to virtual cores.

The commit message was supposed to contrast N (now) vs 1 (before) as number of threads at startup.

The GetNumberOfPhysicalCPU in itksys::SystemInformation seems to be a misnomer. I suppose that was done to give old code (which was written before hyper-threading) a chance to use the new virtual cores.

The comment “Avoid using hyper thread” does not help either :sweat_smile:

Looking at that quoted function above, I am not sure the result variable is really used. It looks like the “num” is most likely used from the code block above. There is also a itksys::SystemInformation::GetNumberOfLogicalCPU() method which would clarify the code, and should work across platforms.

Are you aware of why this code is like this? Or even if the GetGlobalDefaultNumberOfThreadsByPlatform method could just utilizes calls to itksys::System

I just took a fast look at some of it. I notice on macOS that semaphore_create() is used. Did you consider using GCD’s dispatch_semaphore_create() instead?

Cheers,

During the v4 effort I recall having a hack of the MultiThreader to uses Apples Grand Central Dispatch. As I recall there were some impressive results for some applications. I found this branch on GitHub which may be some early version of it:

If we are now accepting change in the ITK threading model, to the thread pool model where we cover threads to jobs asynchronously executed, And willing to consider system specific back ends the the multi-thread it may be worth looking into GCD again…

@sean for your specific proposal are you suggesting missing pthreads with GCD?

Here are the declarations in SystemInformation.cxx:

unsigned int GetNumberOfLogicalCPU(); // per physical cpu
unsigned int GetNumberOfPhysicalCPU();

It looks like this code is not just old, but also heavily backwards compatible :smiley:

With ThreadPool patch merged and tested for a few days with nightly submissions, we could turn it off again by default, until we reach consensus on what should be the default. @matt.mccormick @fbudin can you pitch in?

It could indeed be interesting to look at using GCD, but I was not in fact suggesting anything that ambitious.

I was wondering only about the use of semaphore_create(), and why the author chose to use that instead of dispatch_semaphore_create()?

Sean

:tophat: Hat’s off to you, @dzenanz that there were no dashboard issues after the merge! Yes, I think we could turn it off again by default.

The was the issue with the itkBarrierTest, the test assumed 4 cores and it was being run on a machine with 2 cores, resulting in a failure. Hence there was the follow-up. I will submit a patch to turn it off by default.

Patch submitted:
http://review.source.kitware.com/#/c/22752/

1 Like