New Thread Pool Update

blowekamp · October 26, 2017, 2:33pm

The recent merging of the thread pools seems impressively successful on the dashboard.

So I am tyring to figure out the impact of this recent ThreadPool topic. I am seeing the following things:

New ITK_USE_THREAD_POOL environment variable controls the use of classic threading vs thread pool, defaults to on
The default number of “threads” has been changed from the number of virtual cores to the number of physical cores
With the thread pool, when ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS is set it determines the number of chunks not the number of threads
Similarly, with the thread pool the ProcessObject::SetNumberOfThreads sets the number of chunks not the number of threads used.

I find this current implementation is mixing up number of threads and number of chunks for an algorithm. These need to be separately managed IMHO. Hopefully, this is just the start, and there is a plane to ensure the filters and users have control with the number of threads used and how the jobs is chucked, alone with control to determine if all threads run concurrently.

Additionally, when working with large images image filters I found that the virtual cores generally improved the performance of the filtering algorithms and never did it hurt. There could be a variety of reasons of this, including the level of optimization of the code and how the images were chucked. I’d like to run the timing experiment of running #physical_cores threads with #virutal_cores chuncks, vs #virtual_cores threads with #virtual_cores chunks to determine if the improve performance if from chunking or better CPU utilization by utilizing the vcores. How can I do that with the current implementation?

matt.mccormick · October 26, 2017, 7:22pm

@dzenanz is working with Jared to change how chunking is handled – changes are expected.

Yes, the default number of threads could likely be refined, and your help in testing will be appreciated. It will be more worthwhile to fine tune this after chunking has been updated. We are also planning on improving benchmark result visualization, which will be helpful.

blowekamp · October 26, 2017, 7:38pm

I look forward to seeing a prototype of the improved chunking models. I have gone through quite a number of use cases of the ImageSplitter, and have tweaked many pipelines for performance. I look forward to lending a hand in designing a flexible interface that can meet many people needs and provide accessible options to get desired behavior of individual filters and entire pipelines. Please let me know what I can do before the implementation get too concrete.

matt.mccormick · October 26, 2017, 9:11pm

The first step is to iteratively make improvements that are empirically shown to improve performance. Then, the interface can be refined.

dzenanz · October 27, 2017, 5:47pm

I think I just moved that code from MultiThreader to ThreadPool. If I changed the semantic, it is entirely accidental. But why do you think so? The below test returns 24 for both measures:

#include "itkImage.h"
#include <iostream>

int main(int argc, char *argv[])
{
    std::cout << "GlobalDefaultNumberOfThreads=" <<
        itk::MultiThreader::GetGlobalDefaultNumberOfThreads() << std::endl;

    std::cout << "ThreadPool threads=" <<
        itk::ThreadPool::GetInstance()->GetNumberOfCurrentlyIdleThreads() << std::endl;

    return EXIT_SUCCESS;
}

on a processor with 12 physical cores and 2 threads per core:

[dzenan@kauai-kitware-com threadCountITK]$ lscpu
Architecture:          x86_64
CPU op-mode(s):        32-bit, 64-bit
Byte Order:            Little Endian
CPU(s):                24
On-line CPU(s) list:   0-23
Thread(s) per core:    2
Core(s) per socket:    6
Socket(s):             2
NUMA node(s):          2
Vendor ID:             GenuineIntel
CPU family:            6
Model:                 62
Model name:            Intel(R) Xeon(R) CPU E5-2630 v2 @ 2.60GHz
Stepping:              4
CPU MHz:               1199.960
BogoMIPS:              5205.43
Virtualization:        VT-x
L1d cache:             32K
L1i cache:             32K
L2 cache:              256K
L3 cache:              15360K
NUMA node0 CPU(s):     0-5,12-17
NUMA node1 CPU(s):     6-11,18-23

dzenanz · October 27, 2017, 5:56pm

If you want non-default number of threads, invoke the test executable with ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS environment variable set to the desired number, e.g.:

ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=12 ./tester 24
ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=24 ./tester 24
ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=24 ./tester 12
ITK_GLOBAL_DEFAULT_NUMBER_OF_THREADS=12 ./tester 12
ITK_USE_THREADPOOL=OFF ./tester 24
ITK_USE_THREADPOOL=OFF ./tester 12

and tester.cpp might have this:

unsigned int threadCount=atoi(argv[1]);
//sets number of chunks if thread pool is in use
filter->SetNumberOfThreads(threadCount);

This should give you results for all the cases you wanted to examine.

blowekamp · October 27, 2017, 5:57pm

I saw this line:

github.com

InsightSoftwareConsortium/ITK/blob/f2b6450bacb2d720149ad8271181959f4755b180/Modules/Core/Common/src/itkThreadPoolPosix.cxx#L52


  num = 1;
#endif
#if defined( __SVR4 ) && defined( sun ) && defined( PTHREAD_MUTEX_NORMAL )
  pthread_setconcurrency(num);
#endif

  itksys::SystemInformation mySys;
  mySys.RunCPUCheck();
  mySys.RunOSCheck();
  mySys.RunMemoryCheck();
  int result = mySys.GetNumberOfPhysicalCPU(); // Avoid using hyperthreading cores.
  if( result == -1 )
    {
    num = 1;
    }
  return num;
}

void
ThreadPool
::PlatformCreate(Semaphore &semaphore)

And the commit message says “Now ThreadPool starts with number of threads equal to number of CPU cores”, as opposed to virtual cores.

dzenanz · October 27, 2017, 6:07pm

The commit message was supposed to contrast N (now) vs 1 (before) as number of threads at startup.

The GetNumberOfPhysicalCPU in itksys::SystemInformation seems to be a misnomer. I suppose that was done to give old code (which was written before hyper-threading) a chance to use the new virtual cores.

blowekamp · October 27, 2017, 6:24pm

The comment “Avoid using hyper thread” does not help either

Looking at that quoted function above, I am not sure the result variable is really used. It looks like the “num” is most likely used from the code block above. There is also a itksys::SystemInformation::GetNumberOfLogicalCPU() method which would clarify the code, and should work across platforms.

Are you aware of why this code is like this? Or even if the GetGlobalDefaultNumberOfThreadsByPlatform method could just utilizes calls to itksys::System

seanm · October 27, 2017, 7:06pm

I just took a fast look at some of it. I notice on macOS that semaphore_create() is used. Did you consider using GCD’s dispatch_semaphore_create() instead?

Cheers,

blowekamp · October 27, 2017, 7:18pm

During the v4 effort I recall having a hack of the MultiThreader to uses Apples Grand Central Dispatch. As I recall there were some impressive results for some applications. I found this branch on GitHub which may be some early version of it:

If we are now accepting change in the ITK threading model, to the thread pool model where we cover threads to jobs asynchronously executed, And willing to consider system specific back ends the the multi-thread it may be worth looking into GCD again…

@sean for your specific proposal are you suggesting missing pthreads with GCD?

dzenanz · October 27, 2017, 7:43pm

Here are the declarations in SystemInformation.cxx:

unsigned int GetNumberOfLogicalCPU(); // per physical cpu
unsigned int GetNumberOfPhysicalCPU();

It looks like this code is not just old, but also heavily backwards compatible

With ThreadPool patch merged and tested for a few days with nightly submissions, we could turn it off again by default, until we reach consensus on what should be the default. @matt.mccormick @fbudin can you pitch in?

seanm · October 27, 2017, 8:12pm

It could indeed be interesting to look at using GCD, but I was not in fact suggesting anything that ambitious.

I was wondering only about the use of semaphore_create(), and why the author chose to use that instead of dispatch_semaphore_create()?

Sean

matt.mccormick · October 27, 2017, 8:51pm

Hat’s off to you, @dzenanz that there were no dashboard issues after the merge! Yes, I think we could turn it off again by default.

dzenanz · October 27, 2017, 8:59pm

The was the issue with the itkBarrierTest, the test assumed 4 cores and it was being run on a machine with 2 cores, resulting in a failure. Hence there was the follow-up. I will submit a patch to turn it off by default.

dzenanz · October 27, 2017, 9:22pm

Patch submitted:
http://review.source.kitware.com/#/c/22752/