Multi-threader refactoring

benoit · May 24, 2018, 10:19pm

Yes I think the last paragraph is central. Think of higher, task-based level and let the load balancer do the scheduling (and assignment to physical threads and cores) for you.

The overhead of the real time profiling is likely minimal. Of course for some particular cases this strategy may reduce performance because the load balancer likely needs a short amount of time to sample performance and adapt (although it probably starts with default settings that are not that bad). But these problems were likely not very good candidate for parallelism in the first place.

Doing optimal manual scheduling for all present and future hardware is not that easy. For example should the granularity be the same when vector registers are 512bits (eg Xeon Phi/Skylake/Cannon Lake) versus 256bits? Not sure (and it probably depends whether the code is highly vectorized or not). Can we predict if our code will play nice with hyperthreading or not (ie depending on memory accesses / external ressources), ie if we should put two tasks per core? Not sure either.

I like the idea of TBB to introduce the higher level concept of tasks. If you know that your problem does not scale well (or will have negative scalability), just do one big task. If it is expected to scale, decompose it in a lot of small tasks, and then batches of them will be automatically mapped (without any magic) onto physical threads.

I understand that ITK has a long history of thinking in term of threads. I think it wouldn’t be a bad idea to add this vision of tasks, too. But this is maybe for another thread

blowekamp · May 24, 2018, 10:46pm

Yes, task based parallelism is better that low level thread based. Who is suggesting other wise?

@warfield Do you have a proposal for what control of resources we should expose?

warfield · May 25, 2018, 8:53pm

We should provide an opportunity to control the initialization of the task scheduler. This allows for determining when the task scheduler is created and destroyed, how many threads the task scheduler uses, and the stack size for worker threads. This should most often be done with sensible default options, but for the special cases where fine control is needed, it should be exposed.

https://software.intel.com/en-us/node/506296

dzenanz · May 31, 2018, 7:49pm

This patch also adds the control of thread count for TBB back-end. Direct link to patch set 2 (might get outdated).

benoit · June 13, 2018, 6:29pm

About your implementation @dzenanz: that’s great that you are using tbb::proportional_split in TBBImageRegionSplitter so that multiple batches of splits along the highest dimension (e.g., slices in 3D) can be scheduled together when the load balancer figures out this will reduce overhead.

However it seems that the splitting can only be done along the largest dimension. Is that right? Correct me if i’m wrong but if this is the case, this will be an important limitation with many-core architectures and a small-ish number of slices.

For example Xeon Phi KNL has ~60 cores/240 hardware threads. It also becomes realistic to have SMP architectures with 2 sockets and a lot of cores in each, for example 24 cores in each (so typically a 48 core machine with 96 hardware threads. The stampede2 supercomputer at TACC actually has… 1736 of these machines).

Unfortunately, a lot of medical images have less than ~60 slices of useful information (eg 2mm slice-thickness diffusion MRI brain images). Moreover, some slices typically have less information (eg inferior and superior parts of the brain after brain extraction). If each task is a slice, there will be a critical computational complexity imbalance between slices which will kill efficiency (whether or not you use a smart scheduler). Some slices may be finished very quickly and then the cores will be unused.

To avoid that we introduced in the ITK paper a splitting that can split more than one dimension (by lines for example). This produces a much higher number of smaller tasks for the scheduler, giving more freedom to the load balancer and ensuring that the computational resources are maximally used. However, in the ITK paper we did not use proportional_split

The ideal filter for high scalability would definitely need to combine both approaches, ie use proportional_split and allow a finer decomposition (for example in lines). The problem is that it requires special care when converting a series of lines to a itk::ImageRegion.

If we think of a 3-D volume, there are 3 possible configurations:

all the lines are on one slice. in that case the corresponding itk::ImageRegion is easy to construct
the first series of lines are on a slice and the remaining on the next slice. In that case we need to run build 2 regions and run ThreadedGenerateData twice
the first series of lines are on a slice, then a bunch of lines cover multiple slices, and then the remaining lines. In that case we need to create 3 itk::ImageRegion and run ThreadedGenerateData on each (3 calls)

This would requires writing a function that creates the right number of N-D dimensions itk::ImageRegion (and doing the right number of calls to ThreadedGenerateData) in the general case.

If we need to define priority, the highest priority may be to decompose into a larger number of smaller tasks rather than using proportional_split, because even without a smart load balancer the cores are going to be heavily unused on many core machines if only slice decomposition is available.

What do you think?

dzenanz · June 14, 2018, 12:15pm

The decomposition is done along the highest index dimension which can be split. For 3D case, splitting is done along Z until all is left is one slice. But that slice can be further split into lines. The minimum unit into which image can be split is a single pixel. Writing and using this multi-threader has forced me to fix some bugs arising from assumptions that a whole line along X axis will be a part of single outputRegionForThread. Here are some hypothetical requested splits (as TBB might ask for), and the approximate result my implementation would provide:

RegionSize, Ratio, OneRegion, AnotherRegion
90x80x100,  1to10, 90x80x10,  90x80x90
90x80x3,    1to10, 90x80x1,   90x80x2
90x80x1,    1to10, 90x8x1,    90x72x1
90x1x1,     1to10, 9x1x1,     81x1x1
9x1x1,      1to10, 1x1x1,     8x1x1

benoit · June 14, 2018, 12:36pm

Oh wow, that looks like good work.

spinicist · October 2, 2018, 12:28pm

Hello,

Is setting the number of threads/work units working in 5.0b01? I’ve tried calling its::MultiThreaderBase::SetGlobalMaximumNumberOfThreads() at the start of my program, and alternately ->SetNumberOfWorkUnits() on my filter. In this case I was trying to set the number of threads to 1 for debugging, but in both cases the filter still ran with multiple threads.

dzenanz · October 2, 2018, 2:27pm

Both of those should be working. But they control different things, ever since threads and work units have been split. Controlling NumberOfWorkUnits might not be implemented consistently in all filters. Which filter was problematic?

spinicist · October 4, 2018, 1:09pm

Hi @dzenanz - it’s my own filter. Code is here: https://github.com/spinicist/QUIT/blob/master/Source/Core/ModelFitFilter.h#L262

dzenanz · October 4, 2018, 2:26pm

Here is a common pattern in classes that invoke their own Parallelize* method. This pattern comes from ImageSource. At a glance, your class seem to miss this.

But since you have only one parallel section, and you already call it DynamicThreadedGenerateData, why not rely on ImageSource to properly invoke it for you? The check you are currently doing in GenerateData could be delegated to BeforeThreadedGenerateData. So you could basically rename GenerateData into BeforeThreadedGenerateData and remove ParallelizeImageRegion call.

spinicist · October 4, 2018, 8:23pm

Thanks @dzenanz. My memory of why I’m doing it this way is because that’s the way you suggested on another thread, or something like that

Looking at the code I think I still need to do it this way because that code isn’t just a check - this filter also lets you set a subregion and then only process that subregion (but the output will still be the full image, with anything outside the subregion zeroed). I know this is a corner case, my application is slightly outside what ITK was intended for.

Out of interest, why is the this->GetMultiThreader()->SetNumberOfWorkUnits(this->GetNumberOfWorkUnits()); call necessary? It seems counter-intuitive to set the number of work units to the current number of work units?

dzenanz · October 4, 2018, 8:27pm

That call is necessary in case NumberOfWorkUnits was set on the filter since construction or last Update call. Without this the filter would always run with default number of work units, unless filter’s threader was modified directly.

spinicist · October 5, 2018, 11:11am

Thanks again @dzenanz, I think it’s working now