Multi-threader refactoring

Yes I think the last paragraph is central. Think of higher, task-based level and let the load balancer do the scheduling (and assignment to physical threads and cores) for you.

The overhead of the real time profiling is likely minimal. Of course for some particular cases this strategy may reduce performance because the load balancer likely needs a short amount of time to sample performance and adapt (although it probably starts with default settings that are not that bad). But these problems were likely not very good candidate for parallelism in the first place.

Doing optimal manual scheduling for all present and future hardware is not that easy. For example should the granularity be the same when vector registers are 512bits (eg Xeon Phi/Skylake/Cannon Lake) versus 256bits? Not sure (and it probably depends whether the code is highly vectorized or not). Can we predict if our code will play nice with hyperthreading or not (ie depending on memory accesses / external ressources), ie if we should put two tasks per core? Not sure either.

I like the idea of TBB to introduce the higher level concept of tasks. If you know that your problem does not scale well (or will have negative scalability), just do one big task. If it is expected to scale, decompose it in a lot of small tasks, and then batches of them will be automatically mapped (without any magic) onto physical threads.

I understand that ITK has a long history of thinking in term of threads. I think it wouldnā€™t be a bad idea to add this vision of tasks, too. But this is maybe for another thread :slight_smile:

1 Like

Yes, task based parallelism is better that low level thread based. Who is suggesting other wise? :man_shrugging:

@warfield Do you have a proposal for what control of resources we should expose?

We should provide an opportunity to control the initialization of the task scheduler. This allows for determining when the task scheduler is created and destroyed, how many threads the task scheduler uses, and the stack size for worker threads. This should most often be done with sensible default options, but for the special cases where fine control is needed, it should be exposed.

https://software.intel.com/en-us/node/506296

This patch also adds the control of thread count for TBB back-end. Direct link to patch set 2 (might get outdated).

About your implementation @dzenanz: thatā€™s great that you are using tbb::proportional_split in TBBImageRegionSplitter so that multiple batches of splits along the highest dimension (e.g., slices in 3D) can be scheduled together when the load balancer figures out this will reduce overhead.

However it seems that the splitting can only be done along the largest dimension. Is that right? Correct me if iā€™m wrong but if this is the case, this will be an important limitation with many-core architectures and a small-ish number of slices.

For example Xeon Phi KNL has ~60 cores/240 hardware threads. It also becomes realistic to have SMP architectures with 2 sockets and a lot of cores in each, for example 24 cores in each (so typically a 48 core machine with 96 hardware threads. The stampede2 supercomputer at TACC actually hasā€¦ 1736 of these machines).

Unfortunately, a lot of medical images have less than ~60 slices of useful information (eg 2mm slice-thickness diffusion MRI brain images). Moreover, some slices typically have less information (eg inferior and superior parts of the brain after brain extraction). If each task is a slice, there will be a critical computational complexity imbalance between slices which will kill efficiency (whether or not you use a smart scheduler). Some slices may be finished very quickly and then the cores will be unused.

To avoid that we introduced in the ITK paper a splitting that can split more than one dimension (by lines for example). This produces a much higher number of smaller tasks for the scheduler, giving more freedom to the load balancer and ensuring that the computational resources are maximally used. However, in the ITK paper we did not use proportional_split

The ideal filter for high scalability would definitely need to combine both approaches, ie use proportional_split and allow a finer decomposition (for example in lines). The problem is that it requires special care when converting a series of lines to a itk::ImageRegion.

If we think of a 3-D volume, there are 3 possible configurations:

  1. all the lines are on one slice. in that case the corresponding itk::ImageRegion is easy to construct
  2. the first series of lines are on a slice and the remaining on the next slice. In that case we need to run build 2 regions and run ThreadedGenerateData twice
  3. the first series of lines are on a slice, then a bunch of lines cover multiple slices, and then the remaining lines. In that case we need to create 3 itk::ImageRegion and run ThreadedGenerateData on each (3 calls)

This would requires writing a function that creates the right number of N-D dimensions itk::ImageRegion (and doing the right number of calls to ThreadedGenerateData) in the general case.

If we need to define priority, the highest priority may be to decompose into a larger number of smaller tasks rather than using proportional_split, because even without a smart load balancer the cores are going to be heavily unused on many core machines if only slice decomposition is available.

What do you think?

2 Likes

The decomposition is done along the highest index dimension which can be split. For 3D case, splitting is done along Z until all is left is one slice. But that slice can be further split into lines. The minimum unit into which image can be split is a single pixel. Writing and using this multi-threader has forced me to fix some bugs arising from assumptions that a whole line along X axis will be a part of single outputRegionForThread. Here are some hypothetical requested splits (as TBB might ask for), and the approximate result my implementation would provide:

RegionSize, Ratio, OneRegion, AnotherRegion
90x80x100,  1to10, 90x80x10,  90x80x90
90x80x3,    1to10, 90x80x1,   90x80x2
90x80x1,    1to10, 90x8x1,    90x72x1
90x1x1,     1to10, 9x1x1,     81x1x1
9x1x1,      1to10, 1x1x1,     8x1x1
2 Likes

Oh wow, that looks like good work.

1 Like

Hello,

Is setting the number of threads/work units working in 5.0b01? Iā€™ve tried calling its::MultiThreaderBase::SetGlobalMaximumNumberOfThreads() at the start of my program, and alternately ->SetNumberOfWorkUnits() on my filter. In this case I was trying to set the number of threads to 1 for debugging, but in both cases the filter still ran with multiple threads.

Both of those should be working. But they control different things, ever since threads and work units have been split. Controlling NumberOfWorkUnits might not be implemented consistently in all filters. Which filter was problematic?

Hi @dzenanz - itā€™s my own filter. Code is here: https://github.com/spinicist/QUIT/blob/master/Source/Core/ModelFitFilter.h#L262

1 Like

Here is a common pattern in classes that invoke their own Parallelize* method. This pattern comes from ImageSource. At a glance, your class seem to miss this.

But since you have only one parallel section, and you already call it DynamicThreadedGenerateData, why not rely on ImageSource to properly invoke it for you? The check you are currently doing in GenerateData could be delegated to BeforeThreadedGenerateData. So you could basically rename GenerateData into BeforeThreadedGenerateData and remove ParallelizeImageRegion call.

1 Like

Thanks @dzenanz. My memory of why Iā€™m doing it this way is because thatā€™s the way you suggested on another thread, or something like that :wink:

Looking at the code I think I still need to do it this way because that code isnā€™t just a check - this filter also lets you set a subregion and then only process that subregion (but the output will still be the full image, with anything outside the subregion zeroed). I know this is a corner case, my application is slightly outside what ITK was intended for.

Out of interest, why is the this->GetMultiThreader()->SetNumberOfWorkUnits(this->GetNumberOfWorkUnits()); call necessary? It seems counter-intuitive to set the number of work units to the current number of work units?

That call is necessary in case NumberOfWorkUnits was set on the filter since construction or last Update call. Without this the filter would always run with default number of work units, unless filterā€™s threader was modified directly.

1 Like

Thanks again @dzenanz, I think itā€™s working now

2 Likes