Multi-threader refactoring

About your implementation @dzenanz: that’s great that you are using tbb::proportional_split in TBBImageRegionSplitter so that multiple batches of splits along the highest dimension (e.g., slices in 3D) can be scheduled together when the load balancer figures out this will reduce overhead.

However it seems that the splitting can only be done along the largest dimension. Is that right? Correct me if i’m wrong but if this is the case, this will be an important limitation with many-core architectures and a small-ish number of slices.

For example Xeon Phi KNL has ~60 cores/240 hardware threads. It also becomes realistic to have SMP architectures with 2 sockets and a lot of cores in each, for example 24 cores in each (so typically a 48 core machine with 96 hardware threads. The stampede2 supercomputer at TACC actually has… 1736 of these machines).

Unfortunately, a lot of medical images have less than ~60 slices of useful information (eg 2mm slice-thickness diffusion MRI brain images). Moreover, some slices typically have less information (eg inferior and superior parts of the brain after brain extraction). If each task is a slice, there will be a critical computational complexity imbalance between slices which will kill efficiency (whether or not you use a smart scheduler). Some slices may be finished very quickly and then the cores will be unused.

To avoid that we introduced in the ITK paper a splitting that can split more than one dimension (by lines for example). This produces a much higher number of smaller tasks for the scheduler, giving more freedom to the load balancer and ensuring that the computational resources are maximally used. However, in the ITK paper we did not use proportional_split

The ideal filter for high scalability would definitely need to combine both approaches, ie use proportional_split and allow a finer decomposition (for example in lines). The problem is that it requires special care when converting a series of lines to a itk::ImageRegion.

If we think of a 3-D volume, there are 3 possible configurations:

  1. all the lines are on one slice. in that case the corresponding itk::ImageRegion is easy to construct
  2. the first series of lines are on a slice and the remaining on the next slice. In that case we need to run build 2 regions and run ThreadedGenerateData twice
  3. the first series of lines are on a slice, then a bunch of lines cover multiple slices, and then the remaining lines. In that case we need to create 3 itk::ImageRegion and run ThreadedGenerateData on each (3 calls)

This would requires writing a function that creates the right number of N-D dimensions itk::ImageRegion (and doing the right number of calls to ThreadedGenerateData) in the general case.

If we need to define priority, the highest priority may be to decompose into a larger number of smaller tasks rather than using proportional_split, because even without a smart load balancer the cores are going to be heavily unused on many core machines if only slice decomposition is available.

What do you think?

2 Likes