Multi-threader refactoring

Doxygen is run only for master, not for proposed patches. But even for master, the doxygen page is missing for MultiThreader and MultiThreaderBase. And that is a mystery.

2 Likes

The major part of refactoring is now finished. It has been merged into master.

itk::TBBImageToImageFilter is now obsolete. Its functionality has been mostly replaced by itkTBBMultiThreader.

However, due to time constraints, the refactoring has not been completed. Most important one is separating NumberOfWorkUnits (a filter property) from NumberOfThreads (a multi-threader property). Also, itk::TBBMultiThreader currently does not respect number of threads. @warfield @benoit If somebody is willing to work on this, I can get them started.

1 Like

Great job Dzenan!

The trick to force the maximum number of thread used by TBB is to use
// Set up the TBB task_scheduler
tbb::task_scheduler_init tbb_init(MAX_NUMBER);

before the parallel_for.

see the doc there: Documentation Library

I agree this could be used to force the maximum number of threads used by ITK (defined by the global variable, i don’t remember its exact name)
However, with a modern scheduler such as TBB, you really should refrain from using a GetNumberOfThreads/SetNumberOfThreads mechanism. This is well synthesized in the doc (link above):

The reason for not specifying the number of threads in production code is that in a large software project, there is no way for various components to know how many threads would be optimal for other threads. Hardware threads are a shared global resource. It is best to leave the decision of how many threads to use to the task scheduler.

1 Like

I think it is quite common to need to limit the number of threads for a task. It could be you only want 1 thread or the task does not scale well and you might want to limit the task to 4. This is different than the global number of threads the scheduler manages. The current ITK names, it their prior implementation, really refer to the number of threads for a task and not a global limit.

In the old days, people thought they could increase performance by keeping track of where threads are assigned, and knowing the scalability of each implementation of each algorithm on each piece of hardware, knowing what other people are running on the hardware, and then optimizing. Nowadays, where the same code may be running on a laptop, or a 48 core Xeon processor, with vastly different scaling considerations,and different jobs of different priority potentially scheduled on the same hardware, the practical solution to optimal performance is to have the task scheduler schedule tasks on to threads and then on to CPU cores to minimize the total run time for you. If you attempt to do this yourself, your best outcome is that you are as good as the task scheduler.All the other times, you are worse than the task scheduler.

In TBB, the scheduler knows if you should be running on one CPU only , or should be running on four, and adjusts that for you.

From the Intel TBB guide book:

There are a variety of approaches to parallel programming, ranging from using platform-dependent threading primitives to exotic new languages. The advantage of Intel Threading Building Blocks is that it works at a higher level than raw threads, yet does not require exotic languages or compilers. You can use it with any compiler supporting ISO C++. The library differs from typical threading packages in the following ways:

Intel Threading Building Blocks enables you to specify logical paralleism instead of threads. Most threading packages require you to specify threads. Programming directly in terms of threads can be tedious and lead to inefficient programs, because threads are low-level, heavy constructs that are close to the hardware. Direct programming with threads forces you to efficiently map logical tasks onto threads. In contrast, the Intel Threading Building Blocks run-time library automatically maps logical parallelism onto threads in a way that makes efficient use of processor resources.

Intel Threading Building Blocks targets threading for performance. Most general-purpose threading packages support many different kinds of threading, such as threading for asynchronous events in graphical user interfaces. As a result, general-purpose packages tend to be low-level tools that provide a foundation, not a solution. Instead, Intel Threading Building Blocks focuses on the particular goal of parallelizing computationally intensive work, delivering higher-level, simpler solutions.

1 Like

The ability to set to limit the number of threads is a requirement.

The scheduler is not magic, and can not predict which tasks will have negative scalability. And these do exist in ITK. The refactoring does not yet provide persistent resources for a thread during a task, so they may be re-allocated for each chunk during a task, which is not good for improving scalability compared to fixed splitting.

I regularly limit the threads of 1 to assist be debugging and algorithm analysis. How can you analyze the scalability of any algorithm if you have no control of the resources.

There are also cases where the program is executing on a shared resource, and the number of threads must be limited.

I could go on about use cases.

TBB seem to readily support it in a couple ways:

I still can’t get ITK with TBB to link properly with applications that use ITK w/ TBB.

What is needed and goes hand in hand with controlling number of threads is splitting notion of work units from threads. This is a very logical split with a thread pool. And what people mostly want to lower is number of work units, not threads.

1 Like

I see three potential “user tweak-ables”:

  1. Yes, limiting the number of “work units” is desired.

  2. Limit the global number of threads to ITK tasks - ITKv4 did not have this. The execution of more task, just resulted in more threads being utilized. This “tweak-able” maps very well to the thread pool concept, where there is a limited number of threads that get allocated to tasks. Some higher level thread libraries allow for labeled thread pools for given “tasks”, it may be a desirable feature to give ITK a labeled pool, and limit it’s resources. These features would clearly by at the MultiThread interface, and not at the filter.

  3. And I still think that limiting the number of threads per “task” or filter is needed.

ITK is not a turn-key system. It is a toolkit to assemble algorithms and applications. As such we have always cateried full developer control and ability to experiment.

I’ll also mention that we have/had a “SingleThread” Multi-Threader in ITK, this may be a good time to revive it and perhaps ensure that it can be utilized either globally, filter configurable, to automatically when threads/tasks/muffins/work units are set to 1.

Controlling the number of threads is an important feature of the scheduler, when measuring scalability, perhaps when debugging. Therefore, the scheduler can be initialized in such a way that it limits the maximum number of threads:
https://software.intel.com/en-us/node/506297

The rest of the API shouldn’t be polluted with information that is not needed, such as the thread id, or the number of threads that exist.

The TBB scheduler doesn’t have to guess what is more efficient. It can monitor utilization and know what is more efficient. That is like magic to a thread programmer.

From the documentation:
The threads you create with a threading package are logical threads, which map onto the physical threads of the hardware. For computations that do not wait on external devices, highest efficiency usually occurs when there is exactly one running logical thread per physical thread. Otherwise, there can be inefficiencies from the mismatch. Undersubscription occurs when there are not enough running logical threads to keep the physical threads working. Oversubscription occurs when there are more running logical threads than physical threads. Oversubscription usually leads to time sliced execution of logical threads, which incurs overheads as discussed in Appendix A, Costs of Time Slicing. The scheduler tries to avoid oversubscription, by having one logical thread per physical thread, and mapping tasks to logical threads, in a way that tolerates interference by other threads from the same or other processes.

The key advantage of tasks versus logical threads is that tasks are much lighter weight than logical threads. On Linux systems, starting and terminating a task is about 18 times faster than starting and terminating a thread. On Windows systems, the ratio is more than 100. This is because a thread has its own copy of a lot of resources, such as register state and a stack. On Linux, a thread even has its own process id. A task in Intel® Threading Building Blocks, in contrast, is typically a small routine, and also, cannot be preempted at the task level (though its logical thread can be preempted).

Tasks in Intel Threading Building Blocks are efficient too because the scheduler is unfair. Thread schedulers typically distribute time slices in a round-robin fashion. This distribution is called “fair”, because each logical thread gets its fair share of time. Thread schedulers are typically fair because it is the safest strategy to undertake without understanding the higher-level organization of a program. In task-based programming, the task scheduler does have some higher-level information, and so can sacrifice fairness for efficiency. Indeed, it often delays starting a task until it can make useful progress.

The scheduler does load balancing. In addition to using the right number of threads, it is important to distribute work evenly across those threads. As long as you break your program into enough small tasks, the scheduler usually does a good job of assigning tasks to threads to balance load. With thread-based programming, you are often stuck dealing with load-balancing yourself, which can be tricky to get right.

TIP
Design your programs to try to create many more tasks than there are threads, and let the task scheduler choose the mapping from tasks to threads.

Finally, the main advantage of using tasks instead of threads is that they let you think at a higher, task-based, level. With thread-based programming, you are forced to think at the low level of physical threads to get good efficiency, because you have one logical thread per physical thread to avoid undersubscription or oversubscription. You also have to deal with the relatively coarse grain of threads. With tasks, you can concentrate on the logical dependences between tasks, and leave the efficient scheduling to the scheduler.

1 Like

Yes I think the last paragraph is central. Think of higher, task-based level and let the load balancer do the scheduling (and assignment to physical threads and cores) for you.

The overhead of the real time profiling is likely minimal. Of course for some particular cases this strategy may reduce performance because the load balancer likely needs a short amount of time to sample performance and adapt (although it probably starts with default settings that are not that bad). But these problems were likely not very good candidate for parallelism in the first place.

Doing optimal manual scheduling for all present and future hardware is not that easy. For example should the granularity be the same when vector registers are 512bits (eg Xeon Phi/Skylake/Cannon Lake) versus 256bits? Not sure (and it probably depends whether the code is highly vectorized or not). Can we predict if our code will play nice with hyperthreading or not (ie depending on memory accesses / external ressources), ie if we should put two tasks per core? Not sure either.

I like the idea of TBB to introduce the higher level concept of tasks. If you know that your problem does not scale well (or will have negative scalability), just do one big task. If it is expected to scale, decompose it in a lot of small tasks, and then batches of them will be automatically mapped (without any magic) onto physical threads.

I understand that ITK has a long history of thinking in term of threads. I think it wouldn’t be a bad idea to add this vision of tasks, too. But this is maybe for another thread :slight_smile:

1 Like

Yes, task based parallelism is better that low level thread based. Who is suggesting other wise? :man_shrugging:

@warfield Do you have a proposal for what control of resources we should expose?

We should provide an opportunity to control the initialization of the task scheduler. This allows for determining when the task scheduler is created and destroyed, how many threads the task scheduler uses, and the stack size for worker threads. This should most often be done with sensible default options, but for the special cases where fine control is needed, it should be exposed.

https://software.intel.com/en-us/node/506296

This patch also adds the control of thread count for TBB back-end. Direct link to patch set 2 (might get outdated).

About your implementation @dzenanz: that’s great that you are using tbb::proportional_split in TBBImageRegionSplitter so that multiple batches of splits along the highest dimension (e.g., slices in 3D) can be scheduled together when the load balancer figures out this will reduce overhead.

However it seems that the splitting can only be done along the largest dimension. Is that right? Correct me if i’m wrong but if this is the case, this will be an important limitation with many-core architectures and a small-ish number of slices.

For example Xeon Phi KNL has ~60 cores/240 hardware threads. It also becomes realistic to have SMP architectures with 2 sockets and a lot of cores in each, for example 24 cores in each (so typically a 48 core machine with 96 hardware threads. The stampede2 supercomputer at TACC actually has… 1736 of these machines).

Unfortunately, a lot of medical images have less than ~60 slices of useful information (eg 2mm slice-thickness diffusion MRI brain images). Moreover, some slices typically have less information (eg inferior and superior parts of the brain after brain extraction). If each task is a slice, there will be a critical computational complexity imbalance between slices which will kill efficiency (whether or not you use a smart scheduler). Some slices may be finished very quickly and then the cores will be unused.

To avoid that we introduced in the ITK paper a splitting that can split more than one dimension (by lines for example). This produces a much higher number of smaller tasks for the scheduler, giving more freedom to the load balancer and ensuring that the computational resources are maximally used. However, in the ITK paper we did not use proportional_split

The ideal filter for high scalability would definitely need to combine both approaches, ie use proportional_split and allow a finer decomposition (for example in lines). The problem is that it requires special care when converting a series of lines to a itk::ImageRegion.

If we think of a 3-D volume, there are 3 possible configurations:

  1. all the lines are on one slice. in that case the corresponding itk::ImageRegion is easy to construct
  2. the first series of lines are on a slice and the remaining on the next slice. In that case we need to run build 2 regions and run ThreadedGenerateData twice
  3. the first series of lines are on a slice, then a bunch of lines cover multiple slices, and then the remaining lines. In that case we need to create 3 itk::ImageRegion and run ThreadedGenerateData on each (3 calls)

This would requires writing a function that creates the right number of N-D dimensions itk::ImageRegion (and doing the right number of calls to ThreadedGenerateData) in the general case.

If we need to define priority, the highest priority may be to decompose into a larger number of smaller tasks rather than using proportional_split, because even without a smart load balancer the cores are going to be heavily unused on many core machines if only slice decomposition is available.

What do you think?

2 Likes

The decomposition is done along the highest index dimension which can be split. For 3D case, splitting is done along Z until all is left is one slice. But that slice can be further split into lines. The minimum unit into which image can be split is a single pixel. Writing and using this multi-threader has forced me to fix some bugs arising from assumptions that a whole line along X axis will be a part of single outputRegionForThread. Here are some hypothetical requested splits (as TBB might ask for), and the approximate result my implementation would provide:

RegionSize, Ratio, OneRegion, AnotherRegion
90x80x100,  1to10, 90x80x10,  90x80x90
90x80x3,    1to10, 90x80x1,   90x80x2
90x80x1,    1to10, 90x8x1,    90x72x1
90x1x1,     1to10, 9x1x1,     81x1x1
9x1x1,      1to10, 1x1x1,     8x1x1
2 Likes

Oh wow, that looks like good work.

1 Like

Hello,

Is setting the number of threads/work units working in 5.0b01? I’ve tried calling its::MultiThreaderBase::SetGlobalMaximumNumberOfThreads() at the start of my program, and alternately ->SetNumberOfWorkUnits() on my filter. In this case I was trying to set the number of threads to 1 for debugging, but in both cases the filter still ran with multiple threads.

Both of those should be working. But they control different things, ever since threads and work units have been split. Controlling NumberOfWorkUnits might not be implemented consistently in all filters. Which filter was problematic?

Hi @dzenanz - it’s my own filter. Code is here: https://github.com/spinicist/QUIT/blob/master/Source/Core/ModelFitFilter.h#L262

1 Like