Multi-threader refactoring

dzenanz · April 16, 2018, 3:44pm

Threading infrastructure refactoring is entering final stages. Feedback is welcome in this patch and the few patches before which are part of the same branch.

Some outstanding TODOs:

remove threadId parameter for ParallelizeImageRegion method
separate NumberOfWorkUnits (a filter property) from NumberOfThreads (a multi-threader property) - currently underway
implement ParallelizeImageRegion in PoolMultiThreader

dzenanz · April 16, 2018, 5:03pm

Feedback from Kris Campbell:

Please give some sort of overview of the responsibilities of the base class and derived class. Also, some indication or point to examples of which class filters should use. For example , one question I have is: Is it up to to the filter writer or the application writer gluing filters together to decide which type of threading strategy to use? It is currently unclear what the recommended approach is.

dzenanz · April 16, 2018, 5:05pm

Users should try to use the base class only. The base class will delegate work to derived classes as needed. If a filter strongly favors some kind of parallelization over others, the filter author might try to use that knowledge somehow. But mostly it should be left to application developer. And the defaults should be good enough so the app developer does not have to fiddle with it until (and if) he gets to profiling and optimization phase of development.

But I am not sure where should we write the above advice. The Software Guide? Documentation of MultiThreaderBase? Someplace else? Can others pitch in? @jhlegarreta @matt.mccormick @blowekamp @hjmjohnson @fbudin

matt.mccormick · April 16, 2018, 5:09pm

But I am not sure where should we write the above advice

Documentation specific to the class could go in MultiThreaderBase’s Doxygen documentation.

A high level overview of how the threading works and how to develop for ITK’s infrastructure could go in the ITK Software Guide, Architecture Section and How To Write A Filter section.

This is looking awesome, Dzenan!

blowekamp · April 16, 2018, 5:43pm

I quickly review the patch. I still suggest implementing the region splitting over a basic array type instead of the C++ template itk::ImageRegion, so that there can be one function implemented and all dimensions supported.The ImageRegionSplitterBase class has an example of this class.

What are the ways to control the default multithreaded? Perhaps there should be an environment variable that could be specified ITK_DEFAULT_MULTITHREADER=“CLASSIC” or “POOL” or “TBBPOOL”, there could be others OpenMP, Grand Central Dispatch etc…

dzenanz · April 16, 2018, 8:23pm

If ITK is build with TBB support, TBBMulitThreader is the default. Otherwise the existing logic gets used to decide between Pool or Classic.

But you are right, we could and should make this control less convoluted and more structurally similar to the code.

jhlegarreta · April 17, 2018, 6:45am

+1 for what @matt.mccormick has suggested as suitable places, specifically:

II Architecture
- 3.2 Essential System Concepts
  - 3.2.7 Multi-Threading
III Development Guidelines
- 8 How To Write A Filter
  - 8.4 Threaded Filter Execution

Including some comment and mentions in 8.2 Overview of Filter Creation and 8.3. Streaming Large Data

Also, there is a mention to multithreading in in 3.2.4 Smart Pointers and Memory Management.

spinicist · April 17, 2018, 4:19pm

Apologies for a newbie question - is there somewhere I can read the new Doxygen help that would be generated by this patch? It would easier than trying to read through the patch itself.

(I’m really looking forward to this patch!)

matt.mccormick · April 17, 2018, 8:58pm

@spinicist The Doxygen is generated nightly from ITK’s Git master branch. Unfortunately, though, it is currently mysteriously missing documentation for itk::MultiThreaderBase.

spinicist · April 18, 2018, 1:27pm

@matt.mccormick Isn’t that because @dzenanz’s patches haven’t been merged to master yet, and hence the nightly documentation won’t include it?

Sorry, that was kind of the point of my question - is Doxygen run for patches or only for master? It’s a bit hard to get an overview of such a patch, particularly with the gerrit interface (fingers crossed for the swap to github soon)

dzenanz · April 18, 2018, 2:36pm

Doxygen is run only for master, not for proposed patches. But even for master, the doxygen page is missing for MultiThreader and MultiThreaderBase. And that is a mystery.

dzenanz · May 23, 2018, 6:50pm

The major part of refactoring is now finished. It has been merged into master.

itk::TBBImageToImageFilter is now obsolete. Its functionality has been mostly replaced by itkTBBMultiThreader.

However, due to time constraints, the refactoring has not been completed. Most important one is separating NumberOfWorkUnits (a filter property) from NumberOfThreads (a multi-threader property). Also, itk::TBBMultiThreader currently does not respect number of threads. @warfield @benoit If somebody is willing to work on this, I can get them started.

benoit · May 23, 2018, 10:05pm

Great job Dzenan!

The trick to force the maximum number of thread used by TBB is to use
// Set up the TBB task_scheduler
tbb::task_scheduler_init tbb_init(MAX_NUMBER);

before the parallel_for.

see the doc there: Documentation Library

I agree this could be used to force the maximum number of threads used by ITK (defined by the global variable, i don’t remember its exact name)
However, with a modern scheduler such as TBB, you really should refrain from using a GetNumberOfThreads/SetNumberOfThreads mechanism. This is well synthesized in the doc (link above):

The reason for not specifying the number of threads in production code is that in a large software project, there is no way for various components to know how many threads would be optimal for other threads. Hardware threads are a shared global resource. It is best to leave the decision of how many threads to use to the task scheduler.

blowekamp · May 23, 2018, 11:01pm

I think it is quite common to need to limit the number of threads for a task. It could be you only want 1 thread or the task does not scale well and you might want to limit the task to 4. This is different than the global number of threads the scheduler manages. The current ITK names, it their prior implementation, really refer to the number of threads for a task and not a global limit.

warfield · May 23, 2018, 11:33pm

In the old days, people thought they could increase performance by keeping track of where threads are assigned, and knowing the scalability of each implementation of each algorithm on each piece of hardware, knowing what other people are running on the hardware, and then optimizing. Nowadays, where the same code may be running on a laptop, or a 48 core Xeon processor, with vastly different scaling considerations,and different jobs of different priority potentially scheduled on the same hardware, the practical solution to optimal performance is to have the task scheduler schedule tasks on to threads and then on to CPU cores to minimize the total run time for you. If you attempt to do this yourself, your best outcome is that you are as good as the task scheduler.All the other times, you are worse than the task scheduler.

In TBB, the scheduler knows if you should be running on one CPU only , or should be running on four, and adjusts that for you.

From the Intel TBB guide book:

There are a variety of approaches to parallel programming, ranging from using platform-dependent threading primitives to exotic new languages. The advantage of Intel Threading Building Blocks is that it works at a higher level than raw threads, yet does not require exotic languages or compilers. You can use it with any compiler supporting ISO C++. The library differs from typical threading packages in the following ways:

Intel Threading Building Blocks enables you to specify logical paralleism instead of threads. Most threading packages require you to specify threads. Programming directly in terms of threads can be tedious and lead to inefficient programs, because threads are low-level, heavy constructs that are close to the hardware. Direct programming with threads forces you to efficiently map logical tasks onto threads. In contrast, the Intel Threading Building Blocks run-time library automatically maps logical parallelism onto threads in a way that makes efficient use of processor resources.

Intel Threading Building Blocks targets threading for performance. Most general-purpose threading packages support many different kinds of threading, such as threading for asynchronous events in graphical user interfaces. As a result, general-purpose packages tend to be low-level tools that provide a foundation, not a solution. Instead, Intel Threading Building Blocks focuses on the particular goal of parallelizing computationally intensive work, delivering higher-level, simpler solutions.

blowekamp · May 24, 2018, 12:03am

The ability to set to limit the number of threads is a requirement.

The scheduler is not magic, and can not predict which tasks will have negative scalability. And these do exist in ITK. The refactoring does not yet provide persistent resources for a thread during a task, so they may be re-allocated for each chunk during a task, which is not good for improving scalability compared to fixed splitting.

I regularly limit the threads of 1 to assist be debugging and algorithm analysis. How can you analyze the scalability of any algorithm if you have no control of the resources.

There are also cases where the program is executing on a shared resource, and the number of threads must be limited.

I could go on about use cases.

TBB seem to readily support it in a couple ways:

I still can’t get ITK with TBB to link properly with applications that use ITK w/ TBB.

dzenanz · May 24, 2018, 3:17pm

What is needed and goes hand in hand with controlling number of threads is splitting notion of work units from threads. This is a very logical split with a thread pool. And what people mostly want to lower is number of work units, not threads.

blowekamp · May 24, 2018, 3:51pm

I see three potential “user tweak-ables”:

Yes, limiting the number of “work units” is desired.
Limit the global number of threads to ITK tasks - ITKv4 did not have this. The execution of more task, just resulted in more threads being utilized. This “tweak-able” maps very well to the thread pool concept, where there is a limited number of threads that get allocated to tasks. Some higher level thread libraries allow for labeled thread pools for given “tasks”, it may be a desirable feature to give ITK a labeled pool, and limit it’s resources. These features would clearly by at the MultiThread interface, and not at the filter.
And I still think that limiting the number of threads per “task” or filter is needed.

ITK is not a turn-key system. It is a toolkit to assemble algorithms and applications. As such we have always cateried full developer control and ability to experiment.

I’ll also mention that we have/had a “SingleThread” Multi-Threader in ITK, this may be a good time to revive it and perhaps ensure that it can be utilized either globally, filter configurable, to automatically when threads/tasks/muffins/work units are set to 1.

warfield · May 24, 2018, 9:18pm

Controlling the number of threads is an important feature of the scheduler, when measuring scalability, perhaps when debugging. Therefore, the scheduler can be initialized in such a way that it limits the maximum number of threads:
https://software.intel.com/en-us/node/506297

The rest of the API shouldn’t be polluted with information that is not needed, such as the thread id, or the number of threads that exist.

warfield · May 24, 2018, 9:24pm

The TBB scheduler doesn’t have to guess what is more efficient. It can monitor utilization and know what is more efficient. That is like magic to a thread programmer.

From the documentation:
The threads you create with a threading package are logical threads, which map onto the physical threads of the hardware. For computations that do not wait on external devices, highest efficiency usually occurs when there is exactly one running logical thread per physical thread. Otherwise, there can be inefficiencies from the mismatch. Undersubscription occurs when there are not enough running logical threads to keep the physical threads working. Oversubscription occurs when there are more running logical threads than physical threads. Oversubscription usually leads to time sliced execution of logical threads, which incurs overheads as discussed in Appendix A, Costs of Time Slicing. The scheduler tries to avoid oversubscription, by having one logical thread per physical thread, and mapping tasks to logical threads, in a way that tolerates interference by other threads from the same or other processes.

The key advantage of tasks versus logical threads is that tasks are much lighter weight than logical threads. On Linux systems, starting and terminating a task is about 18 times faster than starting and terminating a thread. On Windows systems, the ratio is more than 100. This is because a thread has its own copy of a lot of resources, such as register state and a stack. On Linux, a thread even has its own process id. A task in Intel® Threading Building Blocks, in contrast, is typically a small routine, and also, cannot be preempted at the task level (though its logical thread can be preempted).

Tasks in Intel Threading Building Blocks are efficient too because the scheduler is unfair. Thread schedulers typically distribute time slices in a round-robin fashion. This distribution is called “fair”, because each logical thread gets its fair share of time. Thread schedulers are typically fair because it is the safest strategy to undertake without understanding the higher-level organization of a program. In task-based programming, the task scheduler does have some higher-level information, and so can sacrifice fairness for efficiency. Indeed, it often delays starting a task until it can make useful progress.

The scheduler does load balancing. In addition to using the right number of threads, it is important to distribute work evenly across those threads. As long as you break your program into enough small tasks, the scheduler usually does a good job of assigning tasks to threads to balance load. With thread-based programming, you are often stuck dealing with load-balancing yourself, which can be tricky to get right.

TIP
Design your programs to try to create many more tasks than there are threads, and let the task scheduler choose the mapping from tasks to threads.

Finally, the main advantage of using tasks instead of threads is that they let you think at a higher, task-based, level. With thread-based programming, you are forced to think at the low level of physical threads to get good efficiency, because you have one logical thread per physical thread to avoid undersubscription or oversubscription. You also have to deal with the relatively coarse grain of threads. With tasks, you can concentrate on the logical dependences between tasks, and leave the efficient scheduling to the scheduler.