Parsing a DICOM directory recursively and reading all DICOM series from it is a time-consuming operation. I recently updated my application to parallelize this operation at the thread level as much as I can. But I found that ITK has limited support for such parallelization.
This is the most parallelized algorithm I could implement: Step 1: The main thread parses the DICOM directory recursively and creates an itk::GDCMSeriesFileNames object. Step 2: The main thread then calls GDCMSeriesFileNames::GetFileNames() for each series in the object, copies the file names to a vector of strings, and inserts the vector in a queue. Step 3: Multiple worker threads consume the queue and read the series in parallel using thread-specific instances of itk::ImageSeriesReader objects.
Note, I did Step 1 in the main thread so as to create only one copy of the itk::GDCMSeriesFileNames object. Also, Step 2 had to be done in the main thread because the GetFileNames() routine isn’t thread safe.
Is there a more parallelized algorithm feasible with the current version of ITK?
If not, it will be great if we can update ITK such that Step 2 above, which is an expensive operation because it sorts the files, can be made thread safe so that it can also be executed in parallel for different series in worker threads. As far as I can see, it’s currently not thread safe only because GDCMSeriesFileNames::GetFileNames() routine stores the file names in a vector of strings, one of its private members:
Perhaps Step 2 can be made thread safe by overloading the GDCMSeriesFileNames::GetFileNames() function such that it returns the file names in a caller-provided reference to a vector of strings?
My suggestion assumes that GDCM, specifically the routines of the gdcm::SerieHelper class, are thread safe.
Your parallelization approach sounds about right. Your plan to make step 2 thread-safe sounds good to me, but @mathieu.malaterre should confirm it is OK. @mihail.isakov and @blowekamp might want to pitch in too.
If your plan is implemented, we should add a new thread-safe signature for GetFileNames and delegate to it from the current method, but keep the current one for backwards compatibility. Add a note in its docstring that it is not thread-safe, and the other one is preferred. I would not deprecate it.
I have had problems using GDCM concurrently in multiple threads, so I would be caution of this approach.
I wrote a test in SimpleITK which concurrently tries to read some files, to test if a specific ImageIO is concurrent thread safe for reading:
GDCMImageIO failed this test along with several other ImageIOs. So in generally parallel reading of slices will not work. Additionally, many of the other ImageIOs that do work are simpler files formats of header with some bulk data to read. When a reader is IO bound reading the bulk data, concurrent reading may not improve performance and could hurt.
This approach is also implemented in itk.js. Step 3 is parallel, but Step 2 is also serial. To accelerate Step 2, we may also want to just parse the tags required to sort the series.
Following up the thread-safety of the 3-step approach:
I have executed the 3-step approach 50 times thus far, on 2 different data sets, one containing 49 series (6000+ files) and the other, 74 series (8000+ files). I see a performance improvement of 2-3x when using 32 threads, in comparison with a single-threaded implementation. I haven’t seen a concurrency issue (looking for a segfault).
I use a thread-specific GDCMImageIO pointer in Step 3.
I’m not familiar with SimpleITK, but given my experiments and the itk.js implementation, wondering if there’s something different in the way SimpleITK reads DICOM series that has compromised the thread safety.
If you do batch processing then DICOM loading is typically a small fraction of the total time, so even an improvement of 2-3x may result in barely perceivable overall time reduction.
If you need fast DICOM loading for an interactive application then what counts is not actual loading time but the perception of loading time, such as short “time to first image”. If you start displaying the frames as they are loaded then time to first image is something like a tenth of a second, which means that the user practically does not need to wait at all. If you want your user to wait for loading of the entire volume then you will provide much worse user experience: even if you reduce the loading time from 30 seconds to 10 seconds, it will be still 100x slower than progressive loading. Also, most often you need to load DICOM data from a remote server, so if you want to speed up loading then you can achieve much better results by querying some high-level series information first to initialize a voxel array then starting to display frames as they are received.
Instead of focusing on multi-threading for speeding up a monolithic image loading operation, I would much rather see some activity in improving support for progressive/streaming/partial/tiled/on-demand image loading infrastructure that would make it easier for applications to reduce time to first image. Many of these could be done at application level, though, so I am not sure how much of this has to be implemented in ITK.
My application is a platform for batch processing of images. You can find a high-level overview of the image loading part of the application in this blog post:
DICOM image loading is actually a bottleneck for me in many scenarios. This is because, once data are loaded in memory in my application, the processing takes place in parallel in 100s of threads across a cluster, and that’s quite fast. So any significant improvement in loading performance is quite noticeable in my application’s case.
Assuming that the gdcm::SerieHelper objects are thread-safe, the change I propose for parallelizing Step 2 is quite straightforward. The benefit can be significant for applications like mine.
Based on the blog post, most likely the performance bottleneck in your system is network transfer to the CAS server, so if you want to minimize overall processing time then you need to start processing the data during network transfer. In this, it will not help if ITK’s image reader is multi-threaded, because you can only start the reading when the image transfer is completed, so your processing will be unnecessarily delayed for the entire network transfer time, which is most likely much more than the improvement that you can get from multi-threaded reading.
If network transfer time is not a concern then you can add a conversion step after transfer of a series is completed. You can convert from legacy (one file per slice) into enhanced image format (one file per series) when the DICOM files are transferred to your system. You can load such images at least 10x faster than hundreds of individual files, because everything is in a single file, with a single header, and you can read the entire voxel data into the memory at once. Alternatively, if you know what kind of DICOM data sets you need to support then you might just dump the DICOM headers into json files and reconstructed bulk data into quickly loadable formats (images/segmentations into nrrd files, surfaces into ply/gltf files, etc.). These may not even need any new implementation (just running the incoming data through a converter once) and would result in 10-100x faster loading.
I’m not against making some small, low-risk changes in ITK that make DICOM loading faster on some configurations, just wanted to point out probably there are better options (and maybe that’s why DICOM loading is not multi-threaded already).
I have executed the 3-step approach 50 times thus far, on 2 different
data sets, one containing 49 series (6000+ files) and the other, 74
series (8000+ files). I see a performance improvement of 2-3x when using
32 threads. I haven’t seen a concurrency issue (looking for a segfault).
It’s kind of you @lassoan for taking the time to read the blog post.
The architecture of the system as well as the endpoint I’m working with is constrained such that processing happens after an entire batch is loaded. For example, sometimes we need to join tables containing image data and metadata. The network latency is definitely a bottleneck in some cases, but the platform I’m building needs to be generic, where we do not necessarily assume there’s network latency. Also, I have actually considered the slowness of I/O to be a reason to I/O in parallel, but I’ll need to perform additional experiments to get some numbers to confirm that.
Converting data and storing it locally is a good thought, perhaps especially when the same image is loaded multiple times. But given the generic nature of my application, making assumptions or introducing an additional step is not the ideal.