ImageSeriesReader speed

jwitos · October 21, 2020, 8:41am

Hi everyone,

I’m currently using SimpleITK (2.0) ImageSeriesReader to process several hundred of MR studies. I am using the standard approach with GetGDCMSeriesIDs -> GetGDCMSeriesFileNames -> Execute. It is working fine, although painfully slow. I am looking for your general thoughts / experience in bulk processing data like this.

After doing some basic profiling, it looks like GetGDCMSeriesIDs can take up to 30 seconds per study. Similarly GetGDCMSeriesFileNames with about 30 seconds per series. Execute is quite fast compared to previous two functions, and usually ended up to take around 10 sec/series. Other functions that I use in my pipeline (resampling, reorientation etc.) take <1 sec. Performance gets significantly worse when using multiprocessing – on a machine with 40 cpus running ~40 processes can end up in a single study to be processed for as long as 30 minutes.

Unfortunately I have no expertise in profiling/optimizing code but looking for thoughts. Is this expected behavior? Am I throttling I/O? I believe I am using a GPFS filesystem. Is there any way to further investigate this or improve in any way?

zivy · October 21, 2020, 3:14pm

Hello @jwitos,

This is a hard one, it depends on:
a. what is your data directory structure.
b. what operating system are you using.
c. is data on SSD or spinning disk (in the later case parallel data access can slow the read).

General, relevant discussion on parallel traversal of directory structure in Python from stack overflow.

Data directory structure:

Scenario 1: All images are stored in the same directory. We have ‘n’ images and ‘n’ series (think chest x-rays). GetGDCMSeriesIDs needs to open all n images and then GetGDCMSeriesFileNames opens the ‘n’ images ‘n’ times, so O(n^2).

Scenario 2: Each series is stored in its own directory. GetGDCMSeriesIDs needs to open all n images and then GetGDCMSeriesFileNames opens the ‘n’ images only once as each is in its own directory, so O(n).

Scenario 3: Series may be stored in multiple directories (happens in real life). With some preprocessing analysis and “moving images around” before using GetGDCMSeriesFileNames we’re back to scenario 2. See code below.

Let us know which of option1 or option2 in the code below worked better for you. If no improvement, possibly provide additional details so we can solve this for others that are dealing with similar needs in the community.

Some code:

import SimpleITK as sitk
import os
import time
import tempfile
import shutil
import time
import platform

root_dir = 'path_to_root_data_directory'
reader = sitk.ImageFileReader()

start_time = time.time()

#option one using GetGDCMSeiresIDs
all_series_files = {} 
for dir_name, subdir_names, file_names in os.walk(root_dir):
    sids = sitk.ImageSeriesReader_GetGDCMSeriesIDs(dir_name)
    for sid in sids:
        file_names = sitk.ImageSeriesReader_GetGDCMSeriesFileNames(dir_name, sid)
        reader.SetFileName(file_names[0])
        reader.ReadImageInformation()
        study = reader.GetMetaData('0020|000d')
        key = '{0}:{1}'.format(study,sid)
        if key in all_series_files:
            all_series_files[key].extend(file_names)
        else:
            all_series_files[key] = list(file_names)

print('{0:.4f}'.format(time.time()-start_time))


start_time = time.time()
#option two getting the per series files ourselves.
all_series_files2 = {}
for dir_name, subdir_names, file_names in os.walk(root_dir):
    for file in file_names:
        try:
            fname = os.path.join(dir_name, file)
            reader.SetFileName(fname)
            reader.ReadImageInformation()
            sid = reader.GetMetaData('0020|000e')
            study = reader.GetMetaData('0020|000d')
            key = '{0}:{1}'.format(study,sid)
            if key in all_series_files2:
                all_series_files2[key].append(fname)
            else:
                all_series_files2[key] = [fname]
        except:
            pass
        
print('{0:.4f}'.format(time.time()-start_time))
            

start_time = time.time()
# per series processing:
#read all of the series by creating a single directory for each series and then either
#soft linking or copying the files over based on the operating system. 
reader = sitk.ImageSeriesReader()
for series_data in all_series_files.items():
    study,sid = series_data[0].split(':')
    file_names = series_data[1]
    with tempfile.TemporaryDirectory() as tmpdirname:
     if platform.system() == 'Windows':
         for i, fname in enumerate(file_names):
             shutil.copy(os.path.abspath(fname),
                         os.path.join(tmpdirname,str(i)))
     else:
         for i, fname in enumerate(file_names):
             os.symlink(os.path.abspath(fname),
                         os.path.join(tmpdirname,str(i)))
     reader.SetFileNames(sitk.ImageSeriesReader_GetGDCMSeriesFileNames(tmpdirname, sid))
     img = reader.Execute()
 
print('{0:.4f}'.format(time.time()-start_time))

Life is complicated

lassoan · October 21, 2020, 8:09pm

You cannot expect faster loading than about 50-100 slices/second for the most common legacy DICOM files (single slice per file). If you want to load an image series in a fraction of a second then you can use more efficient file formats, for example DICOM enhanced multiframe or research formats (such as NRRD).

Details:

I’ve done a quick performance test of DICOM import in 3D Slicer. We regularly perform profiling of DICOM import and fix any obvious performance bottlenecks, so while probably the performance could be improved, probably not too much and not too easily. In this test, I’ve used TCIA CPTAC-HNSCC collection if 20256 DICOM files (most of them is a single MRI slice, there are some scout images and reports), 10.7GB in total.

In Slicer we first read the entire collection of DICOM files we want to process. This is important, because DICOM is not aware of filenames and folders, so you cannot rely on having a complete series in a folder (and there can be UID-based links between series, so you cannot always interpret a series by itself anyway).

Indexing of all the data sets took 332sec (cold loading, after computer restart), 290sec (warm loading, right after the same 10GB data set was loaded from disk previously) in total, which means processing about 60 files (about 30MB) per second.

After this, cold loading of a 600-slice MRI takes about 12 seconds (of this, about 1-2 second is spent in geometry consistency check - retrieving all slice position and orientation from the DICOM database and create warping transform if they are not equally spaced, do not have the same orientation, or axes are not orthogonal). Warm loading (loading the file again, just after it was loaded) took about 5 seconds. Speed was about the same with DCMTK and GDCM.

Overall, it looks like that with standard commodity hardware and non-optimized system configuration you can parse and load DICOM at about a rate of about 50-100 files/sec (something like 30-60MB/sec). So, the performance that you have looks about right. Maybe you can upgrade your hardware and with a lot of effort improve your software and get maybe up to 2-5x faster loading, but if you need significantly faster speeds then it is better to switch to a more efficient file format. Modern DICOM (enhanced multiframe files) store the entire series in a single file, so you can expect much better performance there. You can also use simpler file formats, for example, the same MR series (320 MB, 600 slices) that takes 12 seconds to load from DICOM, can be loaded in 0.3 seconds from uncompressed NRRD.

jwitos · October 22, 2020, 5:55pm

Thank you @zivy and @lassoan for very informative comments.

This is a hard one, it depends on:
a. what is your data directory structure.

I have legacy DICOM files (file per slice) in directories separate for different studies. Series are not organized in sub-directories.

I verified Andras’ calculations regarding processing speed. My average study takes ~1.5GB of disk space. On average, a single study is processed by a single process in ~1500 seconds. 1500MB/1500sec => 1MB/1sec. This is when running 40 parallel processes, so ultimately I am processing about 40MB/sec.

I was hoping that it there is some overhead in the way SimpleITK reads directories/series, so I implemented simple Pydicom approach inspired by Slicer’s DICOMUtils sorting. After a short experiment, it looks like there is no time difference between those two methods.

I don’t have an opportunity right now to verify the two options suggested by @zivy, but I think my earlier experiment supports the hypothesis that it would be really difficult to significantly improve processing.

Of course I fully agree with @lassoan argument about simpler file formats. In fact, my current scenario is primarily to convert large number of DICOM files to research-friendly format, extracting some metadata on the fly and performing basic preprocessing (resampling/reorienting).