Adding parallel compression to MetaIO

(Dženan Zukić) #1

As discussed earlier, I want to add parallel compression to MetaIO and NRRD. As NRRD is bundled with other stuff and ITK’s version is not up to date, I plan to work on MetaIO first.

I checked PIGZ, Mark Adler’s tool for parallel, backwards compatible zlib compression. However, that is a tool and not a library, it only does parallel compression - decompression is still single threaded, and Windows support is still on a TODO list.

In addition to (single-threaded) zlib which is already used in ITK, I also examined a more modern compression library zstd and meta-compressor blosc. @neurolabusc suggested to use blosc instead of pure zstd, and my simple benchmark shows blosc+zstd to be better than pure zstd for 16-bit short data (which is the most common).

Explanation of the benchmark
6 different images were used, representative of type of data we encounter.

WB - “whole body” CT, once upon a time downloaded from hcuge.ch.

wb-seg - watershed segmentation of WB, created as a challenging test data for RLEImage class. Available here.

MRA - chest MR angiography, downloaded from hcuge.ch.

wbPET - “whole body” PET, from private collection.
wbPET

CBCT - dental cone beam CT, available here.
CBCT

US - a large 2D ultrasound sweep of some phantom, from private collection.

Source code of the benchmark is here:
Tester.zip (1.9 KB)
blosc_compress has a typesize parameter. I kept that mostly at 2, as majority of test images are 16-bit shorts. Additionally I tested with 1 for uchar, and 1 and 8 for double. The tests were run on 8-core Ryzen 7 Pro 1700 @ 3.0GHz. 8 threads were used. Note: using 16 hyper-threaded cores does not increase the speed significantly.

Results are here:
CompressionBenchmark.xlsx (58.4 KB)
CompressionBenchmark.ods (14.7 KB)

Conclusions
All of the examined alternatives are faster than zlib on low and medium compression levels, even in single threaded mode. The difference is even more apparent in multi-threaded mode.

I am currently leaning towards using blosc+zstd on compression level 2 as the default. Also, I think we should expose SetCompressionLevel in itk::ImageIOBase, which is currently implemented only by a few IOs (e.g. PNG, MINC). @blowekamp suggested also exposing SetCompressor, which would allow programatically preserving backwards compatibility if so desired. A configure-time option for backwards compatible compression is already planned.

Before I start the heavy lifting part of refactoring, are there any suggestions or comments? Everyone is welcome, but these people might have more interest in commenting: @hjmjohnson @glk @brad.king @Stephen_Aylward @fbudin @seanm @jhlegarreta @Niels_Dekker @phcerdan @zivy @bill.lorensen @jcfr @lassoan

1 Like
(Hans Johnson) #2

I don’t understand the numbers in the table. What does ratio mean? Is there a speedup information relative to the current zlib version? Can future analysis also present the size of the resulting file?

Hans

(Andras Lasso) #3

It’s great that you are working on this because compressed file support in metaimage and nrrd has significant limitations.

In our projects the main problem is lack of random read access. Zlib does not let you decompress a selected part of the image, but requires you to decompress the entire stream until the read starting point. This causes huge slowdowns if we want to read single 2D frame from a 3D volume or a 3D frame of a 4D volume.

Another important requirement is streaming writing, so that we can write frames to file without allocating one big buffer for the entire data set. We need this for continuous recording of ultrasound frames to disk, writing large 4D CT data sets (10-20x 200-500MB 3D frames), and writing overlapping segments into a 4D binary labelmap (50-60x 3D binary labelmaps).

For grayscale images (ultrasound, CT, and MR images) we can usually get about 50% compression ratio, which rarely makes any practical difference. Compression starts to have significant benefits when there is large black padding in the image (e.g., CT images are often empty outside a cylinder, ultrasound images are empty outside a fan-shaped region), but even then, the 60-70% compression ratio that we get is not a game changer (we could just as well go without any compression). On the other hand, compression makes huge difference on binary labelmaps (up to 99.9% compression ratio). For both binary labelmaps and images with large black regions, simple run-length encoding could work just as well as a Huffman coding (zlib deflate, etc.) and run-length encoding would be probably much faster and trivial to make it multithreaded.

For parallelization, I would consider simply splitting up the volume into smaller pieces and let one thread encode a single 2D or 3D frame at a time. This way you could easily run compression on many threads, without any change to the compression/decompression algorithm. This would also be a potential solution for improving speed of random read access.

1 Like
(Stephen Aylward) #4

Hi,

I’d be interested in extending MetaIO to support additional compression schemes as well as “levels” of compression. This would be easy with MetaIO’s tagged file format.

We can begin with blosc+zstd.

One thing to keep in mind is that MetaIO code is shared with VTK, so we will need to talk with the VTK folks to make certain they are open to any additional methods. It shouldn’t be a problem, but it is important so that code and capabilities don’t diverge.

s

(Dženan Zukić) #5

Ratio is size of the resulting file, compared to the size of the original file. 40% ratio means original file of 100MB got reduced to 40MB.

That can be calculated: divide zlib’s duration by other contender’s duration. Current default zlib’s compression level is 6. New contender’s are about 10x faster in single threaded more, and in multi-threaded mode the approximate speedup should also be multiplied by the number of physical cores.

(Dženan Zukić) #6

All of that is great feedback @lassoan.

@Stephen_Aylward who among VTK’s developers should I consult?

(Brad King) #7

If we make zstd a dependency then we may need to vendor it in ITK/VTK. For reference, CMake MR 3092 added zstd to CMake.

(Bradley Lowekamp) #8

Related to this is to extend the itk::ImageIOBase class with methods to set (a request for) the compression level and the compression algorithm. These options are common to many IO classes. Additionally providing a consistent interface should lead to better Python interfaces for IO.

(Curtis Rueden) #9

Wasn’t my suggestion, although I agree that it makes total sense!

(Dženan Zukić) #10

“Chris Rorden” and “Curtis Rueden”, I hope @neurolabusc and @ctrueden will forgive me for the mistake.

Also, I wanted to tag maekclena for commenting, but I don’t know his/her discourse handle. Does anyone know who maekclena is?

1 Like
(Chris Rorden) #11

All sounds reasonable. Related to this, I added DICOM to NRRD export to my dcm2niix. That tool uses pigz if found, otherwise zlib. The benefit for targetting gz is for compatibility. If that is your goal, using the CloudFlare zlib doubles the performance for compression.

On the other hand, zstd offers better compression (in particular, when combined with blosc), and should enable parallel decompression (where gz is always single threaded).

While I think zstd+blosc is a great combination, the issue is whether this is becoming a specialized format. I would suggest you get this format added to the NRRD specification so you are creating true NRRD format images. I also note that the talented team behind zstd are aware of this low hanging fruit. I suspect some future version of a zstd-inspored format will automatically detect the correct data swizzle. This would provide a standard solution.

In sum, I think this is a great idea. I simply urge that the eventual solution becomes an integral part of the NRRD standard.

1 Like
(Chris Rorden) #12

One comment on your study - the swizzle used by Blosc should be set by the Bits Allocated (0028,0100), and only used if this value is greater than 8. You can also do this by seeing whether byte n is is best predicted by n-1 (no blosc), n-2 (16-bit blosc), n-3 (24-bit blosc), n-4 (32-bit blosc) or n-8 ( 64-bit blosc). You can do this in a single pass through the data. For example, your us data is uchar 8-bit, and therefore a Blosc swizzle will be deleterious. I also wonder if the blosc was set to 64-bit swizzle (as pixeltype is double) for your PET data, as the blosc+zstd results for the PET images don’t look optimal to me. I assume this is how a future variant of zstd will work. Indeed, one calculate the n-back prediction in the main thread, so an optimal swizzle would only happen if needed. This would add virtually no penalty when swizzling does not help.

(Stephen Aylward) #13

I spoke with Ken Martin about updating MetaIO in VTK to use an addition 3rd party library. His comments were the following:

=========
Right now third party libs have to be mangled so be aware of that. I would suggest posting to VTK discourse and see if there is any resistance. The usual resistance would be

  • size (huge is bad)
  • portability (e.g. doesn;t compile on * which is a problem)
  • minimal added value (e.g. does the same thing as * which we already have, so why add more, we have zlib, lzma, lz4, zfp already for example)
2 Likes
(Andras Lasso) #14

Adding an option to split the data to tiles (e.g., each slice of a 3D volume could be a tile) and compress each tile independently using current zlib would allow parallel compression/decompression on many threads and would allow extraction parts the data set without requiring decompressing everything else before. It would probably fulfill all requirements of our projects.

MetaIO specification may already support this, because you can specify multiple data files for a single header.

It would be great if we could focus all our efforts on one research image file format (nrrd, nifti, metaio,…?). Implementing the same features in several file formats and maintain converters between them is a lot of unnecessary work.

(Bradley Lowekamp) #15

Here is the PR for this improvement:

The CompressionLevel is mostly straight forward. With the caveat: some compression algos have the range 0-100, while others have 0-9 (deflate/zlib based). In this implementation I choose to have each ImageIO define a max. This preserves compatibility, at the expense of not having a consistent range.

1 Like