Not able to get RTK speedups on GPU

whitaker · April 25, 2025, 4:17pm

Hi.

I am using RTK to do iterative reconstruction. Everything seems to be working, except when trying to run things on the GPU. Everything compiles fine and it runs. However, when the GPU-compiled programs run, they are slow, slower than the CPU versions. nvtop shows that the process is present on the GPU, but that it is using 0% of GPU processing while it is hammering the CPU. This is true of RTK code that I wrote as well as the RTK-ready applications such as rtkfdk, admmtv, conjugategradient, etc. I have compiled (clean) ITK with the flags ITK_USE_GPU and RTK_USE_CUDA. Below is a list of system stuff (for reference) and the display from nvtop. Any pointers/advice is appreciated.

Ross

OS: Ubuntu 22.04
Graphics card: NVIDIA RTX 500
Driver Version: 535.183.01
CUDA Version: 12.2

dzenanz · April 25, 2025, 7:20pm

@simon.rit might answer.

whitaker · April 25, 2025, 8:34pm

Thx

simon.rit · April 25, 2025, 9:12pm

Could you please provide a concrete example of what you’re trying to do? For rtkfdk, you need to use the option --hardware cuda. For iterative reconstruction, you need to use forward and backprojectors, e.g. --fp CudaRayCast --bp CudaVoxelBased.

whitaker · April 25, 2025, 9:28pm

Aha. I see these now in the command-line options for these executables. Let me try these options and see how it goes.

whitaker · April 25, 2025, 10:01pm

This seems to work. Thanks, so much, for your help!

whitaker · May 30, 2025, 9:32pm

Hi,

Thanks so much for your help. RTK is really a great resource, and I have been studying how it is put together. Great design and software.

I am still having some trouble getting the GPU to engage consistently. When I run the applications that are built in RTK from the command line. Things work as expected. That is, the application uses the GPU consistently and does not use the CPU much. To determine this I use nvtop in linux. So for the commands:

Create a simulated geometry

rtksimulatedgeometry -n 180 -o geometry.xml

Create projections of the phantom file

rtkprojectshepploganphantom -g geometry.xml -o projections.mha --spacing 2 --dimension 800

Reconstruct

rtkregularizedconjugategradient -p . -r projections.mha -o regularizedrecon.mha -g geometry.xml --spacing 2,2,16 --dimension 800,800,100 --tviter 4 --gammatv 10.0 -n 10 --tikhonov 1.0 --gammalaplacian 1.0 --fp CudaRayCast --bp CudaVoxelBased

I get the following GPU behavior:

Which makes sense. Everything seems to be computed on the GPU. Very little use of the CPU during the iterative reconstruction process. However, if I create my own CG iterative code (.cxx attached), I get a different behavior, for the same geometry and reconstruction volume. There is a burst of GPU activity at each iteration and then the process hammers the CPU for several seconds. Like this:

I figure that some piece of the processing is not being done on the GPU. Must be some filter that I have not GPU enabled. However, I have not yet been able to figure out what I am missing. I have verified that the projections and the volume for reconstruction are cuda images, and all of the filters are templated for cuda images.

Any suggestions would be appreciated.

Regards,

Ross

FirstCudaReconstructionCG.cxx (5.0 KB)

simon.rit · June 2, 2025, 6:04am

I don’t know what could be the issue, I would have expected the same behaviour. Is your code indeed slower than the command line application?
My suggestion would be to run the two codes with the exact same parameters and to turn on the CMake option RTK_PROBE_EACH_FILTER. The --verbose option of the command line tool will then report the time spent in each filter. You can do the same in your code, see code here, and compare the printed results to understand what filter is causing this.

whitaker · July 27, 2025, 9:07pm

Hi,

Sorry for the delayed response. I have done some experiments and the probing as you suggested (very cool stuff). I have figured out that the difference between the fast-GPU case and the slow-GPU case. The slow case is compiled outside of the ITK/RTK source/build tree with a separate CMakeLists file (attached). Below are the results of the probe. The slow case takes about 5x as long. There must be some flag or something that I am missing that causes the code compiled outside of the source tree to not run properly. I have confirmed that ITK_USE_GPU and RTK_USE_CUDA are both defined at compile time. Not sure what else to look at.

Thanks again for all your help (and for supporting RTK!).

Compiled within the ITK/RTK source tree:

***************************************************************************************************************************
Probe Tag                                                 Starts    Stops     Time (s)       Memory (kB)    Cuda memory (kB)
***************************************************************************************************************************
AddImageFilter                                            40        40        0.0473486      249966         -125952        
BackwardDifferenceDivergenceImageFilter                   36        36        0.260071       249884         0              
ConjugateGradientConeBeamReconstructionFilter             4         4         17.1498        1.0103e+06     616960         
ConstantImageSource                                       45        45        0.0171709      236589         0              
CudaBackProjectionImageFilter                             24        24        0.645723       492            435200         
CudaConjugateGradientImageFilter                          4         4         16.4275        1.36971e+06    308224         
CudaConstantVolumeSource                                  4         4         0.00055778     108            252416         
CudaDisplacedDetectorImageFilter                          4         4         0.0469735      -168689        225280         
CudaForwardProjectionImageFilter                          20        20        1.59452        21.6           477184         
ForwardDifferenceGradientImageFilter                      36        36        0.336165       750008         -139947        
ImageFileReader                                           1         1         0.0802979      224928         0              
LaplacianImageFilter                                      20        20        0.520535       250143         -251904        
MagnitudeThresholdImageFilter                             16        16        0.0449739      0              0              
ReconstructionConjugateGradientOperator                   20        20        3.08852        1.31128e+06    -240640        
RegularizedConjugateGradientConeBeamReconstructionFilter  1         1         84.4491        715144         452608         
ThresholdImageFilter                                      4         4         0.0674657      249948         -251904        
TotalVariationDenoisingBPDQImageFilter                    4         4         3.85674        -687330        -188928

Compiled separate from the ITK/RTK source tree:

***************************************************************************************************************************
Probe Tag                                                 Starts    Stops     Time (s)       Memory (kB)    Cuda memory (kB)
***************************************************************************************************************************
AddImageFilter                                            40        40        0.399742       249937         -125952        
BackwardDifferenceDivergenceImageFilter                   36        36        3.2012         249952         0              
ConjugateGradientConeBeamReconstructionFilter             4         4         69.6294        1.01022e+06    616960         
ConstantImageSource                                       45        45        0.114886       236656         0              
CudaBackProjectionImageFilter                             24        24        0.646264       492            435200         
CudaConjugateGradientImageFilter                          4         4         68.7377        1.36963e+06    308224         
CudaConstantVolumeSource                                  4         4         0.012482       72             252416         
CudaDisplacedDetectorImageFilter                          4         4         0.0457603      -168697        225280         
CudaForwardProjectionImageFilter                          20        20        1.59471        21.6           477184         
ForwardDifferenceGradientImageFilter                      36        36        4.9089         749904         -139947        
ImageFileReader                                           1         1         0.082335       224784         0              
LaplacianImageFilter                                      20        20        9.23759        249988         -251904        
MagnitudeThresholdImageFilter                             16        16        0.370034       0              0              
ReconstructionConjugateGradientOperator                   20        20        13.4932        1.31122e+06    -240640        
RegularizedConjugateGradientConeBeamReconstructionFilter  1         1         408.758        715280         452608         
ThresholdImageFilter                                      4         4         0.329074       249948         -251904        
TotalVariationDenoisingBPDQImageFilter                    4         4         32.1627        -687392        -188928

CMakeLists.txt (671 Bytes)

whitaker · July 27, 2025, 9:09pm

Those tables didn’t format well. Here is an image of the results:

simon.rit · July 28, 2025, 9:00am

Thanks for sharing. Most likely, you didn’t set the CMAKE_BUILD_TYPE CMake option to Release? This is important to let the compiler optimize the compilation.

whitaker · July 28, 2025, 9:24pm

Bingo! That’s it. I don’t know how I missed that. Thanks Simon.

Ross