Good question. The answer is it depends.
Relative to what coordinate system is the affine transform? If your network is not aware of the image-spacing, origin, direction-cosine-matrix (i.e. these are not part of the inputs), then you are implicitly learning an affine transform which assumes spacing=[1,1,1], origin=[0,0,0], direction-cosin-matrix=identity for both images.
You will need to compensate/correct for that when mapping points from the fixed image to the moving image.
If on the other hand this information is taken into account then just map the fixed points
(physical coordinates in fixed image) to the moving coordinate system using the affine transformation and compute the distances from the corresponding moving points.