I am using SimpleITK (python) and try to understand how the registration framework operates.
Once the alignment is done (in my case using L-BFGS-B optimizer, correlation as similarity metric, BsplineTransform and linear interpolator) I get the estimated transform.

Edit:
With the TransformPoint function, the transform maps points from the resampled_image to the original_image while, using the inverse transform, the function Resample will move the moving image (original_image) in the same physical space than the fixed image (resampled_image).
Point transformations need the (direct) modeling transform, why is the modeling transform from the fixed image to the image that is deformed?

It is counter-intuitive and it seems like I just donâ€™t understand something really basic happening here.

It is counter-intuitive because we think of an image being â€śtransportedâ€ť from the moving space to the fixed space, but thatâ€™s not actually what happens.

Internally, everything is based on points. You can think of the fixed or moving images as a regular grid of points at the center of each voxel, with a value attached (ie, the intensity in that voxel), The â€śtransformâ€ť or â€śforward transformâ€ť transforms points from fixed to moving space. If we have a point in fixed space, we apply the transform to get its location in the moving space. And if we have a point in moving space, we apply the inverse transform to get its location in the fixed space.

If we want to resample the moving image into the fixed space, we start with a regular grid of points in the fixed space. For each point, we apply the transform and get the corresponding location in the moving space. There, we compute an interpolated value from the surrounding points in the moving image, this becomes the voxelâ€™s value in the resampled image.

Itâ€™s backwards in our minds because transforming an image from one space to another involves transporting a point set from the destination space to the source space. Itâ€™s done this way so that we resample a regular grid of points in the destination space.

If it were the other way, weâ€™d end up with an irregular grid to create the resampled image. Imagine we just warped all the moving points to the fixed space. Now we have an irregular grid of points in the fixed space, and we have to solve a more complicated interpolation problem to make the resampled image.

I might talk non-sens but the way I understand it now, and please correct me if I am wrong, is:
For the example of BsplineTransform, we would have a set of control point with coordinates transformed a certain way from the fixed to the moving image. This way we can transform a regular grid of points (at center of each voxel) of the fixed image into the moving image using Bspline interpolation kernel. From this deformed grid of point we apply the inverse transform and interpolate values from the surrounding moving image points which gives us our resampled moving image used to compute the similarity metric.

So the real transform is from fixed to moving but at the end with the visualization of the resampled moving image, the deformations that we see are on the moving image and come from the inverse transform.

I think the domain of the parameters depends on the algorithm, as of ITKv4, the registration can take place in a â€śvirtual domainâ€ť that is neither the fixed nor the moving space. For example, SyN registration defines a half-way space such that the deformation from the virtual space to the fixed is symmetric with the virtual space to moving image.

The regular voxelwise grid of points in the fixed space is used for applying the warps, ie resampling the moving image. During the registration itself, the algorithm doesnâ€™t have to be in the fixed space or use a regular grid.

The problem was my understanding of the resampling process. The resulting resampled images, so basically every visualization that we can get, will illustrate the inverse transformation but in reallity it is just due to the resampling process that import the intensity value of the transformed coordinate (in the moving image) to the original coordinate (in the fixed image), exchanging the original pixels intensity values (from the grid of voxels) with the corresponding from the moving image.