OtsuMultipleThresholdsImageFilter creates asymmetric classes for a symmetric distribution


(This thread was started already in the old list at: http://itk-users.7.n7.nabble.com/ITK-users-OtsuMultipleThresholdsImageFilter-creates-asymmetric-classes-for-symmetric-a-distribution-td38761.html)

I have investigated the problem described there and found the following bug:

When OtsuMultipleThresholdsImageFilter computes the classFrequency vector (in IncrementThresholds) it accumulates the frequencies in the histogram bins ]previous index, this index], so that the bin of the previous index is excluded, but the bin of the current index is included. This means the threshold is effectively at the UPPER boundary of the histogram bin.

When the actual threshold to be returned is computed (line 329), it uses the function GetMeasurement that returns the MIDPOINT of the bin:

m_Output[j] = histogram->GetMeasurement(maxVarThresholdIndexes[j],0);

This discrepancy leads to the error reported earlier.

I suggest the following fix. Replace line 329 with:

m_Output[j] = histogram->GetMaxs()[0][maxVarThresholdIndexes[j]];

which I understand gives corresponding maximum values of the bin, that were effectively used to discriminate into the classes during the variance maximization algorithm.

It is great that you found the problem Frederik! Do you want to submit the patch for this? That way you will get full attribution. If not, I can do it and note in the log that you found the bug and offered a solution.

You are proposing updating this section of code:

It certainly appears that your update matches the intension of the code’s comment there.

Digging into git blame see this was changed from your proposal to the current in the following change:

So, there appears to be a difference in opinion on what the correct behavior should be… I think further investigation is needed…

Ok, I will put down some arguments here.

If we use a histogram with a low number of bins (coarse discretization), we cannot expect the accuracy of the threshold value to good. In the above quoted text, when using 4 or 8 bins we have to expect the threshold to have a large relative discretization error of 1/4 = 25% or or 1/8 = 12.5%. The discretization error can be reduced by increasing the number of bins.

I believe it is more important that the threshold value to be returned should be the one used to define the boundary between the classes for which the between-class variance was maximized.

I will elaborate on to the example presented in the mailing list where a picture with values [1000, 2000, 3000, 4000] was segmented, I would expect that if the number of bins is enough to put bin boundaries between 2000 and 3000, any of those boundaries can be returned as the threshold. If using the ITK Otsu method with 3 bins, it defines the bins

1: [1000, 2003.33…)
2: [2003.33…, 3006.66…)
3: [3006.66…, 4010).

It then computes that the first class consists of bin 1 (containng values 1000 and 2000) and the second class consists of bins 2 and 3 (containing values 3000 and 4000). It is clear that the threshold is 2003.333 here. However, the value 1501.66… (midpoint of first bin) is returned.

Increasing the number of bins does not work to guarantee that the threshold goes on the right side: As long as the bin that includes the value 2000 has a midpoint less than 2000, we will get the wrong class boundary. This happens to 50% of all possible bin partitions, independent of bin size.

One can argue that the midpoint will converge to the maxpoint when the number of bins is large. On the other hand, the frequency in class 1 and class 2 will not converge if returning midpoint as threshold. It will alter between {1, 3} and {2, 2} indefinitely for general bin partitions with decreasing bin size. Nevertheless, {2, 2} is the expected frequencies for all bin sizes small enough to resolve the gap between 2000 and 3000.

I agree with your arguments, @hellman. It is less important to have a good approximation to final value with small number of bins as Dirk examined. It is more important to have a correct threshold with small number of bins.

To have a better comparison with Dirk Padfield’s analysis, can you examine the case of an image with 3001 pixel, with values 1000, 1001, 1002, …, 3999, 4000. What are the numbers with 1, 2 and 3 thresholds (2, 3 and 4 classes), and number of bins 2, 3, 4, 5, 127, 128 and 129. What are the numbers for the 4-pixel image [1000, 2000, 3000, 4000]?

You are making a very good point and I agree.

Have you looked at the “Valley Emphasis” option?

Here are some additional considerations:

  • Dirk compared a variety of Otsu implementations, as was surprised about the in consistencies between them
  • The point of keeping the Ostu and MultiOstu is a good thing to maintain.
  • The OtsuMethods are VERY popular. The prior change occurred during the ITKv4 effort. I recall quite a number of Slicer modules getting test failures as a results of this change. A subtle change in a mask can have surprising impact on registration.

These contrived corner cases are great when developing and testing an algorithm, but may not matter for how the method is used on real images. It may be more of a problem to have the results of someones algorithm change.

The solution may be another “flag” needs to be added to control this behavior.

I am sorry that I have been idling on this issue.

Dženan, that’s a lot of numbers (128 times 3 times 2 :slight_smile: ) I think the issue is most clearly visualized with a very small set of pixels, and that’s also the case when it matters most.

Here are the threshold values with 128 bins and 1-3 thresholds for the cases

  1. otsu: four pixels 1000, 2000, 3000, 4000

$ ./otsu 128



  1. otsu2: 3001 pixels

$ ./otsu2 128



It is interesting to note in case 1 that one class is completely empty for 2 and 3 thresholds.

Bradley, I did not try the Valley Emphasis option.

My issue with the current implementation is not that that it does not give good results in practice, but that the implementation doesn’t follow the behavior guaranteed by the original method. The reason I stumbled upon this was that I am writing unit tests for a code that uses this filter as one of its component. The Otsu thresholding algorithm is very predictable for such simple images as the one with 4 pixel values, so I decided to include that in the unit test to check that the behavior of my code is not ruined for example in a future refactoring.

Regarding the fact that several already existing regression tests will start failing if changing the behavior, adding a flag sounds like a good option. The question then is what should be the default. I understand that much work might be saved initially by letting the default be the current behaviour, but I’d suggest the opposite, i.e. to introduce a flag to turn on legacy behavior to get the behavior we have in the current implementation, and let the default be to take as threshold values the actual bin boundary value. The following example shows that the method does not only fail for pathological images.

Below, I have an image with black (0) background and two gradients from A -> B and C -> 255 with 0 < A < B < C < 255. I used Slicer to segment it with 2 thresholds. It should give one class for the black background 0, one class for [A, B] and one for [C, 255].


This is also the case when the number of bins is 128. However, when changing to 127 bins, I get the following three classes:

The value B is included in the third class, while it should be in the second.

Edit: It should be noted that B = 150 and C = 188, so 127 bins should be enough to fit a threshold in between the two values.

1 Like

Since we are in a transition from version 4 to version 5, changing the default behavior while allowing legacy behavior via a class member flag is acceptable. But let’s wait for the opinions of others.

@hellman Thanks for the close attention to this issue.

Some thoughts:

The Otsu algorithm assumes:

  • There is a distribution of pixel values.
  • There are modes within that distribution.
  • The algorithm distinguishes those modes.

The examples provided do not satisfy those first two assumptions well, so the expected behavior is not well defined. But, we could try make it behave sensibly. Yet, this should not sacrifice the behavior when there is a good distribution of pixels and well defined modes (I don’t know if in what you are proposing, this is the case).

What is the current status on this issue?

As an answer to Matt’s concern about sacrificing behavior for good distributions: I am quite certain this code change would not do that. The thresholds were computed based on the upper bin boundary, but returned to the user as the bin midpoints. This error can barely be noticed for small bin sizes on continuous distributions, but for discrete distributions, it can cause the returned threshold to be on the wrong side of the mass of the mode.

I just submitted a patch for this. Please review and comment.


@hellman I could not get such a dramatic failure because the image you uploaded has jpg compression artifacts. I cleaned it up a little bit but it would be great if you could supply the original image in a lossless format.

Thanks a lot for providing the patch! I do not have access to exactly those images, but I made a new image, a png that can be found in the attached tarball.

Included is also source code for a test example that illustrates the behavior. It loads the image, performs otsu thresholding with 2 thresholds for 128-138 bins. The output images are written to a series of files. If scrolling through the output files (in e.g. vv), one can see that the classification is sensitive to the bin count.

Applying your patch to ITK makes the classification consistent and independent of bin count.

otsutest.tar.gz (6.9 KB)

I updated the patch with this test image. I think it is now ready for merging.

Hi Everyone,
I’m reverting to discourse as it is a bit simpler to discuss than on gerrit.

Thanks for all the work so far. These subtle details are always difficult to sort out. I’ve tested the changes and they do appear to perform as advertised. However I’m trying to shake the gut feeling that this change is something of a band-aid solution.

I think Matt’s earlier comment on the assumptions within Otsu are valid. I’m not convinced that the ramp example satisfies them well, but it is a sensible test. The problem I’m struggling with is why computing on a shifted version of the histogram (ie bin max) and then shifting to bin middle makes such a difference. I’m assuming here that all parts of the computation are done with the relevant shift - hopefully that is true.

My first thought was that perhaps this is an odd vs even separation thing. However the ramp test( image_3modes.png) has a sufficiently large gap between classes that this should not matter.

My second thought is that we have a series of threshold choices with the same score. The test case has a gap in the brightness histogram between 143 and 179, and I suspect the threshold selection score (varBetween) is the same for all bins in this range (Dzenan to confirm if your dev environment is set up?). The code implicitly assumes that tied values are unlikely (only updating the best threshold if the new score is strictly higher than the old one). Thus the first bin in the gap gets selected. It is very likely that returning the max of this bin will produce a very different result to returning the middle, as shown by this test. Someone with more time on their hands can attempt to produce another case where returning the middle is “right”.

So, I think the best solution in this case would be to more carefully analyse the varBetween values - presumably store a vector of them and not assume a unique maxima - search for an extended maxima and take the midpoint.

However this is clearly a pain. Do we want to go this way? Is my guess that the threshold score is constant wrong?

I am not sure this is what you mean, but I think that the statement “all parts of the computations are done with the relevant shift” is exactly what is not true:

  1. The optimal maximum inter-variance thresholds are computed based on the edges of the histogram bins.
  2. The returned thresholds are shifted to midpoints after the optimization have been performed.

It is interesting to note that the only information about the pixel value distribution available to the optimization algorithm is the pre-computed histogram. A histogram bin contains the frequency of the pixels that fall into the range of that bin, but we do not know anything about the distribution of the actual pixel value distribution within that bin. This means the algorithm (A) needs to make assumptions about the distribution within the bin and (B) only knows that the bin edges can discriminate “mass” in the underlying distribution of pixel values.

For (A), if I understand the implementation correctly, the code assumes all pixel with values within the bin range have their values at the midpoint of the bin.

Regarding (B), any perturbation of the bin edges in the histogram might have given a different optimal threshold, depending on how far from the assumption in (A) we are. This might happen, for example if there is significantly more “mass” in the distribution close to the bin edge than in other parts of the bin. This mass will then fall over to the adjacent bin if the edge is perturbed.

Yes, this makes the point (B) above critical for distributions where there are several consecutive bins with frequency 0. The shift of the optimal bin edge to the left bin midpoint (even for continuous distributions of this kind) is very likely to put some pixel values in the wrong class. This will happen every time the upper part of the bin left of the bin edge contains pixels.

I am unforunately not that familiar with the implementation details to be able to answer the questions on e.g. storing vector of varBetween values etc.

Also, I guess the entire notion of splitting classes, where the estimate of the classes is based on histogram bins, makes more sense if the split is at a bin boundary than within a bin.

So the conclusion is that the patch, which makes boundary at bin maximum the default, is the right way to go an should be merged?


Reluctantly :-), but yes.