Remote Module Grading/Ranking System Needed

hjmjohnson · February 20, 2020, 10:29pm

Background

One of the primary goals of creating remote modules was to ease the burden of core developers in maintaining the core ITK code base.
There are currently 44 remote modules distributed with ITK, but there is no mechanism for knowing the quality of the code, or the quality of the testing for those modules.
While attempting to update the 44 remote modules to the latest coding styles for C++11 & clang-format style guidelines it became clear that some of the modules have not compiled successfully for over a year, and that many of the modules have failing or invalid tests.

Goals of this proposal

Provide a grading/ranking system for the remote modules to better convey the compliance level for which the module passes.
Provide cmake filtering to hide modules of lower-class quality
Provide ITK core developers a test-bed in Compliance levels 0,1,2 for ensuring backward compatibility testing, external tool support, and identifying migration guide support
Provide ITK core developers with indications of which remote modules (3,4,5) have not reached a level of maturity which demands high levels of efforts to ensure that they continue to work with the latest ITK developments (Perhaps they have surpassed their useful lifespan, or been replaced with other mechanisms).

Initial Grading Level Criteria

Compliance level 0 (AKA ITK main modules, or remote modules that should become core modules)

Widespread community dependance
Above 90% code coverage
Nightly dashboards and testing monitored rigorously
All requirements below

Compliance Level 1 (Very high-quality code, perhaps small community dependance)

Meets all ITK code style standards
No external requirements beyond those needed by ITK proper
Builds and passes tests on all supported platforms within 1 month of each core tagged release
Active developer community dedicated to maintaining code-base
75% code coverage demonstrated for testing suite
Continuous integration testing performed
All requirements below

Compliance Level 2 (Quality code, perhaps niche community dependance)

Compiles on niche community platforms (may depend on specific external tools like a java environment, or specific external libraries to work )
All requirements below

Compliance Level 3 (Features under development)

Continuous Integration with at primary use case testing passing on all platforms
All requirements below

Compliance Level 4 ( Code of unknown quality )

Some tests exist and pass on at least some platform
All requirements below

Compliance Level 5

Deprecated code, known to be of limited utility, perhaps has known bugs (i.e. the Neural network remote module)

Proposed Action Items

Create subdirectories in the “Modules/Remote” directory for each category.
For each ${mdlname}.remote.cmake file add a cmake comment block indicating how the compliance level was agreed upon (perhaps with links to discourse)

# This module was reviewed during the ITKv5.2 release.
# * Meets all ITK code style standards, as enforced by clang-format
# * Does not have external requirements beyond those needed by ITK proper
# * Builds and passes tests (as of ITKv5.2 release)
# * Active developer community dedicated to maintaining code-base as evidenced by external project build reviews indicate that this module is used by ANTs/Slicer/BRAINSTools packages, and
those developers have a history of maintaining this code-base.
# * 83.5% code coverage demonstrated for testing suite on 2020-02-17
# * Continuous integration testing performed, clang-format linting performed, and kwstyle testing performed

set(${mdlname}_COMPLIANCE_LEVEL 1)

Add cmake variable ${mdlname}_COMPLIANCE_LEVEL=[1,2,3,4,5]
Update itk_fetch_module to hide modules with lower compliance levels than requested default=0)

Please provide comments/suggestions to this proposal

lassoan · February 21, 2020, 5:22am

Thank you for working on this. Rating of remote modules could be certainly useful for many things. The proposed system looks like a good metric for code quality/compliance. However, I see other important factors that we could consider, such as:

development status: planning, alpha, beta, stable, mature, inactive (from PyPI)
number of users/popularity (number of stars on github?)
accuracy: has the software been validated, does it use well-established methods, …?
robustness: does the software tend to crash or hang?
documentation: are methods documented, are there examples, tutorials, sample data?

It might be better to determine a score based on each criteria separately, instead of trying to come up with a single combined score. If we really need a single score then we could always compute a composite score from multiple scores by using some weighted sum.

jhlegarreta · February 21, 2020, 9:04am

Thanks for the monumental work you have done lately to update the remotes Hans. And both of you for discussing these aspects.

In the past I invested some time in trying to improve the remote module infrastructure to comply with the recommended practices in the ITKModuleTemplate, their coverage, keeping up to date with the core ITK coding style changes, creating (together with some folks in the community) some scripts in ITK/Utilities/Maintenance to be able to apply a given change to all remotes, etc.

I have not had that time since a while, though. I am sorry.

Some have still not transitioned to Azure pipelines, and as Hans says, some have not built successfully for a while.

Having read Hans’ introduction, one of the first things that I realize is that we made an effort to transfer the repositories to the InsightSoftwareConsortium organization. This was motivated because very frequently we had a few issues with the upstream repositories (e.g. did not have merge rights, etc.), or the original contributors and maintainers did no longer have the time to maintain their repository. This somehow means that the maintenance burden is put back to the core developers or to the community in a broader sense. I think this is a chicken and egg problem.

Another of the issues had to do with keeping the modules up-to-date with the latest ITK release, but at the same time providing maintenance for prior releases or other ITK build options (e.g. ITKv4 vs. ITKv5, CMake flags such as ITK_LEGACY_REMOVE to On/Off, etc.), or allowing the modules to be used both within the ITK source tree and outside. Both these certainly require work to create the appropriate CI builds, tagging the repositories, etc.

IMHO, scoring the remotes would still demand an automated system based on the criteria we may come up with, including the ones proposed in this thread.

A few thoughts:

In my mind, a proxy of the number of users would be the number of clones or the number of PyPI installs, rather than the number of stars.
The accuracy or the fact that it has been validated I guess it comes from the amount and quality testing (i.e. does it use just synthetic testing data?, can we add real imaging data?, etc.) or from its use/citation in scientific publications (this last may be hard to track unless and even though a citation system like Zenodo is used?) beyond the IJ.
IMHO the robustness is determined by the CI. Additionally, if I am not mistaken the Python ecosystem has tools to automatically report failures from users that use the package.
Can we automatically determine/measure whether all methods are documented?

And Hans’ points are also interesting/necessary.

IMHO it is unfeasible that this is done manually by any core developer.

And finally, when a remote has been around for some time and is at compliance level 0 or 1, when is the time to integrate it into ITK proper? In the ITK Software Guide, section 9.7.2 Procedure for Adding a Remote Module, p. 239, it was agreed to say that

After the Remote Module has experienced sufficient testing, and community members express broad interest in the contribution, the submitter can then move the contribution into the ITK repository via GitHub code review.

But I think we did not succeed in doing this for a few of the remotes that have been around for some time (which would ease some of the issues above).

May be we should try to define what is sufficient and broad interest (e.g. number of pip installs?) as well using the criteria you both proposed.

matt.mccormick · February 21, 2020, 1:42pm

@hjmjohnson great idea. A score will help set expectations of a remote module’s quality.

This is remarkable!

@jhlegarreta Thanks for the efforts on these. In terms of remote module infrastructure, GitHub Actions are now available, which is tremendous for CI, but also continuous deployment and other infrastructure improvements.

Just from a build time perspective, new modules should not be moved into the core unless extremely necessary. Maintainability is also prohibitive.

To keep the grading maintained, objective and automated criteria, as suggested by @lassoan will help. Also, squashing one or two of the proposed levels will reduce complexity – this will make it easier to decide on grading updates and easier to interpret the grades.

hjmjohnson · February 21, 2020, 6:57pm

@lassoan Thanks for the great suggestions. I agree that all the items you are suggesting are important considerations. The reality is that only a small handful of developers (less than 10) are trying to maintain the 44 remote modules. I am concerned with having more criteria than developers or projects.

My primary motivation is to reduce developer burden. I am fearful that properly implementing the great ideas you suggest would add too much burden.

I would like remote module sub-communities to advocate for the modules they want to improve the grades on by providing pull requests that increase their compliance level. Those pull requests will generate discussion, and hopefully resolutions.

hjmjohnson · February 21, 2020, 8:09pm

@jhlegarreta I agree that the ideal situation that you propose is ambitious and far too much work for any one developer to manage with manual intervention.

I am proposing a MUCH simpler solution: Add documentation about the expectations, and single cmake variable:

set(${mdlname}_COMPLIANCE_LEVEL 1)

Allow subsequent Pull request reviewers to determine if the PR meets the intentionally fuzzy expectations. The benefits of this proposal are the simplicity of implementation, and the (hopefully) generated conversations that result in the PR discussions.