AI generated pull requests overwhelming, hard to review carefully

The current stream of AI generated (or AI aided) pull requests is a bit overwhelming to me. It is hard for me to review them carefully.

In general, I try to avoid reviewing any pull request that proposes changing thousands of lines of code, unless the topic of the PR has my special interest. Moreover, I wouldn’t review so many pull requests per day. However, AI can produce an enormous amount of pull requests in no time, changing thousands of lines of code.

Quoting The Register:

The burden of AI-generated code contributions – known as pull requests among developers using the Git version control system – has become a major problem for open source maintainers. Evaluating lengthy, high-volume, often low-quality submissions from AI bots takes time that maintainers, often volunteers, would rather spend on other tasks.

How do you feel?

1 Like

Prompting AI to make the changes is part of the work. The other part is reviewing those changes. If a small number of changes was produced, then diagnosing and prompting the AI was a larger part of the task, and reviewing smaller. If the number of changes is large, reviewing becomes the dominant part of the work. With refactoring-type changes, this is common.

I think that a contributor should review their AI’s output, before proposing it as a PR. Otherwise it mostly amounts to “I would like these sort of changes made, please check whether they are correct”. To take a quote from this article: “AI-generated code requires more careful review than human-written code. Every line is suspect.”

1 Like

I mostly agree. My experimentation with AI code generation this weekend was mostly a failure (from a human effort perspective).

I wanted to see how well AI could assist with converting argument-free ctests to Google tests, and document the approach. I wanted to do what all the news outlets are stating as facts, and was hoping to benefit ITK in the process.

My efforts identified 846 ctests that should/could be converted to Google tests. I chose to convert only the common directory (about 36 tests), and it is clear that managing this relatively simple mechanical task is too burdensome.

I was hoping for an initial conversion that was essentially “no worse” than what was there, but with the benefit of being GTests instead of ctests (which are much easier to work with in the IDE). I was impressed by how well the conversions went; the test introduced no new failures and added new tests. I think, on the whole, the results were slightly better after conversion than before.

HOWEVER:

  1. Conversion puts eyes on the code, and then the community often desires that long-standing shortcomings in the original code be fixed at the same time, while also making sure that only one change is made at a time.

  2. Overburdening (computational limitations) the already extensive CI testing infrastructure, and clogging up other people’s efforts.

  3. Balancing how much to bundle into a single PR is really hard to do in a semi-automated way. Keeping each change in a separate PR creates merge conflict hell, leading back to #2 above.

  4. Competing interests: make minimal changes, but include all fixes to all identified shortcomings at the same time.

==============================

My observations as I train the next generation of developers.

This failed experiment has given me a lot to think about. While AI is really good at handling much of the grunt work for housekeeping tasks, there is a lot more work for the experts with comprehensive knowledge of the system to review and address.

Perhaps for a project that can live with changes that evolve quickly and tolerate small regressions buried in large batches of improvements, a bulk AI conversion is an interesting consideration. For a project with very high requirements for near-perfect commits, there will be a lot of burden in addressing PRs. A bigger concern is likely the dissatisfaction of the small number of active developers who are the gatekeepers of each PR, being overburdened trying to review each PR at the highest level.

3 Likes

I am emphathetic with Neils’ expression of the critical need to avoid maintainer overwhelm.

And while AI can reduce maintainer burden (@blowekamp 's new third-party update skills are a good example of this!) we do have to be careful how it is approach. As Hans noted, the experienced and thoughful input of developers who have a perspective of the historical reasons for things, project goals, and architectual designs and the need to train the next generation of developers are critical.

As we now have arrived in the age of “agentic engineering,” processes that support high quality development are as important as ever, e.g. fast, effective, and thorough CI testing.

Our AGENTS.md is a good start at helping the agents follow our coding style for consistency, ensuring our test coverage, etc. We should continuely improve it so the agents get closer to the results that we want on the initial pass.

We can also help reduce review burden with more AI agents :-). The summary and first pass of review agents, while not perfect (and we should not expect it to be), is very helpful at providing an overview and first pass at identifying issues. We could auto-enable GitHub Copilot review on PR’s. What are folks thoughts on this?

1 Like

It seems like it has been a learning experience for you and the community. In that way it is a success.

There have been a couple cases of PRs, where the agents review had some good comments that I was able to address or get AI to fetch and address. However, I believe it was the initial PR I and with the CMake Module Interface work, where it was not helpful. It left a good number of small detailed comments, but was not helpful with the cmake architectural review. And further more I wasn’t able to dismiss its miss-conceptions, and they resurfaced in the next round of reviews. I didn’t think it was a good use of time to respond to the AI in this situation.

I think the AI reviews can likely do the details, style best practices more easily than the higher level and design issues. The latter are reasons I sometimes make a PR earlier to see if there is agreement that it is a good things to do, and it is a reasonable approach.

1 Like

@hjmjohnson It would be good if you could take a look at Agents.md and update it, while all this is still fresh in your head.

I have also found that the comments are sometimes but not always helpful, and I have observed their quality to improve significantly over time.

According to GitHub, they have been improving the review agent so it will only bring up new comments in subsequent reviews.

1 Like

Thanks for binging up the topic of Copilot reviews!

I see that AI generated reviews can be helpful, just like compiler warnings can help us to prevent mistakes. I don’t think we should process an AI review the same way we would process a human review. The human review process is not just technical, it’s also a social interaction. Obviously a human review should always be treated in a friendly and polite way. And of course, a human reviewer may be unhappy when their comment is ignored. What about a Copilot review? Do we always need to reply to its suggestions in a friendly way? Do we need to “defend” our proposed change against AI generated criticism?

I feel that we shouldn’t put too much weight on an AI generated review. We shouldn’t feel bound to address all of its comments.

Of course, an AI generated comment will become more relevant when it is supported by a human reviewer, by a “like” or a follow-up comment.

I’m not sure about auto-enabling GitHub Copilot review on PR’s. It takes away the human interaction of requesting a Copilot review on a PR. Is that a good thing or not? When I try to address a Copilot review, it sometimes makes me wonder, who am I doing it for? I don’t want to have the feeling that I’m just trying to please a robot :person_shrugging: When someone has actively requested a Copilot review on my PR, it’s clearer to me that addressing the review may also please a human being :smiley:

Of course, there are also environmental costs to the use of AI. AI is known to take lots of energy. I don’t know how that compares to our regular CI, for example. Just something to keep in mind. :innocent:

2 Likes

Absolutely loving this thread. Thanks for kicking off such a thoughtful discussion :heart:

I very much agree that the human side of our collaboration is paramount, and deserves explicit protection and celebration. Code review and issues are a big part of how we build trust, mentorship, and shared ownership in ITK, and I’d really like to keep those interactions social, relational, and human-to-human, with AI as background tooling rather than a “participant” in the conversation. :slightly_smiling_face:

I’m also very guilty myself of anthropomorphizing AI agents. :man_raising_hand: It’s so easy to slip into that because the “API” is natural language, but the reviews these systems give are still rule-driven computational models, reflecting patterns from their training data and our existing codebases, not intentional human judgement. We do need to guard against treating them as people; they don’t need us to be “friendly” in the social sense (though there’s no reason to be rude either!), and we shouldn’t let politeness to a tool dilute the clarity of our technical decisions.

For that reason, I think it’s helpful to mentally file AI review agents alongside linters, compilers, and integration tests. They’re extremely useful, often catch real issues, and can make us more productive and improve code quality. But, just like any other tool, their feedback will never be 100% correct, and occasional false positives or misunderstandings are expected, not a crisis. We should treat AI comments as strong hints or hypotheses to evaluate, not as authoritative verdicts.

We should also stay mindful of the energy cost of all this. Running heavy models on every commit has a real environmental and financial footprint, so designing our workflows to get high value from AI per unit of compute (e.g., scoping when and how we trigger reviews) feels important.

On the positive side, the quality and efficiency of AI code review is moving quickly. Claude Code just today launched a dedicated multi‑agent Code Review system that automatically reviews each PR and leaves inline comments where it finds likely issues, modeled on Anthropic’s internal workflows for nearly every PR. GitHub Copilot’s code review features are also maturing, with Copilot acting as a reviewer that can leave comments and even help implement changes via follow‑up actions. And there’s a broader ecosystem that’s been evolving for years.

That diversity of input is valuable, including among AI tools themselves. I can imagine a future where, for well‑understood parts of the ITK codebase and workflows that the team is comfortable with, it might be reasonable to auto‑enable AI reviews by default because we trust both the tools and our patterns for interpreting them. But even if/when we get there, I’d still want human reviewers at the center. AI provides instrumentation and safety rails. Humans do the actual collaboration and make the final call.

3 Likes