Opus 4.5 on Code Review

Announcing a new version of Macroscope code review with substantially better performance, powered exclusively by Claude Opus 4.5.

Macroscope
Macroscope
Engineering Team
December 12, 2025
Macroscope Code Review v2 Powered by Opus 4.5, 40% Higher Recall & Fewer False Positives

Today we’re releasing a new version of Macroscope Code Review with substantially better performance, powered exclusively by Claude Opus 4.5. Based on our internal benchmarks, v2 has 40% higher recall and improved precision, generating 10% fewer false positives compared to our previous production pipeline. This means Macroscope is able to find significantly more bugs while also reducing the false positive rate (incorrectly flagged bugs).

From benchmarking this model against our internal code review benchmarks, we learned that:

  • Overall, Opus 4.5 vastly outperforms any other model we tested when balancing cost, latency and performance. We assess performance in terms of recall (which represents the number of bugs from our internal benchmark that we detect) and precision (which represents the percentage of surfaced issues that we determine are valid bugs).
  • Performance: We use F1 Score, which is the harmonic mean of these two metrics (recall and precision) to give us a single score to encapsulate overall code review performance. In our benchmark, our new Opus 4.5 powered Code Review leads to a 25% higher F1 score compared to our previous production code review performance.
  • Recall: Code Review v2 powered by Opus 4.5 has 40% better recall than our previous production performance, which was powered primarily by GPT-5.1. This improvement from our initial testing reflects additional tuning and prompt adjustments that enable the model to catch even more real bugs.
  • Precision: Powering Code Review v2 with Opus 4.5 resulted in the best precision of any model we’ve tested, by far. It significantly outperformed Sonnet 4.5 and GPT-5.1 on raw precision, generating 10% fewer false positives.
  • Latency: The only notable trade-off of this new pipeline is that it has ~40% higher latency, with an average latency (in our internal benchmarks) of 262 seconds compared to 183 seconds in our production pipeline. Based on our own testing and feedback from customers, we think this trade-off in latency is well worth the benefit of meaningfully higher recall and precision, especially given that the average latency is still well below most customers’ CI/CD Checks.

As we noted in the benchmark we released in September, we’re relentlessly focused on catching the most bugs while minimizing false positives. Early feedback from customers has already been encouraging and we look forward to hearing more.

Code Review v2 is now live for all Macroscope customers. If you haven’t used Macroscope, we encourage you to sign up for a 2-week free trial.

We're continuing to invest heavily in code review quality and plan to publish updated benchmarks as we ship improvements. As always, reach out on Slack or email contact@macroscope.com with feedback – we'd love to hear from you.