Meta's JiTTests: How AI Writes 22,000 Tests, Finds 4x More Bugs, Then Deletes Them All

Meta built an AI system that writes tests, catches bugs humans miss, and then throws away every single test it creates. On purpose.
It's called JiTTests (Just-in-Time Tests), and the peer-reviewed results speak for themselves: 4x more bugs caught than traditional testing, with a 70% reduction in human review load.
Here's how it works, why it works, and what it means for the future of software testing.
The Problem with Traditional Testing
Most codebases have thousands of tests. And most of those tests are doing one thing: making sure the code does what it already does. They verify existing behavior. They confirm things work the way they're supposed to.
What they don't do well is find new bugs in new code.
When a developer pushes a code change, the test suite runs. If everything passes, the change ships. But passing all existing tests doesn't mean the new code is correct. It means the new code didn't break anything that was already being tested.
There's a gap between "tests passed" and "code is actually correct." That gap is where bugs live.
The Test Debt Problem
At Meta's scale, this problem compounds. With thousands of engineers shipping code daily across platforms like Facebook, Instagram, WhatsApp, and Messenger, the testing infrastructure becomes massive. Tests need maintenance. Flaky tests need fixing. Test suites slow down CI pipelines.
Adding more tests to catch more edge cases sounds logical, but it creates test debt — a growing maintenance burden that eventually works against you.
Meta's Approach: Just-in-Time Testing
Instead of writing permanent tests that stick around forever, Meta built a system that generates tests specifically to find bugs in new code, and then throws them away once they've done their job.
The system is called JiTTests, and it uses a 6-step pipeline:
Step 1: New Code Lands
A developer submits a code change (a "diff" in Meta's internal workflow). The system picks it up automatically.
Step 2: LLM Infers Intent
A large language model reads the code change and infers what the developer was trying to do. This is the key insight: instead of blindly testing random behaviors, the AI understands the intent behind the change.
Step 3: Create Mutants
The system creates "mutants" — slightly broken versions of the code change. These mutations simulate the kinds of bugs that might exist. For example, swapping a > for a >=, removing a null check, or changing a return value.
Step 4: Generate Catching Tests
The LLM writes tests specifically designed to catch these mutants. Each test targets a specific potential bug in the new code, not the existing codebase.
Step 5: Assessors Filter Results
An ensemble of rule-based and LLM-based assessors evaluates each test. This step filters out flaky tests, false positives, and tests that would waste an engineer's time. Only high-confidence results make it through.
Step 6: Engineer Gets Notified (Then the Test is Deleted)
If a test catches a real issue, the engineer gets notified with the specific bug and evidence. Once the engineer acknowledges or fixes the issue, the test is deleted. It served its purpose. No maintenance. No flaky test debt.
The Results
These aren't theoretical numbers. JiTTests was presented at FSE 2025 (Foundations of Software Engineering) in Trondheim, Norway, with peer-reviewed results across 7 Meta platforms.
| Metric | Result |
|---|---|
| Total tests analyzed | 22,126 |
| Bugs caught vs. traditional hardening | 4x more |
| Effectiveness vs. coincidental failures | 20x more |
| Human review reduction | 70% |
| Mutant kill rate (JiTTests/ACH) | 15% |
| Mutant kill rate (TestGen-LLM, coverage-only) | 2.4% |
| Tests that also raised code coverage | 51% |
The 15% vs 2.4% Difference
This is worth highlighting. Meta also has another system called TestGen-LLM that generates tests purely to increase code coverage. That system kills 2.4% of mutants. JiTTests, which targets bugs specifically, kills 15%.
The difference is intent. Writing tests to increase a coverage number is fundamentally different from writing tests to find bugs. JiTTests does the latter, and it's 6x more effective at finding real issues.
The Counterintuitive Finding
51% of JiTTests tests also happened to increase code coverage, even though that wasn't their goal. By targeting real bugs instead of coverage metrics, you get meaningful coverage as a side effect.
Code Example: Old Way vs. New Way
Here's a simplified comparison to illustrate the difference.
Traditional Approach
// Developer writes a function
function calculateDiscount(price, membership) {
if (membership === 'premium') {
return price * 0.8;
}
return price;
}
// Traditional test (verifies known behavior)
test('premium members get 20% discount', () => {
expect(calculateDiscount(100, 'premium')).toBe(80);
});
test('non-members pay full price', () => {
expect(calculateDiscount(100, 'basic')).toBe(100);
});
These tests verify expected behavior. But what about edge cases like null membership, negative prices, or the boundary between membership tiers?
JiTTests Approach
The system would:
- Read the code change and understand it handles pricing with membership tiers
- Create mutants like removing the
ifcheck, changing0.8to0.9, or swapping the equality operator - Generate targeted tests that catch each mutation:
// AI-generated test targeting a specific mutant
test('null membership should not receive discount', () => {
expect(calculateDiscount(100, null)).toBe(100);
});
// AI-generated test targeting boundary behavior
test('discount should be exactly 20%, not more', () => {
const result = calculateDiscount(100, 'premium');
expect(result).toBe(80);
expect(result).not.toBe(90); // catches 0.9 mutant
});
Then delete the tests after flagging any issues to the engineer.
Five Problems Solved in One Stroke
JiTTests addresses multiple testing challenges simultaneously:
- Test debt — Tests are ephemeral. No maintenance burden.
- Flaky tests — Assessors filter out unreliable tests before engineers see them.
- Coverage theater — Targets bugs, not coverage metrics.
- Review fatigue — 70% less human review needed.
- Bug detection speed — Catches issues at code review time, not in production.
Should You Stop Writing Tests?
No. JiTTests complements traditional testing. The permanent test suite still validates core behavior and prevents regressions. JiTTests adds a targeted bug-hunting layer on top of that.
The key takeaway isn't "tests are useless." It's that the purpose of a test matters more than the number of tests. A test designed to find a specific bug is more valuable than a test designed to increase a coverage percentage.
For most teams, the practical lesson is this: when reviewing a code change, ask "what bugs could exist here?" rather than "do we have enough tests?" That's the mindset shift JiTTests embodies.
Resources
- Research Paper: JiTTests on arXiv
- Meta Engineering Blog: Revolutionizing Software Testing with LLM-Powered Bug Catchers
Related Articles
- Claude Code Tutorial for Beginners: Complete Guide (2026) — If you're interested in AI developer tools, this guide covers everything you need to get started with Claude Code.
- Whisper AI Tutorial: Free Offline Transcription for Developers — Another AI tool that's changing developer workflows, this time for audio transcription.
- YouTube Shorts Generator with Claude Code and Remotion — See how AI coding assistants can automate creative workflows beyond testing.
Published on AyyazTech. Subscribe on YouTube for weekly deep dives on AI, automation, and software engineering.