Meta's JiTTests: How AI Writes 22,000 Tests, Finds 4x More Bugs, Then Deletes Them All

Meta built an AI system that writes tests, catches bugs humans miss, and then throws away every single test it creates. On purpose.

It's called JiTTests (Just-in-Time Tests), and the peer-reviewed results speak for themselves: 4x more bugs caught than traditional testing, with a 70% reduction in human review load.

Here's how it works, why it works, and what it means for the future of software testing.

The Problem with Traditional Testing

Most codebases have thousands of tests. And most of those tests are doing one thing: making sure the code does what it already does. They verify existing behavior. They confirm things work the way they're supposed to.

What they don't do well is find new bugs in new code.

When a developer pushes a code change, the test suite runs. If everything passes, the change ships. But passing all existing tests doesn't mean the new code is correct. It means the new code didn't break anything that was already being tested.

There's a gap between "tests passed" and "code is actually correct." That gap is where bugs live.

The Test Debt Problem

At Meta's scale, this problem compounds. With thousands of engineers shipping code daily across platforms like Facebook, Instagram, WhatsApp, and Messenger, the testing infrastructure becomes massive. Tests need maintenance. Flaky tests need fixing. Test suites slow down CI pipelines.

Adding more tests to catch more edge cases sounds logical, but it creates test debt — a growing maintenance burden that eventually works against you.

Meta's Approach: Just-in-Time Testing

Instead of writing permanent tests that stick around forever, Meta built a system that generates tests specifically to find bugs in new code, and then throws them away once they've done their job.

The system is called JiTTests, and it uses a 6-step pipeline:

Step 1: New Code Lands

A developer submits a code change (a "diff" in Meta's internal workflow). The system picks it up automatically.

Step 2: LLM Infers Intent

A large language model reads the code change and infers what the developer was trying to do. This is the key insight: instead of blindly testing random behaviors, the AI understands the intent behind the change.

Step 3: Create Mutants

The system creates "mutants" — slightly broken versions of the code change. These mutations simulate the kinds of bugs that might exist. For example, swapping a > for a >=, removing a null check, or changing a return value.

Step 4: Generate Catching Tests

The LLM writes tests specifically designed to catch these mutants. Each test targets a specific potential bug in the new code, not the existing codebase.

Step 5: Assessors Filter Results

An ensemble of rule-based and LLM-based assessors evaluates each test. This step filters out flaky tests, false positives, and tests that would waste an engineer's time. Only high-confidence results make it through.

Step 6: Engineer Gets Notified (Then the Test is Deleted)

If a test catches a real issue, the engineer gets notified with the specific bug and evidence. Once the engineer acknowledges or fixes the issue, the test is deleted. It served its purpose. No maintenance. No flaky test debt.

The Results

These aren't theoretical numbers. JiTTests was presented at FSE 2025 (Foundations of Software Engineering) in Trondheim, Norway, with peer-reviewed results across 7 Meta platforms.

Metric	Result
Total tests analyzed	22,126
Bugs caught vs. traditional hardening	4x more
Effectiveness vs. coincidental failures	20x more
Human review reduction	70%
Mutant kill rate (JiTTests/ACH)	15%
Mutant kill rate (TestGen-LLM, coverage-only)	2.4%
Tests that also raised code coverage	51%

The 15% vs 2.4% Difference

This is worth highlighting. Meta also has another system called TestGen-LLM that generates tests purely to increase code coverage. That system kills 2.4% of mutants. JiTTests, which targets bugs specifically, kills 15%.

The difference is intent. Writing tests to increase a coverage number is fundamentally different from writing tests to find bugs. JiTTests does the latter, and it's 6x more effective at finding real issues.

The Counterintuitive Finding

51% of JiTTests tests also happened to increase code coverage, even though that wasn't their goal. By targeting real bugs instead of coverage metrics, you get meaningful coverage as a side effect.

Code Example: Old Way vs. New Way

Here's a simplified comparison to illustrate the difference.

Traditional Approach

// Developer writes a function
function calculateDiscount(price, membership) {
  if (membership === 'premium') {
    return price * 0.8;
  }
  return price;
}

// Traditional test (verifies known behavior)
test('premium members get 20% discount', () => {
  expect(calculateDiscount(100, 'premium')).toBe(80);
});

test('non-members pay full price', () => {
  expect(calculateDiscount(100, 'basic')).toBe(100);
});

These tests verify expected behavior. But what about edge cases like null membership, negative prices, or the boundary between membership tiers?

JiTTests Approach

The system would:

Read the code change and understand it handles pricing with membership tiers
Create mutants like removing the if check, changing 0.8 to 0.9, or swapping the equality operator
Generate targeted tests that catch each mutation:

// AI-generated test targeting a specific mutant
test('null membership should not receive discount', () => {
  expect(calculateDiscount(100, null)).toBe(100);
});

// AI-generated test targeting boundary behavior
test('discount should be exactly 20%, not more', () => {
  const result = calculateDiscount(100, 'premium');
  expect(result).toBe(80);
  expect(result).not.toBe(90); // catches 0.9 mutant
});

Then delete the tests after flagging any issues to the engineer.

Five Problems Solved in One Stroke

JiTTests addresses multiple testing challenges simultaneously:

Test debt — Tests are ephemeral. No maintenance burden.
Flaky tests — Assessors filter out unreliable tests before engineers see them.
Coverage theater — Targets bugs, not coverage metrics.
Review fatigue — 70% less human review needed.
Bug detection speed — Catches issues at code review time, not in production.

Should You Stop Writing Tests?

No. JiTTests complements traditional testing. The permanent test suite still validates core behavior and prevents regressions. JiTTests adds a targeted bug-hunting layer on top of that.

The key takeaway isn't "tests are useless." It's that the purpose of a test matters more than the number of tests. A test designed to find a specific bug is more valuable than a test designed to increase a coverage percentage.

For most teams, the practical lesson is this: when reviewing a code change, ask "what bugs could exist here?" rather than "do we have enough tests?" That's the mindset shift JiTTests embodies.

Resources

Research Paper: JiTTests on arXiv
Meta Engineering Blog: Revolutionizing Software Testing with LLM-Powered Bug Catchers

Claude Code Tutorial for Beginners: Complete Guide (2026) — If you're interested in AI developer tools, this guide covers everything you need to get started with Claude Code.
Whisper AI Tutorial: Free Offline Transcription for Developers — Another AI tool that's changing developer workflows, this time for audio transcription.
YouTube Shorts Generator with Claude Code and Remotion — See how AI coding assistants can automate creative workflows beyond testing.

Published on AyyazTech. Subscribe on YouTube for weekly deep dives on AI, automation, and software engineering.