Back to Blog

Meta's JiTTests: How AI Writes 22,000 Tests, Finds 4x More Bugs, Then Deletes Them All

By Ayyaz Zafar
Meta JiTTests AI Testing System - YouTube thumbnail

Meta built an AI system that writes tests, catches bugs humans miss, and then throws away every single test it creates. On purpose.

It's called JiTTests (Just-in-Time Tests), and the peer-reviewed results speak for themselves: 4x more bugs caught than traditional testing, with a 70% reduction in human review load.

Here's how it works, why it works, and what it means for the future of software testing.

The Problem with Traditional Testing

Most codebases have thousands of tests. And most of those tests are doing one thing: making sure the code does what it already does. They verify existing behavior. They confirm things work the way they're supposed to.

What they don't do well is find new bugs in new code.

When a developer pushes a code change, the test suite runs. If everything passes, the change ships. But passing all existing tests doesn't mean the new code is correct. It means the new code didn't break anything that was already being tested.

There's a gap between "tests passed" and "code is actually correct." That gap is where bugs live.

The Test Debt Problem

At Meta's scale, this problem compounds. With thousands of engineers shipping code daily across platforms like Facebook, Instagram, WhatsApp, and Messenger, the testing infrastructure becomes massive. Tests need maintenance. Flaky tests need fixing. Test suites slow down CI pipelines.

Adding more tests to catch more edge cases sounds logical, but it creates test debt — a growing maintenance burden that eventually works against you.

Meta's Approach: Just-in-Time Testing

Instead of writing permanent tests that stick around forever, Meta built a system that generates tests specifically to find bugs in new code, and then throws them away once they've done their job.

The system is called JiTTests, and it uses a 6-step pipeline:

Step 1: New Code Lands

A developer submits a code change (a "diff" in Meta's internal workflow). The system picks it up automatically.

Step 2: LLM Infers Intent

A large language model reads the code change and infers what the developer was trying to do. This is the key insight: instead of blindly testing random behaviors, the AI understands the intent behind the change.

Step 3: Create Mutants

The system creates "mutants" — slightly broken versions of the code change. These mutations simulate the kinds of bugs that might exist. For example, swapping a > for a >=, removing a null check, or changing a return value.

Step 4: Generate Catching Tests

The LLM writes tests specifically designed to catch these mutants. Each test targets a specific potential bug in the new code, not the existing codebase.

Step 5: Assessors Filter Results

An ensemble of rule-based and LLM-based assessors evaluates each test. This step filters out flaky tests, false positives, and tests that would waste an engineer's time. Only high-confidence results make it through.

Step 6: Engineer Gets Notified (Then the Test is Deleted)

If a test catches a real issue, the engineer gets notified with the specific bug and evidence. Once the engineer acknowledges or fixes the issue, the test is deleted. It served its purpose. No maintenance. No flaky test debt.

The Results

These aren't theoretical numbers. JiTTests was presented at FSE 2025 (Foundations of Software Engineering) in Trondheim, Norway, with peer-reviewed results across 7 Meta platforms.

MetricResult
Total tests analyzed22,126
Bugs caught vs. traditional hardening4x more
Effectiveness vs. coincidental failures20x more
Human review reduction70%
Mutant kill rate (JiTTests/ACH)15%
Mutant kill rate (TestGen-LLM, coverage-only)2.4%
Tests that also raised code coverage51%

The 15% vs 2.4% Difference

This is worth highlighting. Meta also has another system called TestGen-LLM that generates tests purely to increase code coverage. That system kills 2.4% of mutants. JiTTests, which targets bugs specifically, kills 15%.

The difference is intent. Writing tests to increase a coverage number is fundamentally different from writing tests to find bugs. JiTTests does the latter, and it's 6x more effective at finding real issues.

The Counterintuitive Finding

51% of JiTTests tests also happened to increase code coverage, even though that wasn't their goal. By targeting real bugs instead of coverage metrics, you get meaningful coverage as a side effect.

Code Example: Old Way vs. New Way

Here's a simplified comparison to illustrate the difference.

Traditional Approach

// Developer writes a function
function calculateDiscount(price, membership) {
  if (membership === 'premium') {
    return price * 0.8;
  }
  return price;
}

// Traditional test (verifies known behavior)
test('premium members get 20% discount', () => {
  expect(calculateDiscount(100, 'premium')).toBe(80);
});

test('non-members pay full price', () => {
  expect(calculateDiscount(100, 'basic')).toBe(100);
});

These tests verify expected behavior. But what about edge cases like null membership, negative prices, or the boundary between membership tiers?

JiTTests Approach

The system would:

  1. Read the code change and understand it handles pricing with membership tiers
  2. Create mutants like removing the if check, changing 0.8 to 0.9, or swapping the equality operator
  3. Generate targeted tests that catch each mutation:
// AI-generated test targeting a specific mutant
test('null membership should not receive discount', () => {
  expect(calculateDiscount(100, null)).toBe(100);
});

// AI-generated test targeting boundary behavior
test('discount should be exactly 20%, not more', () => {
  const result = calculateDiscount(100, 'premium');
  expect(result).toBe(80);
  expect(result).not.toBe(90); // catches 0.9 mutant
});

Then delete the tests after flagging any issues to the engineer.

Five Problems Solved in One Stroke

JiTTests addresses multiple testing challenges simultaneously:

  1. Test debt — Tests are ephemeral. No maintenance burden.
  2. Flaky tests — Assessors filter out unreliable tests before engineers see them.
  3. Coverage theater — Targets bugs, not coverage metrics.
  4. Review fatigue — 70% less human review needed.
  5. Bug detection speed — Catches issues at code review time, not in production.

Should You Stop Writing Tests?

No. JiTTests complements traditional testing. The permanent test suite still validates core behavior and prevents regressions. JiTTests adds a targeted bug-hunting layer on top of that.

The key takeaway isn't "tests are useless." It's that the purpose of a test matters more than the number of tests. A test designed to find a specific bug is more valuable than a test designed to increase a coverage percentage.

For most teams, the practical lesson is this: when reviewing a code change, ask "what bugs could exist here?" rather than "do we have enough tests?" That's the mindset shift JiTTests embodies.

Resources

Related Articles

Published on AyyazTech. Subscribe on YouTube for weekly deep dives on AI, automation, and software engineering.

Share this article