deepEval Quality Gates: How Spectr Prevents Bad Tests from Shipping
A behind-the-scenes look at how Spectr's eval pipeline scores AI-generated tests across five metrics before they ever reach your CI pipeline.
Every test Spectr generates is scored on five metrics before it ships. If the composite score falls below the threshold, the test is flagged, not silently merged into your suite.
The Five Metrics
Faithfulness, Coverage, Correctness, AssertionDensity, and AntiPattern avoidance each capture a different dimension of test quality.
- •Faithfulness — does the test actually test what the user story describes?
- •Coverage — what proportion of the described behaviour has at least one assertion?
- •Correctness — are the assertions logically sound given the code under test?
- •AssertionDensity — is there enough verification per line of test code?
- •AntiPattern — no magic sleeps, no hardcoded selectors, no empty catch blocks
Gate Levels
PASS (≥75) ships immediately. WARN (55–74) ships with a developer notification and is flagged in the PR comment. FAIL (<55) blocks the release gate until the test is revised or manually overridden.
Why This Matters
AI-generated tests can be syntactically correct and still provide zero quality signal. A test that asserts true === true will pass every time and protect nothing. The eval pipeline exists specifically to catch this class of silent failure.