deepEval Quality Gates: How Spectr Prevents Bad Tests from Shipping

Every test Spectr generates is scored on five metrics before it ships. If the composite score falls below the threshold, the test is flagged, not silently merged into your suite.

The Five Metrics

Faithfulness, Coverage, Correctness, AssertionDensity, and AntiPattern avoidance each capture a different dimension of test quality.

•Faithfulness — does the test actually test what the user story describes?
•Coverage — what proportion of the described behaviour has at least one assertion?
•Correctness — are the assertions logically sound given the code under test?
•AssertionDensity — is there enough verification per line of test code?
•AntiPattern — no magic sleeps, no hardcoded selectors, no empty catch blocks

Gate Levels

PASS (≥75) ships immediately. WARN (55–74) ships with a developer notification and is flagged in the PR comment. FAIL (<55) blocks the release gate until the test is revised or manually overridden.

Why This Matters

AI-generated tests can be syntactically correct and still provide zero quality signal. A test that asserts true === true will pass every time and protect nothing. The eval pipeline exists specifically to catch this class of silent failure.

Back to blog