How to Review AI-Generated Test Steps Before They Reach CI

AI-generated test steps can save time, but they also introduce a new failure mode, automation that looks reasonable at a glance and breaks in CI for reasons that were never reviewed. The problem is not that AI writes tests, it is that teams often let generated steps move from prompt to pipeline without a controlled review path. If you want to review AI-generated test steps before they reach CI, you need a workflow that treats them like code, with explicit ownership, validation rules, and merge gates.

This matters for SDETs, QA managers, frontend engineers, and platform teams because the cost of a weak test is not just a flaky build. It is a false sense of coverage, unstable pull requests, and time spent debugging a test that was never sound to begin with. The goal is not to reject AI-generated tests. The goal is to make them reviewable, deterministic where possible, and safe to promote into CI.

What counts as an AI-generated test step

An AI-generated test step is any test action, assertion, or setup instruction produced by a model, agent, or low-code assistant rather than typed directly by a reviewer. That could be:

A Playwright or Cypress step drafted from a prompt
A low-code workflow created by an agentic platform
An assertion suggested from a page description
A locator strategy recommended by AI
A test flow assembled from recorded user journeys

The risk profile changes depending on where AI sits in the toolchain. Code-first teams may have generated TypeScript snippets that still need normal code review. Low-code teams may have platform-native steps that need approval in a visual editor. Either way, the same review principles apply.

If a step cannot be explained in terms a reviewer understands, it should not be allowed to pass into CI without more scrutiny.

The basic principle, separate generation from approval

The simplest mistake is letting generation and approval happen in the same action. If a model produces a step and the pipeline immediately consumes it, you have no chance to validate intent, selector stability, or assertion quality.

A safer model is:

Generate test steps in a draft state.
Review them against explicit quality rules.
Validate them locally or in a staging runner.
Promote only approved steps into the CI branch or test registry.
Fail the build if the approved artifact changes without review.

This is the same discipline teams use for infrastructure changes, security rules, and production code. The only difference is that test logic often gets treated as disposable, which makes it a soft target for automation mistakes.

A practical review workflow for AI-generated test steps

A workable review process usually has five stages. The exact implementation depends on your stack, but the control points should be recognizable.

1. Generation in a draft workspace

Create AI-generated steps in a draft area that is not connected to CI. This can be a branch, a sandbox project, a separate folder, or a platform workspace with limited permissions. The draft should preserve the original prompt, generated steps, timestamps, and reviewer notes.

You want traceability here. A reviewer should be able to answer:

What prompt produced this test?
Which app version was visible to the model?
Did the model infer behavior from the UI or from product requirements?
Was any human correction already applied?

Without this context, review turns into guesswork.

2. Structural validation before human review

Before a human spends time on the test, run automatic validation rules. These are not pass or fail checks for the app under test, they are checks on the test artifact itself.

Good structural rules include:

No hardcoded sleeps unless explicitly allowed
Every locator must use an approved strategy or a documented fallback
Every assertion must be tied to a visible outcome or a business rule
No duplicate steps that repeat the same action without reason
No navigation to production systems from non-production pipelines
No generated step may mutate shared state without cleanup

For code-first tests, these rules can be enforced with linters, static analysis, or custom AST checks. For low-code tests, the equivalent may be schema validation and platform-side constraints.

3. Human review with a checklist

This is where the actual review happens. A reviewer should inspect the test for correctness, maintainability, and intent. A good checklist is short, but strict.

Ask:

Does the test verify a real user outcome?
Are selectors stable enough for the app’s change rate?
Does the assertion match the business requirement, not just the DOM?
Is the step sequence minimal, or does it contain unnecessary interactions?
Are waits conditional instead of time-based?
Are test data and environment assumptions documented?
Would this still be understandable in six months?

If the answer to any of these is unclear, the test should stay in draft.

4. Controlled execution in a staging or preview environment

After review, run the test in a controlled environment before it is allowed into the main CI path. This is important because some mistakes only appear at runtime, for example:

A locator matches the wrong element
An assertion passes on a loading state instead of the final state
A generated step depends on a tooltip that appears only in one browser
A form field is disabled until client-side hydration completes

Use this phase to capture artifacts, logs, screenshots, and timing information. The goal is not full confidence, it is to detect obvious mismatches before the merge gate.

5. Promotion through CI gating

Only approved and validated artifacts should be merged into the CI path. That means pull request checks, branch protections, or test registry promotion rules should stop unreviewed changes from entering the pipeline.

CI gating can be simple, such as requiring a code owner review for test files. It can also be more specific, such as rejecting any test change that lacks a review marker or approval label.

What to review in AI-generated steps

The key is not to judge whether the test is “good” in the abstract. Judge whether it is safe to run in CI and valuable to keep.

1. Selector quality

Selectors are one of the most common AI failure points. Models may pick brittle CSS paths, text selectors that are likely to change, or overly broad locators.

Prefer selectors that are:

Stable across cosmetic changes
Anchored to semantics, not layout
Unique enough to avoid ambiguity
Visible to the test maintainer

For example, a generated selector like div:nth-child(3) > button is usually suspect. A data attribute such as data-testid="save-profile" is often better if your team maintains that contract.

If your team uses Playwright, this review can include an explicit preference for locators that match your engineering standards. For example:

typescript

await page.getByTestId('save-profile').click();
await expect(page.getByRole('status')).toHaveText('Saved');

The point is not that Playwright is the only valid answer, it is that generated tests should respect the locator strategy the team already trusts.

2. Assertion strength

AI often produces assertions that are too weak or too specific. Both are problems.

Too weak:

“Element exists” without proving the user outcome
“Text contains” when the exact state matters
“Page loaded” when the business flow is unfinished

Too specific:

Exact copy checks for text that changes for localization or A/B experiments
Pixel-based checks for content that varies with rendering differences
Assertions tied to transient loading states

A useful review question is, “What user contract is this assertion proving?” If you cannot answer it, the assertion is probably not ready.

3. Wait strategy

Generated steps often overuse fixed waits because they are easy to write. That makes tests slow and flaky.

Reject steps that use hard sleeps unless the wait is truly about a known external delay, and even then, document why.

Prefer waiting for:

A visible state change
Network completion if your framework supports it
A specific element becoming enabled
An API response that backs the UI transition

For example, in Cypress:

javascript cy.intercept(‘POST’, ‘/api/orders’).as(‘createOrder’); cy.get(‘[data-testid=”submit-order”]’).click(); cy.wait(‘@createOrder’); cy.contains(‘Order confirmed’).should(‘be.visible’);

The review should check that the wait aligns with the app’s actual behavior, not with model convenience.

4. Test data assumptions

AI-generated tests often assume data exists, users are logged in, or feature flags are enabled. Reviewers need to confirm those assumptions are encoded in setup steps or fixtures.

Questions to ask:

Does the test create its own data?
Does it rely on shared accounts?
Is the test isolated from previous runs?
Are flags, locale, and permissions explicit?

If not, CI will eventually surface nondeterministic failures that are hard to reproduce.

5. Cleanup and side effects

Any generated test that creates data, changes settings, or submits forms should have a cleanup strategy or use disposable environments. Reviewers should look for idempotency and isolation.

If a flow changes a profile setting, can that be reset? If it creates an order, can the test environment absorb that action safely? If not, the test may be acceptable only in a non-destructive environment.

A review checklist you can actually use

A concise checklist works better than a long policy document. Here is a practical one for AI test approval:

The test maps to a real user or business path
The locator strategy matches team standards
All waits are event-driven or condition-based
Assertions prove behavior, not implementation noise
Test data is explicit and isolated
The flow can be understood by a human maintainer
Side effects are safe or cleaned up
The test is tagged for the right pipeline stage
The generated artifact and reviewer approval are traceable

If you want a gate that is easy to automate, require each item to be explicitly marked during review. That creates a structured approval record instead of informal comments buried in chat.

How to enforce CI gating without slowing the team down

The best CI gating is strict enough to protect quality and lightweight enough that engineers do not route around it.

Branch protection and code owners

For code-first repositories, put AI-generated test files behind normal branch protections. If the file path is tests/generated/, require a reviewer from QA or the SDET group before merge. This is basic, but effective.

Metadata labels or approval markers

If the tests are generated in a platform, create a status like draft, reviewed, or approved-for-ci. CI should refuse to consume anything below approved-for-ci.

Pre-merge validation jobs

Run a dedicated validation job before merge. It should verify:

Steps conform to schema
No forbidden commands exist
Assertions match allowed patterns
Review metadata is present
Test artifacts are tied to the current branch or change request

Here is a small GitHub Actions example for structural checks on generated test files:

name: validate-generated-tests
on:
  pull_request:
    paths:
      - 'tests/generated/**'
jobs:
  lint-and-validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npm run lint:generated-tests
      - run: npm run validate:test-metadata

Promote only reviewed artifacts

Do not let the CI job discover and execute draft tests by default. Instead, have the pipeline read only approved manifests or tagged suites. That way, a generated draft is not accidentally included because someone named a file in the wrong directory.

Where human review should be strict, and where it can be lenient

Not every step deserves the same level of scrutiny. A good workflow applies stricter review to critical paths.

Be strict for:

Authentication
Payments
Account recovery
Destructive actions
Compliance-sensitive flows
Assertions on legal or contractual copy

Be more flexible for:

Cosmetic checks in non-critical areas
Exploratory test scaffolding
Temporary tests for new features under active development

This is where teams often overcorrect. They either review everything with the same intensity, which slows delivery, or they review almost nothing, which defeats the point. Severity-based review is the better model.

Common failure patterns in AI-generated steps

A review workflow is only useful if it catches the actual classes of mistakes AI tends to make.

Fragile DOM assumptions

The model may infer that the visible button text corresponds to the action you want, but the UI could have multiple similar controls. Reviewers should look for context, hierarchy, and semantic roles.

Overfitted copy checks

AI may produce exact text assertions from the current UI state. If copy changes often, that is a maintenance trap. Consider asserting intent instead of full text.

For instance, Endtest’s AI Assertions are designed for natural-language checks, which can be useful when the validation target is the state of the page rather than a rigid selector or exact string. The documentation describes validating complex conditions in plain language, which can help teams express business expectations more directly. Use that kind of capability carefully, however, because natural-language checks still need review for scope and strictness.

Misordered steps

Generated tests may click before waiting for navigation, or assert before the app is ready. Reviewers should validate sequencing, especially in reactive frontends where state changes are asynchronous.

False confidence from happy-path only coverage

AI is often good at producing the simplest success case. That is helpful, but incomplete. Reviewers should decide whether the generated step set also needs negative coverage, boundary cases, or role-based variants.

Uncontrolled scope creep

A generated test that started as “verify login” might quietly become a multi-page workflow with setup, navigation, profile edits, and reporting. More steps are not necessarily better. Reviewers should keep flows small enough to diagnose when they fail.

A good division of responsibility between AI and humans

AI should draft, propose, and refactor. Humans should approve intent, risk, and maintainability.

A sensible division looks like this:

AI drafts candidate steps from product language
The reviewer normalizes selectors and assertions
Automation validates syntax, metadata, and prohibited patterns
CI runs only approved suites
Failures feed back into the draft queue, not the approved path

This division keeps the strengths of AI, speed and breadth, while preserving the control humans need for reliability.

Where Endtest can fit in a controlled review workflow

If your team wants a low-code option, Endtest can fit into this kind of controlled workflow because its agentic AI test creation approach produces standard, editable platform-native steps rather than opaque output. That matters when the review process depends on humans being able to inspect and adjust each step before approval.

Its AI Assertions capability can also be useful when a reviewable test needs plain-English checks over the page, cookies, variables, or logs, with strictness settings that help teams decide how rigid a validation should be. For teams that want generated steps, but still need maintainable test assets, that is a reasonable place to evaluate the tradeoff.

The broader point is not that a single tool solves review. The point is that whatever platform you use should make the generated artifact editable, traceable, and easy to gate before CI.

How to document the review workflow so it sticks

A review process fails when it lives only in one manager’s head. Put the rules where engineers can find them.

Document:

What qualifies as AI-generated
Which teams approve generated tests
Which checks are mandatory before CI
Which steps require heightened scrutiny
How to mark a test as approved
How to revert an unreviewed change that slipped through

Keep the documentation close to the repo or platform, not hidden in a general engineering handbook. The people using it need the policy at the moment they are about to merge.

A sample policy for merge approval

Here is a lightweight policy you can adapt:

All AI-generated test steps must be created in draft state.
Draft tests must pass structural validation.
A human reviewer must approve the step sequence, locators, and assertions.
Approved tests must run successfully in a staging environment.
CI may execute only artifacts marked approved-for-ci.
Any generated test that changes critical workflows requires a second reviewer.
Any flaky or ambiguous step must be rewritten before merge.

That is enough structure to prevent most avoidable problems without turning the process into bureaucracy.

When to reject an AI-generated test outright

Sometimes the right answer is not to fix the test, it is to reject it.

Reject if:

The test cannot be isolated from shared state
The generated locators are inherently brittle and cannot be improved
The assertion does not correspond to a business requirement
The test depends on unstable external systems without a contract
The step sequence is so long that failures will be untriageable
The model invented behavior that the product does not actually support

That last point is especially important. AI can confidently hallucinate a flow that sounds plausible. If the product does not support it, do not paper over the mismatch. Remove it.

A final implementation pattern that works well

For many teams, the best pattern is a three-lane system:

Draft lane, AI generates candidate steps
Review lane, humans validate and edit
Approved lane, CI runs only the reviewed assets

This model is simple enough for small teams and scalable enough for larger ones. It also creates a clear boundary between experimentation and enforcement. If you adopt only one thing from this guide, make it that boundary.

AI can accelerate test creation, but speed is only helpful if the results are trustworthy. With a controlled review workflow, you can review AI-generated test steps before they reach CI, keep false confidence out of your pipeline, and still benefit from the productivity gains that make AI testing worth adopting in the first place.