AI-generated test steps can save time, but they also introduce a new failure mode, automation that looks reasonable at a glance and breaks in CI for reasons that were never reviewed. The problem is not that AI writes tests, it is that teams often let generated steps move from prompt to pipeline without a controlled review path. If you want to review AI-generated test steps before they reach CI, you need a workflow that treats them like code, with explicit ownership, validation rules, and merge gates.

This matters for SDETs, QA managers, frontend engineers, and platform teams because the cost of a weak test is not just a flaky build. It is a false sense of coverage, unstable pull requests, and time spent debugging a test that was never sound to begin with. The goal is not to reject AI-generated tests. The goal is to make them reviewable, deterministic where possible, and safe to promote into CI.

What counts as an AI-generated test step

An AI-generated test step is any test action, assertion, or setup instruction produced by a model, agent, or low-code assistant rather than typed directly by a reviewer. That could be:

  • A Playwright or Cypress step drafted from a prompt
  • A low-code workflow created by an agentic platform
  • An assertion suggested from a page description
  • A locator strategy recommended by AI
  • A test flow assembled from recorded user journeys

The risk profile changes depending on where AI sits in the toolchain. Code-first teams may have generated TypeScript snippets that still need normal code review. Low-code teams may have platform-native steps that need approval in a visual editor. Either way, the same review principles apply.

If a step cannot be explained in terms a reviewer understands, it should not be allowed to pass into CI without more scrutiny.

The basic principle, separate generation from approval

The simplest mistake is letting generation and approval happen in the same action. If a model produces a step and the pipeline immediately consumes it, you have no chance to validate intent, selector stability, or assertion quality.

A safer model is:

  1. Generate test steps in a draft state.
  2. Review them against explicit quality rules.
  3. Validate them locally or in a staging runner.
  4. Promote only approved steps into the CI branch or test registry.
  5. Fail the build if the approved artifact changes without review.

This is the same discipline teams use for infrastructure changes, security rules, and production code. The only difference is that test logic often gets treated as disposable, which makes it a soft target for automation mistakes.

A practical review workflow for AI-generated test steps

A workable review process usually has five stages. The exact implementation depends on your stack, but the control points should be recognizable.

1. Generation in a draft workspace

Create AI-generated steps in a draft area that is not connected to CI. This can be a branch, a sandbox project, a separate folder, or a platform workspace with limited permissions. The draft should preserve the original prompt, generated steps, timestamps, and reviewer notes.

You want traceability here. A reviewer should be able to answer:

  • What prompt produced this test?
  • Which app version was visible to the model?
  • Did the model infer behavior from the UI or from product requirements?
  • Was any human correction already applied?

Without this context, review turns into guesswork.

2. Structural validation before human review

Before a human spends time on the test, run automatic validation rules. These are not pass or fail checks for the app under test, they are checks on the test artifact itself.

Good structural rules include:

  • No hardcoded sleeps unless explicitly allowed
  • Every locator must use an approved strategy or a documented fallback
  • Every assertion must be tied to a visible outcome or a business rule
  • No duplicate steps that repeat the same action without reason
  • No navigation to production systems from non-production pipelines
  • No generated step may mutate shared state without cleanup

For code-first tests, these rules can be enforced with linters, static analysis, or custom AST checks. For low-code tests, the equivalent may be schema validation and platform-side constraints.

3. Human review with a checklist

This is where the actual review happens. A reviewer should inspect the test for correctness, maintainability, and intent. A good checklist is short, but strict.

Ask:

  • Does the test verify a real user outcome?
  • Are selectors stable enough for the app’s change rate?
  • Does the assertion match the business requirement, not just the DOM?
  • Is the step sequence minimal, or does it contain unnecessary interactions?
  • Are waits conditional instead of time-based?
  • Are test data and environment assumptions documented?
  • Would this still be understandable in six months?

If the answer to any of these is unclear, the test should stay in draft.

4. Controlled execution in a staging or preview environment

After review, run the test in a controlled environment before it is allowed into the main CI path. This is important because some mistakes only appear at runtime, for example:

  • A locator matches the wrong element
  • An assertion passes on a loading state instead of the final state
  • A generated step depends on a tooltip that appears only in one browser
  • A form field is disabled until client-side hydration completes

Use this phase to capture artifacts, logs, screenshots, and timing information. The goal is not full confidence, it is to detect obvious mismatches before the merge gate.

5. Promotion through CI gating

Only approved and validated artifacts should be merged into the CI path. That means pull request checks, branch protections, or test registry promotion rules should stop unreviewed changes from entering the pipeline.

CI gating can be simple, such as requiring a code owner review for test files. It can also be more specific, such as rejecting any test change that lacks a review marker or approval label.

What to review in AI-generated steps

The key is not to judge whether the test is “good” in the abstract. Judge whether it is safe to run in CI and valuable to keep.

1. Selector quality

Selectors are one of the most common AI failure points. Models may pick brittle CSS paths, text selectors that are likely to change, or overly broad locators.

Prefer selectors that are:

  • Stable across cosmetic changes
  • Anchored to semantics, not layout
  • Unique enough to avoid ambiguity
  • Visible to the test maintainer

For example, a generated selector like div:nth-child(3) > button is usually suspect. A data attribute such as data-testid="save-profile" is often better if your team maintains that contract.

If your team uses Playwright, this review can include an explicit preference for locators that match your engineering standards. For example:

typescript

await page.getByTestId('save-profile').click();
await expect(page.getByRole('status')).toHaveText('Saved');

The point is not that Playwright is the only valid answer, it is that generated tests should respect the locator strategy the team already trusts.

2. Assertion strength

AI often produces assertions that are too weak or too specific. Both are problems.

Too weak:

  • “Element exists” without proving the user outcome
  • “Text contains” when the exact state matters
  • “Page loaded” when the business flow is unfinished

Too specific:

  • Exact copy checks for text that changes for localization or A/B experiments
  • Pixel-based checks for content that varies with rendering differences
  • Assertions tied to transient loading states

A useful review question is, “What user contract is this assertion proving?” If you cannot answer it, the assertion is probably not ready.

3. Wait strategy

Generated steps often overuse fixed waits because they are easy to write. That makes tests slow and flaky.

Reject steps that use hard sleeps unless the wait is truly about a known external delay, and even then, document why.

Prefer waiting for:

  • A visible state change
  • Network completion if your framework supports it
  • A specific element becoming enabled
  • An API response that backs the UI transition

For example, in Cypress:

javascript cy.intercept(‘POST’, ‘/api/orders’).as(‘createOrder’); cy.get(‘[data-testid=”submit-order”]’).click(); cy.wait(‘@createOrder’); cy.contains(‘Order confirmed’).should(‘be.visible’);

The review should check that the wait aligns with the app’s actual behavior, not with model convenience.

4. Test data assumptions

AI-generated tests often assume data exists, users are logged in, or feature flags are enabled. Reviewers need to confirm those assumptions are encoded in setup steps or fixtures.

Questions to ask:

  • Does the test create its own data?
  • Does it rely on shared accounts?
  • Is the test isolated from previous runs?
  • Are flags, locale, and permissions explicit?

If not, CI will eventually surface nondeterministic failures that are hard to reproduce.

5. Cleanup and side effects

Any generated test that creates data, changes settings, or submits forms should have a cleanup strategy or use disposable environments. Reviewers should look for idempotency and isolation.

If a flow changes a profile setting, can that be reset? If it creates an order, can the test environment absorb that action safely? If not, the test may be acceptable only in a non-destructive environment.

A review checklist you can actually use

A concise checklist works better than a long policy document. Here is a practical one for AI test approval:

  • The test maps to a real user or business path
  • The locator strategy matches team standards
  • All waits are event-driven or condition-based
  • Assertions prove behavior, not implementation noise
  • Test data is explicit and isolated
  • The flow can be understood by a human maintainer
  • Side effects are safe or cleaned up
  • The test is tagged for the right pipeline stage
  • The generated artifact and reviewer approval are traceable

If you want a gate that is easy to automate, require each item to be explicitly marked during review. That creates a structured approval record instead of informal comments buried in chat.

How to enforce CI gating without slowing the team down

The best CI gating is strict enough to protect quality and lightweight enough that engineers do not route around it.

Branch protection and code owners

For code-first repositories, put AI-generated test files behind normal branch protections. If the file path is tests/generated/, require a reviewer from QA or the SDET group before merge. This is basic, but effective.

Metadata labels or approval markers

If the tests are generated in a platform, create a status like draft, reviewed, or approved-for-ci. CI should refuse to consume anything below approved-for-ci.

Pre-merge validation jobs

Run a dedicated validation job before merge. It should verify:

  • Steps conform to schema
  • No forbidden commands exist
  • Assertions match allowed patterns
  • Review metadata is present
  • Test artifacts are tied to the current branch or change request

Here is a small GitHub Actions example for structural checks on generated test files:

name: validate-generated-tests
on:
  pull_request:
    paths:
      - 'tests/generated/**'
jobs:
  lint-and-validate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npm run lint:generated-tests
      - run: npm run validate:test-metadata

Promote only reviewed artifacts

Do not let the CI job discover and execute draft tests by default. Instead, have the pipeline read only approved manifests or tagged suites. That way, a generated draft is not accidentally included because someone named a file in the wrong directory.

Where human review should be strict, and where it can be lenient

Not every step deserves the same level of scrutiny. A good workflow applies stricter review to critical paths.

Be strict for:

  • Authentication
  • Payments
  • Account recovery
  • Destructive actions
  • Compliance-sensitive flows
  • Assertions on legal or contractual copy

Be more flexible for:

  • Cosmetic checks in non-critical areas
  • Exploratory test scaffolding
  • Temporary tests for new features under active development

This is where teams often overcorrect. They either review everything with the same intensity, which slows delivery, or they review almost nothing, which defeats the point. Severity-based review is the better model.

Common failure patterns in AI-generated steps

A review workflow is only useful if it catches the actual classes of mistakes AI tends to make.

Fragile DOM assumptions

The model may infer that the visible button text corresponds to the action you want, but the UI could have multiple similar controls. Reviewers should look for context, hierarchy, and semantic roles.

Overfitted copy checks

AI may produce exact text assertions from the current UI state. If copy changes often, that is a maintenance trap. Consider asserting intent instead of full text.

For instance, Endtest’s AI Assertions are designed for natural-language checks, which can be useful when the validation target is the state of the page rather than a rigid selector or exact string. The documentation describes validating complex conditions in plain language, which can help teams express business expectations more directly. Use that kind of capability carefully, however, because natural-language checks still need review for scope and strictness.

Misordered steps

Generated tests may click before waiting for navigation, or assert before the app is ready. Reviewers should validate sequencing, especially in reactive frontends where state changes are asynchronous.

False confidence from happy-path only coverage

AI is often good at producing the simplest success case. That is helpful, but incomplete. Reviewers should decide whether the generated step set also needs negative coverage, boundary cases, or role-based variants.

Uncontrolled scope creep

A generated test that started as “verify login” might quietly become a multi-page workflow with setup, navigation, profile edits, and reporting. More steps are not necessarily better. Reviewers should keep flows small enough to diagnose when they fail.

A good division of responsibility between AI and humans

AI should draft, propose, and refactor. Humans should approve intent, risk, and maintainability.

A sensible division looks like this:

  • AI drafts candidate steps from product language
  • The reviewer normalizes selectors and assertions
  • Automation validates syntax, metadata, and prohibited patterns
  • CI runs only approved suites
  • Failures feed back into the draft queue, not the approved path

This division keeps the strengths of AI, speed and breadth, while preserving the control humans need for reliability.

Where Endtest can fit in a controlled review workflow

If your team wants a low-code option, Endtest can fit into this kind of controlled workflow because its agentic AI test creation approach produces standard, editable platform-native steps rather than opaque output. That matters when the review process depends on humans being able to inspect and adjust each step before approval.

Its AI Assertions capability can also be useful when a reviewable test needs plain-English checks over the page, cookies, variables, or logs, with strictness settings that help teams decide how rigid a validation should be. For teams that want generated steps, but still need maintainable test assets, that is a reasonable place to evaluate the tradeoff.

The broader point is not that a single tool solves review. The point is that whatever platform you use should make the generated artifact editable, traceable, and easy to gate before CI.

How to document the review workflow so it sticks

A review process fails when it lives only in one manager’s head. Put the rules where engineers can find them.

Document:

  • What qualifies as AI-generated
  • Which teams approve generated tests
  • Which checks are mandatory before CI
  • Which steps require heightened scrutiny
  • How to mark a test as approved
  • How to revert an unreviewed change that slipped through

Keep the documentation close to the repo or platform, not hidden in a general engineering handbook. The people using it need the policy at the moment they are about to merge.

A sample policy for merge approval

Here is a lightweight policy you can adapt:

  1. All AI-generated test steps must be created in draft state.
  2. Draft tests must pass structural validation.
  3. A human reviewer must approve the step sequence, locators, and assertions.
  4. Approved tests must run successfully in a staging environment.
  5. CI may execute only artifacts marked approved-for-ci.
  6. Any generated test that changes critical workflows requires a second reviewer.
  7. Any flaky or ambiguous step must be rewritten before merge.

That is enough structure to prevent most avoidable problems without turning the process into bureaucracy.

When to reject an AI-generated test outright

Sometimes the right answer is not to fix the test, it is to reject it.

Reject if:

  • The test cannot be isolated from shared state
  • The generated locators are inherently brittle and cannot be improved
  • The assertion does not correspond to a business requirement
  • The test depends on unstable external systems without a contract
  • The step sequence is so long that failures will be untriageable
  • The model invented behavior that the product does not actually support

That last point is especially important. AI can confidently hallucinate a flow that sounds plausible. If the product does not support it, do not paper over the mismatch. Remove it.

A final implementation pattern that works well

For many teams, the best pattern is a three-lane system:

  • Draft lane, AI generates candidate steps
  • Review lane, humans validate and edit
  • Approved lane, CI runs only the reviewed assets

This model is simple enough for small teams and scalable enough for larger ones. It also creates a clear boundary between experimentation and enforcement. If you adopt only one thing from this guide, make it that boundary.

AI can accelerate test creation, but speed is only helpful if the results are trustworthy. With a controlled review workflow, you can review AI-generated test steps before they reach CI, keep false confidence out of your pipeline, and still benefit from the productivity gains that make AI testing worth adopting in the first place.