Flakiness in Automated Testing: Reasons and How to Reduce It

According to Google research, 84% of failures that look like code regressions in CI systems are actually caused by unstable tests, not real bugs. Atlassian estimates that its engineering organization loses more than 150,000 developer hours per year investigating failures generated by flakiness.
In this guide, you will understand what flaky tests are, what the most common causes are, how to detect them in your suite, and most importantly, how to eliminate them for good: through best practices, stronger test design, and the growing role of artificial intelligence in this process.
What is a flaky test? A precise definition
A flaky test (or unstable test) is an automated test that produces inconsistent results: it passes in one run and fails in another, without any change to the application code or the test environment. It is non-deterministic behavior, where the same inputs do not guarantee the same outputs.
It sounds straightforward, but confusion with other types of failures is common. It is worth distinguishing between them:
| Failure type | Behavior | What it indicates |
| Flaky test | Intermittent failure, no pattern | Fragility in the test or the environment |
| Broken test | Fails consistently, every time | Real bug or misconfiguration |
| Real bug | Reproducible failure with defined steps | Problem in the application |
A flaky test is not necessarily a symptom of a bug in the application. It is, in most cases, a symptom of fragility in the test itself or in the environment where it runs.
Tools like Cypress and Selenium, because they rely on rigid selectors and static scripts, are particularly vulnerable to flakiness when the application layout changes. TestBooster.ai addresses this problem at the root: its intent-driven AI interprets what the test is trying to do (not where to click in the HTML), and automatically adapts to UI changes without any rework.
Why are flaky tests so dangerous?
At first glance, a test that “sometimes fails” seems like a minor inconvenience. In practice, the impact is far greater and compounds over time.
The cost of instability
Industry numbers are straightforward. With a flakiness rate of just 1.5% across a suite of 1,000 tests, roughly 15 tests will fail in every release cycle. Each failure demands investigation: is it a real bug or just noise? At 30 to 90 minutes per investigation, that translates to between 7 and 22 hours of lost development time per release, chasing ghosts.
At greater scale: Atlassian documented that, in the Jira Frontend repository, unstable tests were responsible for up to 21% of failures in the main build. At Slack, before the implementation of an automated flaky test detection and suppression system, only 20% of builds were passing; of the failing builds, 57% were caused by unstable tests, not compilation errors or real issues. After automating the process, that figure dropped to less than 5%.
Between 2022 and 2025, the proportion of teams experiencing flaky tests grew from 10% to 26%, a 160% increase over three years, according to a Bitrise analysis of more than 10 million mobile builds.
The “Cry Wolf” effect
There is a well-documented phenomenon in medicine called alarm fatigue: when healthcare professionals are exposed to an excessive volume of alerts, many of which are false positives, they begin ignoring them by reflex. The consequences can be fatal.
With tests, the mechanism is the same. When a suite fails frequently without a real cause, developers stop treating failures as reliable signals. Those who deal with unstable tests more often are significantly more likely to dismiss failures that could be genuine bugs.
Flakiness blocks the CI/CD pipeline
Continuous integration pipelines depend on reliable signals to gate or release deployments. When those signals are noisy, two things happen: either teams block releases unnecessarily (investigating failures that are not bugs), or they learn to ignore failures and ship code without real confidence in quality. Neither scenario is acceptable.
The 7 main causes of flaky tests
Understanding the root cause is the first step toward solving the problem. Most flakiness cases fall into one of these seven categories.
1. Synchronization and async timing issues
This is the most common cause. Academic research on 201 flaky test fixes in Apache projects found that 45% of cases are related to async synchronization issues.
What happens in practice: the test tries to interact with an element (a button, a modal, a form field) before it is available in the interface. The most common, but incorrect, fix is adding a fixed sleep(): “wait 2 seconds and move on.” This works on the local machine, where the environment is fast. It fails in CI, where resources are shared and responses can take longer.
The right solution is to use dynamic waits: instead of waiting a fixed amount of time, the test periodically checks whether the element is available before proceeding. No fixed delay, no dependency on momentary environment conditions.

2. Race conditions and concurrency
When multiple processes or tests compete for the same shared resource, the outcome depends on which process gets there first. That is a race condition.
A classic example: Test_CreateUser creates a user with ID=123. Test_DeleteUser deletes the user with ID=123. If both run in parallel and the deletion happens before the creation, the test fails, even though neither test has a bug. The execution order determines the outcome, making it non-deterministic.
The solution requires isolation: each test operates on its own data, without depending on state created by other tests.
3. Brittle selectors
This is the most relevant cause for teams using traditional tools like Selenium, Cypress, or Playwright. Selectors such as #btn-checkout-v2, .product-list-item:first-child, or dynamic text tied to the layout break with every interface update: a redesign, a component change, an A/B test.
The developer did not change the business logic. They changed the HTML. And the entire test suite goes down with it.
This is where TestBooster.ai differentiates itself fundamentally. While traditional tools rely on selectors that must be manually maintained with every layout change, TestBooster uses intent-driven AI: tests are written in plain language (“click the add to cart button”, “fill in the email field with the test user”) and the AI interprets the intent, not the selector. When the UI changes, the test keeps working, with no rework, no investigation, no broken pipeline.
4. Unstable external dependencies
Tests that depend on third-party APIs, shared databases, or external services are inherently more fragile. The stability of the test becomes tied to factors outside your control: network latency, rate limiting, momentary service unavailability.
The most effective mitigation is mocking: simulating the behavior of the external service within the test environment, ensuring the test validates the application logic, not third-party availability. Containerization (Docker, for example) helps create reproducible, isolated environments.
5. Shared state between tests (test order dependency)
When a test depends on data created by a previous test, execution order starts to matter. This violates a fundamental principle of test automation: tests must be independent of each other.
The problem manifests subtly. Tests pass when run in the default order. They fail when run in a random order or in parallel. The diagnosis is straightforward (run in randomized order) and so is the fix: each test should perform its own setup and teardown, guaranteeing a clean and predictable state.
6. Environment inconsistencies
“It works on my machine” is one of the most famous, and most frustrating, phrases in software development. When the developer’s test environment differs from the CI environment in browser version, operating system, library version, or network configuration, results diverge.
The structural solution is environment standardization: containerization with Docker ensures the test runs under the same conditions everywhere. For UI tests across multiple browsers, tools with support for standardized execution are essential.
7. Poor test design
Not all flakiness comes from the environment. Some comes from the test itself, written or structured poorly. Tests that validate more than one thing at a time, that use locators based on dynamic text, or that lack clear assertions are naturally more unstable.
The single responsibility principle applies here: each test should validate exactly one thing. Assertions must be explicit and deterministic. Wait logic must be clear. The simpler and more focused the test, the less surface area it exposes to failure.
8 Practical Strategies to Reduce Flakiness
Once the problem is diagnosed, it is time to act. These are the highest-impact approaches, in order of applicability:
- Use dynamic waits: never use a fixed sleep(). Implement smart polling that checks for element availability before interacting. This eliminates dependency on momentary environment conditions.
- Fully isolate your tests: each test should be self-contained, with its own setup and teardown. No test should depend on state left by another.
- Mock external dependencies: simulate APIs, databases, and third-party services within the test environment. This removes from the equation every factor outside your control.
- Containerize your environments: Docker ensures parity between local and CI environments. The test runs under the same conditions every time, regardless of where it is executed.
- Eliminate brittle selectors: favor semantic selectors, data-testid attributes, or better yet, tools that do not rely on selectors at all. Intent-driven AI eliminates this problem at the root.
- Implement test quarantine: unstable tests that are identified should be isolated immediately, before investigating the cause. Do not let them pollute the main pipeline.
- Monitor flakiness rates over time: define clear metrics (e.g., “no test should fail more than 5% of the time without a code change”) and monitor continuously. You cannot improve what you do not measure.
- Invest in AI-powered self-healing: tools that automatically adapt to UI changes eliminate at the root the main source of flakiness in interface tests.
Traditional Tools vs. the AI Approach
| Tool | Relies on selectors | Resilient to UI changes | Self-healing | Flakiness risk from layout |
| Selenium | Yes (XPath/CSS) | No | No | High |
| Cypress | Yes (CSS/data-attr) | Partial | No | High |
| Playwright | Yes (locators) | Partial | No | Medium |
| TestBooster.ai | No (natural language) | Yes | Yes | Very low |
FAQ about flaky tests

What is a flaky test?
A flaky test is an automated test that produces inconsistent results: it passes in one run and fails in another without any change to the code or the environment. It is non-deterministic behavior that undermines the reliability of the automation suite and the CI/CD pipeline.
What is the main cause of flaky tests?
The most common cause is synchronization and async timing issues, responsible for around 45% of cases in Java projects. In UI and E2E tests, brittle selectors that break when the layout changes are equally prevalent and, often, the most recurring source of instability in day-to-day engineering work.
Are flaky tests worse than having no tests at all?
In many scenarios, yes. A test that passes 70% of the time generates false negatives that drain the team. More critically: with multiple unstable tests, the suite as a whole starts failing in most runs, producing the “cry wolf” effect. Developers learn to ignore failures and real bugs reach production unchallenged. The phenomenon is detailed in the multivocal review published on ScienceDirect, which analyzed 200 articles on flakiness in research and practice.
Should I use retries to fix flaky tests?
Retries are a temporary workaround, acceptable while the root cause is being investigated. Using them as a permanent policy is counterproductive: they mask real bugs, normalize failures, increase CI execution time, and degrade the quality of the automation signal over time.
How does AI help reduce flaky tests?
AI-powered tools eliminate the main cause of flakiness in UI tests: the dependency on technical selectors that break with every layout change. By using intent-driven AI, tests execute based on what the user wants to do, not on where the element sits in the HTML. This makes the suite fundamentally more resilient to product evolution.
What is the difference between a flaky test and a broken test?
A broken test fails consistently, every time: it clearly signals a real bug or misconfiguration, and is relatively straightforward to diagnose. A flaky test fails intermittently, with no clear pattern, making it difficult to determine whether the failure is noise or a genuine signal. That ambiguity is precisely what makes flakiness so costly.
Does TestBooster eliminate flakiness entirely?
TestBooster eliminates the main source of flakiness in UI and E2E tests: layout changes. The intent-driven AI automatically adapts to the interface, with no need to update selectors or scripts. For other causes (timing, external dependencies), the tool provides detailed reports with screenshots and actionable insights so the team can efficiently identify and fix the root of the problem.
Visit our website and see TestBooster in action → testbooster.ai


