When System Weaknesses Go Unnoticed, Human Error Becomes Inevitable

There is a pattern that repeats itself across industries, organisations, and incident reports with remarkable consistency. It goes like this:

Something small starts going wrong. People notice. They adapt. They work around it. Life goes on. And then, one day, something larger fails — and everyone looks back at the trail of small signals that were there all along, quietly announcing what was coming.

This is not a story about negligence. It is not a story about bad people or poor training. It is a story about what happens when organisations are better at reacting to failures than they are at listening to the system that is trying to warn them.

The QC laboratory balance is a perfect case in point.


What Happened in the Lab

In a quality control laboratory, analysts began reporting unstable balance readings. The readings were fluctuating. Results were inconsistent. The balances — critical instruments in a setting where precision is not optional — were behaving unreliably.

Initially, the reports were treated as a minor inconvenience. Perhaps a bit of environmental interference. Perhaps user technique. Perhaps nothing to worry about too urgently.

But the analysts had to keep working. QC labs don’t pause because a balance is misbehaving. So they did what skilled, resourceful people do when a tool lets them down: they found another way. Despite having three analytical balances in their solution preparation area — right where they needed them — analysts began making the journey to a distant zone to weigh their materials.

Think about what that sentence contains. Three balances. Available. Nearby. And yet people were walking away from them, every time, to use a different one somewhere else in the facility. Not because they were told to. Not because of a procedure change. Because they had quietly, individually, collectively — lost trust in the equipment that was supposed to support their work.

When the issue was finally formally investigated, the root cause was found. It was not mysterious. It was not complex. The granite anti-vibration slab on which the balances sat had shifted. It was now in contact with adjacent structures. Vibrations that the slab was designed to isolate were being transmitted directly into the balances through that physical contact. A slab that was supposed to protect the instruments from the building’s noise had, through a gradual, undetected shift, become the very conduit of the interference it was meant to block.

The fix, once identified, was straightforward. The slab was repositioned. The contact was eliminated. The balances returned to stability.

But the story doesn’t end there. Because the more important question is not what the fix was. It is everything that happened before the fix was applied.


The Anatomy of a Drift Towards Failure

What this incident describes so precisely is a phenomenon that James Reason — the cognitive psychologist whose work on human error remains indispensable — called system migration towards failure. Systems do not typically fail catastrophically and without warning. They drift. They accumulate small degradations. They develop gaps between how they were designed to work and how they are actually working. And crucially, people inside the system adapt to those gaps — which makes the gaps invisible to the organisation even as they grow.

Let us trace the drift in the balance story:

Stage 1 — The physical change. The granite slab shifts. This happens silently, probably gradually, possibly during a facility move, maintenance activity, or simply through the minor ground vibrations of daily operations. No alarm sounds. No system flags it. No one sees it.

Stage 2 — The performance degradation. The balances begin to behave unreliably. Readings fluctuate. The instruments are still technically operational — they haven’t broken down, they haven’t stopped producing numbers. They are just producing the wrong numbers, inconsistently.

Stage 3 — The signal. Analysts notice. They report. The signals are real. They are coming from the sharp end of the operation — the people closest to the work, whose professional judgment told them something was wrong.

Stage 4 — The misclassification. The signals are received and categorised as a “minor inconvenience.” Not investigated. Not escalated. Not treated as the system warning they actually were. This is the critical juncture. The moment the signal was downgraded in importance is the moment the drift accelerated without resistance.

Stage 5 — The workaround. Analysts, unable to work with unreliable equipment and unable to wait for a fix that hadn’t been prioritised, developed their own solution. Walk to the distant zone. Use the trusted balance. Get the work done. This workaround was sensible. It was professional. It kept operations running. And it completely masked the problem from any aggregate view.

Stage 6 — Normalisation. The workaround becomes routine. New analysts learn it as “how things are done here.” The three nearby balances sit unused for weighing. The system has adapted around the failure. The failure is now invisible — not because it has been fixed, but because the human beings in the system have compensated for it.

This is where organisations are most vulnerable. Not when systems fail dramatically, but when systems fail quietly and people compensate effectively.


What Was Actually at Risk

It would be easy to look at this story and conclude that the outcome was manageable — analysts found a workaround, work continued, no major deviation occurred. But that reading misses the severity of what was accumulating beneath the surface.

Unreliable results. During the period between the slab shifting and the investigation, some weighing may have been done on the compromised balances. Analytical results in a QC laboratory flow downstream into product release decisions, stability assessments, and regulatory documentation. An unstable balance is not an inconvenience in that context. It is a potential data integrity issue.

Inconsistent practices. Some analysts walked to the distant zone. Others may not have known to. Some may have continued using the local balances, unaware of the consensus that had developed among their more experienced colleagues. Informal workarounds are, by definition, not standardised. Different people behave differently. Consistency — a cornerstone of any QC operation — is compromised.

Loss of confidence in equipment. Trust, once lost, is not automatically restored when the underlying issue is fixed. Analysts who spent weeks or months compensating for unreliable balances do not immediately return to trusting them because someone said “it’s fixed now.” Equipment confidence has to be rebuilt, which takes time and evidence.

Increased deviation risk. Every workaround is an undocumented procedure deviation waiting to be discovered. If an audit or inspection had examined the weighing records against actual practice — why are analysts consistently using a balance in Zone B for solution preparation when Zones A, C, and D have dedicated instruments? — the gap between documented procedure and actual practice would have been visible and difficult to explain.

Masking of a systemic issue. Perhaps most importantly, the workaround prevented the system signal from propagating upward. If no one reports a problem because everyone has found their own way around it, the system never receives the information it needs to trigger a proper investigation. The root cause remains in place. And it may not be finished causing problems.


Near Misses: The Warnings We Are Trained to Ignore

There is a concept in safety science — one that has been hard-won through disaster investigations in aviation, nuclear power, petrochemicals, and healthcare — called the near miss. A near miss is an event that did not result in harm but had the potential to do so. The conventional response to near misses, in organisations with mature safety cultures, is to treat them with nearly as much seriousness as actual incidents. Not because of what happened, but because of what they reveal about the conditions in which something worse could happen.

The balance reports were near misses. Every time an analyst noticed a fluctuating reading and adapted, that was the system trying to communicate. Every workaround was a near miss data point that, properly collected and analysed, would have pointed directly to the anti-vibration slab.

But near misses are systematically underreported and underinvestigated in most organisations, for a set of entirely understandable reasons:

  • They didn’t cause harm, so the urgency doesn’t feel proportionate
  • Reporting them requires effort and time that people under workload pressure don’t have
  • The culture may — explicitly or implicitly — discourage bringing “small problems” to management attention
  • Individual workarounds feel like solutions, removing the felt need to report a problem that has been “handled”
  • The signal-to-noise ratio is high: in any complex operation, there are many minor anomalies, and distinguishing the meaningful ones from genuine noise requires deliberate attention

The result is that the warnings accumulate silently. The system drifts. And the first time the organisation formally engages with the problem is when it escalates past the point where informal workarounds can contain it.


Fixing the System, Not the Person

When the balance investigation was finally completed and the root cause identified, the right thing was done: the slab was repositioned, the physical cause was eliminated, and the balances were restored to reliable function.

But it is worth pausing on what the wrong response would have looked like — because it is a response that organisations reach for instinctively, especially in regulated environments where compliance and accountability are paramount.

The wrong response would have been to focus on the analysts.

Why didn’t they report this earlier? Why were they using non-designated equipment? Why wasn’t the issue escalated through proper channels? Were the workarounds documented? Were there procedural deviations that need to be raised?

These questions are not entirely without merit. But if they become the primary line of investigation, the organisation has directed its energy at the people who were compensating for the system’s failure rather than at the system that was failing. The analysts were responding rationally to an unreliable environment. They were doing their jobs as best they could with the tools available to them. Making them the subject of corrective action does not fix the slab. It does not prevent the next slab from shifting. It does teach people that when equipment fails and they find a workaround, the safest thing to do is say nothing — because speaking up about the problem has a cost, and staying quiet has none.

Strong quality cultures fix the system. They ask: what physical, procedural, or organisational condition made this error possible, and how do we change that condition? They look at the slab, the maintenance schedule, the inspection checklist, the procedure for equipment qualification, the mechanism by which analyst reports are triaged and acted upon.

Weak quality cultures fix the person. They retrain, they issue memos about reporting obligations, they add a signature line to a form. And then they wait for the next slab to shift.


What Listening Early Actually Looks Like

The phrase “listening early” is easy to write and hard to operationalise. Here is what it actually requires in practice:

Treating workarounds as data. Any time a person routinely does something differently from the documented procedure, that is information. Not necessarily evidence of wrongdoing — often evidence of a gap between how the system was designed and how it actually performs. Workaround mapping — systematically identifying where informal practices have diverged from formal ones — is one of the most valuable and underused tools in quality and safety management.

Investigating weak signals before they become strong ones. A single complaint about balance instability might be noise. Two complaints from different analysts over two weeks is a pattern. Three analysts independently choosing to use a different piece of equipment is a signal so strong it should not require a formal deviation report to trigger an investigation. The systems and the culture need to be sensitive enough to detect patterns at that level.

Separating signal from urgency. The fact that something has not yet caused a recordable deviation does not mean it is not serious. Urgency and importance are not the same thing. A signal can be important — pointing to a real systemic vulnerability — while producing no immediate urgency because people have found ways to work around it. Quality cultures learn to evaluate signals on their systemic implications, not only on the severity of the immediate outcome.

Making it safe and easy to raise concerns. If the cultural or procedural overhead of reporting a “minor inconvenience” is high, people won’t do it. Reports need to be straightforward to submit, clearly acknowledged when received, and visibly acted upon. The signal that most powerfully encourages people to keep raising concerns is: “the last time I raised something, something happened.”

Closing the loop. When a reported concern is investigated, the person who raised it should know the outcome. This is not just courtesy — it is a critical part of the feedback loop that sustains a reporting culture. If concerns disappear into a system and nothing is heard back, the rational conclusion is that raising concerns was a waste of time.


The Deeper Principle: Systems Determine Performance

At its heart, the balance story is a demonstration of a principle that the human factors and quality science communities have been articulating for decades, but that organisations still struggle to internalise fully:

When systems are strong, human performance improves. When systems are weak, even the best people struggle.

This is not a comfortable idea for organisations built on the assumption that the way to improve performance is to improve people — to train more, to select better, to hold more accountable. And training, selection, and accountability all matter. But they operate within a system. They cannot compensate indefinitely for system deficiencies.

An analyst working with an unreliable balance in a poorly maintained environment, under time pressure, with no clear channel for raising concerns, and no recent evidence that raising concerns changes anything — that analyst is going to make errors. Not because they are incompetent. Because the conditions around them are generating errors faster than any individual can catch and correct them.

Change the conditions, and performance changes with them. Fix the slab, calibrate the balance, build the reporting culture, act on the signals — and the analyst who was making errors in the broken environment becomes the reliable professional they always were in a functional one.

This is what it means to design for humans rather than expecting humans to compensate for design.


Conclusion: Small Signals, Large Consequences

The granite slab in a QC laboratory. The unstable balance readings. The analysts quietly walking to another zone.

None of this looked like a crisis. It looked like a minor inconvenience, a bit of operational friction, a small inefficiency absorbed into daily workarounds. It looked, in other words, exactly like the early stages of many serious failures look — before they become serious.

The lesson is not that every minor complaint harbours a catastrophic risk. It is that the discipline of investigating weak signals, building cultures where raising concerns is normal and rewarded, and directing corrective energy at systems rather than individuals — that discipline pays dividends that are difficult to see until the moment you realise a drift towards failure has been arrested before it arrived.

Strong quality culture doesn’t mean having fewer problems. It means detecting them earlier, understanding them more deeply, and fixing the right things.

The slab shifts. The question is always: how quickly do you find out, and what do you do when you do?


Acknowledgement: The distinction between system-focused and person-focused error management draws on the foundational work of James Reason and the practical frameworks developed by John Evans in his Error Risk Reduction reference manual — essential reading for anyone serious about understanding and reducing human error in real working environments.