FMECA: THE RISK METHODOLOGY OT SECURITY FORGOT TO STEAL
From Reliability Engineering to Cyber Threat Modelling
Failure Mode, Effects and Criticality Analysis has kept aircraft and nuclear plants safe for sixty years. Applied to OT cybersecurity, it gives security teams a rigorous, auditable path from asset inventory to prioritised control deployment — without the hand-waving of generic risk matrices.
What FMECA Actually Is — And Why Cybersecurity Should Care
FMECA — Failure Mode, Effects and Criticality Analysis — is a bottom-up, inductive analytical technique that originated in US military reliability engineering during the 1940s, was codified in MIL-STD-1629A in 1980, and has since become a cornerstone of safety-critical industries including aerospace (ARP 4761), nuclear (IEC 60812), and automotive (AIAG-VDA FMEA). The method works by systematically enumerating every way a component or function can fail, then tracing the cascade of effects upward through the system, and finally scoring each failure mode on two axes: severity of consequence and probability of occurrence. The product of these scores — the Criticality Number (CN) or, in the FMEA variant, the Risk Priority Number (RPN) — gives engineers a ranked list of what to fix first.
The cybersecurity community has largely developed its own risk vocabulary — threat modelling with STRIDE, attack graphs, likelihood-consequence matrices, and bow-tie analysis — but these methods share a structural weakness: they tend to be asset-agnostic or scenario-driven rather than exhaustively component-centric. FMECA inverts this. It starts not with a threat actor but with a specific component, asks what it can do wrong, and only then asks what the operational consequence would be. For OT environments where the physical consequence of a cyber event — a valve failing open, a turbine overspeeding, a safety instrumented system being bypassed — is the actual risk, this grounding in physical failure modes is not a nice-to-have. It is the only intellectually honest starting point.
Adapting FMECA for cybersecurity requires one conceptual extension: the cause of a failure mode is no longer limited to mechanical wear, manufacturing defect, or environmental stress. It now includes malicious manipulation, unauthorised command injection, firmware corruption, and denial-of-service against a safety-critical communication path. Once that extension is accepted, the entire FMECA machinery — worksheets, severity scales, criticality matrices, corrective action tracking — applies directly. Security teams gain a methodology that is already understood by process safety engineers, already accepted by regulators, and already embeds the concept of criticality that most cyber risk frameworks conspicuously lack.
The critical insight: in OT cybersecurity, the failure mode is the risk unit — not the threat actor, not the vulnerability, not the CVE score. FMECA forces analysts to stay anchored to operational consequence, which is exactly where cyber risk in industrial environments must be measured.
Building a Cyber FMECA: Column by Column
A cyber FMECA worksheet extends the classical structure with columns specific to the threat environment. The starting point is the asset register — every PLC, RTU, HMI, historian, safety controller, and network device in scope. For each asset, the analyst enumerates functional failure modes: loss of control output, spurious trip, incorrect process variable reading, communication loss, firmware integrity failure. This is where OT-domain expertise is non-negotiable. A cybersecurity analyst who does not understand the process cannot enumerate failure modes correctly; equally, a process engineer who does not understand attack vectors cannot identify which failure modes have a plausible cyber cause. The worksheet makes this knowledge gap visible and forces its resolution.
For each failure mode, the analyst assigns a Severity score (S) on a 1–10 scale anchored to operational consequence: at the low end, a nuisance alarm with no process impact; at the high end, a safety system bypass leading to potential loss of life or catastrophic equipment damage. Severity is independent of how likely the failure is — this is a discipline that generic risk matrices routinely violate by conflating consequence with probability, producing scores that can be manipulated by adjusting likelihood estimates. FMECA keeps them separate until the final criticality calculation.
The Occurrence score (O) represents the probability that a specific cyber cause produces the failure mode during the assessment period. In classical FMECA, occurrence is informed by field failure rate data. In cyber FMECA, the analyst draws on threat intelligence, MITRE ATT&CK for ICS technique frequency data, vulnerability density of the asset class, and network exposure level. A Modbus RTU on an air-gapped serial network has a different occurrence profile than a Modbus TCP gateway reachable from the corporate DMZ — and the worksheet must reflect that difference explicitly rather than burying it in a qualitative band.
The third scoring dimension, Detectability (D), measures how likely the organisation is to detect the failure mode before or immediately after it causes harm. Low detectability — a 9 or 10 on the inverse scale — applies to failure modes that produce no alarm, no process anomaly visible to the operator, and no network telemetry that a SOC analyst would flag. This column is where the cyber FMECA often delivers its most uncomfortable findings: safety instrumented systems running on legacy fieldbus protocols that produce no cybersecurity telemetry whatsoever, meaning a compromised SIS output card could be commanding a spurious trip for minutes before the control room operator notices an anomaly in the physical process. The Risk Priority Number (RPN = S × O × D) quantifies exactly this gap and forces it into the risk register where it can be acted upon.
Why Cyber FMECA Is Hard to Execute Well
FMECA's rigour is also its friction. The method demands cross-discipline collaboration, high-quality asset data, and scoring discipline that most organisations underestimate until they are already mid-workshop. These are not reasons to abandon the approach — they are reasons to understand what investment it actually requires.
Asset Inventory Completeness
FMECA breaks down entirely if the asset register is incomplete. In most brownfield OT environments, a significant fraction of field devices — particularly legacy serial-connected sensors, relay logic controllers, and third-party vendor skids — are either absent from the CMDB or documented at insufficient granularity to enumerate failure modes. Beginning a cyber FMECA without first completing a passive network discovery and physical walkdown produces a false sense of completeness: the worksheets look thorough but have structural gaps corresponding to the assets nobody knew existed. The Stuxnet and TRITON incidents both exploited components that were either underdocumented or outside the assumed threat boundary.
Cross-Domain Scoring Calibration
Severity and Occurrence scores only have analytical value if they are calibrated consistently across the team. OT process engineers and cybersecurity analysts bring fundamentally different mental models to severity scoring: a process engineer may rate a spurious ESD trip as low severity because the plant recovers safely, while a security analyst rates it high because it achieves the attacker's disruption objective. Neither is wrong — but without a pre-agreed scoring rubric that explicitly separates safety consequence, operational impact, and attacker goal achievement, FMECA workshops produce inconsistent scores that make the criticality ranking meaningless.
Maintenance Burden Over Time
A cyber FMECA is a point-in-time analysis. Process modifications, firmware updates, network topology changes, and new threat intelligence all have the potential to invalidate existing scores without any visible signal that a review is needed. Organisations that treat the FMECA as a compliance deliverable rather than a living document find that within eighteen months the worksheet reflects an OT environment that no longer exists. Embedding FMECA review triggers into the management of change process — so that any modification to a scored asset automatically flags its FMECA entries for review — is structurally important but adds process overhead that operations teams often resist.
Cyber FMECA vs. Common OT Risk Assessment Approaches
| Method | Starting Point | Consequence Anchor | Cyber-Cause Explicit | Produces Ranked Action List |
|---|---|---|---|---|
| Cyber FMECA | Component failure mode | Physical process consequence | Yes | Yes — by RPN |
| STRIDE Threat Modelling | Data flow / trust boundary | Information security property | Yes | No — requires separate prioritisation |
| Bow-Tie Analysis | Single top event | Safety or operational event | Partial | No — barrier-focused not ranked |
| Generic Risk Matrix | Scenario or asset | Variable — often financial | Partial | Yes — but coarse-grained |
| ATT&CK for ICS Mapping | Adversary technique | Technique coverage gap | Yes | No — coverage not consequence-ranked |
Implementing Cyber FMECA in an OT Environment
Prerequisite: A passive OT network discovery scan and physical asset walkdown should be completed before the FMECA workshop begins. Starting without a verified asset register wastes workshop time and produces an incomplete analysis.
Scope and Asset Baseline
Establish the system boundary, complete asset inventory, and build the functional block diagram that will serve as the FMECA scope document. Agree on scoring rubrics before any worksheet entry is made.
Worksheet Population and Scoring
Conduct structured FMECA workshops for each asset class. Enumerate failure modes, map cyber causes using ATT&CK for ICS, score S, O, and D, and calculate RPN. Produce the initial criticality ranking.
Action Assignment and Process Integration
Translate the criticality-ranked register into a time-bound corrective action plan. Integrate FMECA review triggers into the management of change process to ensure the worksheet remains current.
Questions Worth Sitting With
FMECA does not simplify OT cybersecurity risk assessment — it makes the complexity explicit and auditable. Before adopting or dismissing the method, these questions deserve honest answers from your team.
If your current risk assessment methodology was challenged in a post-incident regulatory review, could you demonstrate that every high-consequence failure mode had been identified and scored before the incident occurred?
Does your security team have sufficient OT process knowledge to enumerate failure modes correctly, or is the analysis being done by people who understand cyber threats but not the physical process they are protecting?
How would your risk register change if detectability — the likelihood of catching a compromise before it causes harm — were scored explicitly for every critical asset rather than assumed to be adequate?
Is your asset inventory complete enough to begin a bottom-up analysis, or would starting FMECA today expose inventory gaps that are themselves a critical risk?
At what point does the maintenance overhead of a living FMECA document exceed the analytical value it provides — and what process changes would reduce that overhead rather than abandoning the rigour?