Root cause analysis
This article has multiple issues. Please help improve it or discuss these issues on the talk page. (Learn how and when to remove these template messages)(Learn how and when to remove this template message)
Root cause analysis (RCA) is a method of problem solving used for identifying the root causes of faults or problems. A factor is considered a root cause if removal thereof from the problem-fault-sequence prevents the final undesirable outcome from recurring; whereas a causal factor is one that affects an event's outcome, but is not a root cause. Though removing a causal factor can benefit an outcome, it does not prevent its recurrence with certainty.
- Define and describe properly the event or problem ('five whys' technique).
- Establish a timeline from normal situation until the final crisis or failure.
- Distinguish between root causes and causal factor.
- Once implemented (and with constant execution), RCA is transformed into a method of problem prediction.
RCA is applied to methodically identify and correct the root causes of events, rather than to simply address the symptomatic result. Focusing correction on root causes has the goal of entirely preventing problem recurrence. Conversely, RCFA (Root Cause Failure Analysis) recognizes that complete prevention of recurrence by one corrective action is not always possible.
RCA is typically used as a reactive method of identifying event(s) causes, revealing problems and solving them. Analysis is done after an event has occurred. Insights in RCA make it potentially useful as a preemptive method. In that event, RCA can be used to forecast or predict probable events even before they occur. While one follows the other, RCA is a completely separate process to incident management.
For example, imagine a fictional segment of students who received poor testing scores. After initial investigation, it was verified that students taking tests in the final period of the school day got lower scores. Further investigation revealed that late in the day, the students lacked ability to focus. Even further investigation revealed that the reason for the lack of focus was hunger. So, the root cause of the poor testing scores was hunger, remedied by moving the testing time to soon after lunch.
As another example, imagine an investigation into a machine that stopped because it overloaded and the fuse blew. Investigation shows that the machine overloaded because it had a bearing that wasn't being sufficiently lubricated. The investigation proceeds further and finds that the automatic lubrication mechanism had a pump which was not pumping sufficiently, hence the lack of lubrication. Investigation of the pump shows that it has a worn shaft. Investigation of why the shaft was worn discovers that there isn't an adequate mechanism to prevent metal scrap getting into the pump. This enabled scrap to get into the pump, and damage it. The root cause of the problem is therefore that metal scrap can contaminate the lubrication system. Fixing this problem ought to prevent the whole sequence of events recurring. Compare this with an investigation that does not find the root cause: replacing the fuse, the bearing, or the lubrication pump will probably allow the machine to go back into operation for a while. But there is a risk that the problem will simply recur, until the root cause is dealt with.
Note that "root causes" are likely to have many levels, and the analysis terminates at a level that is "root" only to the eyes of the current investigator. Looking at the second example, a deeper level root cause is that the maintenance procedures at the plant included periodic inspection of the lubrication subsystem every two years, while the current lubrication subsystem vendor's product specified a 6-month period. Switching vendors may have been due to management's desire to save money, and a failure to consult with cognizant engineering staff on the implication of the change, and so on.
Thus, while the "root cause" shown above may have prevented the quoted recurrence, it would not have prevented other – perhaps more severe – future failures.
Quite often, "engineering-type" root cause analysis stops at the engineering level, and fails to go down to deeper, organizational roots.
Root cause analysis is used in ITIL, a set of detailed practices for IT service management, where the analysis is performed to resolve a recurring incident by the problem manager. Within computer security incident management, root cause analysis may be called security investigation and analysis.
Rather than one sharply defined methodology, RCA comprises many different tools, processes, and philosophies. However, several very-broadly defined approaches or "schools" can be identified by their basic approach or field of origin: safety-based, production-based, assembly-based, process-based, failure-based, and systems-based.
- Safety-based RCA arose from the fields of accident analysis and occupational safety and health.
- Production-based RCA has roots in the field of quality control for industrial manufacturing.
- Process-based RCA, a follow-on to production-based RCA, broadens the scope of RCA to include business processes.
- Failure-based RCA originates in the practice of failure analysis as employed in engineering and maintenance.
- Systems-based RCA has emerged as an amalgam of the preceding schools, incorporating elements from other fields such as change management, risk management and systems analysis.
Despite the different approaches among the various schools of root cause analysis, all share some common principles. Several general processes for performing RCA can also be defined.
- The primary aim of root cause analysis is: to identify the factors that resulted in the nature, the magnitude, the location, and the timing of the harmful outcomes (consequences) of one or more past events; to determine what behaviors, actions, inactions, or conditions need to be changed; to prevent recurrence of similar harmful outcomes; and to identify lessons that may promote the achievement of better consequences. ("Success" is defined as the near-certain prevention of recurrence).
- To be effective, root cause analysis must be performed systematically, usually as part of an investigation, with conclusions and root causes that are identified backed up by documented evidence. A team effort is typically required.
- There may be more than one root cause for an event or a problem, therefore the difficult part is demonstrating the persistence and sustaining the effort required to determine them.
- The purpose of identifying all solutions to a problem is to prevent recurrence at lowest cost in the simplest way. If there are alternatives that are equally effective, then the simplest or lowest cost approach is preferred.
- The root causes identified will depend on the way in which the problem or event is defined. Effective problem statements and event descriptions (as failures, for example) are helpful and usually required to ensure the execution of appropriate analyses.
- One logical way to trace down root causes is by utilizing hierarchical clustering data-mining solutions (such as graph-theory-based data mining). A root cause is defined in that context as "the conditions that enable one or more causes". Root causes can be deductively sorted out from upper groups of which the groups include a specific cause.
- To be effective, the analysis should establish a sequence of events or timeline for understanding the relationships between contributory (causal) factors, root cause(s) and the defined problem or event to be prevented.
- Root cause analysis can help transform a reactive culture (one that reacts to problems) into a forward-looking culture (one that solves problems before they occur or escalate). More importantly, RCA reduces the frequency of problems occurring over time within the environment where the process is used.
- Root cause analysis as a force for change is a threat to many cultures and environments. Threats to cultures are often met with resistance. Other forms of management support may be required to achieve effectiveness and success with root cause analysis. For example, a "non-punitive" policy toward problem identifiers may be required.
General process for performing and documenting an RCA-based corrective action
RCA (in steps 3, 4 and 5) forms the most critical part of successful corrective action, directing the corrective action at the true root cause of the problem. Knowing the root cause is secondary to the goal of prevention, as it is not possible to determine an absolutely effective corrective action for the defined problem without knowing the root cause.
- Define the problem or describe the event to prevent in the future. Include the qualitative and quantitative attributes (properties) of the undesirable outcomes. Usually this includes specifying the natures, the magnitudes, the locations, and the timing of events. In some cases, "lowering the risks of reoccurrences" may be a reasonable target. For example, "lowering the risks" of future automobile accidents is certainly a more economically attainable goal than "preventing all" future automobile accidents.
- Gather data and evidence, classifying it along a timeline of events to the final failure or crisis. For every behavior, condition, action and inaction, specify in the "timeline" what should have been done when it differs from what was done.
- In data mining Hierarchical Clustering models, use the clustering groups instead of classifying: (a) peak the groups that exhibit the specific cause; (b) find their upper-groups; (c) find group characteristics that are consistent; (d) check with experts and validate.
- Ask "why" and identify the causes associated with each sequential step towards the defined problem or event. "Why" is taken to mean "What were the factors that directly resulted in the effect?"
- Classify causes into two categories: causal factors that relate to an event in the sequence; and root causes that interrupted that step of the sequence chain when eliminated.
- Identify all other harmful factors that have equal or better claim to be called "root causes". If there are multiple root causes, which is often the case, reveal those clearly for later optimum selection.
- Identify corrective action(s) that will, with certainty, prevent recurrence of each harmful effect and related outcomes or factors. Check that each corrective action would, if pre-implemented before the event, have reduced or prevented specific harmful effects.
- Identify solutions that, when effective and with consensus agreement of the group: prevent recurrence with reasonable certainty; are within the institution's control; meet its goals and objectives; and do not cause or introduce other new, unforeseen problems.
- Implement the recommended root cause correction(s).
- Ensure effectiveness by observing the implemented solutions in operation.
- Identify other possibly useful methodologies for problem solving and problem avoidance.
- Identify and address the other instances of each harmful outcome and harmful factor.
- Wilson, Paul F.; Dell, Larry D.; Anderson, Gaylord F. (1993). Root Cause Analysis: A Tool for Total Quality Management. Milwaukee, Wisconsin: ASQ Quality Press. pp. 8–17. ISBN 0-87389-163-5.
"Root Cause Analysis for Civil Aviation Authorities and Air Navigation Service Providers". International Air Transport Association. IATA. 8 April 2016. Archived from the original on 8 April 2016. Retrieved 17 November 2017.
Key steps to conducting an effective root cause analysis, which tools to use for root cause identification, and how to develop effective corrective actions plans.
"Root Cause Analysis for Safety Management Practitioners & Business Area Owners". Sofema Aviation Services. Sofema. 17 November 2017. Archived from the original on 17 November 2017. Retrieved 17 November 2017.
Identify best practice techniques and behaviours to perform effective Root Cause Analysis (RCA)
- Taiichi Ohno (1988). Toyota Production System: Beyond Large-Scale Production. Portland, Oregon: Productivity Press. p. 17. ISBN 0-915299-14-3.