||This article includes a list of references, but its sources remain unclear because it has insufficient inline citations. (January 2011)|
Safety engineering is an engineering discipline which assures that engineered systems provide acceptable levels of safety. It is strongly related to systems engineering, industrial engineering and the subset system safety engineering. Safety engineering assures that a life-critical system behaves as needed, even when components fail.
The primary goal of safety engineering is to manage risk, eliminating or reducing it to acceptable levels. Risk is the combination of the probability of a failure event, and the severity resulting from the failure. For instance, the severity of a particular failure may result in fatalities, injuries, property damage, or nothing more than annoyance. It may be a frequent, occasional, or rare occurrence. The acceptability of the failure depends on the combination of the two. Probability is often more difficult to predict than severity due to the many factors that could lead to a failure, such as mechanical failure, environmental effects, and operator error.
Safety engineering attempts to reduce the frequency of failures, and ensure that when failures do occur, the consequences are not life-threatening. For example, bridges are designed to carry loads well in excess of the heaviest truck likely to use them. This reduces the likelihood of being overloaded. Most bridges are designed with redundant load paths, so that if any one structural member fails, the structure will remain standing. This reduces the severity if the bridge is overloaded.
Ideally, safety engineering starts during the early design of a system. Safety engineers consider what undesirable events can occur under what conditions, and project the related accident risk. They may then propose or require safety mitigation requirements in specifications at the start of development or changes to existing CAD designs or in-service products to make a system safer. This may done by full elimination of any type of hazards or by lowering accident risk. Far too often, rather than actually influencing the design, safety engineers are assigned to prove that an existing, completed design is safe. If the engineer discovers significant safety problems late in the development process, correcting them can be very expensive. This type of error has the potential to waste large sums of money and likely more important, human lives and environmental damage.
The exception to this conventional approach is the way some large government agencies approach safety engineering from a more proactive and proven process perspective, known as "system safety". The system safety philosophy is to be applied to complex and critical systems, such as commercial airliners, complex weapon systems, spacecraft, rail and transportation systems, air traffic control system and other complex and safety-critical industrial systems. The proven system safety methods and techniques are to prevent, eliminate and control hazards and risks through designed influences by a collaboration of key engineering disciplines and product teams. Software safety is a fast growing field since modern systems functionality are increasingly being put under control of software. The whole concept of system safety and software safety, as a subset of systems engineering, is to influence safety-critical systems designs by conducting several types of hazard analyses to identify hazards, validate hazards & verify design, assess and if needed to specify (new) design safety features and procedures to strategically mitigate risk to acceptable levels before the system is certified.
Additionally, failure mitigation can go beyond design recommendations, particularly in the area of maintenance. There is an entire realm of safety and reliability engineering known as Reliability Centered Maintenance (RCM), which is a discipline that is a direct result of analyzing potential failures within a system and determining maintenance actions that can mitigate the risk of failure. This methodology is used extensively on aircraft and involves understanding the failure modes of the serviceable replaceable assemblies in addition to the means to detect or predict an impending failure. Every automobile owner is familiar with this concept when they take in their car to have the oil changed or brakes checked. Even filling up one's car with fuel is a simple example of a failure mode (failure due to fuel exhaustion), a means of detection (fuel gauge), and a maintenance action (filling the car's fuel tank). (The use of a car's odometer also to gauge fuel illustrates the concept of "redundant sensors".)
For large scale complex systems, hundreds if not thousands of maintenance actions can result from the failure analysis. These maintenance actions are based on conditions (e.g., gauge reading or leaky valve), hard conditions (e.g., a component is known to fail after 100 hrs of operation with 95% certainty), or require inspection to determine the maintenance action (e.g., metal fatigue). The RCM concept then analyzes each individual maintenance item for its risk contribution to safety, mission, operational readiness, or cost to repair if a failure does occur. Then the sum total of all the maintenance actions are bundled into maintenance intervals so that maintenance is not occurring around the clock, but rather, at regular intervals. This bundling process introduces further complexity, as it might stretch some maintenance cycles, thereby increasing risk, but reduce others, thereby potentially reducing risk, with the end result being a comprehensive maintenance schedule, purpose built to reduce operational risk and ensure acceptable levels of operational readiness and availability.
Analysis techniques can be split into two categories: qualitative and quantitative methods. Both approaches share the goal of finding causal dependencies between a hazard on system level and failures of individual components. Qualitative approaches focus on the question "What must go wrong, such that a system hazard may occur?", while quantitative methods aim at providing estimations about probabilites, rates and/or severity of consequences.
Traditionally, safety analysis techniques rely solely on skill and expertise of the safety engineer. In the last decade model-based approaches have become prominent. In contrast to traditional methods, model-based techniques try to derive relationships between causes and consequences from some sort of model of the system.
Traditional methods for safety analysis
The two most common fault modeling techniques are called failure mode and effects analysis and fault tree analysis. These techniques are just ways of finding problems and of making plans to cope with failures, as in probabilistic risk assessment. One of the earliest complete studies using this technique on a commercial nuclear plant was the WASH-1400 study, also known as the Reactor Safety Study or the Rasmussen Report.
Failure modes and effects analysis
Failure Mode and Effects Analysis (FMEA) is a bottom-up, inductive analytical method which may be performed at either the functional or piece-part level. For functional FMEA, failure modes are identified for each function in a system or equipment item, usually with the help of a functional block diagram. For piece-part FMEA, failure modes are identified for each piece-part component (such as a valve, connector, resistor, or diode). The effects of the failure mode are described, and assigned a probability based on the failure rate and failure mode ratio of the function or component. This quantiazation is difficult for software ---a bug exists or not, and the failure models used for hardware components do not apply. Temperature and age and manufacturing variability affect a resistor; they do not affect software.
Failure modes with identical effects can be combined and summarized in a Failure Mode Effects Summary. When combined with criticality analysis, FMEA is known as Failure Mode, Effects, and Criticality Analysis or FMECA, pronounced "fuh-MEE-kuh".
Fault tree analysis
Fault tree analysis (FTA) is a top-down, deductive analytical method. In FTA, initiating primary events such as component failures, human errors, and external events are traced through Boolean logic gates to an undesired top event such as an aircraft crash or nuclear reactor core melt. The intent is to identify ways to make top events less probable, and verify that safety goals have been achieved.
FTA may be qualitative or quantitative. When failure and event probabilites are unknown, qualitative fault trees may be analyzed for minimal cut sets. For example, if any minimal cut set contains a single base event, then the top event may be caused by a single failure. Quantitative FTA is used to compute top event probability, and usually requires computer software such as CAFTA from the Electric Power Research Institute or SAPHIRE from the Idaho National Laboratory.
Some industries use both fault trees and event trees. An event tree starts from an undesired initiator (loss of critical supply, component failure etc.) and follows possible further system events through to a series of final consequences. As each new event is considered, a new node on the tree is added with a split of probabilities of taking either branch. The probabilities of a range of "top events" arising from the initial event can then be seen.
Usually a failure in safety-certified systems is acceptable if, on average, less than one life per 109 hours of continuous operation is lost to failure. Most Western nuclear reactors, medical equipment, and commercial aircraft are certified to this level. The cost versus loss of lives has been considered appropriate at this level (by FAA for aircraft systems under Federal Aviation Regulations).
Once a failure mode is identified, it can usually be mitigated by adding extra or redundant equipment to the system. For example, nuclear reactors contain dangerous radiation, and nuclear reactions can cause so much heat that no substance might contain them. Therefore reactors have emergency core cooling systems to keep the temperature down, shielding to contain the radiation, and engineered barriers (usually several, nested, surmounted by a containment building) to prevent accidental leakage. Safety-critical systems are commonly required to permit no single event or component failure to result in a catastrophic failure mode.
Most biological organisms have a certain amount of redundancy: multiple organs, multiple limbs, etc.
For any given failure, a fail-over or redundancy can almost always be designed and incorporated into a system.
Safety and reliability
Safety is not reliability. If a medical device fails, it should fail safely; other alternatives will be available to the surgeon. If an aircraft fly-by-wire control system fails, there is no backup. Electrical power grids are designed for both safety and reliability; telephone systems are designed for reliability, which becomes a safety issue when emergency (e.g. US "911") calls are placed.
Probabilistic risk assessment has created a close relationship between safety and reliability. Component reliability, generally defined in terms of component failure rate, and external event probability are both used in quantitative safety assessment methods such as FTA. Related probabilistic methods are used to determine system Mean Time Between Failure (MTBF), system availability, or probability of mission success or failure. Reliability analysis has a broader scope than safety analysis, in that non-critical failures are considered. On the other hand, higher failure rates are considered acceptable for non-critical systems.
Safety generally cannot be achieved through component reliability alone. Catastrophic failure probabilities of 10−9 per hour correspond to the failure rates of very simple components such as resistors or capacitors. A complex system containing hundreds or thousands of components might be able to achieve a MTBF of 10,000 to 100,000 hours, meaning it would fail at 10−4 or 10−5 per hour. If a system failure is catastrophic, usually the only practical way to achieve 10−9 per hour failure rate is through redundancy. Two redundant systems with independent failure modes, each having an MTBF of 100,000 hours, could achieve a failure rate on the order of 10−10 per hour because of the multiplication rule for independent events.
When adding equipment is impractical (usually because of expense), then the least expensive form of design is often "inherently fail-safe". That is, change the system design so its failure modes are not catastrophic. Inherent fail-safes are common in medical equipment, traffic and railway signals, communications equipment, and safety equipment.
The typical approach is to arrange the system so that ordinary single failures cause the mechanism to shut down in a safe way (for nuclear power plants, this is termed a passively safe design, although more than ordinary failures are covered). Alternately, if the system contains a hazard source such as a battery or rotor, then it may be possible to remove the hazard from the system so that its failure modes cannot be catastrophic. The U.S. Department of Defense Standard Practice for System Safety (MIL–STD–882) places the highest priority on elimination of hazards through design selection.
One of the most common fail-safe systems is the overflow tube in baths and kitchen sinks. If the valve sticks open, rather than causing an overflow and damage, the tank spills into an overflow. Another common example is that in an elevator the cable supporting the car keeps spring-loaded brakes open. If the cable breaks, the brakes grab rails, and the elevator cabin does not fall.
Some systems can never be made fail safe, as continuous availability is needed. For example, loss of engine thrust in flight is dangerous. Redundancy, fault tolerance, or recovery procedures are used for these situations (e.g. multiple independent controlled and fuel fed engines). This also makes the system less sensitive for the reliability prediction errors or quality induced uncertainty for the separate items. On the other hand, failure detection & correction and avoidance of common cause failures becomes here increasingly important to ensure system level reliability. 
It is common practice to plan for the failure of safety systems through containment and isolation methods. The use of isolating valves, also known as the block and bleed manifold, is very common in isolating pumps, tanks, and control valves that may fail or need routine maintenance. In addition, nearly all tanks containing oil or other hazardous chemicals are required to have containment barriers set up around them to contain 100% of the volume of the tank in the event of a catastrophic tank failure. Similarly, in a long pipeline, there are remote-closing valves at regular intervals so that a leak can be isolated. Fault isolation boundaries are similarly designed into critical electronic systems or computer software. The goal of all containment systems is to provide means of mitigating the consequences of failure. Fault isolation might also refer to the extent to which detected failures might be isolated for successful recovery. The isolation level shows the system identure level at which the failure cause can be recovered (often by replacement of a line replaceable unit).
- Earthquake engineering
- Effective safety training
- Forensic engineering
- Hazard and operability study
- Industrial engineering
- IEC 61508
- Loss-control consultant
- Occupational medicine
- Nuclear safety
- Process safety management
- Risk assessment
- Risk management
- Safety life cycle
- Occupational safety and health
- Zonal safety analysis
- ANM-110 (1988). System Design and Analysis (pdf). Federal Aviation Administration. Advisory Circular AC 25.1309-1A. Retrieved 2011-02-20.
- S–18 (2010). Guidelines for Development of Civil Aircraft and Systems. Society of Automotive Engineers. ARP4754A.
- S–18 (1996). Guidelines and methods for conducting the safety assessment process on civil airborne systems and equipment. Society of Automotive Engineers. ARP4761.
- Standard Practice for System Safety (pdf). D. U.S. Department of Defense. 1998. MIL–HDBK–882D. Retrieved 2010-03-14.
- Bornschlegl, Susanne (2012). Ready for SIL 4: Modular Computers for Safety-Critical Mobile Applications (pdf). MEN Mikro Elektronik. Retrieved 2012-05-29.
- Lees, Frank (2005). Loss Prevention in the Process Industries (3 ed.). Elsevier. ISBN 978-0-7506-7555-0.
- Kletz, Trevor (1984). Cheaper, safer plants, or wealth and safety at work: notes on inherently safer and simpler plants. I.Chem.E. ISBN 0-85295-167-1.
- Kletz, Trevor (2001). An Engineer’s View of Human Error (3 ed.). I.Chem.E. ISBN 0-85295-430-1.
- Kletz, Trevor (1999). HAZOP and HAZAN (4 ed.). Taylor & Francis. ISBN 0-85295-421-2.
- Lutz, Robyn R. (2000). Software Engineering for Safety: A Roadmap. The Future of Software Engineering. ACM Press. ISBN 1-58113-253-0. Retrieved 31 August 2006.
- Grunske, Lars; Kaiser, Bernhard; Reussner, Ralf H. (2005). Specification and Evaluation of Safety Properties in a Component-based Software Engineering Process. Springer. Retrieved 7 September 2013.
- US DOD (10 February 2000). Standard Practice for System Safety. Washington, DC: US DOD. MIL-STD-882D. Retrieved 7 September 2013.
- US FAA (30 December 2000). System Safety Handbook. Washington, DC: US FAA. Retrieved 7 September 2013.
- NASA (16 December 2008). Agency Risk Management Procedural Requirements. NASA. NPR 8000.4A.
- Leveson, Nancy (2011). Engineering a Safer World - Systems Thinking Applied To Safety. Engineering Systems. The MIT Press. ISBN 978-0-262-01662-9. Retrieved 3 July 2012.
- American Society of Safety Engineers (official website)
- Board of Certified Safety Professionals (official website)
- System Safety Society (official website)
- The Safety and Reliability Society (SaRS) (official website)
- Canadian Society of Safety Engineering (official website)
- U.S. Army Pamphlet 385-16 System Safety Management Guide