Intermittent fault

An intermittent fault, often called simply an "intermittent", is a malfunction of a device or system that occurs at intervals, usually irregular, in a device or system that functions normally at other times. Intermittent faults are common to all branches of technology, including computer software. An intermittent fault is caused by several contributing factors, some of which may be effectively random, which occur simultaneously. The more complex the system or mechanism involved, the greater the likelihood of an intermittent fault.

A simple example of an effectively random cause in a physical system is a borderline electrical connection in the wiring or a component of a circuit, where (cause 1, the cause that must be identified and rectified) two conductors may touch subject to (cause 2, which need not be identified) a minor change in temperature, vibration, orientation, voltage, etc. (Sometimes this is described as an "intermittent connection" rather than "fault".) In computer software a program may (cause 1) fail to initialise a variable which is required to be initially zero; if the program is run in circumstances such that memory is almost always clear before it starts, it will malfunction on the rare occasions that (cause 2) the memory where the variable is stored happens to be non-zero beforehand.

Intermittent faults are notoriously difficult to identify and repair ("troubleshoot") because each individual factor does not create the problem alone, so the factors can only be identified while the malfunction is actually occurring. The person capable of identifying and solving the problem is seldom the usual operator. Because the timing of the malfunction is unpredictable, and both device or system downtime and engineers' time incur cost, the fault is often simply tolerated if not too frequent unless it causes unacceptable problems or dangers. For example, some intermittent faults in critical equipment such as medical life support equipment could result in killing a patient or in aeronautics causes a flight to be aborted or in some cases crash.

If an intermittent fault occurs for long enough during troubleshooting, it can be identified and resolved in the usual way.

Some techniques to resolve intermittent faults are:

Automatic logging of relevant parameters over a long enough time for the fault to manifest can help; parameter values at the time of the fault may identify the cause so that appropriate remedial action can be taken.
Changing operating circumstances while the fault is present to see if the fault temporarily clears or changes. For example, tapping components, cooling them with freezer spray, heating them. Striking the cabinet may temporarily clear the fault.
a database of similar faults which have been resolved in identical or similar equipment^[1]
precautionary changes, without attempting to pinpoint the fault. For example, electrolytic capacitors subject to high ripple currents can be changed as a routine measure, without bothering to troubleshoot the fault at all. Connectors can be disconnected and reseated. This is sometimes a measure of desperation; things are changed until the fault stops happening, and it is hoped that it is actually resolved rather than dormant.
In electrical systems and cable systems, time domain reflectometry techniques can be used: pulses are sent down electric wiring and the pulses reflected back are examined for anomalies, for example intermittent leakage during the stresses of aircraft operation; this can only be done for one test channel at time and is generally limited to intermittent faults >100milliseconds.^[2]
In complex, multiple channel systems, where the fault/s might be in an interconnection, the ideal method of finding an intermittent fault is to be able to monitor, detect and isolate all channels or electrical paths continuously and simultaneously. This methodology allows the system under test to benefit from continuous and complete test coverage while any environmental stressing of the system is performed. This type cannot be performed by scanning testing technology but needs to have some form of electronic neural-network which can perform these test without the need for any scanning and/or digital averaging; this testing regime is covered by the DoD's MIL-PRF-32516 published in March 2015 and it calls for testing technology to operate in the Class 1 category in order to combat intermittent faults effectively.^[3]

References

^ Example of an intermittent TV fault in a database [1]: "Z3T CHASSIS - NO START UP - INTERMITTENT. D1124 (5.1V) ZENER LEAKY"
^ "Spread Spectrum Time Domain Reflectometry for Locating Intermittent Faults" Furse, Cynthia; Smith, Paul; IEEE SENSORS JOURNAL, VOL. 5, NO. 6, DECEMBER 2005"
^ "No Fault Found, Retest OK, Cannot Duplicate or Fault Not Found? - Towards a standardised taxonomy [2]" Samir Khan, Paul Phillips, Chris Hockley, Ian Jennions"

External links

[1] Example of an intermittent TV fault in a database [1]: "Z3T CHASSIS - NO START UP - INTERMITTENT. D1124 (5.1V) ZENER LEAKY"

[2] "Spread Spectrum Time Domain Reflectometry for Locating Intermittent Faults" Furse, Cynthia; Smith, Paul; IEEE SENSORS JOURNAL, VOL. 5, NO. 6, DECEMBER 2005"

[3] "No Fault Found, Retest OK, Cannot Duplicate or Fault Not Found? - Towards a standardised taxonomy [2]" Samir Khan, Paul Phillips, Chris Hockley, Ian Jennions"

[1]

[2]

[3]