What is Fault Management Architecture (FMA)
The Oracle Solaris OS includes an architecture for building and deploying systems and services
that are capable of predictive self healing. The service that is the core of the Fault Management
Architecture (FMA) receives data related to hardware and software errors and system changes,
and automatically diagnoses any underlying problem.
For a hardware fault, FMA attempts to take faulty components offline. For other hardware problems, software problems, and some
system changes, FMA provides information for the administrator to use to fix the problem.
Other system changes produce only informational notification
FMA can diagnose and manage faults, defects, and alerts:
■ Faults – A fault is a type of problem where something that used to work no longer does. A
fault typically describes a failed hardware component.
■ Defects – A defect is a type of problem where something never worked. A defect typically
describes a software component.
■ Alerts – An alert is neither a fault nor a defect. An alert can represent a problem or can be
■ The fmadm list command and the fmadm faulty commands display all active faults,
defects, and alerts.
■ The fmadm list-fault command displays all active faults.
■ The fmadm list-defect command displays all active defects.
■ The fmadm list-alert command displays all active alerts.
Use the fmadm replaced command to indicate that the suspect FRU has been replaced. If
multiple faults are currently reported against one FRU, the FRU shows as replaced in all cases.
fmadm replaced FMRI | label
When an FRU is replaced, the serial number of the FRU changes. If fmd automatically detects
that the serial number of an FRU has changed, the Fault Manager behaves in the same way
as if you had entered the fmadm replaced command. If fmd cannot detect whether the serial
number of the FRU has changed, then you must enter the fmadm replaced command if you
have replaced the FRU. If fmd detects that the serial number of the FRU has not changed, then
the fmadm replaced command exits with an error.
If you remove the FRU but do not replace the FRU, the Fault Manager displays the suspect as
Use the fmadm repaired command when you have performed a physical repair other than
replacement of the FRU to resolve the problem. Examples of such repairs include reseating a
card or straightening a bent pin. If multiple faults are currently reported against one FRU, the
FRU shows as repaired in all cases.
fmadm repaired FMRI | label
Use the acquit subcommand if you determine that the indicated resource is not the cause of the
fault. Usually the Fault Manager automatically acquits some suspects in a multi-element suspect
Replacement takes precedence over repair, and both replacement and repair take precedence
over acquittal. Thus, you can acquit a component and then subsequently repair the component,
but you cannot acquit a component that has already been repaired.
If you do not specify any FMRI or label with the UUID, then the entire event is identified as
able to be ignored. A case is considered repaired when the fault event UUID is acquitted.
fmadm acquit UUID
Acquit by FMRI or label with no UUID only if you determine that the resource is not a factor
in any current cases in which that resource is a suspect. If multiple faults are currently reported
against one FRU, the FRU shows as acquitted in all cases.
fmadm acquit FMRI
fmadm acquit label
To acquit a resource in one case and keep that resource as a suspect in other cases, specify both
the fault event UUID and the resource FMRI or both the UUID and the resource label, as shown
in the following examples
fmadm acquit FMRI UUID
fmadm acquit label UUID
Enable fmd service:
Reset the fmd serd modules:
fmadm reset cpumem-diagnosis
fmadm reset cpumem-retire
fmadm reset eft
fmadm reset io-retire