fmadm faulty clear

por | 16 abril, 2018

SH Procedural Article for ILOM-Based Diagnosis (Doc ID 1155200.1) To BottomTo Bottom

In this Document
Purpose
Details
Section A – Displaying Fault Event Information
Section A.1 Using the Fault Management Shell
Section A.2 Using the Standard ILOM Command Line Interface
Section B – Submitting a Service Request
Auto Service Request (ASR) Activated for the Product
Submitting a Service Request Via the Support Center
Section C – Post-Repair Procedures
Section C.1 Using Fault Management Shell to Clear the Fault
Section C.2 Using the ILOM Command Line Interface to Clear the Fault
References

Applies to:
Sun Microsystems > Servers
Information in this document applies to any platform.
Purpose

This article provides standard procedures for viewing details of a hardware fault diagnosed by the ILOM-based fault managers. Information contained in this article includes the preparation required when opening a service request and actions required to modify the fault status after completion of the repair action.
Details

Note: The example contained in this document is representative of the what will appear on your system. However there will be slight variations for your specific fault.
Section A – Displaying Fault Event Information
This section describes specific procedures for viewing the details of diagnosed fault, such as, the impacted resources and the replaceable parts that have been identified as being faulty. Execution of these procedures should be performed prior to manually submitting a service request.

The Fault Management Shell is the preferred method for displaying the details of a diagnosed fault. However, support for this command shell varies depending ILOM release level and server product model.

Determine if the Fault Management Shell is supported on your product by logging into to the ILOM command interface as root and executing the the command indicated in the procedures below.

Note: The host name may be substituted in place of the IP address of the Service Processor when logging into the ILOM CLI.

% ssh -l root <IP address of Service Processor>

-> show /SP/faultmgmt/shell

/SP/faultmgmt/shell
Targets:

Properties:

Commands:
show
start

The above indicates the Fault Management Shell is supported. Proceed to section A.1 Using the Fault Management Shell

-> show /SP/faultmgmt/shell
show: No such target /SP/faultmgmt/shell

The above indicates the Fault Management Shell is not supported on your product. Proceed to section A.2 Using the ILOM Command Line Interface

Section A.1 Using the Fault Management Shell

The following procedure assumes you are logged into the ILOM command line interface as root per the instructions above.

Enter the fault management shell to obtain pertinent information about the fault.

-> start /SP/faultmgmt/shell
Are you sure you want to start /SP/faultmgmt/shell(y/n)? y

faultmgmtsp>

Use the ‘fmadm faulty’ command to identify the faulty component/FRU.

Example 1

The Example output shown below identifies the suspect FRU as «/SYS/FANBD/FM0», which represents the full physical path to the FRU. The hierarchical path «/SYS» represents the chassis, «FANBD» represents the fan board, and «FM0» represents the Fan Module.

For the example below the fan does not contain a FRUID so the part number and serial number are displayed as ‘unknown’. When this information is available, these fields will contain valid information. See Example 2 below.

faultmgmtsp> fmadm faulty
——————- ———————————— ————– ——–
Time UUID msgid Severity
——————- ———————————— ————– ——–
2010-08-17/20:19:09 c1060771-8f6e-eb1f-a65c-bb47d261a1d4 SPT-8000-3R Major

Fault class : fault.chassis.device.fan.fail

FRU : /SYS/FANBD/FM0
(Part Number: unknown)
(Serial Number: unknown)

Description : Fan tachometer speed is below its normal operating range.

Response : The service-required LED may be illuminated on the affected FRU and chassis. System will be powered down when the High
Temperature threshold is reached.

Impact : System may be powered down if redundant fan modules are not
operational.

Action : The administrator should review the ILOM event log for
additional information pertaining to this diagnosis. Please refer to the Details section of the Knowledge Article for
additional information.

Example 2
The Example 2 output shown below identifies the suspect FRU as ‘/SYS/MB’. The hierarchical path «/SYS» represents the chassis, ‘/MB’ represents the Mother Board.
faultmgmtsp> fmadm faulty
——————- ———————————— ————– ——–
Time UUID msgid Severity
——————- ———————————— ————– ——–
2010-08-30/14:44:36 2a4e3a37-b243-e071-8b26-f65cb5d015f1 SPT-8000-DH Critical

Fault class : fault.chassis.voltage.fail

FRU : /SYS/MB
(Part Number: 541-3857-07)
(Serial Number: 1005LCB-1018B2009T)

Description : A chassis voltage supply is operating outside of the
allowable range.

Response : The system will be powered off. The chassis-wide service
required LED will be illuminated.

Impact : The system is not usable until repaired. ILOM will not allow
the system to be powered on until repaired.

Action : The administrator should review the ILOM event log for
additional information pertaining to this diagnosis. Please
refer to the Details section of the Knowledge Article for
additional information.

Example 3 (ILOM 3.2 +)
The following example depicts the change in fmadm faulty output delivered as of ILOM 3.2. In this example a memory fault was diagnosed on a SPARC T5-2 system. The system, system component and the single suspect FRU identity properties are explicitly presented in individual fields.

The System properties identify the top-level product while the System Component the identity of a constituent system-level component (i.e. server) of that product containing the diagnosed problem.

Other elements of additional event information presented include:

Diag Engine – Identity of the diagnosis software that generated this event.
Problem Status – The overall status of this diagnosed problem.
Status (FRU) – Indicates the status of this FRU («faulty» in this case).

[fmadm faulty

faultmgmtsp> fmadm faulty
——————- ———————————— ————– ——–
Time UUID msgid Severity
——————- ———————————— ————– ——–
2000-04-09/23:59:42 ebd48d6a-3c0c-cf8c-b29d-93c7ffc401a3 SPSUN4V-8000-CQ MAJOR

Problem Status : solved
Diag Engine : fdd 1.0
System
Manufacturer : Oracle Corporation
Name : T5 engineered
Part_Number : 1234
Serial_Number : 4321

System Component
Manufacturer : Oracle Corporation
Name : SPARC T5-2
Part_Number : 12345678+1+1
Serial_Number : 1239BDC0FA

—————————————-
Suspect 1 of 1
Fault class : fault.memory.dimm
Certainty : 100%
Affects : /SYS/MB/CM0/CMP/MR1/BOB0/CH0/D0
Status : faulted but still in service

FRU
Status : faulty
Location : /SYS/MB/CM0/CMP/MR1/BOB0/CH0/D0
Manufacturer : Samsung
Name : 8192MB DDR3 SDRAM DIMM
Part_Number : 07042208,M393B1K70DH0-YK0
Revision : 04
Serial_Number : 00CE02121585C74755
Chassis
Manufacturer : Oracle Corporation
Name : T5 chassis
Part_Number : abcd
Serial_Number : dbca

Description : The number of correctable errors associated with this memory
module has exceeded acceptable levels.

Response : An attempt will be made to remove the affected memory from
service.

Impact : The dimm may be deconfiguread at system restart which would
reduce total system memory capacity.

Action : Use ‘fmadm faulty’ to provide a more detailed view of this
event. Please refer to the associated reference document at
http://support.oracle.com/msg/SPSUN4V-8000-CQ for the latest
service procedures and policies regarding this diagnosis.

Section A.2 Using the Standard ILOM Command Line Interface

The following procedure assumes you are logged into the ILOM command line interface as root per the instructions above.

Use the following commands described below to identify the faulty component / FRU.

The sample output shown below in steps A1-A3 identify the suspect FRU as «/SYS/MB/P0», which represents the full physical path to FRU, whereby «SYS» represents the chassis, «MB» represents the motherboard, and «P0» represents the processor.

Refer to either the service label on top cover or silk screen labeling on the motherboard to locate processor «P0».

Step 1 List all known faults in the system

Example:
-> show /SP/faultmgmt

/SP/faultmgmt
Targets:
0 (/SYS/MB/P0)

Properties:

Commands:
cd
show

Step 2. List the state of a faulted processor

Example:
-> show /SYS/MB/P0

/SYS/MB/P0
Targets:
D0
D1
D2
D3
D4
D5
D6
D7
D8
PRSNT
SERVICE

Properties:
type = Host Processor
fru_name = Genuine Intel(R) CPU 000 @ 2.67GHz
fru_manufacturer = Intel
fru_version = 04
fru_part_number = 060A
fault_state = Faulted
clear_fault_action = (none)

Commands:
cd
show

Step 3. List the contents of the ILOM event log

Example:
-> show /SP/logs/event/list

6313 Sun Dec 28 09:54:57 2008 Fault Fault critical
Fault detected at time = Sun Dec 28 09:54:57 2008.
The suspect component: /SYS/MB/P0 has fault.cpu.intel.l1itlb with probability=100.
Refer to http://www.sun.com/msg/SPX86-8000-TX for details.

Section B – Submitting a Service Request

This section provides guidance on submitting a service request to Oracle Services in response to the diagnosed fault reported.
Auto Service Request (ASR) Activated for the Product

If ASR has been activated for the product on which this problem was diagnosed, you have, or will receive a notification via e-mail confirming a service request has been automatically opened along with instructions for viewing the service request.

All of the fault event telemetry required to open a service request has already been transmitted to Oracle. Unless contacted and instructed otherwise by an Oracle service representative, no further actions is required to report this problem and open a service request.

If you are reading this article in response to a fault message or SNMP trap generated on the product, rather than in response to the ASR notification e-mail mentioned above, then you can check on the status of the associated service request by logging into My Oracle Support.

Refer to https://oracle.com/asr for more information on Auto Service Request (ASR) and the currently supported products.

NOTE: ASR implements a set of rules for determining which events should result in a service request being automatically submitted. Message IDs that do not result in a service request being automatically opened by ASR will be so noted in the associated document for that specific Message ID.

Submitting a Service Request Via the Support Center

In cases where ASR has not been activated, open a service request by logging into My Oracle Support and follow the indicated procedures, which will include presenting elements of the event content displayed using the procedures provided in Section A.
Section C – Post-Repair Procedures

This section describes specific procedures that may be required to modify the status of faults that have been repaired and return impacted resources to normal operation.

On some products the ILOM fault management function can determine if the associated FRUs have been replaced and automatically clear the associated fault status. In some cases it cannot and the fault will have to be changed manually.

To determine if the fault is still present run the same commands applied in section A.1 or A.2 (Fault Management Shell or ILOM Command Line Interface) as appropriate. If the fault is no longer present then no further action is required. If it is still present then follow the procedures described in Section C.1 or C.2 to manually clear the fault.

In some cases evidence of this same fault may also be stored by the Solaris fault manager. If Solaris was in fact the operating system running, then follow the procedures in Section C of the following document to determine if additional post-repair action is required:

PSH Procedural Article for Solaris FMA-Based Diagnosis (Doc ID 1173733.1).

Section C.1 Using Fault Management Shell to Clear the Fault

Enter the fault management shell.

-> start /SP/faultmgmt/shell
Are you sure you want to start /SP/faultmgmt/shell (y/n) ? y

faultmgmtsp>

Use ‘fmadm repair’ to clear the fault.

Rather than the UUID, the FRU path (/SYS/FANBD/FM0) could also be used.

Example 3
Example 3 shows the ‘fmadm repaired’ command required after the suspect FRU has been replaced. Using the UUID from the ‘fmadm faulty from Example 1 above, the command would be:

faultmgmtsp> fmadm repair 9df39f93-f356-6d26-e081-e4f3a9872c2f

Example 4

Example 4 shows the ‘fmadm repaired’ command required after the FRU has been replaced.. This example shows the FRU Path from Example 2 above being used. The command would be:

fmadm repair /SYS/MB

Section C.2 Using the ILOM Command Line Interface to Clear the Fault
Login to the ILOM command line interface as ‘root’ and use the following commands to clear the fault.

Example:
-> set /SYS/MB/P0 clear_fault_action=true
Are you sure you want to clear /SYS/MB/P0 (y/n)? y
Set ‘clear_fault_action’ to ‘true’