CISCO UCS Blades – Memory Troubleshooting

This is a guest blog post kindly contributed by Eric Daly @daly_eric.

In an effort to pin point specific DIMMs within UCS that are throwing an error, please follow these simple steps. Be aware that an entire memory channel with up to 3 DIMMs may show as disabled in a channel, all because of one DIMM with uncorrectable errors.

1) Check the Inventory > Memory tab to see which DIMMs are not registering. Make note of the DIMM Location (F0,F1,F2) as per Below:
TS_MEM1
2) Review the SEL Log and search for the specific DIMM throwing “uncorrectable” “memory error”. In this case you will see from the image below that the F2 DIMM was causing the issue. If nothing shows in SEL log perform steps 3-5.
TS_MEM2
3) Reset CIMC controller of blade (Recover Server > Reset CIMC (Server controller). Wait a minute or 2.
4) Re-acknowledge blade. Takes 2-3 mins
5) Review SEL Log again as per step 2 in order to identify the faulting DIMM.

DEEPER ANALYSIS
1) Download techsupport for the specific chassis where the suspect blade is located.
TS_MEM3
2) Extract the tar and then extract the relvant zip file for suspect blade. There are 2 files which will give you a clear picture of memory DIMM failures. MrcOut.txt and DimmBl.log
3) Locate the DimmBl.log file and open this with Word (not notepad).
TS_MEM4
4) You will get a summary of first page telling you if blade has any DIMMs with uncorrectable errors

====================== SUMMARY OF DIMM ERRORS ======================
NO DIMM ECC ERRORS ON THIS BLADE

====================== DIMM BL RAM DATABASE DUMP ========================

====== RAM DB DUMP =====
--- Control Header :
DataBaseFormatVersion : 2
FaultSensorInitDone : 0x00
SyncTaskInitDone : 0x01
DimmBLEnabledBySAM : FALSE
MostRecentHostBootTime : Sat Jun 28 19:07:28 2014
PreviousHostBootTime : Sat Jun 28 02:47:42 2014
MostRecentHostShutdownTime : Sat Jun 28 02:58:12 2014
ErrorSamplingIntervalLength : 1209600
DBSyncPeriod : 3600
CurrentIntervalIndex : 0

---------------------- PER DIMM ERROR COUNTS -----------
CORRECTABLE ERRORS UNCORRECTABLE ERRORS
DIMM ID Total This Boot Total This Boot
-----------------------------------------------------------
A0 0 0 0 0
A1 0 0 0 0
A2 0 0 0 0
B0 0 0 0 0
B1 0 0 0 0
B2 0 0 0 0
C0 0 0 0 0
C1 0 0 0 0
C2 0 0 0 0
D0 0 0 0 0
D1 0 0 0 0
D2 0 0 0 0
E0 0 0 0 0
E1 0 0 0 0
E2 0 0 0 0
F0 0 0 0 0
F1 0 0 0 0
F2 0 0 0 0
G0 0 0 0 0
G1 0 0 0 0
G2 0 0 0 0
H0 0 0 0 0
H1 0 0 0 0
H2 0 0 0 0

One thought on “CISCO UCS Blades – Memory Troubleshooting

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s