CISCO UCS Blades – Memory Troubleshooting
This is a guest blog post kindly contributed by Eric Daly @daly_eric. In an effort to pin point specific DIMMs within UCS that are throwing an error, please follow these […]
Virtualization & Storage
This is a guest blog post kindly contributed by Eric Daly @daly_eric. In an effort to pin point specific DIMMs within UCS that are throwing an error, please follow these […]
This is a guest blog post kindly contributed by Eric Daly @daly_eric.
In an effort to pin point specific DIMMs within UCS that are throwing an error, please follow these simple steps. Be aware that an entire memory channel with up to 3 DIMMs may show as disabled in a channel, all because of one DIMM with uncorrectable errors.
1) Check the Inventory > Memory tab to see which DIMMs are not registering. Make note of the DIMM Location (F0,F1,F2) as per Below:
2) Review the SEL Log and search for the specific DIMM throwing “uncorrectable” “memory error”. In this case you will see from the image below that the F2 DIMM was causing the issue. If nothing shows in SEL log perform steps 3-5.
3) Reset CIMC controller of blade (Recover Server > Reset CIMC (Server controller). Wait a minute or 2.
4) Re-acknowledge blade. Takes 2-3 mins
5) Review SEL Log again as per step 2 in order to identify the faulting DIMM.
DEEPER ANALYSIS
1) Download techsupport for the specific chassis where the suspect blade is located.
2) Extract the tar and then extract the relvant zip file for suspect blade. There are 2 files which will give you a clear picture of memory DIMM failures. MrcOut.txt and DimmBl.log
3) Locate the DimmBl.log file and open this with Word (not notepad).
4) You will get a summary of first page telling you if blade has any DIMMs with uncorrectable errors
====================== SUMMARY OF DIMM ERRORS ======================
NO DIMM ECC ERRORS ON THIS BLADE
====================== DIMM BL RAM DATABASE DUMP ========================
====== RAM DB DUMP =====
--- Control Header :
DataBaseFormatVersion : 2
FaultSensorInitDone : 0x00
SyncTaskInitDone : 0x01
DimmBLEnabledBySAM : FALSE
MostRecentHostBootTime : Sat Jun 28 19:07:28 2014
PreviousHostBootTime : Sat Jun 28 02:47:42 2014
MostRecentHostShutdownTime : Sat Jun 28 02:58:12 2014
ErrorSamplingIntervalLength : 1209600
DBSyncPeriod : 3600
CurrentIntervalIndex : 0
---------------------- PER DIMM ERROR COUNTS -----------
CORRECTABLE ERRORS UNCORRECTABLE ERRORS
DIMM ID Total This Boot Total This Boot
-----------------------------------------------------------
A0 0 0 0 0
A1 0 0 0 0
A2 0 0 0 0
B0 0 0 0 0
B1 0 0 0 0
B2 0 0 0 0
C0 0 0 0 0
C1 0 0 0 0
C2 0 0 0 0
D0 0 0 0 0
D1 0 0 0 0
D2 0 0 0 0
E0 0 0 0 0
E1 0 0 0 0
E2 0 0 0 0
F0 0 0 0 0
F1 0 0 0 0
F2 0 0 0 0
G0 0 0 0 0
G1 0 0 0 0
G2 0 0 0 0
H0 0 0 0 0
H1 0 0 0 0
H2 0 0 0 0
Ramblings by Keith Lee
Discussions about all things VxRail.
Random Technology thoughts from an Irish Virtualization Geek (who enjoys saving the world in his spare time).
Musings of a VMware Cloud Geek
Converged and Hyper Converged Infrastructure
'Scamallach' - Gaelic for 'Cloudy' ...
Storing data and be awesome
Best Practices et alia
Every Cloud Has a Tin Lining.
Yes, got it, Searching it