EMC VNX – List of Useful NAS Commands

Verify ‘NAS’ Services are running:
Login to the Control Station as ‘nasadmin’ and issue the cmd /nas/sbin/getreason from the CS console. The reason code output should be as follows (see detailed list of Reason Codes below):
10 - slot_0 primary control station
11 - slot_1 secondary control station
5 - slot_2 contacted
5 - slot_3 contacted

Complete a full ‘Health Check’:
/nas/bin/nas_checkup
Location of output:
#check Log:# cd /nas/log/
ls
cat /nas/log/checkup-rundate.log

Confirm the EMC NAS version installed and the model name:
/nasmcd/bin/nas_version
/nas/sbin/model

Stop ‘NAS’ Services:
/sbin/service nas stop
Start ‘NAS’ Services:
/sbin/service nas start

Check running status of the ‘DATA Movers’ and view which slot is active/standby:
nas_server -info -all

Verify connectivity to the VNX storage processors (SPs) from the Control Stations:
/nas/sbin/navicli -h SPA_IP domain -list
/nas/sbin/navicli -h SPB_IP domain -list

Confirm the VMAX/VNX is connected to the NAS:
nas_storage -check -all
nas_storage -list

View VNX NAS Control LUN Storage Group details:
/nas/sbin/navicli -h SP_IP storagegroup -list -gname ~filestorage

List the disk table to ensure all of the Control Volumes have been presented to both Data Movers:
nas_disk -list

Check the File Systems:
df -h

server_sysconfig server_2 -virtual
/nas/bin/nas_storage –info –all

View trunking devices created (LACP,Ethc,FSN):
server_sysconfig server_2 -virtual
Example view interface name “LACP_NAS”:
server_sysconfig server_2 -virtual -info LACP_NAS
server_2 :
*** Trunk LACP_NAS: Link is Up ***
*** Trunk LACP_NAS: Timeout is Short ***
*** Trunk LACP_NAS: Statistical Load Balancing is IP ***
Device Local Grp Remote Grp Link LACP Duplex Speed
------------------------------------------------------------------------
fxg-1-0 10002 51840 Up Up Full 10000 Mbs
fxg-1-1 10002 51840 Up Up Full 10000 Mbs
fxg-2-1 10002 51840 Up Up Full 10000 Mbs
fxg-2-0 10002 51840 Up Up Full 10000 Mbs

Check Code Levels:
List the datamovers: nas_server -list
Check the DART code installed on the Data Movers: server_version ALL
Check the NAS code installed on the Control Station: nas_version

View Network Configuration:
To display parameters of all interfaces on a Data Mover, type:
Control Station: /sbin/ifconfig (eth3 is the mgmt interface)
Data Movers: server_ifconfig server_2 -all
cat ifcfg-eth3

View VNX SP IP Addresses from the CS console:
grep SP /etc/hosts | grep A_
grep SP /etc/hosts | grep B_

Verify Control Station Comms:
/nas/sbin/setup_enclosure -checkSystem

Confirm the unified FLAG is set:
/nas/sbin/nas_hw_upgrade -fc_option -enable

Date & Time:
Control Station: date
Data Movers: server_date ALL

Check IP & DNS info on the CS/DM:
nas_cs -info
server_dns ALL

Log Files:
Log file location: /var/log/messages
Example of NAS services starting successfully:
grep -A10 “Starting NAS services” /var/log/messages*
Output:
Dec 8 19:07:27 emcnas_i0 S95nas: Starting NAS services
Dec 8 19:07:46 emcnas_i0 EMCServer: nas_mcd: MCD will monitor CS IPMI connection.
Dec 8 19:08:46 emcnas_i0 EMCServer: nas_mcd: slot 0 missed 10 heartbeats from slot 1.
Dec 8 19:08:50 emcnas_i0 EMCServer: nas_mcd: Install Manager is running on slot 0, skipping slot 1 reboot
Dec 8 19:08:50 emcnas_i0 EMCServer: nas_mcd: Slot 0 becomes primary due to timeout
Dec 8 19:08:52 emcnas_i0 mcd_helper: All NBS devices are up
Dec 8 19:09:08 emcnas_i0 kernel: kjournald starting. Commit interval 5 seconds

Check the Data Mover Logs:
server_log server_2

Failing over a Control Station:
Failover:
/nas/sbin/./cs_standby -failover
Takeover:
/nasmcd/sbin/./cs_standby -takeover
Or reboot:
nas_cs –reboot

Shutdown control station:
/sbin/shutdown -h now

Power off CS1 from CS0:
/nas/sbin/t2reset pwroff -s 1

List on which VMAX3 directors each CS and DM are located:
nas_inventory -list

List Datmover PARAMETERS:
/nas/bin/server_param server_2 -info
/nas/bin/server_param server_3 -info
/nas/bin/server_param server_2 -facility -all -list
/nas/bin/server_param server_3 -facility -all -list

Determine the failover status of the Blades (Datamovers):
/nas/bin/nas_server -info –all

Initiate a manual failover of server_2 to the standby Datamover:
server_standby server_2 -activate mover

List the status of the Datamovers:
nas_server -list

Review the information for server_2:
nas_server -info server_2
All DMs: nas_server -info ALL

Shutdown Datamover (Xblade):
/nas/bin/server_cpu server_2 -halt now

Power on the Datamover (Xblade):
/nasmcd/sbin/t2reset pwron -s 2

Restore the original primary Datamover:
server_standby server_2 -restore mover

To monitor an immediate cold restart of server_2:
server_cpu server_2 -reboot cold -monitor now
A cold reboot or a hardware reset shuts down the Data Mover completely before restarting, including a Power on Self Test (POST).

To monitor an immediate warm restart of server_2:
server_cpu server_2 -reboot -monitor now
A warm reboot or a software reset performs a partial shutdown of the Data Mover, and skips the POST after restarting. A software reset is faster than the hardware reset.

Clean Shutdown:
Shutdown Control Stations and DATAMovers:
/nasmcd/sbin/nas_halt -f now
finished when:exited on signal IS APPEARS

Powerdown Entire VNX Including Storage Processors:
nas_halt –f –sp now

Check if Product Serial Number is Correct:
/nasmcd/sbin/serial -db_check
Remove inconsistency between the db file and the enclosures:
/nasmcd/sbin/serial -repair

List All Hardware components by location:
nas_inventory -list -location
nas_inventory -list -location | grep “DME 0 Data Mover 2 IO Module”

Use location address to view specific component details:
nas_inventory -info “system:VNX5600:CKM001510001932007|enclosure:xpe:0|mover:VNX5600:2|iomodule::1”

List of Reason Codes:
0 – Reset (or unknown state)
1 – DOS boot phase, BIOS check, boot sequence
2 – SIB POST failures (that is, hardware failures)
3 – DART is loaded on Data Mover, DOS boot and execution of boot.bat, boot.cfg.
4 – DART is ready on Data Mover, running, and MAC threads started.
5 – DART is in contact with Control Station box monitor.
6 – Control Station is ready, but is not running NAS service.
7 – DART is in panic state.
9 – DART reboot is pending or in halted state.
10 – Primary Control Station reason code
11 – Secondary Control Station reason code
13 – DART panicked and completed memory dump (single Data Mover configurations only, same as code 7, but done with dump)
14 – This reason code can be set for the Blade for any of the following:
• Data Mover enclosure-ID was not found at boot time
• Data Mover’s local network interface MAC address is different from MAC address in configuration file
• Data Mover’s serial number is different from serial number in configuration file
• Data Mover was PXE booted with install configuration
• SLIC IO Module configuration mismatch (Foxglove systems)
15 – Data Mover is flashing firmware. DART is flashing BIOS and/or POST firmware. Data Mover cannot be reset.
17 – Data Mover Hardware fault detected
18 – DM Memory Test Failure. BIOS detected memory error
19 – DM POST Test Failure. General POST error
20 – DM POST NVRAM test failure. Invalid NVRAM content error (checksum, WWN, etc.)
21 – DM POST invalid peer Data Mover type
22 – DM POST invalid Data Mover part number
23 – DM POST Fibre Channel test failure. Error in blade Fibre connection (controller, Fibre discovery, etc.)
24 – DM POST network test failure. Error in Ethernet controller
25 – DM T2NET Error. Unable to get blade reason code due to management switch problems.

CISCO UCS Blades – Memory Troubleshooting

This is a guest blog post kindly contributed by Eric Daly @daly_eric.

In an effort to pin point specific DIMMs within UCS that are throwing an error, please follow these simple steps. Be aware that an entire memory channel with up to 3 DIMMs may show as disabled in a channel, all because of one DIMM with uncorrectable errors.

1) Check the Inventory > Memory tab to see which DIMMs are not registering. Make note of the DIMM Location (F0,F1,F2) as per Below:
TS_MEM1
2) Review the SEL Log and search for the specific DIMM throwing “uncorrectable” “memory error”. In this case you will see from the image below that the F2 DIMM was causing the issue. If nothing shows in SEL log perform steps 3-5.
TS_MEM2
3) Reset CIMC controller of blade (Recover Server > Reset CIMC (Server controller). Wait a minute or 2.
4) Re-acknowledge blade. Takes 2-3 mins
5) Review SEL Log again as per step 2 in order to identify the faulting DIMM.

DEEPER ANALYSIS
1) Download techsupport for the specific chassis where the suspect blade is located.
TS_MEM3
2) Extract the tar and then extract the relvant zip file for suspect blade. There are 2 files which will give you a clear picture of memory DIMM failures. MrcOut.txt and DimmBl.log
3) Locate the DimmBl.log file and open this with Word (not notepad).
TS_MEM4
4) You will get a summary of first page telling you if blade has any DIMMs with uncorrectable errors

====================== SUMMARY OF DIMM ERRORS ======================
NO DIMM ECC ERRORS ON THIS BLADE

====================== DIMM BL RAM DATABASE DUMP ========================

====== RAM DB DUMP =====
--- Control Header :
DataBaseFormatVersion : 2
FaultSensorInitDone : 0x00
SyncTaskInitDone : 0x01
DimmBLEnabledBySAM : FALSE
MostRecentHostBootTime : Sat Jun 28 19:07:28 2014
PreviousHostBootTime : Sat Jun 28 02:47:42 2014
MostRecentHostShutdownTime : Sat Jun 28 02:58:12 2014
ErrorSamplingIntervalLength : 1209600
DBSyncPeriod : 3600
CurrentIntervalIndex : 0

---------------------- PER DIMM ERROR COUNTS -----------
CORRECTABLE ERRORS UNCORRECTABLE ERRORS
DIMM ID Total This Boot Total This Boot
-----------------------------------------------------------
A0 0 0 0 0
A1 0 0 0 0
A2 0 0 0 0
B0 0 0 0 0
B1 0 0 0 0
B2 0 0 0 0
C0 0 0 0 0
C1 0 0 0 0
C2 0 0 0 0
D0 0 0 0 0
D1 0 0 0 0
D2 0 0 0 0
E0 0 0 0 0
E1 0 0 0 0
E2 0 0 0 0
F0 0 0 0 0
F1 0 0 0 0
F2 0 0 0 0
G0 0 0 0 0
G1 0 0 0 0
G2 0 0 0 0
H0 0 0 0 0
H1 0 0 0 0
H2 0 0 0 0

EMC VNX – Managing Data Mover (Blade) Relationships

Today I had a VNX Data Mover (Blade) slic issue where the failover relationships were not as they should be. The following commands are very useful for checking the relationships between the Blades and restoring those replationships if necessary.

You can determine the Blade failover status through the following command:

/nas/bin/nas_server -info –all

DM2

Type field indicates whether the Blade is a Primary Blade (nas) or a standby blade (standby). The standbyfor field lists the primary Blade for which the Blade is the standby.

If the name field indicates that a primary Blade is faulted and has failed over to its standby, you need to perform a restore on that primary Blade. Enter the following command:
/nas/bin/server_standby server_4 -restore mover

If a failover relationship is not as it should be then you can delete the failover relationship as follows. For example deleting server_2 (Primary) relationship with server_4 (Standby):
/nas/bin/server_standby server_2 -delete mover=server_4
It is then possible to create a failover relationship between server_2 as (Primary) and server_5 (Standby):
/nas/bin/server_standby server_2 -c mover=server_5 -policy auto

The following commands can be used to halt and reboot the Blade:
/nasmcd/sbin/t2reset -s 5 halt
/nasmcd/sbin/t2reset reboot -s 5

Another useful command you can use in order to list the I/O modules in the Blade:
/nasmcd/sbin/t2vpd -s 5

DM1