VMware ESXi host disconnects from vCenter

Issue :

ESXi hosts disconnects from vCenter and may not even connect directly using vSphere client. The VMs will continue to run. SSH to host will work. Execution of esxcfg-scsidevs -m command will hang. LUN disappears in one or more hosts in an ESXi cluster. 

Errors from vCenter :
A general system error occurred: Invalid response code: 503 Service Unavailable.
Unable to communicate with the remote host, since it is disconnected.
Cannot contact the specified host. The host may not be available on the network, a network configuration problem may exist, or the management services on this host may not be responding.

Snippets from the vmkernal logs from ESXi host:

Check for non-responsive luns

[root@ESXi01:~] cat /var/log/vmkernel.log  | grep -i responsive

cpuxx:yyyyyyy ALERT: hostd detected to be non-responsive

Check vmkernel.log where device status is 0x18 which corresponds to Reservation conflict

[root@ESXi01:~] cat /var/log/vmkernel.log | grep 0x18 | head

cpuxx:yyyyy)NMP: nmp_ResetDeviceLogThrottling:3343: Error status H:0x0 D:0x18 P:0x0 Sense Data: 0x2 0x3a 0x0 from dev "naa.value" occurred 2926 times(of 2926 commands)

H:0x0 D:0x18 P:0x0

Host Status     [0x0]     - OK - This status is returned when there is no error on the host side. This is when you will see if there is a status for a Device or Plugin. It is also when you will see Valid sense data instead of Possible sense Data.

Device Status     [0x18] - RESERVATION CONFLICT - This status is returned when a LUN is in a Reserved status and commands from initiators that did not place that SCSI reservation attempt to issue commands to it.

Plugin Status     [0x0]  - GOOD - No error. (ESXi 5.x / 6.x only)

Troubleshooting :

  1. Now we have identified that the issue is with SCSI-2 Reservation conflict.
  2. Confirm the affected LUN using this command:
    1. cat /var/log/vmkernel.log | grep 0x18 | cut -d\" -f2 | sort -nr | uniq -c
  3. This would return the affected LUNs
  4. Reset the affected LUN using the below command (this will not affect the running VMs in the LUN):
    1. vmkfstools -L lunreset /vmfs/devices/disks/naa.value (value will be your naa id)
  5. Confirm the activity by checking the logs:
WARNING: NMP: nmpDeviceTaskMgmt:2288: Attempt to issue lun reset on device naa.value. This will clear any SCSI-2 reservations on the device.HBX: 2802: 'Datastore': HB at offset 3485696 - Waiting for timed out HB:[HB state abcdef02 offset 3485696 gen 61 stampUS 13127489805898 uuid 5811a939-aa641342-f39b-848f69156bb5 jrnl <FB 246097> drv 14.61 lockImpl 3]Resv: 632: Executed out-of-band lun reset on naa.valueHBX: 276: 'Datastore': HB at offset 3485696 - Reclaimed heartbeatScsiCore: 1609: Power-on Reset occurred on naa.value
  1. Now execute the command esxcfg-scsidevs -m to confirm everything runs fine.

This issue would have caused due to a recent change in storage side. Ensure that after each change in storage side, a host reboot is done.

Comments

Popular posts from this blog

VMware and Windows Interview Questions: Part 2

VMware and Windows Interview Questions: Part 3

VMware vMotion error at 14%