VMware ESXi host disconnects from vCenter
Issue :
ESXi hosts
disconnects from vCenter and may not even connect directly using vSphere
client. The VMs will continue to run. SSH to host will work. Execution of
esxcfg-scsidevs -m command will hang. LUN disappears in one or more hosts in an ESXi cluster.
Errors from vCenter :
A general system error occurred: Invalid response code: 503 Service Unavailable.
Unable to communicate with the remote host, since it is disconnected.
Cannot contact the specified host. The host may not be available on the network, a network configuration problem may exist, or the management services on this host may not be responding.
Snippets from the vmkernal logs from ESXi host:
Check for non-responsive luns
[root@ESXi01:~] cat /var/log/vmkernel.log |
grep -i responsive
cpuxx:yyyyyyy ALERT: hostd
detected to be non-responsive
Check vmkernel.log where device status is
0x18 which corresponds to Reservation conflict
[root@ESXi01:~] cat /var/log/vmkernel.log |
grep 0x18 | head
cpuxx:yyyyy)NMP:
nmp_ResetDeviceLogThrottling:3343: Error status H:0x0 D:0x18 P:0x0
Sense Data: 0x2 0x3a 0x0 from dev "naa.value" occurred 2926 times(of
2926 commands)
H:0x0 D:0x18 P:0x0
Host Status
[0x0] - OK - This status is returned when
there is no error on the host side. This is when you will see if there is a
status for a Device or Plugin. It is also when you will see Valid sense data
instead of Possible sense Data.
Device Status
[0x18] - RESERVATION CONFLICT - This status is returned
when a LUN is in a Reserved status and commands from initiators that did not
place that SCSI reservation attempt to issue commands to it.
Plugin Status
[0x0] - GOOD - No error. (ESXi 5.x / 6.x only)
Troubleshooting :
- Now we have identified that the issue is with SCSI-2 Reservation conflict.
- Confirm the affected LUN using this command:
- cat /var/log/vmkernel.log | grep 0x18 | cut -d\" -f2 | sort -nr | uniq -c
- This would return the affected LUNs
- Reset the affected LUN using the below command (this will not affect the running VMs in the LUN):
- vmkfstools -L lunreset /vmfs/devices/disks/naa.value (value will be your naa id)
- Confirm the activity by checking the logs:
WARNING: NMP: nmpDeviceTaskMgmt:2288: Attempt to issue lun reset on device naa.value. This will clear any SCSI-2 reservations on the device.HBX: 2802: 'Datastore': HB at offset 3485696 - Waiting for timed out HB:[HB state abcdef02 offset 3485696 gen 61 stampUS 13127489805898 uuid 5811a939-aa641342-f39b-848f69156bb5 jrnl <FB 246097> drv 14.61 lockImpl 3]Resv: 632: Executed out-of-band lun reset on naa.valueHBX: 276: 'Datastore': HB at offset 3485696 - Reclaimed heartbeatScsiCore: 1609: Power-on Reset occurred on naa.value
- Now execute the command esxcfg-scsidevs -m to confirm everything runs fine.
This issue would have caused due to a recent change in storage side. Ensure that after each change in storage side, a host reboot is done.
Comments
Post a Comment