As most of you know by now when vSAN is enabled vSphere HA uses the vSAN network for heartbeating. I recently wrote an article about the isolation address and relationship with heartbeat datastores. In the comment section, Johann asked what the settings should be for 2-Node Direct Connect with vSAN. A very valid question as an isolation is still possible, although not as likely as with a stretched cluster considering you do not have a network switch for vSAN in this configuration. Anyway, you would still like the VMs that are impacted by the isolation to be powered off and you would like the other remaining host to power them back on.
So the question remains, which IP Address do you select? Well, there’s no IP address to select in this particular case. As it is “direct connect” there are probably only 2 IP addresses on that segment (one for host 1 and another for host 2). You cannot use the default gateway either, as that is the gateway for the management interface, which is the wrong network. So what do I recommend:
- Disable the Isolation Response >> set it to “leave powered on” or “disabled” (depends on the version used
- Disable the use of the default gateway by setting the following HA advanced setting:
- das.usedefaultisolationaddress = false
That probably makes you wonder what will happen when a host is isolated from the rest of the cluster (other host and the witness). Well, when this happens then the VMs are still killed, but not as a result of the isolation response kicking in, but as a result of vSAN kicking in. Here’s the process:
- Heartbeats are not received
- Host elects itself primary
- Host pings the isolation address
- If the host can’t ping the gateway of the management interface then the host declares itself isolated
- If the host can ping the gateway of the management interface then the host doesn’t declare itself isolated
- Either way, the isolation response is not triggered as it is set to “Leave powered on”
- vSAN will now automatically kill all VMs which have lost access to its components
- The isolated host will lose quorum
- vSAN objects will become isolated
- The advanced setting “VSAN.AutoTerminateGhostVm=1” allows vSAN to kill the “ghosted” VMs (with all components inaccessible).
In other words, don’t worry about the isolation address in a 2-node configuration, vSAN has this situation covered! Note that “VSAN.AutoTerminateGhostVm=1” only works for 2-node and Stretched vSAN configurations at this time.
UPDATE:
I triggered a failure in my lab (which is 2-node, but not direct connect), and for those who are wondering, this is what you should be seeing in your syslog.log:
syslog.log:2017-11-29T13:45:28Z killInaccessibleVms.py [INFO]: Following VMs are powered on and HA protected in this host. syslog.log:2017-11-29T13:45:28Z killInaccessibleVms.py [INFO]: * ['vm-01', 'vm-03', 'vm-04'] syslog.log:2017-11-29T13:45:32Z killInaccessibleVms.py [INFO]: List inaccessible VMs at round 1 syslog.log:2017-11-29T13:45:32Z killInaccessibleVms.py [INFO]: * ['vim.VirtualMachine:1', 'vim.VirtualMachine:2', 'vim.VirtualMachine:3'] syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: List inaccessible VMs at round 2 syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: * ['vim.VirtualMachine:1', 'vim.VirtualMachine:2', 'vim.VirtualMachine:3'] syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: Following VMs are found to have all objects inaccessible, and will be terminated. syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: * ['vim.VirtualMachine:1', 'vim.VirtualMachine:2', 'vim.VirtualMachine:3'] syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: Start terminating VMs. syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: Successfully terminated inaccessible VM: vm-01 syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: Successfully terminated inaccessible VM: vm-03 syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: Successfully terminated inaccessible VM: vm-04 syslog.log:2017-11-29T13:46:06Z killInaccessibleVms.py [INFO]: Finished killing the ghost vms
Johann says
Awesome, thank you Duncan for the detailed clarification.
Egbert Fielden says
Hi Johann, did this work for you ? It didn’t for us.
Thanks,
Egbert
Johann Stander says
Hi Egbert,
Yes this worked for me and my VM was automatically powered off and restarted on the other host. (It took around a minute and a half before the VM was powered off so give it some time).
The way I tested it was to remove all the physical network adapters from the vSAN VDS on the host where the VM is running.
My HA settings are as follows:
Host monitoring = enabled
Host failure response = restart VMs
Response to Host Isolation = Disabled
PDL and APD = Disabled
Heartbeat datastores = Use datastores only from the specified list. Nothing selected
Advanced options – das.ignoreInsufficientHbDatastore = True
Also on each host set advanced system settings – VSAN.AutoTerminateGhostVm = True
Hope this helps.
Steven Rodenburg says
Hello,
The value for “VSAN.AutoTerminateGhostVm” is either “1” or “0”. Not “True” or “False”.
I’ve built a number of 2-Node back-2-back vSAN ROBO’s and I assume the wizard in vSAN 6.6 automatically sets this value to “1” (because I never had to touch it).
Kind regards,
Steven Rodenburg
Johann Stander says
Sorry I just typed that incorrectly, thanks for the correction Steven.
Max Dembo says
In a vSAN 2 node direct connect deployment, there is no isolatin address to set and you get:
– a red circle on the host (with all green circle in the vSAN Health)
– a warning “This host ha no isolation address defined as required by vSphere HA”
– a cluster configuration issue
Is it acceptable?
How to suppress this unresolvable warning?
Johann Stander says
Hey Max. By default, vSphere HA uses the default gateway of the console network as an isolation address, so please make sure you can ping it from both hosts. Also make sure it is not disabled using advanced configuration parameters like das.usedefaultisolationaddress = false
Max Dembo says
Hi Johann,
so when using a direct connect 2 node vsan cluster is OK to use the management network for HA?
We perfectly know that with vSAN clusters configured using swithes (4 or more nodes) we have to use the vSAN network for HA heartbeat.
The vSAN 2 node guide is not very clear:
“Note that a 2-node Direct Connect configuration is a special case. In this situation it is impossible to
configure a valid external isolation address within the vSAN network. VMware recommends disabling
the isolation response for a 2-node Direct Connect configuration”
Also Cormac and Duncan they have not recommended to use the default gateway.
Thanks!
Johann says
hey Max, sorry for the late reply and my bad just noticed I spaced out on my response. You are right, when vSAN is enabled, then vSphere HA uses the vSAN traffic for heartbeat. Do you have an IP address specified for vSphere HA advanced options ‘das.isolationaddress’, or do you maybe have a gateway address specified for the vSAN VMKernel on the direct connect network? I have not seen that message before when setting up a 2 node direct-connect. have you opened a case with VMware, maybe just have them validate your config?
Duncan Epping says
Sorry for the slow response from my side, somehow comments do not end up in my mailbox anymore and I just noticed your comments. I think the article explains why an isolation address is not needed in a 2 node configuration. However, it doesn’t indeed explain the error message you are getting.
This message and I just validated it on the latest build of vSphere, is new to me as well. I have it in my environment now as well and it can’t be suppressed through the UI, which is annoying to say the least. Let me dig internally, or talk to engineering, to see if there is an advanced option to remove the message completely.
Of course what you can do is set “das.isolationaddress0” to for instance 127.0.0.1, that way you won’t get an error at least.
I will also look at the 2-node guide and talk to the authors about what the recommendation should look like, as things are confusing indeed.
Duncan Epping says
Okay I am reading through the various documents and I understand where the confusion is. There are two things here:
1. The isolation address
2. The isolation response
By default the isolation address is the default gateway of the management interface. vSphere will validate if an isolation address is available to use, regardless of how the isolation response is configured.
For 2 node configurations, or vSAN in general, we have recommended people to disable the use of the default gateway by setting das.usedefaultisolationaddress to false. Next you would set up the das.isolationaddress on the vSAN network.
For 2 node direct connect however you won’t be using the isolation response. However, vSphere still checks if there is an isolation address of any kind available it can use. So if you set das.usedefaultisolationaddress to true then you don’t get the error, when you have a gateway configured on the management network of course.
Hope that helps.
Max Dembo says
Hi Johann, I will open a support case and post the VMware answer.
Regards.
Max Dembo says
Hello Duncan, thanks for your time.
I will leave the default HA settings, I hope that the HA section of the 2 node vSAN guide will be updated to clarify this point.
In particular has to be specified that for a direct connect setup these 2 advanced settings must not be used:
das.usedefaultisolationaddress False
das.isolationaddress0 IP address on vSAN network
Best Regards.
Duncan Epping says
I made the changes here: https://storagehub.vmware.com/t/vmware-vsan/vsan-2-node-guide/cluster-settings-vsphere-ha-4/
vspherevm says
Hi Duncan,
I am in a weird situation. I have a 2 node vSAN direct connect cluster.
The server has :
vmk0 : Mgmt/Witness traffic(172.16.x ip and I able to connect to the GW but not to the vSAN IP for obvious reason)
vmk1: Vmotion
vmk2: vSAN network (10.x IP)
VSAN.AutoTerminateGhostVm is set to “1”
Cluster settings: Host monitoring = enabled
Host failure response = restart VMs
Response to Host Isolation = Disabled
PDL and APD = Disabled
Heartbeat datastores = Use datastores only from the specified list. Nothing selected
Advanced options – das.ignoreInsufficientHbDatastore = True
I am still getting error message on cluster : Insufficient configured resources to satisfy the desired vSphere HA failover level on the cluster
And host state as “Host Failed”.
In the fdm log I could see error: “[ClusterElection::SendAll] [60 times] sendto vsan_IP failed: Host is down”
Can I do something in this case?
Also, after checking vmware docs I could see this link: https://docs.vmware.com/en/VMware-vSphere/6.7/com.vmware.vsphere.avail.doc/GUID-6E053835-F918-4080-A38C-46500F31BCA9.html
This one says I need to have 3 nodes. Is it true?