Archive for the ‘ESXi’ Tag

Using APCUPSD in a small ESXi 4 environment

Like most small office environments, where I work depends on mains electricity without any form of battery/generator backup. For a number of years our supply seemed remarkably unreliable compared with my home 15 miles away. It eventually transpired that one of the three phases into the office was overloaded. After a rebalancing of the phases was carried out by the supply company things improved greatly. By that time though, pretty well device we had was run through a desk side UPS and we had become pretty adept at setting them up.

At the time all our virtual servers were running VMWare Server Version 1, which was probably the best bit of free software I’ll ever use. We ran the hypervisor on CentOS 5. That meant that we had a full Linux OS available to run such things as the UPS monitoring software. It also mean we had a full scripting environment to handle the shut down and start up of the virtual machines either side of a power failure.

All our UPS units are made by APC, so the obvious Linux service to run was APCUPSD. The UPS units were all attached via USB to their relevant hosts. We configured the UPS units so that as soon as a power failure was detected, APCUPSD gave a 3 minute grace period to see if the power would come back. In our experience, if the power failed for more than 30 seconds it might be off for hours so 3 minutes was more than enough to decide if a full shut down was needed.

When we recently moved from VMWare Server 1 to ESXi 4, UPS support was a real headache. With no host OS on which to install APCUPSD we had to sort out how to make a guest monitor the host’s UPS and then manage the shut down of the host and all of the associated guests. This rather felt like sitting on a branch and sawing through the wrong side because the final act of the guest has to be to shut down the host the guest is actually running on and then tell to UPS to shut down. The solution we came up with runs like this:

The UPS monitoring guest consists of:

  • CentOS 5, which gives us a fully featured Linux OS.
  • The vCLI so we can control the host (and thus the other hosted guests).
  • APCUPSD so we can communicate with and control the UPS.
  • A custom APCUPSD doshutdown script in etc/apcupsd/.
  • A custom script that finds and shuts down every guest on a specified ESXi host via vCLI commands.

The UPS is connected to the host via USB and the guest communicates with the UPS via a virtual USB connection. That’s all pretty straightforward, until we get a power failure. At which point we hope things go like this…

  • APCUPSD on the UPS monitoring guest gets notified of the UPS status change to On Battery.
  • After 3 minutes on battery power, the UPS monitoring guest runs the custom doshutdown script which
    • calls a custom local script that sends a shut down signal via the vCli to all of the other guests.
    • sleeps for 45 seconds to ensure all the guests are down. 45 seconds is enough for our guests, but you would need to test this in your own environment if you were trying to do the same.
  • That leaves us with just the UPS monitoring guest running on the host and the host itself.
  • We must now stop the APCUPSD service so that the USB kill power signal is sent. It took me days to realise you must stop the APCUPSD service else the APCUPSD killpower signal is ignored 😦
  • The UPS monitoring guest then runs the APCUPSD killpower command which tells the UPS to shut down, but (and this is the crucial part) the UPS gives you DSHUTD seconds of power before it does so. We set DSHUTD to 90 seconds.
  • The UPS monitoring guest tells its own host to shut down.
  • The host is configured to shut down any running guests, of which the UPS monitoring guest should now be the only one.
  • The UPS monitoring guest gets the shut down signal, which it obeys.
  • The host shuts down. All this has to happen in DSHUTD seconds, which in our case it does easily.
  • Finally, after DSHUTD seconds the UPS shuts down.

As each command runs in the custom doshutdown script, it writes what it’s doing to a log file in /tmp on the UPS monitoring guest. You can’t see this at the time the shut down is happening, but after the event and, in particular, for testing, it’s very good to be able to see that all went well (or not) and how much time each stage took.

Be very careful if, like us, you have multiple ESXi hosts that use the same scripts. The UPS monitoring guests send the shut down commands via the network so they are more than capable of shutting down the wrong host if you give them the wrong set of credentials. Keep the credentials in separate files from the scripts so you can propagate updated versions of the scripts to all interested guests without the risk of them all trying to shut down the same host!