Archive for the ‘VMWare’ Category

Loss of network interfaces after applying CentOS kernel 2.6.18-308.4.1 on ESXi 4.1 Update 2

One of our hosting company requires us to keep up to date with RHEL 5 kernels on a server they support. As this is a production machine it means that when a new kernel is available I start applying it to development machines and migrate the upgrade through to production over a period of 3-4 weeks.

 

The first stage of testing is to apply the kernel on various ESXi hosted VMs and a couple of real servers, most of which run CentOS 5. The most recent kernel 2.6.18-308.4.1 seems to work OK, but on all of the ESXi hosts there seems to be an issue with the virtual network interfaces. In all cases I use the VMXNET 3 adapter type. The hosts are running ESXi 4.1 Update 2 with a number of HP provided bundles.

 

After the application of the updates and the subsequent reboot, only the local loopback network interface starts up. I’ve seen various suggestions as to what may be the cause but I can’t comment on those. To fix the issue, my initial test was to re-install the VMWare Tools which seems to do the trick. Having got that far, on a different guest I tried just rerunning vmware-config-tools.pl. That seems to solve the problem immediately (just restart the network service). The fix appears to be persistent across a reboot.

Using APCUPSD in a small ESXi 4 environment

Like most small office environments, where I work depends on mains electricity without any form of battery/generator backup. For a number of years our supply seemed remarkably unreliable compared with my home 15 miles away. It eventually transpired that one of the three phases into the office was overloaded. After a rebalancing of the phases was carried out by the supply company things improved greatly. By that time though, pretty well device we had was run through a desk side UPS and we had become pretty adept at setting them up.

At the time all our virtual servers were running VMWare Server Version 1, which was probably the best bit of free software I’ll ever use. We ran the hypervisor on CentOS 5. That meant that we had a full Linux OS available to run such things as the UPS monitoring software. It also mean we had a full scripting environment to handle the shut down and start up of the virtual machines either side of a power failure.

All our UPS units are made by APC, so the obvious Linux service to run was APCUPSD. The UPS units were all attached via USB to their relevant hosts. We configured the UPS units so that as soon as a power failure was detected, APCUPSD gave a 3 minute grace period to see if the power would come back. In our experience, if the power failed for more than 30 seconds it might be off for hours so 3 minutes was more than enough to decide if a full shut down was needed.

When we recently moved from VMWare Server 1 to ESXi 4, UPS support was a real headache. With no host OS on which to install APCUPSD we had to sort out how to make a guest monitor the host’s UPS and then manage the shut down of the host and all of the associated guests. This rather felt like sitting on a branch and sawing through the wrong side because the final act of the guest has to be to shut down the host the guest is actually running on and then tell to UPS to shut down. The solution we came up with runs like this:

The UPS monitoring guest consists of:

  • CentOS 5, which gives us a fully featured Linux OS.
  • The vCLI so we can control the host (and thus the other hosted guests).
  • APCUPSD so we can communicate with and control the UPS.
  • A custom APCUPSD doshutdown script in etc/apcupsd/.
  • A custom script that finds and shuts down every guest on a specified ESXi host via vCLI commands.

The UPS is connected to the host via USB and the guest communicates with the UPS via a virtual USB connection. That’s all pretty straightforward, until we get a power failure. At which point we hope things go like this…

  • APCUPSD on the UPS monitoring guest gets notified of the UPS status change to On Battery.
  • After 3 minutes on battery power, the UPS monitoring guest runs the custom doshutdown script which
    • calls a custom local script that sends a shut down signal via the vCli to all of the other guests.
    • sleeps for 45 seconds to ensure all the guests are down. 45 seconds is enough for our guests, but you would need to test this in your own environment if you were trying to do the same.
  • That leaves us with just the UPS monitoring guest running on the host and the host itself.
  • We must now stop the APCUPSD service so that the USB kill power signal is sent. It took me days to realise you must stop the APCUPSD service else the APCUPSD killpower signal is ignored 😦
  • The UPS monitoring guest then runs the APCUPSD killpower command which tells the UPS to shut down, but (and this is the crucial part) the UPS gives you DSHUTD seconds of power before it does so. We set DSHUTD to 90 seconds.
  • The UPS monitoring guest tells its own host to shut down.
  • The host is configured to shut down any running guests, of which the UPS monitoring guest should now be the only one.
  • The UPS monitoring guest gets the shut down signal, which it obeys.
  • The host shuts down. All this has to happen in DSHUTD seconds, which in our case it does easily.
  • Finally, after DSHUTD seconds the UPS shuts down.

As each command runs in the custom doshutdown script, it writes what it’s doing to a log file in /tmp on the UPS monitoring guest. You can’t see this at the time the shut down is happening, but after the event and, in particular, for testing, it’s very good to be able to see that all went well (or not) and how much time each stage took.

Be very careful if, like us, you have multiple ESXi hosts that use the same scripts. The UPS monitoring guests send the shut down commands via the network so they are more than capable of shutting down the wrong host if you give them the wrong set of credentials. Keep the credentials in separate files from the scripts so you can propagate updated versions of the scripts to all interested guests without the risk of them all trying to shut down the same host!

Don’t ignore log files until things break….

I look after a number of services that continuously generate log files. Much of the content is difficult to make sense of and the size of the files make them impossible to review in detail every day. Of course, for the most part they get ignored until something goes wrong. Then you just hope there’s something in there to give you a clue as to what the problem is.

As an example, I am the DBA for several DB2 instances running on RHEL 5. These generate large diagnostics logs (db2diag.log) that occasionally contain records that I really need to know about. There’s a much shorter db2 notification log but I really prefer to track the lower level messages. The problem is seeing the “wood from the trees”, i.e. finding the unusual messages I need to worry about in amongst those I can safely ignore.

There is another more insidious problem. Some  messages indicate events that are OK from time to time but a sudden surge in their frequency indicates something is wrong. Simply identifying such messages as ‘noise’ and then excluding them (we all love “grep -v” for this) might be a very bad thing to do.

DB2 comes with a reasonably useful command (db2diag) that allows filtering of the diagnostics log, but there were two issues with it for me. Firstly, on RHEL 5 for a long time the version supplied with DB2 V9.7 lost the ability to pipe input to the command. This I see has been fixed as of Fix Pack 5, but my second issue is that I also wanted some way of filtering and counting all of the standard messages I got, leaving the unusual ones to be shown in full.

My solution was to write my own log file parser. The process of writing the parser proved to be very worthwhile. I learnt a great deal about the content of the messages and the different components that can generate them. I had studied the basics of the diagnostics log for a DB2 certification exam, but there’s nothing like working with the file for really getting a better understanding of it. My parser is nowhere near as complete as the db2diag command but it handles the messages I commonly get and simply reports in full any messages it doesn’t recognise.

In practise, the parser gets called every day under cron for each db2 instance. The process is as follows:

  • For messages the parser recognises, count the number of times each one occurs.
  • Report in full any message the parser doesn’t recognise.
  • Produce a summary report of recognised message counts at the end.
  • The whole report is emailed to me.
  • The parsed log file is then archived. A new log file gets created automatically and old ones are later deleted by a separate clean up task.

Most days there are no unknown messages so the report is very simple. In case things go badly wrong I have a cap of 50 unknown messages before the report stops writing them out – I don’t want a 100MB email with a 10 million occurrences of the same message – 50 is enough to tell me something has broken!

What isn’t in my parser, and probably should be, is the ability to indicate that a particular message has appeared an unusual number of times. In truth, I’m so familiar with my own reports that I have a good idea of what to expect. However, if a number of people are supporting your system then this would probably be a good addition.

The parser means that, in effect, I read the whole of the db2 diagnostics logs from several DB2 instances every day and I do it in a matter of a few seconds. The emails containing the report get saved and take up very little disk space. When unusual messages get generated they are very obvious and I can decide if I need to do something about them. These can be an early warning of a problem that is going to become a very big problem later.

A typical report looks like this (“Unfiltered” messages are ones shown in full):

Message processing completed
============================
Message timestamps range from 2012-03-27-03.00.02 to 2012-03-28-03.00.01
Messages found:        4327
Messages unfiltered:      0
Messages filtered:     4327

Filtered Message Analysis
-------------------------
Message Type                            Occurred
0 - Info                                     152
1 - Event                                   3432
2 - Health Monitor, Runstats                   3
2 - Load starting                            124
2 - Utilities Note                           120
2 - Utility phase complete                   496

It’s not just DB2 that I apply this process to. The ESXi hosts I look after are configured to send their syslogs to remote Linux servers. A similar script parses these. In the case of ESXi, not only do I look for unusual messages, but I get to see that regular jobs have run e.g. auto-backup.sh (every hour) and tmpwatch (every day) .

Applying ESXi 4.1 Update 2 to HP Proliant servers

I’m in the process of applying VMWare ESXi 4.1 Update 2 to some HP Proliant servers. Once I’d found the various steps required, the process was pretty straightforward. However, as is often the case when you’re on your own with a job you’ve not done before, the research/head scratching/gnashing of teeth took a lot longer than I’d expected. As it seems very poorly documented, I thought I would start this blog by detailing my experiences.

I only have a small number of servers to update. They were already running the HP OEM version of ESXi 4.1 Update 1. Some of the servers are local to me, the others are in a remote data centre. One of the local servers (a DL380 G5) is a sandpit machine that runs no meaningful loads. It is available to take the strain in case one of the main servers fails. The rest of the time it is used for testing. On all the servers, the disks containing guests are internal and are configured as raid 1 pairs. ESXi is installed on the first disk pair.

The environment I work in does not allow for guests to use vMotion. I can manually move guests and I had done so by moving a range of guests onto the test server (after it had had Update 2 applied) to check that the guests worked as expected and that VMWare tools updated without issue. These guests were then just moved back to their normal hosts after those hosts had been updated.

The HP OEM version of ESXi comes with various HP agents added in for things like hardware monitoring. If you just install the vanilla ESXi 4.1 you miss out on things that should make your life a lot easier when a power supply or fan fails. The original installation of ESXi 4.1 with Update 1 (with the HP bundle included) had been a simple matter of downloading the installation image from the VMWare web site. However, getting the update was more of an issue. VMWare only supplied the Update 2 bundle and made no mention of OEM updates. HP will supply a fresh installation image of ESXi 4.1 with Update 2 included but made no mention of a single image that would install both the VMWare and HP bundles combined.

So, the solution for me was to get the update bundles separately from VMWare and HP and apply both bundles to each server.

Starting from http://www.vmware.com/patchmgr/findPatch.portal I managed to download update-from-esxi4.1-4.1_update02.zip which gave me the ESXi 4.1 update 2. Finding the HP bundle proved much harder. So much so that in fact at one stage I assumed one didn’t exist. I eventually came across it in an HP Alert. The file I needed was called hp-esxi4.1uX-bundle-1.2-25.zip. Once you know what the file is called, it’s easy to find!

The bundles are applied by using the vihostupdate command. The server level commands were run from a vSphere Client via vCenter Server. The vihostupdate command is part of the vCLI, which in my case is installed on a separate, real server running RHEL 6.2 64bit. I applied the VMWare bundle, rebooted, then applied the HP bundle and rebooted. You can probably do both software updates with a single reboot. Having got the process to work with two reboots on my test server I stuck with that for the other servers. So, the process is as follows:

  • Shutdown the guests.
  • Put the host into maintenance mode.
  • Stop any guests from starting automatically when the host boots.
  • List what’s in the bundle
# vihostupdate --username=root --server=<ESXiHost> -l -b update-from-esxi4.1-4.1_update02.zip

Enter password:

---------Bulletin ID---------   ----------------Summary-----------------
ESXi410-201110201-SG            Updates ESXi 4.1 Firmware
ESXi410-201110202-UG            Updates ESXi 4.1 VMware Tools
ESXi410-Update02                VMware ESXi 4.1 Complete Update 2
  • List what’s already installed
# vihostupdate --username=root --server=<ESXiHost> -q
Enter password:
---------Bulletin ID--------- -----Installed----- ----------------Summary-----------------
ESXi410-201101223-UG          2011-01-13T05:09:39 3w-9xxx: scsi driver for VMware ESXi
ESXi410-201101224-UG          2011-01-13T05:09:39 vxge: net driver for VMware ESXi
hpq-esxi4.1uX-bundle-1.1      2011-03-31T11:26:48 HP ESXi 4.1 Bundle 1.1
hp-nmi-driver-1.2.02          2011-03-31T11:27:30 HP NMI Sourcing Driver for VMware ESX/ESXi 4.1
  • Apply the update “ESXi410-Update02”. Note that I mix my command switch formats a bit here, using -b and –bulletin in preference to using -b and -B.
# vihostupdate --username=root --server=<ESXiHost> --install -b update-from-esxi4.1-4.1_update02.zip 
  --bulletin ESXi410-Update02
Enter password:
Please wait patch installation is in progress ...
The update completed successfully, but the system needs to be rebooted for the changes to be effective.

When you run the above command, patience is a virtue. It takes a good couple of minutes during which time you get no progress information.

  • Now you can check to see that the installation worked by seeing what’s installed
# vihostupdate --username=root --server=<ESXiHost> -q
Enter password:
---------Bulletin ID--------- -----Installed----- ----------------Summary-----------------
hpq-esxi4.1uX-bundle-1.1      2011-03-31T11:26:48 HP ESXi 4.1 Bundle 1.1
hp-nmi-driver-1.2.02          2011-03-31T11:27:30 HP NMI Sourcing Driver for VMware ESX/ESXi 4.1
ESXi410-Update02              2012-03-02T13:56:58 VMware ESXi 4.1 Complete Update 2

We can see our ESXi410-Update02 bulletin is now reported.

  • As I mentioned above, at this point I rebooted the host but I guess you can just carry on and install the HP offline bundle.
  • Again, check what’s in the bundle
# vihostupdate --username=root --server=<ESXiHost> --list -b hp-esxi4.1uX-bundle-1.2-25.zip
Enter password:

---------Bulletin ID---------   ----------------Summary-----------------
hpq-esxi4.1uX-bundle-1.2-25     HP ESXi 4.1 Bundle 1.2-25
  • And apply it
# vihostupdate --username=root --server=<ESXiHost> --install -b hp-esxi4.1uX-bundle-1.2-25.zip 
  --bulletin hpq-esxi4.1uX-bundle-1.2-25
Enter password:
Please wait patch installation is in progress ...
The update completed successfully, but the system needs to be rebooted for the changes to be effective.

Once again, have a little patience whilst the installation takes place.

  • That’s the installation done, so now you have reboot the host. It should come back up in maintenance mode.
  • Take host out of maintenance mode and you’re ready to start booting the guests.
  • At some point, resume restarting guests when the host boots if that’s applicable in your environment.
  • Each guest will need VMWare tools updating and configuring – which requires the guests to be rebooted.