Archive for March, 2012|Monthly archive page

Don’t ignore log files until things break….

I look after a number of services that continuously generate log files. Much of the content is difficult to make sense of and the size of the files make them impossible to review in detail every day. Of course, for the most part they get ignored until something goes wrong. Then you just hope there’s something in there to give you a clue as to what the problem is.

As an example, I am the DBA for several DB2 instances running on RHEL 5. These generate large diagnostics logs (db2diag.log) that occasionally contain records that I really need to know about. There’s a much shorter db2 notification log but I really prefer to track the lower level messages. The problem is seeing the “wood from the trees”, i.e. finding the unusual messages I need to worry about in amongst those I can safely ignore.

There is another more insidious problem. Some  messages indicate events that are OK from time to time but a sudden surge in their frequency indicates something is wrong. Simply identifying such messages as ‘noise’ and then excluding them (we all love “grep -v” for this) might be a very bad thing to do.

DB2 comes with a reasonably useful command (db2diag) that allows filtering of the diagnostics log, but there were two issues with it for me. Firstly, on RHEL 5 for a long time the version supplied with DB2 V9.7 lost the ability to pipe input to the command. This I see has been fixed as of Fix Pack 5, but my second issue is that I also wanted some way of filtering and counting all of the standard messages I got, leaving the unusual ones to be shown in full.

My solution was to write my own log file parser. The process of writing the parser proved to be very worthwhile. I learnt a great deal about the content of the messages and the different components that can generate them. I had studied the basics of the diagnostics log for a DB2 certification exam, but there’s nothing like working with the file for really getting a better understanding of it. My parser is nowhere near as complete as the db2diag command but it handles the messages I commonly get and simply reports in full any messages it doesn’t recognise.

In practise, the parser gets called every day under cron for each db2 instance. The process is as follows:

  • For messages the parser recognises, count the number of times each one occurs.
  • Report in full any message the parser doesn’t recognise.
  • Produce a summary report of recognised message counts at the end.
  • The whole report is emailed to me.
  • The parsed log file is then archived. A new log file gets created automatically and old ones are later deleted by a separate clean up task.

Most days there are no unknown messages so the report is very simple. In case things go badly wrong I have a cap of 50 unknown messages before the report stops writing them out – I don’t want a 100MB email with a 10 million occurrences of the same message – 50 is enough to tell me something has broken!

What isn’t in my parser, and probably should be, is the ability to indicate that a particular message has appeared an unusual number of times. In truth, I’m so familiar with my own reports that I have a good idea of what to expect. However, if a number of people are supporting your system then this would probably be a good addition.

The parser means that, in effect, I read the whole of the db2 diagnostics logs from several DB2 instances every day and I do it in a matter of a few seconds. The emails containing the report get saved and take up very little disk space. When unusual messages get generated they are very obvious and I can decide if I need to do something about them. These can be an early warning of a problem that is going to become a very big problem later.

A typical report looks like this (“Unfiltered” messages are ones shown in full):

Message processing completed
============================
Message timestamps range from 2012-03-27-03.00.02 to 2012-03-28-03.00.01
Messages found:        4327
Messages unfiltered:      0
Messages filtered:     4327

Filtered Message Analysis
-------------------------
Message Type                            Occurred
0 - Info                                     152
1 - Event                                   3432
2 - Health Monitor, Runstats                   3
2 - Load starting                            124
2 - Utilities Note                           120
2 - Utility phase complete                   496

It’s not just DB2 that I apply this process to. The ESXi hosts I look after are configured to send their syslogs to remote Linux servers. A similar script parses these. In the case of ESXi, not only do I look for unusual messages, but I get to see that regular jobs have run e.g. auto-backup.sh (every hour) and tmpwatch (every day) .

Applying ESXi 4.1 Update 2 to HP Proliant servers

I’m in the process of applying VMWare ESXi 4.1 Update 2 to some HP Proliant servers. Once I’d found the various steps required, the process was pretty straightforward. However, as is often the case when you’re on your own with a job you’ve not done before, the research/head scratching/gnashing of teeth took a lot longer than I’d expected. As it seems very poorly documented, I thought I would start this blog by detailing my experiences.

I only have a small number of servers to update. They were already running the HP OEM version of ESXi 4.1 Update 1. Some of the servers are local to me, the others are in a remote data centre. One of the local servers (a DL380 G5) is a sandpit machine that runs no meaningful loads. It is available to take the strain in case one of the main servers fails. The rest of the time it is used for testing. On all the servers, the disks containing guests are internal and are configured as raid 1 pairs. ESXi is installed on the first disk pair.

The environment I work in does not allow for guests to use vMotion. I can manually move guests and I had done so by moving a range of guests onto the test server (after it had had Update 2 applied) to check that the guests worked as expected and that VMWare tools updated without issue. These guests were then just moved back to their normal hosts after those hosts had been updated.

The HP OEM version of ESXi comes with various HP agents added in for things like hardware monitoring. If you just install the vanilla ESXi 4.1 you miss out on things that should make your life a lot easier when a power supply or fan fails. The original installation of ESXi 4.1 with Update 1 (with the HP bundle included) had been a simple matter of downloading the installation image from the VMWare web site. However, getting the update was more of an issue. VMWare only supplied the Update 2 bundle and made no mention of OEM updates. HP will supply a fresh installation image of ESXi 4.1 with Update 2 included but made no mention of a single image that would install both the VMWare and HP bundles combined.

So, the solution for me was to get the update bundles separately from VMWare and HP and apply both bundles to each server.

Starting from http://www.vmware.com/patchmgr/findPatch.portal I managed to download update-from-esxi4.1-4.1_update02.zip which gave me the ESXi 4.1 update 2. Finding the HP bundle proved much harder. So much so that in fact at one stage I assumed one didn’t exist. I eventually came across it in an HP Alert. The file I needed was called hp-esxi4.1uX-bundle-1.2-25.zip. Once you know what the file is called, it’s easy to find!

The bundles are applied by using the vihostupdate command. The server level commands were run from a vSphere Client via vCenter Server. The vihostupdate command is part of the vCLI, which in my case is installed on a separate, real server running RHEL 6.2 64bit. I applied the VMWare bundle, rebooted, then applied the HP bundle and rebooted. You can probably do both software updates with a single reboot. Having got the process to work with two reboots on my test server I stuck with that for the other servers. So, the process is as follows:

  • Shutdown the guests.
  • Put the host into maintenance mode.
  • Stop any guests from starting automatically when the host boots.
  • List what’s in the bundle
# vihostupdate --username=root --server=<ESXiHost> -l -b update-from-esxi4.1-4.1_update02.zip

Enter password:

---------Bulletin ID---------   ----------------Summary-----------------
ESXi410-201110201-SG            Updates ESXi 4.1 Firmware
ESXi410-201110202-UG            Updates ESXi 4.1 VMware Tools
ESXi410-Update02                VMware ESXi 4.1 Complete Update 2
  • List what’s already installed
# vihostupdate --username=root --server=<ESXiHost> -q
Enter password:
---------Bulletin ID--------- -----Installed----- ----------------Summary-----------------
ESXi410-201101223-UG          2011-01-13T05:09:39 3w-9xxx: scsi driver for VMware ESXi
ESXi410-201101224-UG          2011-01-13T05:09:39 vxge: net driver for VMware ESXi
hpq-esxi4.1uX-bundle-1.1      2011-03-31T11:26:48 HP ESXi 4.1 Bundle 1.1
hp-nmi-driver-1.2.02          2011-03-31T11:27:30 HP NMI Sourcing Driver for VMware ESX/ESXi 4.1
  • Apply the update “ESXi410-Update02”. Note that I mix my command switch formats a bit here, using -b and –bulletin in preference to using -b and -B.
# vihostupdate --username=root --server=<ESXiHost> --install -b update-from-esxi4.1-4.1_update02.zip 
  --bulletin ESXi410-Update02
Enter password:
Please wait patch installation is in progress ...
The update completed successfully, but the system needs to be rebooted for the changes to be effective.

When you run the above command, patience is a virtue. It takes a good couple of minutes during which time you get no progress information.

  • Now you can check to see that the installation worked by seeing what’s installed
# vihostupdate --username=root --server=<ESXiHost> -q
Enter password:
---------Bulletin ID--------- -----Installed----- ----------------Summary-----------------
hpq-esxi4.1uX-bundle-1.1      2011-03-31T11:26:48 HP ESXi 4.1 Bundle 1.1
hp-nmi-driver-1.2.02          2011-03-31T11:27:30 HP NMI Sourcing Driver for VMware ESX/ESXi 4.1
ESXi410-Update02              2012-03-02T13:56:58 VMware ESXi 4.1 Complete Update 2

We can see our ESXi410-Update02 bulletin is now reported.

  • As I mentioned above, at this point I rebooted the host but I guess you can just carry on and install the HP offline bundle.
  • Again, check what’s in the bundle
# vihostupdate --username=root --server=<ESXiHost> --list -b hp-esxi4.1uX-bundle-1.2-25.zip
Enter password:

---------Bulletin ID---------   ----------------Summary-----------------
hpq-esxi4.1uX-bundle-1.2-25     HP ESXi 4.1 Bundle 1.2-25
  • And apply it
# vihostupdate --username=root --server=<ESXiHost> --install -b hp-esxi4.1uX-bundle-1.2-25.zip 
  --bulletin hpq-esxi4.1uX-bundle-1.2-25
Enter password:
Please wait patch installation is in progress ...
The update completed successfully, but the system needs to be rebooted for the changes to be effective.

Once again, have a little patience whilst the installation takes place.

  • That’s the installation done, so now you have reboot the host. It should come back up in maintenance mode.
  • Take host out of maintenance mode and you’re ready to start booting the guests.
  • At some point, resume restarting guests when the host boots if that’s applicable in your environment.
  • Each guest will need VMWare tools updating and configuring – which requires the guests to be rebooted.