Don’t ignore log files until things break….

I look after a number of services that continuously generate log files. Much of the content is difficult to make sense of and the size of the files make them impossible to review in detail every day. Of course, for the most part they get ignored until something goes wrong. Then you just hope there’s something in there to give you a clue as to what the problem is.

As an example, I am the DBA for several DB2 instances running on RHEL 5. These generate large diagnostics logs (db2diag.log) that occasionally contain records that I really need to know about. There’s a much shorter db2 notification log but I really prefer to track the lower level messages. The problem is seeing the “wood from the trees”, i.e. finding the unusual messages I need to worry about in amongst those I can safely ignore.

There is another more insidious problem. Some  messages indicate events that are OK from time to time but a sudden surge in their frequency indicates something is wrong. Simply identifying such messages as ‘noise’ and then excluding them (we all love “grep -v” for this) might be a very bad thing to do.

DB2 comes with a reasonably useful command (db2diag) that allows filtering of the diagnostics log, but there were two issues with it for me. Firstly, on RHEL 5 for a long time the version supplied with DB2 V9.7 lost the ability to pipe input to the command. This I see has been fixed as of Fix Pack 5, but my second issue is that I also wanted some way of filtering and counting all of the standard messages I got, leaving the unusual ones to be shown in full.

My solution was to write my own log file parser. The process of writing the parser proved to be very worthwhile. I learnt a great deal about the content of the messages and the different components that can generate them. I had studied the basics of the diagnostics log for a DB2 certification exam, but there’s nothing like working with the file for really getting a better understanding of it. My parser is nowhere near as complete as the db2diag command but it handles the messages I commonly get and simply reports in full any messages it doesn’t recognise.

In practise, the parser gets called every day under cron for each db2 instance. The process is as follows:

  • For messages the parser recognises, count the number of times each one occurs.
  • Report in full any message the parser doesn’t recognise.
  • Produce a summary report of recognised message counts at the end.
  • The whole report is emailed to me.
  • The parsed log file is then archived. A new log file gets created automatically and old ones are later deleted by a separate clean up task.

Most days there are no unknown messages so the report is very simple. In case things go badly wrong I have a cap of 50 unknown messages before the report stops writing them out – I don’t want a 100MB email with a 10 million occurrences of the same message – 50 is enough to tell me something has broken!

What isn’t in my parser, and probably should be, is the ability to indicate that a particular message has appeared an unusual number of times. In truth, I’m so familiar with my own reports that I have a good idea of what to expect. However, if a number of people are supporting your system then this would probably be a good addition.

The parser means that, in effect, I read the whole of the db2 diagnostics logs from several DB2 instances every day and I do it in a matter of a few seconds. The emails containing the report get saved and take up very little disk space. When unusual messages get generated they are very obvious and I can decide if I need to do something about them. These can be an early warning of a problem that is going to become a very big problem later.

A typical report looks like this (“Unfiltered” messages are ones shown in full):

Message processing completed
============================
Message timestamps range from 2012-03-27-03.00.02 to 2012-03-28-03.00.01
Messages found:        4327
Messages unfiltered:      0
Messages filtered:     4327

Filtered Message Analysis
-------------------------
Message Type                            Occurred
0 - Info                                     152
1 - Event                                   3432
2 - Health Monitor, Runstats                   3
2 - Load starting                            124
2 - Utilities Note                           120
2 - Utility phase complete                   496

It’s not just DB2 that I apply this process to. The ESXi hosts I look after are configured to send their syslogs to remote Linux servers. A similar script parses these. In the case of ESXi, not only do I look for unusual messages, but I get to see that regular jobs have run e.g. auto-backup.sh (every hour) and tmpwatch (every day) .

Advertisements

No comments yet

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: