Web Technologies

Monitoring – from high level Enterprise to low level techie

May 30th, 2013 at 05:05

‘… your IT Infrastructure viewed as a service … ensure the availability and performance of business critical services … alert you to potential problems… unify your business services into a single (whatever) … view how service changes impact business performance… report on availability … measure performance against SLAs … carry out capacity review and planning … deliver the right information to the right people …. ‘

The numerous websites for commercial, open source and open source-ish Enterprise Monitoring solutions agree that it’s very important business wise, and a single console/panel/pane/whatever with fine grained drill down capabilities is good.

Here is us, top level:

us on a good day

Are we good?  The numbers can be read as showing we are 99.29% okay for hosts and 98.85% okay for services.  Good figures, but of course we don’t know what the down hosts or services do, how important they might be both in isolation and to other dependent services.  As an overview though, providing a comprehensible view of the Enterprise in technology terms, this is a good start with a lot of potential.

green - a colour often associated with safety

Drilling down to a service group and we can see lots of greens.

We have seen this before on this exact screen and still been in the realms of canine level performance, and we are not talking greyhound here.  When end user experience was improved nothing changed – it all stayed green.

As an example of a false sense of security Disk usage can be a green yet the volume being monitored may not be the only one in use, or it may not be in use at all anymore.  It is impossible for the tool to know, and close to impossible for the staff responsible for the monitoring software to keep up with such changes across an estate of 849 hosts.

Another problem is that default thresholds are normally used, meaning that whilst it might be normal for an application to use a lot of CPU and cause longer load queues than average, unless it is configured the monitoring software will continually highlight this as being a potential problem – meaning that both this specific warning and the monitoring as a whole are more likely to be ignored if genuine issues appear.

Periodically the reds make a bid for supremacy, emboldened by something like a power cut, everything turns

red - a colour often associated with danger

their colour (if the monitoring software is still working) and the local experts fix what has happened by logging into their own tools and systems.  Eventually the monitoring software becomes mostly green again and is useful in identifying slices of the architecture which have not been restored to order.  Things you might never have heard of can be identified as being of interest – previously unheard of servers, network adaptors, switches, ‘wetness monitors’ and so on.

To take the monitoring data to the people this software is configured to mail and sometimes text individuals and groups specific warnings of interest to them.  The very lowest number I ever receive of such emails in a day is twelve, and they are rarely about anything I can address.   It is easy to miss an important one in a daily batch of even this size.  In times of crises I have received 1600+ such alerts in one day.

We have seen how our top level monitoring extends down to the techies, now what about the other way around…   (see next post)

Comments are currently not open for this post.