Notification Escalations on NodePing

Most systems run smoothly most of the time.  Servers keep running.  Web sites serve pages and deliver data from backend databases.  DNS servers respond to queries with hardly any delay at all.  Email flows smoothly.

Emergency_light_with_grillIt’s that tiny percent of the time that it doesn’t work that way that causes the heartburn.  A server that has been running just fine for months suddenly hiccups.  But even when that happens, it’s usually a hiccup.  The person who is the first line of responsibility for that service needs to know right away.  They jump on it, clear the problem, and things go back to humming like normal.  You need fast and reliable monitoring to help keep these interruptions to service short.  A lot of times, the service is back to normal before most people realize there was an issue.  These incidents likely go in a report, but the rest of the team doesn’t need to get involved.  Its dealt with, duly noted, and life goes on.

Then there are the times that something goes really wrong.  The first line is working on it, but the server isn’t going to be back up in a minute or two.  Or the first line person is not available.  Maybe he’s in accounting trying to sort out his paperwork for credit card expenses for last month.  Someone else needs to know that things are down.

Sometimes these situations turn into real disasters.  The website is down.  Upper management is going to be calling, wondering who’s spilling revenue out on the server room floor.  The manager getting that call wants to know about it before the phone rings with that call.

Most monitoring systems use escalating notifications to handle these situations.  If a system is down, the first line person should be notified immediately.  If it’s down for a few minutes, the people who back him up need to be brought in.  If it’s down longer than that, systems management will want to get a heads up.

NodePing uses the notification delay feature to provide notification escalations.  A delay can be set on each notification contact for each check.  The NodePing notification delay feature notes that a check set with a delayed notification has gone down.  After the delay interval has been reached, if the check is still “down” we send the notification to that contact.

Set up the first line systems with no delay, so they’ll get notified when the system goes down right away.  If the service hasn’t recovered in a few minutes, send a notification to the systems group using a contact group.  Then, if the service hasn’t recovered in 10 minutes (or whatever the tolerance for the service being down is for this service in your organization), notify the systems management.

The notification delay feature can be used for other things besides notifications.  Sometimes services have a higher tolerance for transient interruptions.  You can use the delay to mean “if this service is down shorter than 3 minutes, I don’t need to be notified.”  This is useful, for example, for services in remote locations where Internet connectivity can have brief interruptions.  But our most common request for using the delays are for notification escalations.

 

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: