Quantcast
Channel: General Networking
Viewing all articles
Browse latest Browse all 27527

NET ALERT: You left grandma at the airport

$
0
0

This is the 170th article in the Spotlight on IT series. If you'd be interested in writing an article on the subject of backup, security, storage, virtualization, mobile, networking, wireless, DNS, or MSPs for the series PM Eric to get started.

Playing with a product today I’m reminded of how critical, and often critically broken, alerting actually is in many environments. You can be a department of one with only a handful of systems sending occasional advisory messages, or a large IT team with hundreds of alerts of every variety, and you have the same problems. Most alerts are noise but a tiny handful are really serious. I’ve been able to improve or at least accept certain idiosyncratic IT macro-conditions over the years, but no matter where I go, broken alerting causes more indigestion than it should.

Once upon a time I worked at an airline. I won’t say which airline, but it was an American airline. In this role, one of my systems gathered and dispatched real-time flight status information to the reservations website. It was used by visitors to check when they needed to be at the airport to pick up grandma, and it also powered the system that sent flight change info to passengers’ mobile devices. Some people might argue that’s the sort of system you might want to ensure is always working, and it should rapidly send alerts and escalates to, well, everybody. It had too many moving parts, tricky dependencies, well-intentioned crocks, third-party mainframe interfaces, and potentially upset grandmas.

While grandmas alert their rides by calling their tardy children and sounding simultaneously accommodating yet disappointed as only a grandma can, every system in this application chain alerted using a different method. Worse, the systems were split between three different business units that didn’t play well together. The TPF flight ops system wrote errors to a log I wasn’t allowed to access, and it sent IBM MQ Series error messages to a central NOC event hub... that didn’t have security policies to allow redirection to me. I built the process that diffed flight updates against a MS SQL database to figure out what changed. I knew what my system was up to, but had no tool to warn downstream bigairline.com of upstream trouble. Of course, I didn’t manage the firewalls, SQL cluster, network links or other infrastructure and all of those sent alerts somewhere, but not to me. Yes grandma, computers are hard. Thanks for the hug.

In some IT organizations you have the additional frustration of data trolls. You know the type. They have amazing, encyclopedic understanding of their systems but tease you with reports or log snippets that they won’t make available. They hide behind policy, departmental boundaries or interpret IT governance regarding access as they see fit. Fortunately, data trolls are rare. Most IT pros will try to help, forwarding alerts for whatever you need. The problem is they don’t have the time or expert knowledge to configure exactly which alerts you need, so they send pretty much everything. All the time. Without end. It forces the problem of determining what’s important downstream to your email inbox.

The flight data system alerting had the same challenge as many complex systems; the easiest way to cross boundaries was email. Any individual system alerts in a way that’s most reasonable for it, (syslog, log files, events, IVR calls, voodoo), but the only universal format is email. It’s convenient to configure, reasonably reliable* and has capacity to carry as much detail as the system needs to send. The problem, of course, is it’s dumb. You can try to create Nobel Prize-level Exchange rules to sort and dispatch them, but in the end critical messages end up sitting there in a mailbox.

*And yes, the network can go down, so you still need your trusty v.32 Smartmodem on a serial port.

To solve my email issue, I ended up building a monstrosity, a Perl cron edifice, which imported emails from Novel GroupUnwise with POP, parsed contents, spit digests to a file, then read confg tables in a shared Excel spreadsheet to figure out who was on-call with the pager. I learned two important lessons from that. One, don’t use shared Excel spreadsheets for anything important, like making sure grandma isn’t freezing on the curb outside O’Hare. Second, Perl is Anger, the fifth circle of hell. (I’m sorry, but someone has to say it).

The hack was workable, and we always knew when the flight data died. Even better, bigairline.com knew as soon as we did. I made sure if the failure was upstream that bigairline.com operations knew in the initial alert that it wasn’t us. But after years of trying everything I could think of, the TPAP (Temporary Perl Alert Processor) never went away, and for all I know it’s still running. I was sure that in my next job things would be different and that surely the airline was unique. No other IT environment could be as messed up when it came to ensuring accurate and actionable alert flow between critical operations systems, across business boundaries and out to the right admins for resolution. But I was wrong, and don’t call me Shirley.

Some IT teams are better than others, a few even getting really close to an easy to manage operation where critical messages aren’t missed, everyone knows who has the pager and the pager doesn’t go off at 3 a.m. because toner is low. (Though an SMS when the CEO’s paper is low might be handy). Great alerting happens when smart admins take time to redirect the fire hoses, invite gurus to lend their resolution expertise and capture smart alert handling processes in one place. It is possible; I’ve seen it and basked in its warm rays. I even was able to spend a little more time with grandma on weekends. Though, one time I was late when the airline failed to send an SMS message that her flight was early.

Anyway, I’m installing Alert Central in my house, and that’s what I was thinking about. It’s an off-label use, but I’m configuring it to manage my family and consulting email alerts with a magic Gmail account. Imagine making your spouse the 24-hour coverage group for bank account emails with you as the escalation admin and you get the picture. I already have Hyper-V in my closet and compared to managing the TPAP it’s a snap.

No problem, grandma. Of course, I knew your flight was early.


Viewing all articles
Browse latest Browse all 27527

Trending Articles