This is the 206th article in the Spotlight on IT series. If you'd be interested in writing an article on the subject of backup, security, storage, virtualization, mobile, networking, wireless, cloud and SaaS, or MSPs for the series PM Ericf to get started.
Over my career, I have spent an inordinate amount of time troubleshooting other people’s networks. I know that makes me sound like I think I am some kind of “packetmaster,” but that’s not it at all. No, I’m no super network consultant genius. I’m the guy that got called in when my product broke someone else’s stuff. While my day job is now focused on advanced cloud networking technologies, I share this old story to keep us humble and focused on the more basic elements that can bring even the most sophisticated network to its knees.
Working in product for networking companies means that you’re on the hook for breaking something almost all the time. And when it is Bank of America calling, there’s almost no such thing as “too last minute for hotel reservations.” The plane tickets are purchased, and it was me in coach, flying out to see what went wrong.
What’s funny is almost every time I was praying it was my product that had broken. “Please, seven gods of the OSI, let it be MY appliance that’s mishandling VLAN tags.”
It was easier if it was my box’s fault. See, if I had — say — hosed up the queuing in the box, I could troubleshoot that quickly. I knew that box in and out. More often than not, if it was my box acting up, I knew what to expect, where to go to grab a packet capture that would give me the detail I needed, and how to recreate issues I found. I had teams of engineers who could find a bug and fix it fast. And I would know it was done.
If it was my box, it meant all I had to do was find the issue and then start the “mea culpas” to the customer. That’s a lot easier than trying to find gentle ways to tell your paying customer that their network is dumb, broken, misconfigured, or otherwise not working the way they intended. My mom taught me never to point fingers… my CEO taught me never to point fingers at people who have a PO open with my company’s name on it.
And so it was that I found myself stepping off a plane in the Deep South. One of our customers had a big chemical plant that was connected back to HQ so they could run their supply chain software. They had just upgraded my box a couple weeks before with our latest software version. It was a LONG process. They really didn’t want to do it. We needed them to do it, because it had some performance fixes that would help them out — and stop the support phone ringing.
They finally found a time to plan an outage and make the upgrade, and they started having problems. Every day, at almost exactly the same time, the network just went away. Poof. Down. Workers just lost connectivity to HQ, to the Internet, to their mail. It was all just gone.
Now, I knew my box was inline. This was very likely all my fault. I also knew their network had a pretty simple router on the outside of me. But man, the L3 switch on the inside was a monster. I still can’t figure out why they needed to build their network with so much complexity, but I do remember there was an “office” network, a “server” network, and a network for the “machines” in the plant. The routing seemed to me to have been done with static routes, one at a time, from machine to desktop, from server to machine, from desktop to HQ. It was a mess.
“Oh, just let my software be broken. Please, I’d give my cat’s remaining front leg for a bug that keeps me from having to look into that network…”
When I arrived, I started the apologizing right away. I was barely out of the car, and I was sorry-ing my way to the door. “I appreciate you guys making the time to upgrade. I know how big a deal that is, I know we can make this work, and I am so grateful that you are working with us, we can get this solved…”
And then I started investigating. The plant was as grey as father time’s beard, and smelled oddly like American cheese. Why would a plastic plant smell like cheese? Does American cheese smell like plastic? IS American Cheese just plastic?! Likely… It was crazy, and disconcerting.
Oh, the network? It was flawless.
It wasn’t even very much traffic, but it was a pretty steady stream — typical medium-sized office stuff: a fair bit of Internet traffic, with a class to keep it from overwhelming the network, a steady stream of mail, and regular blips of server traffic when the supply chain system checked levels and reported home. My box had a pretty simple configuration, and it was even done right. Clean, simple, basic stuff.
But, it wasn’t yet 2:20 p.m.
2:20 p.m. was when it typically happened. I was sure it was going to be log related. They had chosen to dump the day’s logs every day at 3 a.m. We MUST be filling up with logs at around 2 p.m., and then spending too many resources moving them around, and generally misbehaving and breaking the network. That had to be it.
At 2 p.m. I started watching capturing packets and watching the logs.
And it happened. The network broke at about 2:20 p.m. Not forever, but for about 2 minutes, there was nothing. Packets were fine on the LAN side, but the WAN was busted.
Logs were just fine! CPU use was… LOW! Memory was appropriate. What!? What happened? How could this NOT be my fault!? It wasn’t me. It wasn’t my box. I needed to come back the next day…
Now maybe it’s just me, but in technology, it’s really easy to get used to the same problems. It’s also really easy to get really good at finding and addressing that problem, and feeling pretty good about it. It’s a cycle of “being smart” that feels good. You design a thing, you implement the thing, the thing mostly works, you find the problems, you fix them, it works better… and you KNOW MORE about the thing! That’s a pretty decent cycle, whether you're a product manager or a network manager.
But it means, at least for me, that I start looking in the same places for problems; I look in the same places for ways to make things better. I lose the proverbial forest and just start seeing all these friggin’ trees everywhere.
I am jealous of all the “new guys” out there — the ones who haven’t had to fix all these old problems. They get to just attack problems or identify solutions without all the baggage and judgment and grumpiness that I have carefully cultivated over 15 years of bashing my head against various technologies.
Back to the story: I did come back the next day. I marveled at the greenness of the countryside and the amount of train tracks (something we have next to none of out West) traversing the landscape.
I asked about the trains when I got to the plant. Apparently they were coal trains — always running, moving coal from mines to plants. An IT guy was in the middle of describing the inner workings of the coal industry, when the network guy started laughing. A lot.
“2:20! 2:20!” He just kept saying it.
And other people started to laugh too.
They had spent 15 days, and I had got on a plane, flown across the U.S. and spent another couple of days all troubleshooting the changes to the device on their network.
But no one thought about the changes to the train schedule. An old mine was producing again, and thus a new train was now coming by the plant. It passed by a little after 2 p.m., and passed RIGHT THROUGH the line-of sight wireless link that connected that office to a repeater and continued on to the telco.
Yup. A train.
If I had stopped looking in my “area of expertise” for a few minutes, even though the signs pointed to my box being the problem, I could have quickly looked for other changes, other signs, and maybe found a faster way to a solution.
Correlation is not causation, after all. The fact that the network went weird just after my box was upgraded? Meaningless. It was a red herring — a total distraction in the troubleshooting process. Would my new fancy, schmancy cloud network have saved us here? Not a chance. Which serves as a great reminder that even as we virtualize the network, we can’t lose sight of the basics — the physical network.
Oh, and the fact that the plant smelled like cheese? Still a mystery…
---
Have you ever found yourself in a similar position while troubleshooting network problems? Share your networking tales (or your thoughts on American cheese) in the comments below!