Hi,
I work in a school with approx 900 devices. We have about 60 switches most of which are Netgear GS724T, a couple of GSM models e.g. GSM7224 and GSM7328S and a few various Netgear PoE switches for our Meru WiFi. I'm part way through enabling IGMP snooping so not all of our switches are running it yet. We don't have STP turned on because our LEA told us they have known it to cause other problems?
First day back for pupils and apart from the usual "I've forgotten my password" and "My room was painted, can you reconnect my PC" everything was going well. All of a sudden 10 minutes before lunch the entire network dropped (including our cashless tills... great timing!). It even took out our phone system which is connected to our LAN for remote management (Standard phone system, not VOIP). Every light on every single switch on our network was flashing in unison at a steady rate. I'm not an expert when it comes to using wireshark, but I thought I'd give it a go expecting to see a flood of broadcasts and thinking I might be able to spot a source IP. Instead I got a flood of ICMPv6 packets from IPv6 addresses; but we only use IPv4... So at this point I'm thinking could this be a DoS attack? (I've never actually seen one before). With all that in mind, we sent out a team of people to all the IT suites looking for a cable which may have been looped in two sockets by a pupil (this did happen once a few years ago), while myself and the network manager tried to isolate the source to a section of the school by disconnecting parts of our fibre backbone. Confusingly we couldn't see much if any difference in the lights at either end of the disconnected link. How can a broadcast storm be affecting two isolated halves of the network? (Unless there was a looped cable on both sides of the school, which seems a little unlikely)
Next, taking a poke in the dark we were thinking that we have seen very similar activity on one or two switches after a power cut. In which case rebooting the switch has always solved the problem. However we have never seen that happen on such a large scale and our UPS logs weren't reporting any power surges or dips. So we decided to restart one cabinet (4 or 5 switches) but within seconds of them coming back up we had the same activity.
Finally we turned off every switch on the network and turned them back on one cabinet at a time starting at our default gateway and working down. After turning back on the final cabinet the switch activity looked normal and the network was working as it should be. We also had a constant ping going the whole time to the default gateway and it didn't drop. I opened wireshark once the network was running and used this filter: "ip,proto == ICMPv6" and didn't get a single packet... By this time however a lot of PC's had probably been turned off. My concern now is that when the PC's are turned on tomorrow morning we could potentially have the same problem all over again.
Is anybody able to give me any suggestions of what the cause could have been and where I should start if this happens again in the morning?
Many Thanks,
Dan