Recently we had an issue at our office where mail was not properly routing to the different regional Exchange servers and mail was not being sent or received as queues were just building. The routing is done over the Internet through Virtual Private Network (VPN) connections.
I was asked to take a look at the network to determine if it was a network issue. I first asked a few questions just to determine the topology and how I can determine what might be happening, but not before I asked if there were any changes to any of the network or system elements – Firewall, Routers, Switches, OS or Application. I was told there were no changes, so I pulled out my trusty network protocol analyser – Network Instruments Observer ® – and hooked it up to a mirror port to the one going to the Exchange server.
After several hours of looking at the traffic, I could find no evidence of network issues. While there were TCP retransmissions, there were not enough to say that the network itself was causing the issue. In fact, what was evident was application layer errors (as opposed to errors at the transport layer and lower). So we started to look at the Exchange configuration; however, no changes were made to the configuration, or so they said.
Microsoft was later called in, who, after some troubleshooting, said that it was the network and had our IT guys making changes to MTU sizes and other things. For a while I started to doubt myself and my abilities to troubleshoot the network.
Try as they may, Microsoft could not resolve the issue. Somewhere during the process, they then disabled Smart Defense on the Checkpoint ® firewall and voila, everything started to work. So it wasn’t the network. It turned out that Smart Defense auto-update feature downloaded an update that caused the entire issue.
So what did we learn from all this.
- Automatic update is a bad idea for mission critical systems.
- Always check properly to determine if there were any changes to the system. A proper change management process is important.
- Always check every component. It is not always the network.
You always find that users complain that the network is slow or has problems when there is problem with mail, printing and especially the Internet. The most important skill to have for the IT industry, especially the service industry, is troubleshooting skills.
I find it easier to troubleshoot using the OSI Layers starting from the physical layer and moving to the application layer, ruling out each one as you go. But that is of course if you really have no idea of what the problem with be, so depending on the problem, it could be best to start from the top down.