Recently, while doing some wireless training, I’ve had a lot of questions about the importance of keeping historical data and how to use that for troubleshooting. When we get right down to it, troubleshooting any system that is already in production is all about the ‘delta’.
In math, a delta is the difference between two items. When troubleshooting network issues, the delta is the difference between when the network was working properly and when it’s not. In most organizations, it’s the networks fault until proven otherwise!
So, what are the items that we need to track effectively to determine what is different, when things aren’t going the way we expect? Well, it varies on the hardware and the type of network you have deployed, but some of the basic items revolve around user counts, bandwidth, client signal strength, and 802.11 radio counters. For effective troubleshooting using these values, we need more than just a day or even a week of data. In some cases, we need months or even a years worth of data to identify trends.
Let me give you some real world examples where these values are important. First, we have user counts and bandwidth. These two values tell us about the utilization of the network. Most AP vendors will have a recommended maximum number of users per AP. When looking at this trending data, are we passing this number when users start complaining? Bandwidth is a little easier to trend since we know we only have so much bandwidth available, depending on the radio mode we’re using.
Client signal strength and 802.11 counters are a bit more ambiguous. These are the values that without historical information have almost no context. Often the ‘slow’ network issue from users is really a lack of good wireless signal. This can be caused by the user being in an area where there is a known lack of coverage, or it can be because something changed in the environment that is causing an issue. I’ve seen things like new construction (unknown to the IT staff of course) or, my favorite is the twenty pallets of canned beans that get delivered to the warehouse that drastically change the RF coverage.
Looking at the 802.11 radio counters can be an eye opening experience. These values usually relate to reception (i.e. interference issues) and transmissions errors. Things like transmission errors can be caused by stolen antennas (mostly in high schools!) or if you’re using outside antennas and the errors peak when it’s raining, you are getting water down into the coax. Reception errors are more varied, but generally point to some sort of interference. This can be caused by things like microwaves (the graphs will jump up during the lunch hour) or cordless phones. I had one situation where all the clients at a facility dropped off the wireless network every Tuesday at 1pm. I verified the wireless disconnect by looking at the roaming history for the clients. When I looked at the 802.11 counters, I saw a spike in the receptions errors every week at the same time over the last couple of months. After further investigation, it turned out to be the backup generator on the roof doing its weekly self test!
So, the rule of thumb when troubleshooting an already deployed system is to find out what the network looked like when it was working properly and what’s different now, when the network is misbehaving.
Written by Jeremy HaltomSocial Bookmark/Email This
February 13th, 2008 at 6:49 pm Quote this comment
[…] the helpdesk is not staffed with RF engineers), trending information (see my earlier blog on ‘Troubleshooting Deltas’), and other troubleshooting dashboards. This way, the helpdesk can accurately diagnose the […]