Previous Table of Contents Next
Killer Packets
Here's a case in point for how filtering applies to troubleshooting a
busy server. I saw a problem in which a UNIX server started to have
trouble sending print jobs to a Novell server. The Novell server would
all of the sudden, and at seemingly random times, generate errors on
its UNIX services screen (PLPD) and stop processing. Only a reload of
the PLPD.NLM file would make the server start processing UNIX print
again. Our first question was, "Who changed something on the Novell
server?" The answer was...nobody. Nothing had changed on the Novell
server. No interrogation or torture was spared to verify this; we were
absolutely certain that nobody had changed anything in the time frame
that we were talking about.
This was a really tough problem to troubleshoot: A search on the
Novell support site for the particular PLPD error message revealed
nothing, and the problem was still popping up intermittently. We
needed an answer relatively quickly, because this print gateway was
responsible for processing print for a time-sensitive function.
Because we were relatively certain that nothing had changed on either
the Novell server or the UNIX server (in fact, the UNIX server was
printing fine to other Novell servers), we decided to see what was
happening on the network. Maybe some errant evil packet was causing
the PLPD server some mental illness.
We connected a sniffer to the server's segment (because we suspected
something bad was happening to the server) and considered what we
wanted to filter on:
o Because we knew something was happening to the Novell server, we
would only capture packets destined for the Novell server's MAC
address.
o Because we knew that this was a very busy file and print
server, it wasn't feasible to capture all packets destined for
this server.
o Because we knew that the problem was with PLPD (and knew
that PLPD accepted UNIX print services via TCP/IP), we would
only accept TCP/IP packets. This eliminated most of the packets
destined for this server, which were Novell file and print
IPX/SPX packets. This left us with a test setup that looked
something like what's shown in Figure 21.4.
[21-04t.jpg]
Figure 21.4 The test setup for a tough NetWare-to-UNIX
printing problem.
As soon as the problem occurred again, we looked at the packet
capture. There are two important concepts here: First, we ran and
stopped the analyzer right after the trouble report. Second, we
synchronized the clock on the network analyzer to the network time
before we started capturing, and we asked the user who reported the
problem to also report the time of the problem. Because this was a
pretty busy print service, we were sure that the problem report was
within plus or minus two or three minutes, so we now only had to
consider packets around the time of the report, thus limiting how much
junk we had to wade through.
Skipping to the end of the trace, we first filtered on the LPD TCP
socket, number 515. We did see a problem: The server stopped
responding to the LPD requests from the UNIX host at the end. Well, we
knew that without taking a trace. Still, this was useful: It let us
know where in the packet list the problem occurred. Therefore, we got
rid of the LPD filter, jumped to the packet where the problem
occurred, and looked at the packets right before the problem.
Apparently, right before the problem occurred, there was an ARP
request (TCP/IP's Address Resolution Protocol). Remember, each TCP/IP
address must have a corresponding MAC address in order for two network
cards to talk. The ARP request I saw was responding with the wrong MAC
address. An ARP packet with the wrong MAC address typically means that
someone else has used a TCP/IP address that's the same as yours, thus
interrupting communications-but that was not the case here.
We tried to find the MAC address reported by the ARP request, but
there was no such network card on our network. Not only that, but I
couldn't find the OUI of the MAC address in my OUI table, which was
also suspicious. Furthermore, this was a network where only one or two
well-known vendors' cards were in use.
Because there was no such device on the network, we next looked at the
switch configuration (remember from Hour 14, "Router and Switch
Basics," that devices on different sides of a switch do not actually
talk directly to each other). Because there was a MAC-level problem,
we naturally suspected the switch. We asked the person responsible for
switch configuration if anything had changed in the last couple of
days-and, in fact, something had. He therefore changed the
configuration back to the way it used to be, and the problem went
away. Tough problem solved!
Two things still bothered me, though. Why could I ping the Novell
server at all if the ARP was incorrect? Well, because ARP is "redone"
every couple of minutes, by the time I was on the scene
troubleshooting, the ARP was correct again; therefore, I could ping
the server without a problem. The switch was only sometimes messing up
the ARP; usually, it was just fine. Second, why did a bad ARP mess up
the LPD service? That's a tougher question, and one I wasn't going to
find the answer to, mostly because it didn't matter. The PLPD.NLM file
(and for that matter, the TCPIP.NLM file) on the Novell server in
question was somewhat old, and an interruption in the data stream was
apparently driving it berserk. After the switch configuration was
fixed and the ARP problem went away, everything was okay once more
(and that, after all, is what's really important).
Previous Table of Contents Next
Wyszukiwarka
Podobne podstrony:
348 351351 354348 (2)351,17,artykul03 (351)12 (348)02 (351)02 (348)351 Ujęcie w ksiegach rachunkowych dodatniej wartości firmys 351 Leksykon onkologii Dozymetria14 (351)więcej podobnych podstron