From the Trenches : PCs Intermittently Lose Connection to Server

Have you had this problem with your network? Over the past 2 1/2 years at work I've had this problem. We have funding/accounting/management software installed on Server1 and users who "use" this software do so via a mapped drive on their PC to Server1. Once in a blue moon a few PCs (not the same ones each time) will lose connection to Server1 (this happened 3 times - odd) completely, i.e. PC can't ping, RD to, or access shares on Server1 and likewise with Server1 to the PCs. The communication between them is dead, over, gone...you get it. How do I solve the problem? Now, the PCs can communicate with other nodes on the network, which they're getting juice to do that from Server2 I suppose. How do I solve the problem? By restarting Server1. I don't want to do that each time though. I want to know why this is happening and if there is another way to fix it.

Keep in mind that when I troubleshot this issue the user was impatient and wanted "it" fixed right away, so I didn't have the time to explore the problem deeply. The following is what I tried in order to keep from restarting Server1.

1. Restarting the problem PC. It didn't work.
2. Rejoining the PC to the domain. I took the PC off the domain, joined it to WORKGROUP then joined it to the domain. This didn't work either.
3. Did a GPUpdate via CMD line. Don't ask...I was scratching the bottom of the idea barrel. Obviously it didn't work.

That's what I tried. By this point the user was huffing and puffing. So, I went ahead and restarted the server and then all was right with the world; at least in the users world. So, I would like to fix this problem without resorting to server restart. Further, I would like to know what causes this to happen. I jumped over to Server Fault to glean wisdom from the sages there and boy did I glean!

Since the problem arises at random times and surfaces very little (3 times over 2 1/2 years) it's going to be difficult to actually troubleshoot this problem, but the guys over at Server Fault told me I could develop an attack plan for when it rears its ugly head again. So the plan, so far, is as follows: to see what is going on during the issue run Wireshark on one of the affect machines and also on Server1; to try and fix the issue disable then re-enable the network card on Server1 or run the following cmd on Server 1: arp -d* (enter). These were just a few suggestions given to me. I thought there would be a network service I could restart under Admin tools\Services, but the guys there said this isn't a service issue.

Anyway, I plan on updating this periodically as I explore the issue. I just posted the question at Server Fault today, so I might get more answers sometime after this posting.

***UPDATE*** 7/18/2012

The problem occurred again yesterday morning and at lunch, but this time it was just one PC that wasn't in the affected group last week. During the problem I did the following:

Restarted the switch in her department - didn't work.
Enabled then disabled her network adapter and the server adapter - didn't work
Updated the driver on her PC this did work for the morning.

The monster reared its ugly head again at lunch.

Went to the server, collected wireshark packets between the affected PC and the server. Then, I restarted the server because I know that works. That fixed the issue. I was only able to read through the collected data for a few minutes because other issues came up (I'm the only IT pro - one man crew) that occupied my time for the rest of the shift. Thought about it through the night. Came in this morning, collected network traffic just to see if there were any network process hogs and couldn't find anything bloating the "pipe." Then it hit me: check the kaspersky logs on the server. I checked the network attack blocker logs and found that last week Kaspersky detected dos.generic.synflood "attacks" from the 3 affected machines last week and the affected machine yesterday. When Kaspersky detects things like that, it will cut off communication with the attacking node for 60 minutes. The logs gave the exact time of the issue and the time matched up with the time affected users called me about the issue. I tracked the logs back 30 days and noticed those logs were clean of attacks.

I set the network attack blocker to only block the attacking node for 1 minute. I'm also going to investigate what the synflood attacks could be. At least for now I know why those machines were disconnected from the server. Of course now, I need to figure out the source of those dos.generic.synflood attacks.

From the Trenches

Pages

Thursday, July 12, 2012

PCs Intermittently Lose Connection to Server

No comments:

Post a Comment