Keep in mind that when I troubleshot this issue the user was impatient and wanted "it" fixed right away, so I didn't have the time to explore the problem deeply. The following is what I tried in order to keep from restarting Server1.
1. Restarting the problem PC. It didn't work.
2. Rejoining the PC to the domain. I took the PC off the domain, joined it to WORKGROUP then joined it to the domain. This didn't work either.
3. Did a GPUpdate via CMD line. Don't ask...I was scratching the bottom of the idea barrel. Obviously it didn't work.
That's what I tried. By this point the user was huffing and puffing. So, I went ahead and restarted the server and then all was right with the world; at least in the users world. So, I would like to fix this problem without resorting to server restart. Further, I would like to know what causes this to happen. I jumped over to Server Fault to glean wisdom from the sages there and boy did I glean!
Since the problem arises at random times and surfaces very little (3 times over 2 1/2 years) it's going to be difficult to actually troubleshoot this problem, but the guys over at Server Fault told me I could develop an attack plan for when it rears its ugly head again. So the plan, so far, is as follows: to see what is going on during the issue run Wireshark on one of the affect machines and also on Server1; to try and fix the issue disable then re-enable the network card on Server1 or run the following cmd on Server 1: arp -d* (enter). These were just a few suggestions given to me. I thought there would be a network service I could restart under Admin tools\Services, but the guys there said this isn't a service issue.
Anyway, I plan on updating this periodically as I explore the issue. I just posted the question at Server Fault today, so I might get more answers sometime after this posting.
The problem occurred again yesterday morning and at lunch, but this time it was just one PC that wasn't in the affected group last week. During the problem I did the following:
- Restarted the switch in her department - didn't work.
- Enabled then disabled her network adapter and the server adapter - didn't work
- Updated the driver on her PC this did work for the morning.
Went to the server, collected wireshark packets between the affected PC and the server. Then, I restarted the server because I know that works. That fixed the issue. I was only able to read through the collected data for a few minutes because other issues came up (I'm the only IT pro - one man crew) that occupied my time for the rest of the shift. Thought about it through the night. Came in this morning, collected network traffic just to see if there were any network process hogs and couldn't find anything bloating the "pipe." Then it hit me: check the kaspersky logs on the server. I checked the network attack blocker logs and found that last week Kaspersky detected dos.generic.synflood "attacks" from the 3 affected machines last week and the affected machine yesterday. When Kaspersky detects things like that, it will cut off communication with the attacking node for 60 minutes. The logs gave the exact time of the issue and the time matched up with the time affected users called me about the issue. I tracked the logs back 30 days and noticed those logs were clean of attacks.
I set the network attack blocker to only block the attacking node for 1 minute. I'm also going to investigate what the synflood attacks could be. At least for now I know why those machines were disconnected from the server. Of course now, I need to figure out the source of those dos.generic.synflood attacks.