Hello everyone! I've been a long time reader first time poster so hopefully you can help me on this one as much as you've helped others in the past.
Here's the problem:
Windows computers (mainly Win 7 Pro but some XP) lose the ability to ping internal servers by host name randomly. I have about 100 workstations and while working in an EMR program the program will freeze up because of a communication loss to the application host. When this happens I can open a cmd prompt and can ping 126.96.36.199 can ping www.google.com can ping 192.168.80.10 (our DC) can ping 192.168.90.232 (the emr app server) can't ping DC01 (our DC) and can't ping EMR (our emr app server). This happens on multiple different machines with nothing in common (i.e. not on the same switches, area of the building, etc.) and when it's happening the rest of the computers appear to be unaffected. If you go to the affected machine and do either "ipconfig /renew", "ipconfig /flushdns" or "netsh int ip reset resetlog.txt" it comes back up and all is well.
Windows Server 2008 R2 Enterprise (fully licensed and updated) on all servers.
All servers are running on top of VMware ESXi 5.0 hosts in a logical datacenter with one NFS based SAN, one iSCSI based SAN and a lot of free space on both.
All servers are static IP in either the 192.168.80.0/24 or 192.168.90.0/24 ranges and all workstations are DHCP in the 192.168.80.76-230 with plenty of extra addresses available and no duplicate IPs.
Two main servers involved are DC01 which is a Domain Controller with all 5 FSMO roles, DHCP and DNS on it and EMR which is an app server with only SQL Server (2010 I think but shouldn't matter) and the EMR application.
What I've Done So Far:
Ran "dcdiag /tests:DNS /DnsAll" on the DC and everything tested correctly.
Ran wireshark on the DC and saw nothing that really jumped out at me.
Checked for errors on the switches. Out of 384 ports I have two dropped packets since the last counter reset so I don't think that's it.
Verified that several of the computers having the issue have different NICs/NIC Drivers so no common threads.
Checked for common updates on the workstations and recent updates on the servers.
Created a new vswitch in vmware to give one dedicated NIC to each of the two VMs in question here with no load balancing or fancy stuff happening.
Beat my head against he wall repeatedly!!! :)
I really appreciate any help the collective guru hive has.
You referred the term “ping by DNS name” in the title.
So I wonder if you can ping by IP address.
Since you captured some packets, did you see the response packet to DNS request from DNS server?
I would appreciate if you can provide ipconfig /all on both server and client side when the issue occurred.
We are trying to better understand customer views on social support experience, so your participation in this interview project would be greatly appreciated if you have time.
Thanks for helping make community forums a great place.
Thanks for the response. Yes, I can ping by IP all the time with no problems to both internal and external targets.
Unfortunately, since the problem strikes workstations (seemingly) randomly I haven't been able to be sitting at one when it happened. However I did write a small bat with this command:
ipconfig /all >> C:\testlog.txt
netstat -ano >> C:\testlog.txt
I started this script running on the domain controller and on the workstation of the user that's been complaining the most before we opened one morning and when the problem occurred I went back and examined the results. Every set of ipconfig and netstat was almost identical with the exception of some port 80 netstat results on the workstation from websurfing.
I just realized I didn't answer your other question about the response packet. The answer is I think so but when I ask when a problem occurred I get answers like "about 10:30" from the users which makes going through packet captures on the DC pretty hard to know if I'm seeing the right time frame or not. Sorry I don't have a better answer.
I had similar problems with a network a while back, which turned out to be related to the network configuration on VMware.
In this case the network guys had configured my VMware network trunks to use the spanning-tree protocol on the physical switches. VMware would spread the load of traffic over multiple adapters, and the spanning-tree protocol would detect packets that were sent under the same network but from a different connection. It would cause network outages ranging from 3 to 30 seconds.
Don't know if this is the case in your network, but it might be worth checking.
Jaap, thanks for the input. This was actually one of the first things I've checked. In my original post I mentioned that I pulled the two VM's that this problem affects out of the vswitch and gave them dedicated physical NICs so, as I understand VMWare networking to work, what you're talking about shouldn't be happening. Also, the network doesn't go down at all. Just the affected host drops DNS resolution ability for a little while. It's very weird.
Good thought though! Thanks and keep 'em coming.
***UPDATE*** I spun up a second DC with only Directory Services and DNS roles installed, changed DHCP settings to include new DC as secondary DNS and the issue appears to have gotten better. Still not gone but better. Any ideas?
I'll continue working on it and post a solution if I find one.