none
Server 2012 R2 File Server Stops Responding to SMB Connections

    Question

  • Hi There,

    Massive shot in the dark here but I am struggling with a pretty major issue atm.  We have a production file server that is hosted on the following:

    Dell MD 3220i -> iSCSI -> Server 2008 R2 Hyper-v Cluster -> Passthrough Disk -> Server 2012 R2 File Server VM

    Essentially 3 times now, roughly a month or so apart.  The file server stops accepting connections.  During this time, the server is perfectly accessible through rdp or with a simple ping.  I can browse the files on the server directly but no-one appears to be able to access the shares over SMB.  A reboot of the server fixes the issue.  

    As per a KB article I removed nod antivirus from the server to rule out a conflicting filter mode driver after the second fault.  Sadly yesterday it happened again.

    The only relevant errors in the servers log files are:

    SMB Server Event ID 551

    SMB Session Authentication Failure Client Name: \\192.168.105.79 Client Address: 192.168.105.79:50774 User Name: HHS\H6-08$ Session ID: 0xFFFFFFFFFFFFFFFF Status: Insufficient server resources exist to complete the request. (0xC0000205) Guidance: You should expect this error when attempting to connect to shares using incorrect credentials. This error does not always indicate a problem with authorization, but mainly authentication. It is more common with non-Windows clients. This error can occur when using incorrect usernames and passwords with NTLM, mismatched LmCompatibility settings between client and server, duplicate Kerberos service principal names, incorrect Kerberos ticket-granting service tickets, or Guest accounts without Guest access enabled

    and

    SMB Server event ID 1020
    File system operation has taken longer than expected.
    
    Client Name: \\192.168.105.97
    Client Address: 192.168.105.97:49571
    User Name: HHS\12J.Champion
    Session ID: 0x2C07B40004A5
    Share Name: \\*\Subjects
    File Name: 
    Command: 5
    Duration (in milliseconds): 176784
    Warning Threshold (in milliseconds): 120000
    
    Guidance:
    
    The underlying file system has taken too long to respond to an operation. This typically indicates a problem with the storage and not SMB.

    I have checked the underlying disk/iscsi/network hyper-v cluster for any other errors or issues, but as far as I can tell everything is fine. 

    Is it possible that something else is left over from the NOD antivirus installation?  

    Looking for suggestions on how to troubleshoot this further.

    Thanks


    Wednesday, December 11, 2013 2:14 PM

All replies

  • Check cabling, switch, amounts of dropped packets. In general - you have a network failure. Do a stress test, find a faulty component and replace it. Also make sure you deplpy NTtcp and IPerf as your config need to have close to wire speeds for TCP.

    P.S. Run multiple physical networks and all the components at least duplicated for production. 


    StarWind iSCSI SAN & NAS

    Wednesday, December 11, 2013 4:22 PM
  • Hi There,

    Thanks for the quick lesson on iSCSI best practice.  As stated I have already checked the underlying storage/networking/iscsi/mpio etc... and there are no problems at all. The same iSCSI/cluster has been running production vm's for 4 years now without any issues.  

    I find it weird that when the SMB service manages to get locked up like this, I can still browse the files fine on the server.  That would rule out any underlying physical storage issue surely? 

    One theory I had could be perhaps the use of an iSCSI passthrough disk in the 2008 R2 host to the 2012 R2 guest.  This is the only thing unique to this VM,  all other guest vm's use vhd files on CSV's.  


    Thursday, December 12, 2013 8:44 AM
  • Hello Dan,
     
    Thank you for your question.

    I am trying to involve someone familiar with this topic to further look at this issue. There might be some time delay. Appreciate your patience.


    If you have any feedback on our support, please send to tnfsl@microsoft.com.

    Monday, December 16, 2013 7:51 AM
  • Hi Dan,

    when the issue occurs again, please restart the server service to test if it can resolve the issue. we have to verify if the Server service is corrupted at that moment

    Regards,

    Mike


    Please remember to click “Mark as Answer” on the post that helps you, and to click “Unmark as Answer” if a marked post does not actually answer your question. This can be beneficial to other community members reading the thread.


    • Edited by Mike Pei-MSFT Monday, December 16, 2013 9:03 AM insert pic
    Monday, December 16, 2013 9:02 AM
  • Thanks for the responses.  

    I will try restarting the Server service next time it occurs, sadly as this only happens one a month or so it may be a while until the condition occurs again.  

    Just for some extra background on the server and its setup:

    • The pass-through disk is configured as a single ntfs volume at around 9TB in size.
    • The volume is presented as a drive letter and then each share (around 8-10 of them) is a subfolder on the disk.
    • The volume does have de-duplication enabled.  Its currently 6TB deduped down to 3TB.
    • The server was upgraded from Server 2012 to Server 2012 R2 before being deployed in production.

    As a long term solution, it may just be a case of building a fresh server to move over to.

    I'll message again next time it happens.


    Monday, December 16, 2013 10:13 AM
  • Hi , can you also check the memory usage of the Virtual Server, have seen a similar issue where the memory was full, had to use Systernal tool to clear the memory down. it could be a memory leak issue.

    This was to do with the backup software we were using. not a native windows issue.

    Monday, December 16, 2013 4:28 PM
  • Hi There,

    I checked the graphs that we log for ram and cpu etc... nothing was out of the ordinary at all the last time it failed.  8Gb of static ram and it was only using around 2.5.

    At present it has not yet failed again, so until it does I'm just waiting.  I will post up as much info as possible once that happens.

    Friday, December 27, 2013 10:55 AM
  • Hi Dan,

    I'd be very interested if you have found a solution to this as it sounds like we are having the exact same problem with 2 of our file servers.  Both are server 2012 (haven't upgraded them to R2 yet) and are virtual machines accessing storage with a pass through fibre connection.  Like yourself when the problem occurs the servers are completely responsive, pingable and can be connected to on RDP where I can access the storage directly with no problem.  One of the servers has an application providing AFP support for our apple macs and the macs continued to access their home directories when this happened so it's definitely not storage related, it's only clients that are connecting via SMB that are affected.  Also a quick server reboot fixes the problem, I will try restarting the server service when it next happens.

    Annoyingly there is nothing in the logs around the time that this happens so the error messages you've posted may not be related to the problem?!

    regards

    Wednesday, January 15, 2014 5:07 PM
  • Hi Ray,

    We've yet to see this issue again.  But likewise we have not done anything related to fix it.  For us I must stress that it would only happen once a month if that which makes it very hard to diagnose.

    Interesting you didn't see anything in your logs.  It could very well be a different issue or mine are unrelated as you said!

    Wednesday, January 15, 2014 6:30 PM
  • Hi Dan,

    Yeah we upgraded to 2012 in September and have had this happen only once on each of our 2 file servers, the second occasion being yesterday (hence why I was looking on the internet as when something happens twice it ceases to be a fluke!).  If it happens again and I find out any more I will let you know.

    Thursday, January 16, 2014 9:02 AM
  • Hello All,

    I too am experiencing this issue on the latest and "greatest" windows server OS. I have tested this on 2012 and 2012 R2 and experienced the issue on both builds. I am running the servers on hyper-v 2012 r2 and have sr-iov enabled on the server nics to rule out the microsoft hyper-v networking stack (although this did occur with the vmq enabled nics too).  Today I made one change and I will see if it helps... I removed any hidden nic cards from device manager. Please keep me posted if you make any progress on resolving this issue on your servers. 

    Thank you,

    Fred

    Monday, January 20, 2014 2:53 PM
  • New update! It happened again!

    So this is the fourth confirmed case now.  Being a little more clued up I observed the following this time:

    • Random clients we're disconnected or could not connect.  Others were still connected fine.
    • No errors we're being logged in the event log.
    • No storage or cluster errors were apparent.
    • Tried restarting the server service.  It failed to restart and just hung at "stopping".  After telling it to stop, a lot of new messages were logged.
    • Being in production I had to restart the server to get our files working again.  Much as I would love to pour over it and troubleshoot for a few hours my phone wouldn't stop ringing.

    The new error message:

    Event ID 2012 - Source: srv

    While transmitting or receiving data, the server encountered a network error. Occasional errors are expected, but large amounts of these indicate a possible error in your network configuration.  The error status code is contained within the returned data (formatted as Words) and may point you towards the problem.

    - <Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
    - <System>
      <Provider Name="srv" />
      <EventID Qualifiers="32768">2012</EventID>
      <Level>3</Level>
      <Task>0</Task>
      <Keywords>0x80000000000000</Keywords>
      <TimeCreated SystemTime="2014-01-22T13:32:18.553037300Z" />
      <EventRecordID>91405</EventRecordID>
      <Channel>System</Channel>
      <Computer>FS-02.HHS.local</Computer>
      <Security />
      </System>
    - <EventData>
      <Data>\Device\LanmanServer</Data>
      <Binary>0000040001002C0000000000DC07008000000000840100C0000000000000000000000000000000004F060000</Binary>
      </EventData>
      </Event>

    I'm going to have to rebuild a new server now just to rule it out.  This is such a pain for us and is really knocking our confidence in the new 2012 R2 OS's  

    As a side note, we do have de-duplication enabled on the volume, I wonder if other people are in the same boat?

    Wednesday, January 22, 2014 2:09 PM
  • Greetings,

    Finally found this thread after weeks of searching... I am having the same or similar issue.  Granted mine is just an enthusiast home setup, but here's what I'm seeing:

    • Originally was running Hyper-V Server R2
    • One of the guest OSs (also Win2k12 R2) was a file server with a pass-through 15TB array on an Areca 1280ML
    • The host VM disk is formatted NTFS, and the 15TB passthrough volume is ReFS.
    • All Intel NICs, though at this point they are probably 2-5 years old.
    • Supermicro X8STE with Xeon W3520 AND 24gb non-ecc RAM.
    • Netgear GS748Tv3 Switch
    • Many configurations of NIC Teaming, and even straight host serving of the files through single NIC.

    Over the last several months I have ruled out everything I can think of, except for the server.  Since it's really only me using the servers, I'll mostly notice it when streaming content... I'll get a win32 I/O #59 error, suggesting a network failure.  It happens sporadically, but usually once an hour, but on occasion I won't see the issue for many hours.  Then, in Event Viewer, I see the 551 error described above:

    SMB Session Authentication Failure
    
    Client Name: \\192.168.0.11
    Client Address: 192.168.0.11:50758
    User Name:
    Session ID: 0x240048000015
    Status: The attempted logon is invalid. This is either due to a bad username or authentication information. (0xC000006D)
    
    Guidance:
    
    You should expect this error when attempting to connect to shares using incorrect credentials.
    
    This error does not always indicate a problem with authorization, but mainly authentication. It is more common with non-Windows clients.
    
    This error can occur when using incorrect usernames and passwords with NTLM, mismatched LmCompatibility settings between client and server, duplicate Kerberos service principal names, incorrect Kerberos ticket-granting service tickets, or Guest accounts without Guest access enabled

    Things I've tried which seem to suggest an issue with the OS:

    • Wired up the server to a separate (cheap) switch directly with my client.  Problem was reproduced.
    • Reconfigured NIC teaming in every combination available, including disabling it.  Problem was reproduced.
    • Copied over a large library of streaming content to a Windows Standard R2 guest OS that is being  hosted on ESX 5.5.  Problem was NOT reproduced after 24 hours of testing (suggesting everything works fine when the host OS is not Windows).  The other box is a very similar setup hardware wise, minus the large storage.  The other box is also connected to the same Netgear switch.
    • Let's see... I also tried streaming music content from the guest file server TO a guest Windows 8.1 client both hosted on the same box.  Problem was reproduced (I was very surprised by this since my understanding is that it's the virtual switch that would have been doing the talking between the two).

    I've read articles about how some of my NICs (like the 82574L), while supported in-box, have been found to have issues and can no longer have drivers written for them because of updated WHQL standards... but my test which reproduced the error on the virtual switch seems to disprove any relationship to the physical NICs.

    It's truly to the point where I'm considering moving this machine to ESX as well.  However, I'd really prefer to stick with what I've got, as I'm tired of working on it.  I'll be bookmarking this an will be MORE THAN HAPPY to provide any additional details the community might need.

    Thank you for your time.

    John



    Saturday, January 25, 2014 5:37 AM
  • I should also note that I didn't have this problem with Windows Server 2012.  I did a little more reading around this morning, and it sounds like others are suggesting it's an issue in SMB 3.02, which R2 uses that previous releases didn't.

    I just happened across that tidbit this morning and thought I'd share.

    Saturday, January 25, 2014 4:48 PM
  • To test the theory, I reinstalled 2012... both Hyper-V Server and the File Server guest OS.  I continue to experience the issue, though I'm not seeing the SMB entries in Event Viewer.  So, it's either not R2 specific, or maybe I've got some type of hardware issue... but that just seems so unlikely.  May try 2008 R2 or something else to confirm.
    Sunday, January 26, 2014 5:54 PM
  • Well, I loaded up the host with ESX 5.5 and installed my 2012 R2 file server as a guest on it.  Configured it all the same as when it was a guest on Hyper-V.  Same problem...

    At this point it sure seems like something in Windows 2012 and above, perhaps with SMB.  I haven't tested with 2008 R2 yet... I might try that next, but it was a heck of a lot of work just to get this far and I'm spent.

    SMB Session Authentication Failure
    
    Client Name: \\192.168.0.180
    Client Address: 192.168.0.180:55373
    User Name: 
    Session ID: 0xC0000000065
    Status: The attempted logon is invalid. This is either due to a bad username or authentication information. (0xC000006D)
    
    Guidance:
    
    You should expect this error when attempting to connect to shares using incorrect credentials.
    
    This error does not always indicate a problem with authorization, but mainly authentication. It is more common with non-Windows clients.
    
    This error can occur when using incorrect usernames and passwords with NTLM, mismatched LmCompatibility settings between client and server, duplicate Kerberos service principal names, incorrect Kerberos ticket-granting service tickets, or Guest accounts without Guest access enabled

    Thursday, January 30, 2014 3:30 AM
  • Yesterday I rebuilt the server with all new hardware... well, new to me.  Dual XEON, LGA771 on a Supermicro board, all Intel NICs, and fully buffered ECC Kingstom RAM.  Loaded it with ESX 5.5 and the same Windows file server guest (Windows 2012 R2 Datacenter Eval).

    Same issue.

    Having ruled out every piece of hardware on my network... I guess I'm left with the possibility that there's some sort of authentication problem between the file server and the domain?  The Event Log message seems to suggest that, at least, and this problem doesn't exist while transferring or streaming files from the virtual guest DC on a different host.

    Not sure what my next step is.  I'm debating converting the guest file server to Ubuntu, which would at least prove or disprove the problem is isolated to the Windows guest.

    Tuesday, February 04, 2014 4:08 PM
  • I hope this is being somewhat helpful and that I'm not just having a conversation with myself :-).

    Here's what I found over the last 24 hours.  As I mentioned, I rebuilt the server with all new hardware... which at this point totally eliminates hardware issues at all levels.  I tested again with 2k12 R2, same issue.  I reverted to 2k8 R2 today, and... same issue.

    So now I'm beyond hardware issues and probably beyond "Windows" issues.  I haven't tried converting the server to Ubuntu yet, but I think my Win2k8 R2 test told me that the problem lies somewhere in configurations (and that the problem isn't related to SMB 3.02).  So, since streaming works perfectly fine when test media is located directly on a domain controller, and since Event Viewer entries suggest authentication issues in SMB, I began looking into reasons why any sort of authentication/domain chatter might fail.

    I have all my servers virtualized, including my domain controllers... and currently it's all on ESX 5.5, as a result of troubleshooting this problem.  I looked into the network configuration of the primary domain controller, the one I was able to successfully stream from.  For its network I had it connected to the same virtual switch as the other guests, which has several NICS teamed together.  So then I ran across this...

    http://social.technet.microsoft.com/Forums/windowsserver/en-US/68b5894a-3eeb-4090-abcb-78538f9379fa/teamed-nic-for-domain-controller?forum=winserverDS

    Which suggests that NIC teaming for Domain Controllers is a no-no.  So, I carved off two vmnics to a new virtual switch, set one as active and one as standby, and am beginning to test again now.  I'll let you know how it goes :-).

    Thursday, February 06, 2014 2:06 AM
  • Well, I think I figured it out.  Like most problems that take weeks to figure out, the solution appears to have been pretty simple.

    I noticed a bunch of audit failures in the event viewer... and all were related to SMB sharing.  More research took me here...

    http://social.technet.microsoft.com/Forums/windowsserver/en-US/ae9da10a-b4d2-4eda-ae6d-ad61b7b6ab79/audit-failure-event-id-4625?forum=winserversecurity

    ... the commands the guy recommended didn't really do anything for me, but I took the premise of the server's "channel" with the domain controller being corrupted and ran with it.  It made sense considering the number of times I've joined and unjoined the file server to the domain.

    So I unjoined it, renamed it to something that's never been on the domain before, rejoined it... and for the last 2-3 hours I've been able to stream media without any interruptions or errors.  I'll let things run all night to be sure, but I think it fixed it.

    So, Dan Kingdon, check out that link.  If his commands don't tell you anything, maybe look into removing the server from the domain, renaming it, then rejoining it.  If that's not feasible in your situation, maybe there's a way to fix the "channel" without doing all of that.

    Hope this has helped.

    John

    Friday, February 07, 2014 4:15 AM
  • I was wrong, that didn't fix it.  At this point I may just try rebuilding AD.
    Saturday, February 08, 2014 11:30 PM
  • Before going that route I decided to try a couple more things... as it just seemed so unlikely that Active Directory was the issue.  Also the many tests I did above, some of which included specific AD diagnostics all indicated that all was well with AD.

    So I decided to disperse media throughout physical PCs in the house and run a 15 hour test streaming from them all.  Not a single error or failure from streams coming from 3 machines, two physical and one virtual on my other server. 

    This test again suggests that the problem is something specific to the problematic server.  Furthermore since all versions of Windows I've tested show the same error, I'm led to believe AGAIN that it's hardware related.  Problem is, at this point I've replaced ALL the hardware in that machine... except for the Areca RAID controllers.

    There are two Arecas in the box, a 1280ML and a 1222... so, one from two different generations.  The 1280 normally hosts all my data, and the 1222 my backups.  I moved media to the 1222, streamed from there for a couple hours, and reproduced the error.

    So then, I sat in a quiet corner and thought for about 20 minutes.  These two Areca controllers are ALMOST the only thing unique to this machine vs. all the others.  The other server also uses a 1280ML, but it's running VMWare with the VMWare driver.  Streaming from virtual machines on ESX aren't accessing the controller directly, which may be why I don't see the problem while streaming from guests that machine.  And of course none of the PCs in the house have Arecas.  Given I've had issues with Areca controllers ever since the release of 2012, I'm thinking now that the culprit is the Areca drivers.  Looking back through all my tests, they all reproduced errors when the media was hosted on machines which used the Areca drivers, including ESX guest with the Arecas passed through.

    In one final test for this theory, I loaded up a Ubuntu 12.04 guest on the problematic Hyper-V machine, and passed through the NTFS partition that I normally pass through to Windows.  I shared out the media folder, and again within about an hour, I got the error.  The underlying Areca driver on the Hyper-V host is the one thing all the failure tests seem to have in common.

    I'm going to pick up an Adaptec from eBay tonight and give that  a try this week. 

    Monday, February 10, 2014 1:10 AM
  • For what it's worth I too am running into the same exact problem.  I've been battling this since late December 2013.

    • Running Windows 2012
    • Hypervisor is ESXi 5.1
    • Server service hangs if i try to restart it during the issue
    • The problem, when it occurs, only affects random clients.  Meaning it continues to work fine for some, but not others.  We have multiple wiring closets and the problem is inter mixed across all of them.
    • I too can RDP / ping to the host from a problem client
    • Rebooting the server is the only solution that fixes it.
    • I did a packet capture from the server view and from what i can see it, an SMB negotiate is sent from the client, but the server never responds with with SMB protocol to use.  It ends up in a repeating loop until the client gives up.

    It's good to know i'm not the only one having this issue. 

    Thursday, February 20, 2014 5:32 PM
  • This looks like a bug. We are debugging this since october 2013. 

    I have found another Topic which describes the same issue: 

    http://www.edugeek.net/forums/windows-server-2012/126721-server-2012-file-server-suddenly-stops-serving-requests-but-otherwise-looks-fine.html

    Friday, February 21, 2014 9:33 AM
  • I'm seeing the same issue on Server 2012 R2 VM hosting on Server 2012.  We're using Dell EqualLogic iSCSI arrays.

    I can reproduce the problem easily with Windows 8.1 workstations and loading roaming profiles off a 2012 R2 file server.

    The key event log entry seems to be:

    File system operation has taken longer than expected.

    Client Name: \\[2001:630:]

    Client Address: [2001:630:]:59115

    User Name: DOMAIN\9999

    Session ID: 0x34000C00003D

    Share Name: \\*\Roaming Profiles

    File Name: STUDENTS\9999.V4\NTUSER.DAT

    Command: 16

    Duration (in milliseconds): 77942

    Warning Threshold (in milliseconds): 15000

    Guidance:

    The underlying file system has taken too long to respond to an operation. This typically indicates a problem with the storage and not SMB.

    I've also notied taskmanager show 100% disk utilisation but 0 read/write/response time.  I can still browse the disk locally but I can't copy files or make directories. It seems to completely lock up the disk.

    Has anyone opened a PSS case? Can you post case numbers - i'm going to open a case and it might be helpful to link cases.  My case number is 114022411211053



    • Edited by DJL Monday, February 24, 2014 11:06 PM
    Monday, February 24, 2014 10:39 PM
  • I'm seeing the same issue on Server 2012 R2 VM hosting on Server 2012.  We're using Dell EqualLogic iSCSI arrays.

    I can reproduce the problem easily with Windows 8.1 workstations and loading roaming profiles off a 2012 R2 file server.

    The key event log entry seems to be:

    File system operation has taken longer than expected.

    Client Name: \\[2001:630:]

    Client Address: [2001:630:]:59115

    User Name: DOMAIN\9999

    Session ID: 0x34000C00003D

    Share Name: \\*\Roaming Profiles

    File Name: STUDENTS\9999.V4\NTUSER.DAT

    Command: 16

    Duration (in milliseconds): 77942

    Warning Threshold (in milliseconds): 15000

    Guidance:

    The underlying file system has taken too long to respond to an operation. This typically indicates a problem with the storage and not SMB.

    I've also notied taskmanager show 100% disk utilisation but 0 read/write/response time.  I can still browse the disk locally but I can't copy files or make directories. It seems to completely lock up the disk.

    Has anyone opened a PSS case? Can you post case numbers - i'm going to open a case and it might be helpful to link cases.  My case number is 114022411211053




    I think that sounds like a different issue than what we're describing.  For us, SMB stops working, but the disk sub-system is fine.  I can copy file without issue (when logged into the server). 
    Tuesday, February 25, 2014 1:46 PM
  • What's everyone's AV and processor?  I've read the there's some know bug with a specific processor.  I have the same processor family as the one mentioned here, but I have a different model.

    http://social.technet.microsoft.com/Forums/en-US/7a625cfb-0252-407a-96bc-131a1a7f291a/intermittent-loss-of-unc-path-access-on-windows-server-2012?forum=winserver8gen

    For our AV we use ESET.

    Tuesday, February 25, 2014 1:48 PM
  • Well, several more weeks and lots more $$ later, I think (for reals) I've found the issue... though I still can't seem to resolve it.

    It has never been hardware related.  Looking back I should have suspected that from the start.  I attempted streaming/copying continuously from the share directly from the IP address of the server... so, \\192.168.0.4\<share>.  Whenever I do that, there's never a problem.

    What I'm seeing now is the problem exists while only streaming from a share by accessing it through either an A record or CNAME.  Even directly to the server via the server name fails \\<servername\<share>.  I have removed all records of the server, rebooted it and let it re-register itself in DNS, and still the problem persists while accessing via DNS host or aliases.

    There are numerous articles like this one describing additional steps to be taken to accommodate sharing via DNS names, but they haven't worked for me either:

    http://forums.techarena.in/windows-server-help/1195474.htm

    So anyways, that's where I stand with it.  \\<IP>\<share> = good to go    \\<anythingdns>\<share> = no dice.

    Wednesday, February 26, 2014 1:38 AM
  • I'm not sure what this actually showed me, but last night I let a bunch of stuff stream to a Windows 7 PC in the living room using the FQDN share path, and everything ran beautifully all night.  On the other hand, on my 8.1 PC (which is where I've been doing the bulk of the troubleshooting), I streamed content all night long accessing the same share by IP address and surprisingly... I received errors all night (share becomes unavailable for a period then re-establishes itself).

    Every test I do invalidates the last one.  This 8.1 machine streams fine from all other physical and virtual machines in the house... all either running Windows 7 or 2012 R2, and up until last night it streamed from the server by IP without issue.

    Wednesday, February 26, 2014 3:03 PM
  • Let me amend my last post.  The 8.1 machine was actually accessing the content through a mapped drive (X:), which was MAPPED to \\<IP>\<share>.  Sitting here this morning with some ZZ Top playing while I work, everything disconnects briefly... and if I have any explorer windows open to the X: drive this window pops up:

    One of the errors I get

    It then goes away and everything resumes.  Previously when I streamed without issue, I simply opened up \\<IP>\<share> in an explorer window and played content from there.  I'll try that again now instead of accessing through a mapped drive.

    Wednesday, February 26, 2014 3:20 PM
  • No problems yet so far streaming from directly browsing to \\<IP>\<share\

    I'm going to try streaming from manually browsing \\<FQDN>\share\, and not from a mapped drive and I'll report back.

    FYI, my maps hare handled in Group Policy... and not by login scripts.  I did have problems in the past with Group Policy drive mappings in Windows 8+.  That problem was fixed by de-selecting "Reconnect".  As another test, perhaps I'll manually map a drive to \\<FQDN>\<share> and try that.

    I think we're narrowing it down? 

    Wednesday, February 26, 2014 5:25 PM
  • Alright, I have streamed all day successfully when accessing the share directly and NOT through a mapped drive.  At least in my scenario, I can pretty confidently say the problem exists only when interacting with a share over mapped network drives.  When interacting with the share directly, like through explorer by entering the UNC, the problem is not reproducible.

    But, since I WANT to be able to used mapped drives I'm going to test more scenarios... like manually mapped drives and drives that are mapped by login group policy scripts instead of drive-map policies and see what that shows me.

    Thursday, February 27, 2014 2:10 AM
  • Here's the link to that other drive mapping issue from nearly 1.5 years ago now with the release of Windows 8...

    http://social.technet.microsoft.com/Forums/en-US/7b033812-4ead-426d-a25b-aa5082859a25/cant-map-network-drive-with-login-script?forum=W8ITProPreRel

    Not sure if there's a relation, but meant to put that link in my earlier post as I referred to it.

    Thursday, February 27, 2014 4:04 AM
  • I honestly don't think it matters how you access the share, as windows doesn't care.  Whether you map by DNS, or IP or whether you manually map a drive, doing via GPP or connect to a UNC.  The issue, whatever it is, is related to the server service, nothing to do with the client.  Whatever the issue is, is new to 2012 R1 / R2.  So SMB 3.0 could be a culprit, as could a whole slew of other things.  Would be nice if MS would chime in.
    Thursday, February 27, 2014 1:32 PM
  • Sure seems like that would be the case... but my tests, as relatively non-technical as they are, do seem to suggest it's at least related to mappings.  Last night I streamed continuously throughout the night from the manually mapped drive and got failures and unavailability messages several times.  Tests from directly accessed shares never so far have the issue (and I can reproduce it reliably).

    Is there some other mechanism involved in talking to shares through maps?  Extra authentication, DNS queries, anything?  Either from the client or host?  Intentional and periodic disconnects and reconnects?

    I reproduced this problem as well with 2012, and even 2008 R2 (see above).  In all cases, I was using a 2012 R2 Domain Controller/ DNS server... and in all cases I was using a Windows 8.1 client.

    John

    Thursday, February 27, 2014 2:27 PM
  • So, that test I did all night with the manually mapped X: drive (and all other drives disconnected) failed... but then I noticed something this morning while continuing to stream media from it.  The GPOs got ran and therefore re-mapped all the drives I disconnected, presumably including the X: drive.  I realized this only because suddenly all the mapped drives I had manually DISconnected reappeared... without needing a reboot or re-login.

    When this GPO was re-applied, at that very moment all the streaming became unavailable and the errors I've been seeing appeared.  So, as of right now, it seems like the drive mapping GPO perhaps gets reapplied periodically... refreshes... or something.  When this happens any network activity using the mapped drive is interrupted until the new connection is established.

    To test this, I deleted all my mapped drive GPOs this morning and rebooted my client.  I then manually mapped the X:\ drive to \\<FQDN>\<share>, and so far for about 4.5 hours there hasn't been one failure.  My guess is because the GPO for mapping drives won't ever run.

    Thursday, February 27, 2014 7:00 PM
  • for us specifically, this is a server issue not a client issue.  The clients can all access other shares just fine, the server specifically freezes (the server service) and will not recover unless we reboot. 
    Thursday, February 27, 2014 7:55 PM
  • I've been working through this with Microsoft over the last few days - it's proving to be a tricky one to pin down.

    Those who are experiencing the problem i'd be intrested to know:

    • Is everyone seeing SMBServer Event ID 1020: File system operation has taken longer than expected?
    • What OS are your clients running?
    • What are your disk counters saying while its happening? (Diskperf -Y -> Taskmgr - > Active Time, Read Speed, Write Speed? or performance monitor)
    • Does restarting the Server service solve the problem? (It does for us, but it takes ages to stop the service, but then again the machine also takes ages to shutdown for the same reason)
    • Who is seeing SMB negotiation problems? (Either using Network Monitor 3.4 or powershell Get-SmbConnection.  We are seeing Windows 8.1 occasionally incorrectly negotiating SMB3 instead of SMB3.02)
    Thursday, February 27, 2014 8:11 PM
  • I've been working through this with Microsoft over the last few days - it's proving to be a tricky one to pin down.

    Those who are experiencing the problem i'd be intrested to know:

    • Is everyone seeing SMBServer Event ID 1020: File system operation has taken longer than expected?
    • What OS are your clients running?
    • What are your disk counters saying while its happening? (Diskperf -Y -> Taskmgr - > Active Time, Read Speed, Write Speed? or performance monitor)
    • Does restarting the Server service solve the problem? (It does for us, but it takes ages to stop the service, but then again the machine also takes ages to shutdown for the same reason)
    • Who is seeing SMB negotiation problems? (Either using Network Monitor 3.4 or powershell Get-SmbConnection.  We are seeing Windows 8.1 occasionally incorrectly negotiating SMB3 instead of SMB3.02)

    To answer your questions since i think we're seeing the same issue:

    • No 1020 event in system or the application log
    • Clients are a mix of windows xp through windows 8.  TMK, Windows 7 and Windows 8 are the only ones i've seen with the problem.  However we have a very small XP population, so that may not be 100% accurate.
    • I didn't think to look at the disk counters, but will post the next time i see it (its been a full week with no issues)
    • We try to restart the server service, but it timesout.  I've never waited to see if it would restart and simply rebooted the server.  Every time we've had the problem though, the server service is hung.  The reboot its self is actually quick.
    • We do see SMB negotiate problems and i have a capture of it.  Basically the negotiate packet comes in from the client to detect the dialect, and then the server never responds with the dialect.  There is TCP communication that's sent to the client. 

    Other info:

    Server OS = Windows 2012 R1

    AV = ESET (Nod32)

    Thursday, February 27, 2014 8:51 PM
  • Alright, I've streamed successfully all day with a manually mapped drive and no drive mapping GPO applied.  I think as far my issue is concerned I'm good to go.  If any of you who are experiencing this problem use drive mapping GPOs to map drives, try instead to map drives with login scripts.

    I will say that never did I have a server freeze, requiring a reboot.  So, I'm not entirely sure my issue ended up being the same as the OP's.  Nonetheless, this is what I've found.  Hope it helps somebody out there.

    John

    Friday, February 28, 2014 1:43 AM
    • We are seeing EventID 1020 but for us the problem is very rare.
    • 99% Windows 7 with a couple of Win 8/8.1
    • Haven't been able to check counters during the problem yet.
    • Restarting server service takes ages (i've never actually waited long enough for it to finish restarting).  Rebooting whole server is much quicker in our scenario.
    • I have not tried this during a failure yet.
    Friday, February 28, 2014 8:44 AM
  • Dan & Eric thanks for your replies.

    Eric - the 1020 event log error is logged in the SMBServer log which can be found under: Event Viewer | Application and Service Logs | Microsoft | Windows | SMBServer | Operational

    We aren't running any AV on our file servers at the moment - took it off for troubleshooting,  We normally run System Centre Endpoint Protection

    Friday, February 28, 2014 9:28 AM
  • Forgot to mention with AV.  Ours did have ESET Nod32 but has been uninstalled since the first occurance of the problem.
    Friday, February 28, 2014 9:31 AM
  • Forgot to mention with AV.  Ours did have ESET Nod32 but has been uninstalled since the first occurance of the problem.

    Did MS have you do any kind of dump while the issue is occurring?  I would presume they'd be able to see the hang up if so. 

    Also, can you provide your case number.  I'm hoping to open a case soon and i'd like to link to yours as well.

    Finally, for whatever reason, my SMB server log is empty.

    Friday, February 28, 2014 1:49 PM
  • Hi guys,

    Out of interest have you migrated the volumes on your file servers from other versions of Windows Server - ie they use to be attached to Server 2008 or where they created new on 2012?

    Can you check the status of 8.3 name creation on your volumes? Run fsutil 8dot3name query D:

    Thanks

    Saturday, March 01, 2014 11:50 AM
  • Hi DJL,

    In our case the volume was originally created on Server 2012 and the whole server was upgraded to 2012 R2.

    8dot3 naming is disable on the volume giving dificulties.

    Thanks

    Monday, March 03, 2014 9:12 AM
  • In our case the data was robocopied (copyall) from a 2003 volume to a new 2012 volume.  8.3 is also disabled for us too.

    FWIW we also have a few Mac's accessing our volume, although i suspect that's not related.

    Monday, March 03, 2014 2:50 PM
  • Hi,
    I work for a Network Solutions provider and we have now seen this problem on atleast 3 completely seperate customer sites, all using different hardware, but all on 2012. We built a server to 2012 R2 last week in the hope it had been resolved but the customer has phoned today to say the server stopped serving files to clients and had to be rebooted. The only 100% fix we have found so far is to rebuild the server back to 2008R2 and we have never seen the problem again.

    We have logged the case with Microsoft and I will update if we get anywhere with it but at the moment most of the blame seems to be on AV (Sophos) although it is fine under 2008R2 and other people with the same problem have tried without AV and the issue still exists.

    I would be willing to work with anyone/share ideas to try and get this resolved for all of us.

    Thanks

    Monday, March 03, 2014 4:04 PM
  • Hi,
    I work for a Network Solutions provider and we have now seen this problem on atleast 3 completely seperate customer sites, all using different hardware, but all on 2012. We built a server to 2012 R2 last week in the hope it had been resolved but the customer has phoned today to say the server stopped serving files to clients and had to be rebooted. The only 100% fix we have found so far is to rebuild the server back to 2008R2 and we have never seen the problem again.

    We have logged the case with Microsoft and I will update if we get anywhere with it but at the moment most of the blame seems to be on AV (Sophos) although it is fine under 2008R2 and other people with the same problem have tried without AV and the issue still exists.

    I would be willing to work with anyone/share ideas to try and get this resolved for all of us.

    Thanks


    Keep us posted and let us know if you need any specifics from our environment.
    Monday, March 03, 2014 5:19 PM
  • Thanks all for you replies.

    We've just managed to capture the information that Microsoft have been requesting.  Essentially they just wanted a network capture on the server and client at the same time while the problem was occurring; with the client trying to access a UNC on the effected server; and a few other bits thrown in.

    Eric I think we are seeing exactly what you see:

    • The client sends an SMB Negotiate request to the server: SMB: C; Negotiate, Dialect = PC NETWORK PROGRAM 1.0, LANMAN1.0, Windows for Workgroups 3.1a, LM1.2X002, LANMAN2.1, NT LM 0.12, SMB 2.002, SMB 2.???
    • We can see this being received by the server but it sends no SMB response back. We do see a TCP response on 445 but it's not SMB
    • The client resends the SMB Negotiate approx. every 20 seconds due to a lack of response from the server

    We killed IOMETER (we were using it to stress the file server) and waited another 5 mins and the server recovered and eventually the client got a response to its SMB negotiate request and they negotiated SMB 3.02 correctly.


    • Edited by DJL Monday, March 03, 2014 5:28 PM
    Monday, March 03, 2014 5:28 PM
  • Hi Everyone,
    I have spoken to another IT guy and he has a number of 2012 servers and has not seen this problem yet (or isn't aware of it) and the only difference we can think of is that he is running Datacenter not Standard. We have seen the problem on 2012 Standard and 2012R2 Standard, can you all confirm the versions you are using?

    Also, DJL, are you saying you can reproduce the problem on demand? If so can you google the following article (I can't post a link at the mo) and try changing the timeout value to something lower to see if it stops the problem from occurring?
    Microsoft network server: Amount of idle time required before suspending session

    Thanks

    Tom

    Tuesday, March 04, 2014 8:33 AM
  • We are running Server 2012 R2 Standard
    Tuesday, March 04, 2014 8:39 AM
  • Hi Tom

    We're running 2012 R2 Datacenter.

    Yes we can reproduce the problem on demand (pretty much) on two cleanly installed Windows Server 2012 R2 Core Datacenter virtual machines, no software other than Windows. 

    I'll give that value a try as some point - Microsoft are having us run through various test at the moment and they are quite specific on what we can/can't change so I'll need to wait until we've finished those tests.


    • Edited by DJL Tuesday, March 04, 2014 12:35 PM
    Tuesday, March 04, 2014 12:32 PM
  • Hi,
    Are you able to explain how you can reproduce the problem (if it isn't too complicated/time consuming for you) so I can do some investigations on our networks?

    Thanks

    Tom

    Tuesday, March 04, 2014 12:40 PM
  • Sure, i'll go into detail about our setup as well

    Our setup:

    • 4x Dell PowerEdge R610's (Intel Xeon X5560, 144GB RAM, Broadcom 1Gbps LOM and Intel 10Gbps X540-T2) running Windows Server 2012 Datacenter Core / Failover Cluster / Hyper-V
    • 3x iSCSI SAN's - 2x Dell EqualLogic PS6000 and 1x PS6110
    • The file servers we see the problem on are Windows Server 2012 R2 Datacenter Core.  Their system/boot disk are VHDX's on Cluster Shared Volumes and the file data is stored on SCSI Pass-through disks. 
    • The file servers have the IPv4 stack uninstalled - we run IPv6 only.
    • All hardware is running the latest firmware/drivers etc
    • Client workstations are running Windows 7 SP1 and Windows 8.1.  All latest updates from Windows Update/WSUS are installed

    To reproduce the problem we:

    • Map a share on the server to a workstation.  Run IOMETER on the share to stress the server.  IOMETER settings are: 2,000,000 sectors | 400 outstanding IO | 512B 100% read access specification | 4 workers.  This takes the disk activity up to 100%
    • We then logon a number of Windows 8.1 workstations simultaneously - the users roaming profile is stored on the same server/volume. 
    • We normally login to about 40 machines at the same time to make sure the problem happens, but it can happen with a few as 1 or 2 machines.

    I'd be interested to know what processors you guys are using?  

    Tuesday, March 04, 2014 1:09 PM
  • Xeon E5520 and Xeon 5620 

    We also have a 4th cluster node with a brand new Xeon E5-2643 v2 but the file server has never really been hosted on that node.

    Thanks for the info on how you have reproduced the error.  I may try the same with our file server out of hours and see if I can trigger the same result.  

    Tuesday, March 04, 2014 1:36 PM
  • The latest server we have seen this on is using Dual Xeon E5-2620 running Citrix XenServer 6.2

    Thanks for the details, very helpful, I will see if I can replicate the problem and post back when I can.

    Tom

    Tuesday, March 04, 2014 4:16 PM
  • For what it's worth, I'm encountering the same problems at my workplace. Setup:

    • Intel Xeon E5-2420
    • VMWare ESXi 5.5, build 1474528. 
    • Windows 2012 R2 Essentials
    • 2012 R2 as sole Domain Controller, running DNS, AD, DHCP and file/print sharing
    • No WSUS set up yet.

    Problem seems to manifest most often during file saves in Office 2007, but 90% of our document shuffling is spreadsheets, so it would make sense that's where we see it most frequently. Once a single user starts having the issue, it starts to show up on others. All of our workstations are running XP or Windows 7.

    Seems to happen most frequently on Windows 7 clients.

    Disabling smb2/3 seems to have allowed me to pull an individual workstation out of the stall just by waiting for the network share to display properly again, but it's not a solution... it usually takes a few minutes for it to resolve the share contents properly. It's not a good approach, just a stopgap that allow (eventually) saving open files.

    .tmp files with the file name as a random hex value show up anywhere we've had a workstation stall out during a save. The file itself is basically inaccessible in most cases until the next reboot of the server. Those temp files often can't be addressed, opened, deleted or used in any way without causing explorer to freeze.

    Once the server is rebooted, I can collect and delete all of those .tmp files, or open them in their respective programs. (word/excel/etc) Once in a while, the original file is also corrupted and can't be opened/used/saved over/replaced/renamed until a server reboot. 

    Not a lot to add to the discussion, just another instance of it happening. I've been following this thread and http://www.edugeek.net/forums/windows-server-2012/126721-server-2012-file-server-suddenly-stops-serving-requests-but-otherwise-looks-fine.html and hoping someone comes up with a solution sooner or later.

    As it currently stands, I wouldn't recommend deploying either 2012 or 2012R2 as a file server in any circumstance. Works great for everything else, but this pretty much shuts our entire workplace down, sometimes multiple times a day, since our key software has data files hosted on the file share. 



    • Edited by bergmbe Tuesday, March 04, 2014 5:47 PM
    Tuesday, March 04, 2014 5:10 PM
  • Setup:

    AD Domain 2008R2, 2 VM DCs (08r2) 1 Physical (2012)

    2 VM FS and 2 Physical.  They are set to cluster.  so clstr1 and 2 physical, 3 and 4 vm.

    \\fs is the file share server name.

    Storage network ISCSI jumbo frames.  Shares on a Dell PS6500.  VMS hosted on MD3220 (same network)

    When this occurs I can move the Node/FS Role to a new server and comes back.  I have to reboot the original server if I want it to work again.

    Problems start with some people and spreads.  We do folder redirects and people are quick to tell us of the issues.

    We brought in consultants to help resolve the issue.  The packet capture was very interesting.  So very responsive to all other traffic but SMB (1 and 2 no 3) show crazy delay with negotiating protocol.

    So in simpleton terms, client say hey, server ack, client smb access, server waits, 50 seconds later client say forget you, server says fine.

    Problems are occurring more because we thought XP (since hotfix installed on server) was part of the issue, and we have migrated to Windows 7 heavily.  Now crashes have gone from once a week to 1 to 2 times a day.

    We increased the size resources to AD PDC (VM) thinking it was pegged (that was yesterday).  Today we had the typical issues right before right out failure.  Moved and services returned.  I will be taking this article to the power that be so we can start a downgrade (I feel the best solution).  Oh and these are brand new builds (fresh installs).

    We are tight on storage (SAN presenting lun to servers over ISCSI) might have to build and move the LUN to new server.  Anyone do this and have any issues?

    • Edited by tcgood Wednesday, March 05, 2014 12:03 PM
    Wednesday, March 05, 2014 4:20 AM
  • Also we use GPP to map drives - does anyone above have the issue with login scripts?


    Just trying to find a common denominator.
    • Edited by tcgood Wednesday, March 05, 2014 4:01 PM
    Wednesday, March 05, 2014 4:00 PM
  • Just an update:

    We've reproduced the problem twice today.  Both times we captured SMBServer tracing, network capture and full memory dump from both the server and a client.

    Microsoft support are now analysing these - hopefully they'll be able to pinpoint something!

    Wednesday, March 05, 2014 5:46 PM
  • Team,

    We are also receving the same issue a couple of times per day with very similar symptoms.

    Yesterday I changed the autodisconnect registry settings which seemed to make the dropouts occur less, however the issue did re-occur today, once! I know this article doens't relate directly to server 2012 R2 but none the less could have an effect. See KB297684 (I can't post links yet)

    I would be interested if this makes a difference to the ones who can re-create the issue as my dropouts occur without warning and I cannot re-create.

    DJL; please keep us updated with your contact from MS, hopefull they find the issue and release a patch.

    Thursday, March 06, 2014 5:33 AM
  • Just an update:

    We've reproduced the problem twice today.  Both times we captured SMBServer tracing, network capture and full memory dump from both the server and a client.

    Microsoft support are now analysing these - hopefully they'll be able to pinpoint something!


     Can you give me your Case Number? Maybe we can link our cases.
    Thursday, March 06, 2014 8:23 AM
  • My case number is 114022411211053.  Can you post yours and I'll try and link it from my side as well

    Thursday, March 06, 2014 11:50 AM
  • I know I came in late with this, but my case is 114030511237424.  I really am trying to find a common denominator here with our configuration.
    Thursday, March 06, 2014 2:32 PM
  • Thanks - I've sent your case number to the engineer dealing with my case.

    I'm not sure there is a common denominator other than Windows 2012/2012R2 - I think it's just a bug in the SMB Server or associated components. I'm seeing the problem with clean installs of Server 2012 R2 core - no av, backup, monitoring etc.

    I'm going to stick Server 2012 R2 on a desktop tomorrow and see if I can reproduce the problem on that - if I can then it'll rule out any problems with virtual machines, iSCSI, passthrough disks etc


    • Edited by DJL Thursday, March 06, 2014 10:06 PM
    Thursday, March 06, 2014 10:05 PM
  • Microsoft says they are calling me but I get nothing on my phone.  I did a netstat from my DC and found that several computers were connected with hundreds of LDAP sessions.  Today we had issues with people gradually losing connections and with powershell I ran netstat -an | Select-String -pattern ":389" .

    I found that the file server was no longer connected to a DC.  It is strange, it was like my AD was experiencing a ddos on ldap.  So I tracked down one of those PCs and ran netstat -b to find out why there were so many connections to the DC on 389.  svchosts was running gpsvc.dll with tons of connections.

    Anyway blah blah blah

    http://support.microsoft.com/kb/2561285

    Still verifying the fix will work - have to spend the week applying this hotfix to problem machines.  I will let everyone know if we are good for awhile. Oh, and this is a year old fix that is not part of updates for windows 7.

    • Edited by tcgood Friday, March 07, 2014 3:00 AM
    Friday, March 07, 2014 2:56 AM
  • I swear working with Microsoft Support is torturous sometimes! No progress here yet..
    Monday, March 10, 2014 1:16 PM
  • Yeah it is.  Did you check your DCs netstat connections over 389?  We have hundreds of computers to attempt to apply the hotfix to.  Also during my troubleshooting of the server I have found that I can move the node over to a working server without an issue coming back to life - after which trying to \\clstr2\ no response until reboot.  At this point I am just ranting...
    Monday, March 10, 2014 8:17 PM
  • Applied that hotfix to our Windows7 pc's, where we're experiencing most of the issues. We only have ~4 on site, so it was fast. Encountered the same issues about 2 hours after applying it, so no dice for our site at least. Let me know if it works for yours. Good luck with the rollout! That sounds like a nightmare. 

    I had not been seeing the :389 issues on netstat that you are though, either before, during or after the fileshare issues manifest. 

    • Edited by bergmbe Monday, March 10, 2014 9:10 PM
    Monday, March 10, 2014 9:05 PM
  • There just has to be a common piece to our issues.  I am pushing the patch using PSexec and a text file - should be on them by the end of the day.  We had another outage today and sent our capture into microsoft.  I am really close to downgrade to 2k8r2.  For us it is happenings daily now - and many times several times in a day.

    How many devices do you have on your network?  Does your FS have a lot of traffic?

    Tuesday, March 11, 2014 3:06 PM
  • So for those with MS cases open, does MS have nothing to say yet?  I've seen at least one of you had generated a system dump.  You would think that's all that's needed for MS to figure it out.
    Tuesday, March 11, 2014 4:21 PM
  • Afternoon all,

    Give this registry setting a try:

    reg add HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters /v DisableLeasing /t REG_DWORD /d 1 /f

    I've just tested it on one of our servers and I can't reproduce the issue at the moment so it looks like it may solve the problem! (I'm trying to leave the office for the day so haven't tested extensively!)


    • Edited by DJL Wednesday, March 12, 2014 12:36 PM
    Tuesday, March 11, 2014 5:24 PM
  • Can you tell me more about the key?  I hate to add a reg key and not know what it does.

    Thanks

    Tuesday, March 11, 2014 6:04 PM
  • @TCGood

    Single server, ~14 workstations plus 5 additional devices. Not a ton of traffic, but at any given time we have 20-30 files open with locks. Mostly small files under 1mb. The busier days with lots of opens and saves seem to coincide with this problem manifesting more frequently. 

    Then again, I came in to the issue this morning and no one had been using the network at all for 12 hours. Only 3 active logins when I arrived, and 2 of them were in the process of stalling out. All other active file share log ons were unresponsive. In the interest of getting people back to work, I just restarted the server.

    I'm with you on being at the point where downgrading makes the most sense. But my business won't put up funds for a 2008 R2 license, so in lieu of that, I recently set up a second VM and installed Ubuntu LTS. I'll be setting it up as a simple file server later this week until I see some resolution from Microsoft. Not ideal, but it can't be helped at this point.

    Tuesday, March 11, 2014 7:02 PM
  • that is bad, at least I am at 600 devices.  Traffic is between 5 and 100mb mostly.

    1) Are you using Quotes (file resource manager)

    2) Install disc where/how acquired

    3)Special settings ie continuous availability, Volume Shadow Copy, ABE etc.

    I am still looking for a common factor.  MS has my captures and keeps asking the same questions, "when the cluster goes down can you access the node?" NO! I can RDP, ping, and otherwise responds.  Admin Share and all SMB unavailable.  I am going to spin up a VM with a different install disc to stress test.

    Tuesday, March 11, 2014 8:02 PM
  • The registry key disables leasing.  More info on leasing can be found here: http://technet.microsoft.com/en-us/library/ff625695(v=ws.10).aspx

    The following will be logged in the event log when you add the reg key:

    File leasing has been disabled for the SMB2 and SMB3 protocols.  This reduces functionality and can decrease performance.

    Registry Key: HKLM\System\CurrentControlSet\Services\LanmanServer\Parameters
    Registry Value: DisableLeasing
    Default Value: 0 (or not present)
    Current Value: non-zero

    Guidance:

    You should expect this event when disabling SMB 3 Leasing. Microsoft does not recommend disabling SMB Leasing. Once disabled, traffic from client to server may increase since metadata and data may no longer be retrieved from a local cache.

    So far this seems to have solved the problem for us.  I'll try and get more info out of Msft later once I've fully confirmed it solves the problem for us.

    Wednesday, March 12, 2014 10:16 AM
  • was this key a MS recommendation or something you discovered?  Also in the past have you been able to consistently reproduce the problem? 
    Wednesday, March 12, 2014 2:58 PM
  • When we hired out consultant this was the first thought he had without looking into the issue very far.  It is extremely reliable, but losing the benefits of SMB 2.0 and 3 (we don't use).  At 1pm will have a call back MS.
    Wednesday, March 12, 2014 4:03 PM
  • Yes - the key was recommended by Microsoft support, apparent they have quite a few people reporting this issue at the moment ;) and yes we have been able to reliably reproduce the problem.

    Today was quite promising: we brought both our file servers back up to full load and we didn't see any problems once the registry key was set - we haven't managed 20 mins at full load since the problem surfaced so definitely better.  I'm still slightly on edge about it though as Wednesday is generally a quieter day for us - if we manage to make Friday afternoon without it resurfacing I'll be more confident.

    tcgood - you won't lose all the benefits of SMB 2.0, 2.1, 3.00 and 3.02.  The reg key is only disabling leasing, all the other improvements will still be available and 3.02 will still be negotiated (client os version dependant). 

    Having said that I don't consider the reg key a fix, merely a work around.  Once I'm confident our servers are stable with the reg key i'll push MS to see what they plan on doing about it permanently. 

    The odd thing is file leasing is in SMB 2.1 which was available in Windows 7/Server 2008 R2 and as far as I'm aware this issue didn't affect 2008 R2.  I guess it could be directory leasing which was introduced in SMB 3...


    • Edited by DJL Wednesday, March 12, 2014 8:34 PM
    Wednesday, March 12, 2014 8:28 PM
  • Good to know, thanks for the quick reply.  I'm going to wait till Friday before i try the key.  if you're still stable after then, i'm going to try it in our environment.  

    What do you guys do to actually make the problem occur?  Would running IO Meter on a network share trigger it?

    Wednesday, March 12, 2014 10:13 PM
  • Ok - never mind.  We've just had the problem reoccur - the reg key doesn't fix it :(
    Thursday, March 13, 2014 10:12 AM
  • Ok - never mind.  We've just had the problem reoccur - the reg key doesn't fix it :(

    So now what does MS think?  It blows my mind that they can't figure this out after getting dumps from multiple folks.
    Friday, March 14, 2014 5:42 PM
  • They want us to capture logs, memory dumps and tracing again! - so another late night tonight.

    It happening every f**king 20 mins on our server at the moment.  I'm royally pissed this morning!


    Edit: Apologies, I'm getting very frustrated with this problem and being told there isn't a problem by PSS.  If it wasn't for the fact our users love work folders i'd be back on 2008 R2 by now.
    • Edited by DJL Monday, March 17, 2014 9:54 PM
    Monday, March 17, 2014 12:05 PM
  • We've reproduced the problem again twice this afternoon and provided PSS with two new complete sets of memory dumps, SMB tracing, event logs, network captures and screen shots from both the server and client.

    They now have 4 sets of this data from us.  Hopefully they'll find something this time!

    This is the weird 100% active time symptom we're seeing in taskmgr:


    • Edited by DJL Monday, March 17, 2014 9:59 PM
    • Proposed as answer by TheOriginalHB3 Tuesday, March 18, 2014 2:48 PM
    • Unproposed as answer by TheOriginalHB3 Tuesday, March 18, 2014 2:48 PM
    Monday, March 17, 2014 5:09 PM
  • We're experiencing the same issue as everyone else on this forum where our 2012 serverd (not all only two and we have eight in our environment so far) will not accept SMB connections, but all other connection are fine.  Much like everyone else we've tried several things (listed below) and the only temp solution is restarting the server:

    Actions take so far with no success!

    1.) Restarting the Server Service - The service doesn't start back up, which leads to a reboot anyway.

    2.) Verified the following Rollups were installed.  

    http://support.microsoft.com/kb/2883201/en-gb
    http://support.microsoft.com/kb/2889784/en-gb

    3.) Turned off Background Optimization for Data Deduplication

    We are currently working with MS (Case #114020511159226) on this issue and they have not idea.  They are just having use collect logs and dump.  We updated the case to SEV A case, so hopefully we have something today.  So I thought it wouldn't hurt to post on this site as well, so see if anyone else had any thoughts.  I'll will keep you guys informed as we try things to come to a solution.  

    A possible hotfix (http://support.microsoft.com/kb/2928360) that I came across that deals with SMB2 and SMB3 is due to a memory leak.  I was wondering if anyone noticed during their issue if the NonPaged Pool memory was exhausted. 

    Tuesday, March 18, 2014 3:16 PM
  • Hi - I came across that article a few days ago and have been monitoring our paged and non-paged pool. Both seem normal and don't change when the problem is occurring.

    The following potential solution was posted on edugeek.  No help for me as we're running Hyper-V, but interesting non the less.

    >> Solved the problem on my end. I don't know why it should matter, but I was using an E1000 network adapter on the VM's giving me this issue, I switched them to VMXNET 3 and have not seen the issue re-occur. This was almost a daily problem and it has not happened for the 8 days that I have been running the VMXNET 3 adapters. I know some of you are not using VMWare ESXI hosts, but for those that are, give this a shot! ESXI 5.1 Server 2012R2


    • Edited by DJL Wednesday, March 19, 2014 12:08 AM
    Wednesday, March 19, 2014 12:08 AM
  • @DJL. I saw that on the edugeek forum as well. I'm going to give it a try this weekend, when I can take down our VMs and make sure the changeover won't affect anything else. Like the guy that posted it, we're using VMWare's ESXI 5.5 and assigned our NICs as E1000 network adapters in the windows environment. I still have printers and a small subset of user files running on the share, even though a majority of our files are now hosted on a second VM running Ubuntu Server 12.04 (LTS release). That way I can still do some testing to see if solutions work for us. I'll post back if switching the NICs over helps at all. 

    Thursday, March 20, 2014 9:57 PM
  • RESOLVED!

    OK People! I hope I can help you everyone out with this solution MS have provided me. I have been dealing closley with the Network Team, in particular the most skilled escalation tech of Asia Pacific who happened to be an expert in SMB. My system has been stable for around 10 days now.

    My Environment

    • vmware ESXi 5.1.0 build 1123961
    • Server 2012 R2

    My Symptoms

    MS have recommended two components to this fix, however, with the vm driver fix applied I was still experiencing the issue, it wasn’t until I made the change to the srv2.sys that the fix became permanent.

    Vmware driver; MS believe that this driver (vsepflt) could have been conflicting with the srv2.sys driver. To disable follow this article. http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2034490

    Srv2.sys; In my opinion this is the actual change that fixed the issue. The srv2.sys driver controls SMB 2 traffic at the kernel level, in operating systems pre Server 2012 R2 the driver is set to auto start. Microsoft have changed the functionality for Server 2012 R2 to ‘start on demand’, this seems to not be starting gracefully when a request is made on SMB 2 or above.

    To change srv2.sys to auto start, open cmd and type sc config srv2 start=auto a reboot will be required after running this command.

    When I talked to the lead network tech about this fix, referenced him to this article, how many people it was affecting etc, he advised that in all cases experiencing this issue changing the srv2.sys to auto start has worked 100% of the time. Why haven’t MS release an official patch I asked – because there have not been enough cases to warrant an official fix.

    I’m very interested in if this fixes your issues, please mark this as an answer if I bring you success! Best of luck!



    Monday, March 24, 2014 5:20 AM
  • Thanks for sharing all the feedback and progress.  I've applied the service fix above to our server and will wait and see. 

    The reg fix I would rather not disable leases just yet!

    Monday, March 24, 2014 9:29 AM
  • I have been having the same issue running ESXi 5.5.0 1331820 with Windows Server 2012 R2 Essentials and this is the only fix so far that has made my system stable. We have been having problems of the system crashing multiple times a day, but it has been running for 2 days continuously.

    Thanks

    Tuesday, March 25, 2014 8:44 AM
  • Thanks for sharing all the feedback and progress.  I've applied the service fix above to our server and will wait and see. 

    The reg fix I would rather not disable leases just yet!

    What reg fix?  I know you mentioned one above, but I thought it wasn't working.
    Tuesday, March 25, 2014 4:27 PM

  • When I talked to the lead network tech about this fix, referenced him to this article, how many people it was affecting etc, he advised that in all cases experiencing this issue changing the srv2.sys to auto start has worked 100% of the time. Why haven’t MS release an official patch I asked – because there have not been enough cases to warrant an official fix.


    Rant:

    To be fair, the only reason I haven't officially reported/opened a case with Microsoft is because my company can't afford the service contracts they charge for issues like this. I understand the need for service contracts during personalized troubleshooting, but it seems counter-intuitive to me in a case like this where it's quite clearly their buggy SMB2 & 3 "upgrades" causing the problems.

    On topic:

    Once others post their experiences with this fix, I'll try it on our system as well. At the moment, I'm verifying stability after changing our network adapters to from VMWare assigned E1000  to VMXNET3. So far so good on that fixing our issues, but I had to move the bulk of our fileshare to a Ubuntu LTS release running SAMBA in order to get a reprieve, so I'm not sure I'm really testing it. If that change that DJL mentioned from the edugeek thread doesn't fix the issue, I'll try your solution, which I'm extremely glad to have as a back up option. Thanks very much for your report on this. 


    • Edited by bergmbe Tuesday, March 25, 2014 11:36 PM
    Tuesday, March 25, 2014 11:36 PM
  • Thanks for sharing all the feedback and progress.  I've applied the service fix above to our server and will wait and see. 

    The reg fix I would rather not disable leases just yet!

    What reg fix?  I know you mentioned one above, but I thought it wasn't working.
    Just want to confirm that the registry fix for leasing does NOT work. At least 3 or 4 of us tried it and it didn't actually solve it. All of us that did it altered it back after we discovered it wasn't the fix. 
    Tuesday, March 25, 2014 11:37 PM
  • I'll check in next week.  if the fix is still working I'll move forward with making the change.

    My next question for MS would be, should we make this the default for all new builds?  What's the downside?

    Wednesday, March 26, 2014 1:34 PM
  • What reg fix?  I know you mentioned one above, but I thought it wasn't working.

    EricCSinger follow these instructions to fix your issue (see full description in my post above), let me know how it goes.

    MS have recommended two components to this fix, however, with the vm driver fix applied I was still experiencing the issue, it wasn’t until I made the change to the srv2.sys that the fix became permanent.

    Vmware driver;MS believe that this driver (vsepflt)could have been conflicting with the srv2.sys driver. To disable follow this article. http://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2034490

    Srv2.sys;In my opinion this is the actual change that fixed the issue. The srv2.sys driver controls SMB 2 traffic at the kernel level, in operating systems pre Server 2012 R2 the driver is set to auto start. Microsoft have changed the functionality for Server 2012 R2 to ‘start on demand’, this seems to not be starting gracefully when a request is made on SMB 2 or above.

    To change srv2.sys to auto start, open cmd and type sc config srv2 start=autoa reboot will be required after running this command.

    When I talked to the lead network tech about this fix, referenced him to this article, how many people it was affecting etc, he advised that in all cases experiencing this issue changing the srv2.sys to auto start has worked 100% of the time. Why haven’t MS release an official patch I asked – because there have not been enough cases to warrant an official fix.

    I’m very interested in if this fixes your issues, please mark this as an answer if I bring you success! Best of luck!

    Thursday, March 27, 2014 11:53 PM
  • I applied the sc config srv2 start=auto command and will report back if it does the trick.  TBH, it might be weeks before I know for sure, just depends on what randomly triggers the freeze.  I went weeks with no issues, now it's happening three times a day.
    Friday, March 28, 2014 12:58 PM
  • Thanks vitis_vinifera, I'll give sc config srv2 start=auto a try shortly.  I can re produce the problem on demand so it should be fairly easy to see if it works or not.

    bergmbe, which country are you in?  In the UK you can pay for support on a case by case basis using a credit card.  It costs ~£240 per case and they only take payment if they fix the problem.  They also don't charge if its a defect in their software.  Pretty good value i think.



    • Edited by DJL Friday, March 28, 2014 5:30 PM
    Friday, March 28, 2014 5:27 PM
  • I applied the sc config srv2 start=auto command and will report back if it does the trick.  TBH, it might be weeks before I know for sure, just depends on what randomly triggers the freeze.  I went weeks with no issues, now it's happening three times a day.

    For me, SMB crashed again after the "fix".

    I can't fathom why MS can't figure this out.  Seriously, its there freaking code, you guys have provided dumps what else do they need.  For those that have cases, has anyone actually gotten this escalated to American tech support?  in other words level 3?

    Friday, March 28, 2014 10:39 PM
  •  sc config srv2 start=auto isn't a fix - I can still reproduce the problem

    Eric I share your frustration.  I've been escalated to level 2 and had to basically start from the beginning again with the new engineer.  After going through disabling all the advanced nic features and blaming 3rd party software (there isn't any) we are back to memory dumps and tracing again.

    I've had to migrate one of our file servers back to 2008 R2 (a 4.5 yr old OS!), but I'm stuck with 2012 R2 on others as we're using work folders.


    Sunday, March 30, 2014 7:32 PM
  • We also have this issue now on two customer sites. Both are running 2012R2 Essentials. Other has Vmware 5.5 on Dell T320 hardware and other has Vmware 5.1 with HP Proliant ML350p Gen8. The Start=Auto didn't help the, but with this problem happens little bit less. But no real solution yet
    Monday, March 31, 2014 7:58 AM
  • How do you reproduce the problem?

    This just happens on our fileservers once / each 10-14 days.

    Is there a trick or can you give us a hint how we can reproduce this to speed up debugging?

    Monday, March 31, 2014 11:59 AM
  • I don't know how to reproduce the problem, but one thing I've done as a test, is ripped out my AV (ESET).  I'll let you know if thing seem stable afterwards.  The problem is that the hangs are intermittent.  Sometime they happen every few hours, sometimes its weeks.
    Monday, March 31, 2014 4:20 PM
  • so far things are good.  We noticed two other file servers (windows 2012) that have not had this problem, have no AV installed.
    Monday, March 31, 2014 11:01 PM
  • To reproduce the problem we:

    • Map a share on the server to a workstation. Run IOMETER on the share to stress the server. IOMETER settings are: 2,000,000 sectors | 400 outstanding IO | 512B 100% read access specification | 4 workers. This takes the disk activity up to 100%
    • We then login a number of Windows 8.1 workstations simultaneously - the users roaming profile is stored on the same server/volume.
    • We normally login to about 40 machines at the same time to make sure the problem happens, but it can happen with a few as 1 or 2 machines.

    We have no AV on our servers, or any other software for the matter



    • Edited by DJL Tuesday, April 01, 2014 9:55 AM
    Tuesday, April 01, 2014 9:55 AM
  • To reproduce the problem we:

    • Map a share on the server to a workstation. Run IOMETER on the share to stress the server. IOMETER settings are: 2,000,000 sectors | 400 outstanding IO | 512B 100% read access specification | 4 workers. This takes the disk activity up to 100%
    • We then login a number of Windows 8.1 workstations simultaneously - the users roaming profile is stored on the same server/volume.
    • We normally login to about 40 machines at the same time to make sure the problem happens, but it can happen with a few as 1 or 2 machines.

    We have no AV on our servers, or any other software for the matter



    if you run fltmc in the command prompt, what shows up?
    Tuesday, April 01, 2014 12:50 PM
  • C:>fltmc

    Filter Name                     Num Instances    Altitude    Frame
    ------------------------------  -------------  ------------  -----
    DfsDriver                                0     405000         0
    Cbafilt                                    3      261150         0
    Datascrn                                0       261000         0
    Quota                                    0       125000          0
    npsvctrig                                1        46000          0



    • Edited by DJL Tuesday, April 01, 2014 8:57 PM
    Tuesday, April 01, 2014 8:57 PM
  • Hello,

    i have the identical problem with our 2012 R2 FailOver Cluster. The installation itself is completly standard with only one registry "tweak": NtfsDisable8dot3NameCreation.

    Do all have this setting enabled who are affected by this issues?

    Best regards

    Wednesday, April 02, 2014 7:11 AM
  • Hello,

    i have the identical problem with our 2012 R2 FailOver Cluster. The installation itself is completly standard with only one registry "tweak": NtfsDisable8dot3NameCreation.

    Do all have this setting enabled who are affected by this issues?

    Best regards


    Okay it's not the problem. After i enabled the Dot3NameCreation the problem occured within a few hours (Maybe due to the higher load?!).
    Wednesday, April 02, 2014 10:33 AM
  • C:>fltmc

    Filter Name                     Num Instances    Altitude    Frame
    ------------------------------  -------------  ------------  -----
    DfsDriver                                0     405000         0
    Cbafilt                                    3      261150         0
    Datascrn                                0       261000         0
    Quota                                    0       125000          0
    npsvctrig                                1        46000          0



    Just to give you an idea, this is all i have.

    Filter Name                     Num Instances    Altitude    Frame
    ------------------------------  -------------  ------------  -----
    npsvctrig                               1        46000         0

    Knock on wood, i've been stable thus far.  You might want to try disabling file screening, quota's, etc. one at a time to see if things start behaving.

    Wednesday, April 02, 2014 1:15 PM
  • That's interesting - thanks Eric.  I'll give that ago

    I've just sent off another load of memory dumps, tracing, net capture etc to MS PSS!

    Wednesday, April 02, 2014 4:00 PM
  • Just tried disabling the filters one at a time until npsvctrig was the only one left - i can still reproduce the problem :(
    • Edited by DJL Wednesday, April 02, 2014 4:38 PM
    Wednesday, April 02, 2014 4:38 PM
  • Just a quick update on our PSS case:

    We've managed to capture the required information and our case has been sent to the Microsoft Global Business Support - Windows Serviceability Team (GES was a shorter name!) for analysis.

    I can't imagine it's going to be a quick response given the millions of line of code etc they'll have to sift through

    Sunday, April 06, 2014 9:47 AM
  • Any update?
    Monday, April 07, 2014 9:00 PM
  • Looks like it may have struck again for us, although it happened while i wasn't in the office, so I can't verify for sure.  If that is the case, then the AV isn't the cause for us.
    Tuesday, April 08, 2014 10:18 PM
  • sc config srv2 start=auto this seemed to help for few days. After first crash I created script that rebooted server every night. Now that won't even help, yesterday and today shares stopped working even server has rebooted at night. Has MS replied anything to this bug?

    Tuesday, April 15, 2014 4:41 AM
  • No update yet - still waiting for the debugging team.  

    Our first memory.dmp we sent them turned out to be corrupt so we had to recapture all the info for them again hence extra delay. 

    Will update when i have more info

    Tuesday, April 15, 2014 11:40 AM
  • I´ve just read teh entire post today. We´re experiencing the problem on Windows Server 2008 R2 Virtual Machine (Windows 2012 Cluster). Is this the case for anyone?

    Fabiano Montelo

    Thursday, April 17, 2014 5:08 PM
  • Hi Everyone, 

    I thought I'd jump in here too. I'm glad to find this tread, I've been pulling out my hair for a while on this one. 

    I have a Win 2012 (not r2) server having the same SMB issue described here. I've been struggling with it for some time. I did a complete  fresh install of a Win 2012 VM, it worked for a while but then the issue popped back up. 

    Here's my situation: 

    Every day or so (sometimes more, sometimes less) the LANMANSERVER service (SMB/SERVER service) stops responding to win7 clients. Accessing files from the console of the server via the volume drives letters (c, d, e etc) works fine, just the mapped SMB drives (M, S, H) do not work, accessing via \\IPADDR\share does not work either. 

    -Win XP client seems to be able to access fine, which suggests a problem with SMB 2/3. Trying to Stop and restart the SERVER service does not work, hangs at STOPPING. Only solution seems to be a full reboot. 

    Server setup:

    Windows 2012 Standard server running on VMware ESXi 5.5

    Direct attached storage RAID 5 on a Lenovo R700 HW raid controller. Windows disks are just regular vmdx files from vmware, no passthrough, no iscsi etc. Saw the same issue on a previous Raid 10 array, with a different controller. 

    Other (non win2012) VMs on the same datastore have no problems. 

    There has never been a AV on this server. The clients are running ESET v5. 

    I have a separate Win 2012 domain  controller remains online, "netstat -b" shows 2 connections on 389 to the DC from the file server before and during the issue. I've never seen the file services on the dc have trouble, but they're not doing much 

    Thoughts:

    Like most here, I don't think Its a hardware issue - there are too many things that don't fit. I believe it's an issue with the SMB/LANMANSERVER/SERVER service or a related driver/service. Its hard to believe after all this time MS doesn't seem to have a common patch.  

    Fixes just tried:

    I have just tried the following fixes suggested here and I'll report back on the success. 

    --Change the Virtual network adapter from E1000 to VMXNET3

    --Autostart SRV2 from an elevated CMD prompt: sc config srv2 start=auto

    --unload driver from an elevated CMD prompt:  fltmc unload vsepflt

    --Change:

    HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\services\vsepflt\Start

    to value "4"

    reboot

    ***************************

    Update: going on 3 weeks, SMB's still working well. I think I've licked it, I was rebooting 1-2 times a day before. 







    • Proposed as answer by CloudThomas Friday, April 25, 2014 7:40 PM
    • Edited by CloudThomas Sunday, May 11, 2014 3:41 AM
    Tuesday, April 22, 2014 7:08 PM
  • Hi all,

    This is response I have had back after they analyised our last memory dump:

    The SRV2 threads responsible for processing incoming SMB requests are stuck on NTFS lock, owned by  another thread trying to perform file system IO. The file system is hung as couple of IO requests containing 2043 packets to the device SCSI\Disk&Ven_EQLOGIC&Prod_100E-00\000000 has been blocked for over 17 minutes. This has caused many SRV2 threads to be hung to ensure serialized access to file system resources. With no more Threads available in the SRV2 queues, large number of SRV work items are queued up and system is unable to process new SMB requests.

    Suggestion:

    1. To  engage vendor of SCSI device EQLOGIC in order  to  verify if there is any underlying issue with the disk.
    2. As a work around try increasing the number of SRV2 threads using the following registry key on the File server. Though this will delay the issue in current circumstances but will not guarantee the remediation of the issue.

    http://technet.microsoft.com/en-us/library/cc957460.aspx

    Key        : HKLM\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters

    Type      : DWORD

    Value    : MaxThreadsPerQueue

    Default value for MaxThreadsPerQueue is 20 you can try increasing to 1024.

    If issue still occurs please collect the kernel dump again as  before when issue occurs without running the IOMeter and send it for further analysis.

    I haven't tried the registry key yet.  It looks like there is an issue else where which is causing this problem.  I'm going to try and capture another memory dump without running iometer.  I can't see the issue is with our storage system as everyone else here is seeing the issue on varying different hardware



    • Edited by DJL Wednesday, April 23, 2014 10:58 AM
    Wednesday, April 23, 2014 10:53 AM
  • I'm experiencing the same issue on a Windows 2008 Server running as a VM within a Windows 2012 Hyper-V cluster.   It's running off of Dell Servers and an Equallogic SAN.   Every week we experience the issue where no one can connect to that specific 2008 server's file shares.  When trying to stop the server service it hangs and won't shut down.  When shutting down the server it also hangs and I have to force a shut off.  After it restarts everything is fine.  I have not tried any of the fixes recommended so please report back if any of the fixes listed are continuing to work.   Thank you everyone for sharing.
    Wednesday, April 23, 2014 12:54 PM
  • For what it's worth, our SAN is Nimble Storage, so it's not the the SAN vendor (unless they're both messed up).  I suspect in your case, you have EQL mounted directory via a software initiator?  I bet if if it was VMware virtual disk, they'd be blaming VMware.  Regardless, the storage vendor is a read hearing.  To me, it still points back to something messed up in the SMB stack.  

    I was really hoping they were going to come back with something a little more solid (as I'm sure you were as well).  They're troubleshooting the symptom, not the problem.

    Wednesday, April 23, 2014 6:54 PM
  • We are having the same issue on a physical 2k8 R2 server with local storage. 
    Thursday, April 24, 2014 3:29 AM
  • Our server didn't had the MaxThreadsPerQueue dword value, but I created it now and let's see if it helps.

    Key        : HKLM\SYSTEM\CurrentControlSet\Services\LanmanServer\Parameters

    Type      : DWORD

    Value    : MaxThreadsPerQueue

    Default value for MaxThreadsPerQueue is 20 you can try increasing to 1024.

    Thursday, April 24, 2014 4:15 AM
  • Just wanted to chime in that we are seeing this exact issue on Windows Server 2008 R2 Standard running on ESXi 5.0.0.

    Although it happens a lot less frequently for us, maybe once a month or so.

    The last time it happened we had no active AV and the only non-Windows software running is a Commvault backup agent.

    I'll be following this thread closely and look into using IOMeter to recreate the problem.


    -Mikael

    Thursday, April 24, 2014 2:25 PM
  • Eric - yes, definitely a red herring - they've tried to blame 3rd party software/hardware several times now.  We use the Microsoft iSCSI initiator, so literally the whole stack is Microsoft through from receiving/sending the SMB packets at guest level all the way through to iSCSI at the host level.  The only 3rd party code is Intel drivers and the Dell EqualLogic MPIO DSM on the hosts.

    Regardless I spent a day checking and updating our storage arrays.  Dell took diagnostic logging and couldn't find anything wrong (surprise!) so at least that should keep Microsoft happy.

    I've now got to try and capture the memory dump again, plus some new storage tracing, but without using IOMeter to recreate the problem. 

    Is anyone else with open support cases getting anywhere?

    Monday, April 28, 2014 9:10 AM
  • I can confirm we see the issue too,

    We are using an EMC VNX5200 SAN, FC to Server 2012 R2 Hyper-V cluster

    The actual error is occuring on our primary file server Server 2012 R2 enterprise. We have fairly low IOPS so the problem has only occured twice this year, but the fact it has reoccured is worrying.

    Like everyone else, the server responds to RDP and locally the drives are fine, it is a SMB access error.

    I look forward to getting a solid solution in the future.

    Thursday, May 01, 2014 1:07 AM
  • Microsoft are now analysing another memory dump and set of tracing from one of our servers...

    Tuesday, May 06, 2014 12:49 PM
  • We had exactly the same problem on 4 Windows 2012 clustered VM on ESXi and have opened a case with MS.

    Before doing anything, the support has asked us to update some components :

    srv2.sys, srvnet.sys, srvsvc.dll, mrxsmb.sys and mrxsmb20.sys.

    http://support.microsoft.com/kb/2899011

    We have updated these components without much hope because the description of the patch does not match our problem, but since this, we have more that one month without issue...

    If that helps...

    Wednesday, May 07, 2014 3:15 PM
  • We are also experiencing the same issue.

    Infrastructure running on Windows Server 2012 on VMware ESX 5.5 Update, IBM FLEX Chassis, Blade X240 and IBM V3700 Storage. Our AD is 2008 R2. FFL 2003 and DFL 2008 R2.

    We are also running DFS Namespace on top of the file server.

    Symptoms:

    Intermittent disconnection of map drives

    Cannot access share or slow to open

    Server hangs at restart

    Once hard rebooted, the server is up and running.

    Just logged a call with Microsoft.

    No relevant log on Windows, ESX or Storage.

    Anyone found a permanent fix so far?


    Irfan Goolab SALES ENGINEER (Microsoft UC) MCP, MCSA, MCTS, MCITP, MCT


    Thursday, May 08, 2014 8:00 AM
  • I'm wondering if anyone has found this related to backups of the server?  It seems that about the time the server shares start to have issues is about the time our backups start.  Again it doesn't happen every time our backups take place but when it does happen, it is during that backup time window of the server.  We are using Symantec Backup Exec 2012 using the remote agents.
    Thursday, May 08, 2014 1:55 PM
  • I don't think our case is related to backup. Another server is backed up with Veeam and another with Windows Backup. And sometimes servers are running for a week without issues, and backups are run daily
    Saturday, May 10, 2014 8:48 AM
  • Hi Everyone, please take a look at my post above from Apr 22. 3 weeks on, no forced reboots!
    Sunday, May 11, 2014 3:44 AM
  • check the Windows Event logs for errors that might help with troubleshooting this.

    Lenora Moss Technical Support Engineer, SMB Partner Support, Symantec Corporation www.symantec.com

    Tuesday, May 13, 2014 6:15 PM
  • So my case has been escalated again "as it's more complex than normal" and I've also had the "we will only spend commercially reasonable efforts on this case going forward" disclaimer.

    I urge anyone experiencing this problem to open a support case with Microsoft (if you don't have a support contract it'll cost you £240...stick it on a credit card).  The more cases they have reported, the more worth while it is for them to spend time fixing the problem.


    I managed to make one of our SQL Servers fall over today - I tried to copy a backup of a database off the server using SMB...big mistake!  Yet it's quite happy when SQL hammers the drive with tens of thousands of iops and 300MB/s.  SMB is broken! I could be moving to Linux soon! :s
    • Edited by DJL Tuesday, May 20, 2014 6:18 PM
    Tuesday, May 20, 2014 6:15 PM
  • @DJL

    Moving to Linux is what I chose for the time being. The amount of time (translation: money) our company has wasted on troubleshooting this particular problem was enough to convince me that at least for a simple file share, it made sense to switch over. For you, the time constraints necessary to set up a linux server are likely more daunting given how many users you have. A simple samba file share was sufficient for us for the time being. I was hoping "time being" meant 4-6 weeks for a patch, but now I'm thinking I'll be lucky if a patch is issued by 2015. I don't understand how this isn't a bigger issue. If it's a fundamental flaw in how 2012 is handling SMB, which it appears to be, I can only assume it's a much more widespread issue than they're admitting to. If I can convince my boss to open a case with M$, I will. We have certain industry applications that ONLY run on Windows, so at some point we're going to NEED this to be fixed. For now, work arounds are enough.

    @LMosla

    I think we're a little beyond the whole "windows event log errors" stage at this point. 

    Wednesday, May 21, 2014 6:16 PM
  • I have had this issue since Jan 2014, only solution is to go back to Server 2012 (Not R2) or Server 2008 R2. All of the fixes mentioned here reduce the occurance's, but also appear to affect performance, which degrades over time. After changing back to 2008 R2 everything works fine!
    Tuesday, June 03, 2014 8:57 AM
  • So my case has been escalated again "as it's more complex than normal" and I've also had the "we will only spend commercially reasonable efforts on this case going forward" disclaimer.

    I urge anyone experiencing this problem to open a support case with Microsoft (if you don't have a support contract it'll cost you £240...stick it on a credit card).  The more cases they have reported, the more worth while it is for them to spend time fixing the problem.


    I managed to make one of our SQL Servers fall over today - I tried to copy a backup of a database off the server using SMB...big mistake!  Yet it's quite happy when SQL hammers the drive with tens of thousands of iops and 300MB/s.  SMB is broken! I could be moving to Linux soon! :s
    I'll be getting a case open soon.  We had the issue come back after 2 months.  There's clearly something particular that triggers it, but I have not been able to reproduce it manually.
    Wednesday, June 04, 2014 6:13 PM
  • Hi all,

    So my case has just been archived for the time being.  I have been told that Microsoft have no fix for this at the moment. Apparently they are aware of some issue with SMB 3.02, although I have no further information other than that.

    Basically as soon as any updates to the relevant dll's are produced the engineer will let me know so I can see if they fix the problem.

    So basically... that's it... I now just have to wait, or dump 2012 R2... not great news

    I still think it's worth opening support cases if you can as it will at the very least bring more attention to the problem, and they may just discover something that help isolate the problem

    Tuesday, June 10, 2014 10:19 AM
  • we have disabled smb2 today. don't know what will happen. 
    Thursday, June 19, 2014 7:29 AM
  • we have disabled smb2 today. don't know what will happen. 
    no luck. it stopped responding again.
    Friday, June 20, 2014 6:05 AM
  • Hi all,

    So my case has just been archived for the time being.  I have been told that Microsoft have no fix for this at the moment. Apparently they are aware of some issue with SMB 3.02, although I have no further information other than that.

    Basically as soon as any updates to the relevant dll's are produced the engineer will let me know so I can see if they fix the problem.

    So basically... that's it... I now just have to wait, or dump 2012 R2... not great news

    I still think it's worth opening support cases if you can as it will at the very least bring more attention to the problem, and they may just discover something that help isolate the problem

    I opened a case (finally).  114062611570165
    Thursday, June 26, 2014 12:32 PM
  • Just got off the phone with MS, has anyone tried this yet?

    http://support.microsoft.com/kb/2957623/en-nz

    They're pretty confident it fixes the problem.

    Thursday, June 26, 2014 5:14 PM
  • I had a 2012 R2 server experience "the issue" on 6/10.  It had the May rollup (KB2955164) installed and rebooted on 6/4, so I don't think it helps.  I think I will try the DisableLeasing registry key, it's listed as a workaround.  I have quite a few production 2012 R2 file servers out there, and so far 2 have experienced the SMB lockup.  I am unable to reproduce even with performing millions of scripted create/update/delete operations from a dozen clients.

    REG ADD HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\lanmanserver\parameters /v DisableLeasing /t REG_DWORD /d 1 /f

    Friday, June 27, 2014 9:54 PM
  • We have had our 2012 R2 Standard file server VMs in production since February, but only started seeing this issue on June 22, after I installed regular Windows updates.  Restarting server service hangs, will not reboot - have to hard power off (but oddly don't get startup errors on power-on).  Same basic symptoms (no sharing, can see files on server via RDP, no pertinent errors in Error Log); here's our spec:

    Dell 720 hosts, Compellent 10k fiber-attached storage

    ESXi 5.5, VMXNET3 NICs (since build)

    We are looking at downgrading to 2008 R2, but I'm worried when that OS will be EOL, forcing us to upgrade to 2012 - maybe they will have a fix by then? ;)

    Thursday, July 24, 2014 12:28 PM
  • Don't forget that if you have an AV it possible to be a suspect as well.  If you are still seeing the symptoms, and you've applied all patches (including the one i linked), then you should make sure your server has all software ripped out of it.  MS will make you do this anyway if you open a case with them.  

    That said, open a case with them.  The more exposure, the higher the likely hood of getting a resolution.

    Finally, so far (fingers crossed) our server has been stable since the patch.  Won't know for sure for at least a couple more weeks though.

    22 hours 34 minutes ago