Server 2012 NFS - 10 minutes to resume after failover?

Pergunta Server 2012 NFS - 10 minutes to resume after failover?

  • Thursday, December 06, 2012 11:52 AM
     
     

    I've got a Server 2012 Core cluster running an HA file server role with the new NFS service. The role has two associated clustered disks. When I fail the role between nodes, it takes 10-12 minutes for the NFS service to come back online - it'll sit in 'Online Pending' for several minutes, then transition to 'Failed', then finally come 'Online'. I've looked at the NFS event logs from the time period of failover and they look slightly odd. For example, in a failover at 18:41, I see this in the Admin log:

    Time EventID Description
    18:41:59 1076 Server for NFS successfully started virtual server {fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}
    18:50:47 2000 A new NFS share was created. Path:Y:\msd_build, Alias:msd_build, ShareFlags:0xC0AE00, Encoding:7, SecurityFlavorFlags:0x2, UnmappedUid:4294967294, UnmappedGid:4294967294
    18:50:47 2000 A new NFS share was created. Path:Z:\eas_build, Alias:eas_build, ShareFlags:0xC0AE00, Encoding:7, SecurityFlavorFlags:0x2, UnmappedUid:4294967294, UnmappedGid:4294967294
    18:50:47 2002 A previously shared NFS folder was unshared. Path:Y:\msd_build, Alias:msd_build
    18:50:47 2002 A previously shared NFS folder was unshared. Path:Z:\eas_build, Alias:eas_build
    18:50:47 1078 NFS virtual server {fc4bf5c0-c2c9-430f-8c44-4220ff6655bd} is stopped
    18:51:47 1076 Server for NFS successfully started virtual server {fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}
    18:51:47 2000 A new NFS share was created. Path:Y:\msd_build, Alias:msd_build, ShareFlags:0xC0AE00, Encoding:7, SecurityFlavorFlags:0x2, UnmappedUid:4294967294, UnmappedGid:4294967294
    18:51:47 2000 A new NFS share was created. Path:Z:\eas_build, Alias:eas_build, ShareFlags:0xC0AE00, Encoding:7, SecurityFlavorFlags:0x2, UnmappedUid:4294967294, UnmappedGid:4294967294

    In the Operational log, I see this:

    Time EventID Description
    18:41:51 1108 Server for NFS received an arrival notification for volume \Device\HarddiskVolume11.
    18:41:51 1079 NFS virtual server successfully created volume \Device\HarddiskVolume11 (ResolvedPath \Device\HarddiskVolume11\, VolumeId {69d0efca-c067-11e1-bbc5-005056925169}).
    18:41:58 1108 Server for NFS received an arrival notification for volume \Device\HarddiskVolume9.
    18:41:58 1079 NFS virtual server successfully created volume \Device\HarddiskVolume9 (ResolvedPath \Device\HarddiskVolume9\, VolumeId {c5014a4a-d0b8-11e1-bbcb-005056925167}).
    18:41:59 1079 NFS virtual server successfully created volume \Pfs\Volume{fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}\ (ResolvedPath \Pfs\Volume{fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}\, VolumeId {fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}).
    18:41:59 1105 Server for NFS started volume \Pfs\Volume{fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}\ (ResolvedPath \Pfs\Volume{fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}\, VolumeId {fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}).
    18:41:59 1079 NFS virtual server successfully created volume \DosDevices\Y:\ (ResolvedPath \Device\HarddiskVolume9\, VolumeId {c5014a4a-d0b8-11e1-bbcb-005056925167}).
    18:44:06 1116 Server for NFS discovered volume Z: (ResolvedPath \Device\HarddiskVolume11\, VolumeId {69d0efca-c067-11e1-bbc5-005056925169}) and added it to the known volume table.
    18:50:47 1116 Server for NFS discovered volume Y: (ResolvedPath \Device\HarddiskVolume9\, VolumeId {c5014a4a-d0b8-11e1-bbcb-005056925167}) and added it to the known volume table.
    18:50:47 1081 NFS virtual server successfully destroyed volume \DosDevices\Y:\.
    18:50:47 1105 Server for NFS started volume Y: (ResolvedPath \Device\HarddiskVolume9\, VolumeId {c5014a4a-d0b8-11e1-bbcb-005056925167}).
    18:50:47 1105 Server for NFS started volume Z: (ResolvedPath \Device\HarddiskVolume11\, VolumeId {69d0efca-c067-11e1-bbc5-005056925169}).
    18:50:48 1106 Server for NFS stopped volume \Pfs\Volume{fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}\ (ResolvedPath \Pfs\Volume{fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}\, VolumeId {fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}).
    18:50:48 1081 NFS virtual server successfully destroyed volume \Pfs\Volume{fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}\.
    18:51:47 1079 NFS virtual server successfully created volume \Pfs\Volume{fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}\ (ResolvedPath \Pfs\Volume{fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}\, VolumeId {fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}).
    18:51:47 1105 Server for NFS started volume \Pfs\Volume{fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}\ (ResolvedPath \Pfs\Volume{fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}\, VolumeId {fc4bf5c0-c2c9-430f-8c44-4220ff6655bd}).

    From this, I'm not sure what's going on between 18:41:59 and 18:44:06, between 18:44:06 and 18:50:47 or between 18:50:48 and 18:51:47. What's the NFS volume discovery doing and why does it take so long?

    Does anyone have any thoughts as to where I could start looking to work out what's happening here? Is there any tracing that can be enabled for the NFS services to indicate what's going on?

    Thanks in advance!

    • Changed Type sbutlerlee Thursday, December 06, 2012 11:52 AM
    •  

All Replies

  • Thursday, December 06, 2012 12:22 PM
     
     
    What type of shared storage do you have? Sounds like a long reservation conflict to me. Do you have any messages in system event log associated with shared storage?

    StarWind iSCSI SAN & NAS

  • Thursday, December 06, 2012 4:34 PM
     
     
    It's Dell/EqualLogic iSCSI. In the system log, there's a message at 18:41:51 from NTFS saying "Volume Z: (\Device\HarddiskVolume11) is healthy.  No action is needed." and one at 18:41:58 saying "Volume Y: (\Device\HarddiskVolume9) is healthy.  No action is needed.", so it sounds like the disks were brought online correctly.
  • Thursday, December 06, 2012 9:24 PM
     
     
    It's Dell/EqualLogic iSCSI. In the system log, there's a message at 18:41:51 from NTFS saying "Volume Z: (\Device\HarddiskVolume11) is healthy.  No action is needed." and one at 18:41:58 saying "Volume Y: (\Device\HarddiskVolume9) is healthy.  No action is needed.", so it sounds like the disks were brought online correctly.
    OK, and do you have the same failover time with or w/o heavy load? Say few clients are doing copy to NFS share and you initiate failover?

    StarWind iSCSI SAN & NAS

  • Friday, December 07, 2012 10:25 AM
     
     
    Yes, it's exactly the same regardless of load at the time of the failover.
  • Tuesday, December 11, 2012 6:17 PM
     
     

    I was able to get some downtime this afternoon so I tried

    • deleting the NFS share in question and recreating it
    • deleting all the NFS shares on the clustered file server (thus removing the NFS Server resource) and recreating them
    • deleting all the NFS shares and the ._nfs folder from all the associated drives and recreating them
    • deleting the clustered file server altogether, shutting down the cluster, starting it back up and recreating the file server.

    None of these made any difference - this particular NFS resource still took about 10 minutes to return to service. I'm therefore supposing it's some aspect of the disk or the data on it that NFS is taking a long time to enumerate, but it's annoying that I don't have any visibility into what's going on. I might try asking this same question in the MS partner forums to see if I get any answers there...

  • Thursday, March 28, 2013 8:02 PM
     
     
    We are also experiencing similar NFS load issues with a Dell/EqualLogic iSCSI (PS4000) SAN. Most recent incident required a rebuild of the NFS Shares in order to get the connection re-established. Any responses from the MS Partner forums on this?