none
Hyper-V Cluster issues after applying Win2008 R2 SP1 on a 3 node Cluster!

    Question

  • Hello,

    After applying Win2008 R2 Sp1 and running "Validate this Cluster" I get these issues in the report.

    "List Potential Cluster Disks"

    Disk with identifier bd5a41af has a Persistent Reservation on it. The disk might be part of some other cluster. Removing the disk from validation set

    Disk with identifier 2eff8c0d has a Persistent Reservation on it. The disk might be part of some other cluster. Removing the disk from validation set

    Disk with identifier bd5a41ad has a Persistent Reservation on it. The disk might be part of some other cluster. Removing the disk from validation set

    Disk with identifier c5643d96 has a Persistent Reservation on it. The disk might be part of some other cluster. Removing the disk from validation set

     

    After checking the disk details do I dare to run this command to get rid of these without getting issues in the running environment ?

    "cluster node node1 /clearpr:5"

    Disks eligible for cluster validation

    Disks will be referred to by the following cluster disk numbers in subsequent tests  Nodes where the disk is visible  Number of nodes where the disk is visible  

    Cluster disk 0 has disk identifier bd5a41ac  All Nodes  3  

     



     "Validate SCSI device Vital Product Data (VPD)"

    Failed to get SCSI page 83h VPD descriptors for cluster disk 1 from node NOD1.company.local status 2

    Failed to get SCSI page 83h VPD descriptors for cluster disk 1 from node NOD1.company.local status 2


    How do I get rid of the above warning ?
    I know that SAN storage device (Promise VessRaid 1820i) does support SCSI 3 preservation so..
    Also before I uppgraded the Nodes to Win2008 R2 SP1 this issue did not occur in the validation test.

    Please advice.

    thx /Tony
    Thursday, March 03, 2011 10:59 AM

Answers

All replies

  • How did you upgrade the cluster to SP1, please check the following guides.

    Upgrading a Hyper-V R2 Cluster to Windows 2008 R2 SP1

    http://workinghardinit.wordpress.com/2011/02/17/upgrading-a-hyper-v-r2-cluster-to-windows-2008-r2-sp1/

    Hyper-V Cluster Is IN!

    http://grinding-it-out.blogspot.com/2011/02/hyper-v-cluster-is-in.html

     

    Friday, March 04, 2011 9:09 AM
  • I pretty much did as the first link you provided explains. But ran into some issues which I explain below.

    1. I live migrated the virtual servers on node2 & node 3 over to node 1.

    2. Upgraded both node 2 & node 3 to SP1, rebooted and applied the remaining KB´s from Winupdate.

    Here I ran into some issues, 3 virtual servers was set to "failed" probably because they tried to  automatically

    live migrate over to their respecitve preffered owners (node 2 / node 3) while I ran winupdate and rebooted..at least that´s what I think

    was the problem. After this whatever I tried I could not start,turn off,move or Live Migrate the 3 failed virtual servers from node 1 to node2 & node 3.

    So I ended up having to recreate them i failover cluster mgr with slightly different names and attach the previous vhd files, after that I removed the old failed servers.

    Booted up the newly created ones and set IP adress etc, all up and running.

    4. Upgraded node 1 to Sp1, rebooted and applied reamining KB´s from Winupdate.

    5. Upgraded integrated components on all virtual servers.

    6. Live migrated the virtual servers over to their preferred owners.

    Ran the vaildate cluster wizard and that´s when I got this issues.

    Any ideas ?

    thx /Tony

     

     

    Friday, March 04, 2011 10:30 AM
  • Did you destroy / rebuild a cluster node without evicting it first anywhere? Those luns seem to have reservations from another (previous cluster) or some other persistent reservation from some other software. Normally running cluster node servername /clearpr:disknumber can fix this but make sure the LUN is not being used anywhere else. You can find out what disk it is via disk manager or diskpart.

    Good luck

    Saturday, March 05, 2011 10:11 PM
  • It might well be the case that one cluster node has been rebuilt, I´m not sure what the oniste admin has done and of corse he isn´t eiher :(

    Judging by the error message I do suspect that indeed is the case. I will check with diskpart etc.

    Thx /Tony

     

    Thursday, March 17, 2011 8:00 AM
  • Hi,

    generally an "rolling" upgrade to SP1 with your failover cluster is fully supported which is also documented here with an step-by-step guide  -> http://ramazancan.wordpress.com/2011/03/07/hyper-v-failover-clusterstep-by-step-guide-rolling-upgrade-sp1/

    when you run validate, what is the status of your cluster/shared disks? ressources online or are the disks taken offline? is there an disk in "available storage" which can be used for validation? otherwise I would assume here, that these are warnings within your validation report are by design as the disks are in use. "Removing the disk from validation set" does only tell you that this disk will not be used for validation tests.

    Checkout the following technet articles which gives an good understanding of validation tests in failover clusters:

    Understanding Cluster Validation Tests
    http://technet.microsoft.com/en-us/library/cc726064.aspx

    Use Validation Tests for Troubleshooting a Failover Cluster
    http://technet.microsoft.com/en-us/library/cc770807.aspx

    Regards

    Ramazan


    Ramazan Can [MVP Cluster] http://ramazancan.wordpress.com/
    Thursday, March 17, 2011 10:09 AM
  • Hi Tony,

    Very interesting. This is exactly what happened in my environment. I had a 5 node Windows 2008 R2 SP1 Cluster running on Dell hardware in combination with a Dell MD3000i iSCSI SAN. To make it worse, in my case running a validation report even damaged the MBR record on the LUNS making them inaccessible to Cluster Shared Volumes. So I lost a good part of my production Hyper-V servers (Thank you back-up).

    When calling Microsoft for support, they directed me to the hardware provider, telling me that they will not assist me without a good error-free Validation Report.

    So I turned to Dell, hired several servers to migrate my Hyper-V productions machines, and rebuild my complete Cluster Node's with Windows Server 2008 R2 SP1, Re-initialized my Dell MD3000i, recreated all LUN's and created a new Cluster with Cluster Shared Volumes. Running the Validation Report for Storage instantly generates Persistent Reservation warnings, leaving all other Storage test without results. Only if I take all CSV offline the Storage Test runs correctly (exept for the warning that all disks are Offline)

    As my road to a solution currently runs via Dell support, I do like to see if your problems can be resolved by this Microsoft forum.

    Kind regards,

    Bas 

    Friday, March 18, 2011 7:30 AM
  • We hit the same issue when upgrading to SP1 on 3 separate clusters, and have to talked to another customer that did as well.  At this point, I can only say that I don't think it's a storage problem, and that we've made Microsoft aware of the issue.  If I have any updates on our front, I'll update this post.  Here's hoping we can get to the bottom of this soon.

     

     


    Janssen Jones - Virtual Machine MVP -http://www.janssenjones.com - Please remember to mark answers as answers. :)
    Friday, March 18, 2011 1:43 PM
  • Hi Fahlis

    can you share more info on your SAN and is this multisite clustering?

    Is this a production environment or a test environment?

    Though i understand the issue was not coming before Sp1 it may be good to engage storage vendor too and recreate the issue while they capture logs in background to see what is happening.


    Gaurav Anand Visit my Blog @ http://itinfras.blogspot.com/
    Saturday, March 19, 2011 11:47 AM
  • Hi,everyone.

     

    I have the same situation. 4-node CSV cluster. Before installation SP1 all test were passed. After SP1 some test are failed: "Validate Disk Access Latency" and Validate SCSI device Vital Product Data (VPD).

    And during validation one CSV lune lost MBR record. Microsoft support specialist tried to analyze, but could find reason.

    Now we have open case and I'll do my best to resolve case via Microsoft.

     


    • Edited by Freafor Wednesday, March 23, 2011 12:27 PM Grammar
    Wednesday, March 23, 2011 12:23 PM
  • I'm experiencing some of the same problems. 3 node Hyper-V cluster has been functioing fine before SP1. I am running Hyper-V on a Cisco UCS hardware platform with B250 servers and the Cisco Palo card. I have been working with Microsoft tech support for the last two weeks and he indicated it was a problem with my storage drivers for the Cisco virtual fiber adapter.

    Here is the list of problems I'm seeing:

    1. Cluster fails at 1 AM each day.
    2. The Reserved system partitions are going offline at random intervals.
    3. When running a validation report, I get errors indicating that I have disks with persistent reservations. Attempted to clear the reservations but that did not work.

    Does nay one have any additional information on this issue?

     

     

    Wednesday, March 23, 2011 2:55 PM
  • Hi,

    just some commments from my side.

    "...Cluster fails at 1 AM each day...."

    Ramazan: interesting, any backups or other custom tasks running here? If possible and you can share your logs + depending on current state of your call at CSS, I would be interested to follow up this offline.... :-)

    "....When running a validation report, I get errors indicating that I have disks with persistent reservations. Attempted to clear the reservations but that did not work.."

    Ramazan: are the "persistent reservations" warnings or errors? did you had LUN in "available storage" for validation tests?

    PS: what is the current status of your call at Microsoft Support? I'm really sure they are interested to get more insights in this case.

    Regards

    Ramazan

     

     


    Ramazan Can [MVP Cluster] http://ramazancan.wordpress.com/
    • Proposed as answer by 玉城広樹 Friday, May 20, 2011 12:27 AM
    Wednesday, March 23, 2011 9:40 PM
  • I made some progress on this issue. I disabled Symantec End Point Protection on all of the hyper-V servers and the cluster did not fail last night.

    I am still seeing erros when validating the cluster. The validation screen stops at the section "List potential Cluster Disks" and under the result colume I have the following error message " Disk with Identifier b069c539 has a persistent reservation on it. The disk might be part of some other cluster. Removing the disk from validation set.". Based on the disk signature provided, this is one of my cluster shared volumes. The validation test appears to stall and does not finish. The only disk that is listed in available storage is my whitness disk, which current shows online pending and then after a minute or so whent to Online. 

    As far as the status of the case, I am trying to get some one to call me back. The enage for a few hours and then I do not here from them for a day or do. Currently they are saying that I need to update my storage drivers, but I have the most recent version installed. Currently I am in the process of contacting my Microsoft Rep to see if I can get this case escalated.

    Thursday, March 24, 2011 4:50 PM
  • Here is my progress.

    I have reinstalled all four servers with Windows Server 2008 R2 SP1 and reconfigured all iSCSI connections to my SAN. Created new LUN's and a new Cluster. When running the validation report, it still show me persistent lock warnings or inaccessible disk warnings.

    The default installation was with iSCSI set to Least Queue Depth for the attached LUN's. After some burn-in tests with virtual machines and advice from Dell, we changed the iSCSI settings to Round Robin with Subset to achieve better performance. At first these changes seem to work fine, but when I rebooted the server that had these changes in iSCSI configuration, there appeared a lot of iSCSI connectivity failures, event id error 20 and 7 for device iSCSPrt started to appear in the system log, followed by device driver problems for the manufactor drivers for the MD3000i in my case. Whatever I changed to restore normal iSCSI connectivity for that server did not work. It even disrupted the Preffered Path for several LUN's on the SAN controllers. The only way to return to normal iSCSI connectivty for that server was top reinstall the complete OS and reconfigure iSCSI and cluster. (removing device drivers, removing server from SAN configuration and readding everthing did not work) I repeated all steps on an identical server in the cluster and it showed the same behavior. Again this server needed to be reinstalled to return to normal performance.

    So my advice is to not change any iSCSI or network controller card settings when SP1 is installed. Because it seems that somewhere along the line old configuration are not completely wiped when doing this.

    On the other hand when I test the whole cluster with a lot of Hyper-V machines on stress test SP1 seems to work more stable than pre SP1. Before SP1 I had on heavy load occasionally networkcards that lost connection on the cluster servers or cluster servers that slowed down to where they became inaccessible. Recent test do not show that behaviour anymore.

    Regards,

    Bas

    Thursday, March 24, 2011 9:13 PM
  • having the same issue here..this is starting to be a wide spread issue.  let sse what support has to say...
    William Bressette, Network Architect, Horn IT Solutions, http://wpjsplace.spaces.live.com/blog
    Friday, March 25, 2011 5:34 AM
  • Customer expierences the same problem after installing SP1 on his MS Hyper-V R2 cluster.

     


    blog: http://www.ivobeerens.nl
    Friday, March 25, 2011 8:20 PM
  • Hello Tony,

    I recently (3/25/2011) installed Windows 2008 R2 Service Pack 1 onto my 4 node cluster and ran the validate the cluster task and have the exact same warnings for List Potential Cluster Disks.  I have 70 LUNs defined and each one reported  this warning "Disk with identifier ccc85569 has a Persistent Reservation on it. The disk might be part of some other cluster. Removing the disk from validation set".  The identifier was unique for each disk.

    I don't know if this is a real issue or not because all VMs are running and can be live migrated between nodes.  I'm attempting to add two additional nodes to the cluster and thought there was a problem because it is taking so long for the validation tests to complete.  But now I believe I need to just be patient and let the validation test run.

    I realize this isn't much help, but wanted to share my experience thus far too.  I hope Microsoft provides assistance if any is needed.

    Regards,

    Robert

    Saturday, March 26, 2011 4:29 PM
  • To gather some data I tried to reproduce it on various setups, with 4 of them upgrades from Windows 2008 R2 RTM to Windows 2008 R2 SP1. My  current path of investigation, based on below results, would be DSM or other software putting it's own reservation on the LUN, perhaps remnants from an older version as the PE2950 are the oldest hosts. But this is purely speculative.

    1) iSCSI SAN (W2K8R2 Storage Server), "ordinary" 1Gbps  Intel NIC's uses for I/O: cannot reproduce it. Test passes clean. This is a cluster upgraded to SP1 from W2K8R2RTM.Could not reproduce the warning.

    2) FC SAN (EVA8000, HP emulex A8002A on latest fimrware/drivers, same as two other FC SAN clusters), 3 nodes, DELL PE2950

    Yes, here I saw this issue.

    3) FC SAN (EVA8000, HP emulex A8002A on latest fimrware/drivers, same as two other FC SAN clusters), 2 nodes, DELL R710 Test passes clean. This is a cluster upgraded to SP1 from W2K8R2RTM. Could not reproduce the warning.

    4) FC SAN (EVA8000, HP emulex A8002A on latest fimrware/drivers, same as two other FC SAN clusters), 2 nodes, DELL R710. Test passes clean. This is a cluster upgraded to SP1 from W2K8R2RTM. Could not reproduce the warning.

    5) DELL MD32X0i ISCSI SAN, "ordinary" 1Gbps Intel NIC's uses for I/O W2K8R2SP1 clean install, 2 nodes. Test passes clean. Could not reproduce the warning.

    Saturday, March 26, 2011 6:51 PM
  • Thanks for your detailled posting around your personal observations, this definitely helps a lot to get a better overall picture here!

    stay tuned for further updates, there are some "investigations" running.....

    Regards

    Ramazan


    Ramazan Can [MVP Cluster] http://ramazancan.wordpress.com/
    Saturday, March 26, 2011 8:36 PM
  • To add to what others have experienced - I have found the following:

    SAN - Equallogic PS4000XV, v.5.0.2 firmware and latest Host integration kit (using dell MPIO)

    Nodes - 2 x Dell 2950III Servers

    One of the nodes is a SP1 upgrade, one is SP1 install from scratch (both datacentre edtn) - Cluster validation passes all tests with a minor warning on the two iSCSI NIC's that are on the same subnet and a windows update missing warning.

    Then I try to add a Dell R610 into the cluster (SP1 upgrade) and the validation fails with "Disk with identifier x has a Persistent Reservation on it. The disk might be part of some other cluster. removing disk from validation set" and "SCSI page 83h VPD descriptors for cluster disk x and x match"

    So, I rebuilt the R610 server from scratch with R2 SP1 and I get the same issues. All NIC driver versions are identical (Broadcom and Intel NICS) across all servers.

    Pre-SP1 I had 2 x Dell 2950III and a Dell R610 all playing happily together in the cluster. I'm happy to provide the relevant cluster validation docs and logs to microsoft if required.

    I did call equallogic support, who were unaware of any issues and couldn't help.

     

    Monday, March 28, 2011 1:55 PM
  • Hi Tony,

    Odd enough we don't see this with 2-node clusters.

    Very recently one of our customers hit on exactly the same problem with HP EVA and a 5-node cluster. Only one of the nodes was upgraded to SP1 and validation hit on the same SCSI-3 PR problems as you do.

    The problem has been reported on a wide variety of storage arrays and vendors.

    I wouldn't be surprised to see a Post-SP1 hotfix addressing this issue in the near future.

    Regards,

    Hans Vredevoort
    Cluster MVP
    http://hyper-v.nu/blogs/hans
    http://twitter.com/hvredevoort

    Monday, March 28, 2011 4:36 PM
  • Anyone has a 4 node and a six node cluster to take a look at? Just wondering :-) But Hans is onto something here ... i'll edit the post above as during copy/past and passing along the information via mail from data center to me, a mistake got in there in there. The two R710 Hyper-V clusters upgraded to W2K8R2 SP1 have only two nodes and don't see the issue.
    Monday, March 28, 2011 4:49 PM
  • As a test, I provisioned a blank disk on the san and presented it to all the nodes (to be used for validation testing). I only connected to this disk on the 3rd node I was trying to introduce.

    Obviously the test has warnings on the 'list potential cluster disks' however, there are no other problems.

    I'm heading towards a 4 node cluster - I have the 4th box waiting to be built and to join the cluster. It would indeed be very interesting to see if the problems disappear if I ignore the initial validation and add nodes 3 & 4 to the cluster.

     

    Tuesday, March 29, 2011 8:13 AM
  • 4 node cluster (Dell Poweredge 2900/2950 nodes, Broadcom / Intel NICs, EMC AX4/5i iSCSI SAN), upgraded inactive nodes to sp1 one at a time and just tried adding a new disk to the cluster.  Ran storage validation tests and got all the validation warnings / errors mentioned plus a disk write access latency error for the new disk:

    Failed to access cluster disk 2, or disk access latency of 0 ms from node ...

    So far no actual problems with my VMs / cluster other than the above

    Tuesday, March 29, 2011 12:10 PM
  • wow, seeing so many people having problems after upgrading to sp1 I'll put my upgrade on hold for now until I see all the problems fixed.
    Tuesday, March 29, 2011 4:32 PM
  • We are aware of this issue and are actively investigating it.

    Thanks for your pateince and understanding.


    Chuck Timon Senior, Support Escalation Engineer (SEE) Microsoft Corporation
    Tuesday, March 29, 2011 8:01 PM
    Moderator
  • I’ve been troubleshooting storage issues after installing SP1 for the past two weeks, and seeing a lot of the same problems as everyone else.  Mine is a SQL cluster, so no CSV’s.  With two nodes in the cluster validation runs with no problems, but after adding a third node to the cluster validation fails with the following two errors:

    “Failed to access cluster disk 1, or disk access latency of 0ms…”

    “Failed to get the SCSI page 83h VPD descriptors for cluster disk 1 from…”

    The common factors seem to be more than two nodes and iSCSI storage.  I’ve been through support with Dell and am currently running the latest version of firmware on our Equallogic PS-series SAN, the latest version of the Equallogic HIT kit, and have configured the settings on our BCM57710 NICs to Dell’s specifications, but the problem persists.

    EDIT:

    Uninstalled SP1 this morning, and I have a healthy 3-node cluster that passes validation.

    • Edited by Carlo Baldini Wednesday, March 30, 2011 2:37 PM Follow-up
    Tuesday, March 29, 2011 10:36 PM
  • Right.  It is the uninstall of SP1 that made you healthy.
    Chuck Timon Senior, Support Escalation Engineer (SEE) Microsoft Corporation
    Wednesday, March 30, 2011 4:38 PM
    Moderator
  • Another one to add to the mix. 5 node Hyper-V Failover Cluster with CSV. Dell M600 blades, EMC CX3-10 FC array.

    Persistent Reservation warning and 83h VPD error. No obvious issue with existing VMs and thankfully I don't need to add another node right now.....

    Thursday, March 31, 2011 6:55 AM
  • Anyone has a 4 node and a six node cluster to take a look at? Just wondering :-) But Hans is onto something here ... i'll edit the post above as during copy/past and passing along the information via mail from data center to me, a mistake got in there in there. The two R710 Hyper-V clusters upgraded to W2K8R2 SP1 have only two nodes and don't see the issue.

    We experienced the same problem on both a three node Dell M600 to Dell EMC AX4 san cluster as well as a four node configuration. After running the validation tests we lost the partitions on the LUN's completely. All the data was gone. We had to recreate all our VM's from backups (Even then lost one day's worth of work). But the saving grace was this happened in out dev environment and not production. Then we tried to recreate it by installing a brand new 3 node cluster but however we tried we could not duplicate the problem. The only difference we can think of is that in the cluster that had issues we had made network changes after installing SP1 and on the crash test dummy we did not make nay changes. We plan to do that and will report back with the results.

    Prahalad

    Friday, April 01, 2011 6:26 PM
  • Do you mean changes to the iSCSI network settings or even to the cluster or Hyper-V guest networksettings? Did you end up with "raw" LUNs again in windows ir corrupt ones that couldn't mount or be read from / written to?
    Friday, April 01, 2011 6:31 PM
  • Do you mean changes to the iSCSI network settings or even to the cluster or Hyper-V guest networksettings? Did you end up with "raw" LUNs again in windows ir corrupt ones that couldn't mount or be read from / written to?

       If that was addressed to me, I meant Network settings on the core. I was able to duplicate the issue on a dummy cluster after I changed the network setting on the core. We are not using iSCSI, we are using FC.

       When I had the problems, I ended with "raw" LUNs. They did not even have partitions in them. The disk ID was 000000 and the disk had become read only with no partitions

    Prahalad

    • Proposed as answer by Jérémims Tuesday, April 05, 2011 9:39 AM
    • Unproposed as answer by Jérémims Tuesday, April 05, 2011 9:39 AM
    Friday, April 01, 2011 8:28 PM
  • Thx for the feedback Prahalad.

    Friday, April 01, 2011 8:53 PM
  • I have the same issue and I have a new installations of the Hyper-V cluster (3 node Windows Server 2008 R2 with SP1).

    If I take offline the disks the validation wizard is ok. Any news about this problem ? Is this a real problem ?

    Thanks

    Thursday, April 07, 2011 8:41 PM
  • Hello All,

    I just got confirmation from Microsoft Support that this is in fact a bug in Windows Server 2008 R2 SP1.  A fix will be available around April 18, 2011.  If you are setting up a cluster or adding nodes to the cluster, Microsoft Tech recommended to skip storage validation in the meantime.  Of course it is at your own risk, but if you are confident that disks can failover to all nodes, then you're most likely good to bypass storage validation.

    -Robert

    Thursday, April 07, 2011 8:54 PM
  • I was following this thread because I just built a 2 node cluster with R2 SP1, and had the same disk issues trying to add a third node.  I'll probably try to add the third node over a weekend and skip the disk tests, since i'm using a HP san and I had no issues getting the first two together. Plus if I wreck it I'll have time to rebuild it.

     

    Is this the place to look for the solution or somewhere else?

    Friday, April 08, 2011 1:28 PM
  • Hello All,

    I just got confirmation from Microsoft Support that this is in fact a bug in Windows Server 2008 R2 SP1.  A fix will be available around April 18, 2011.  If you are setting up a cluster or adding nodes to the cluster, Microsoft Tech recommended to skip storage validation in the meantime.  Of course it is at your own risk, but if you are confident that disks can failover to all nodes, then you're most likely good to bypass storage validation.

    -Robert

    I'm not sure this is the best advice.  I, like a few other people here, saw the disks I was using for validation become corrupted and show up as "raw" disks.
    • Proposed as answer by NeilRawlinson Tuesday, April 12, 2011 9:12 AM
    • Unproposed as answer by NeilRawlinson Tuesday, April 12, 2011 9:13 AM
    Friday, April 08, 2011 1:37 PM
  • Hello,

    As Robert posted above a fix is coming around April 18th. MS has identified the bug. If you know your cluster is OK you can go ahead.As far as I known the bug is a false positive. I have seen +3 clusters with this issue but none of them have anyother issues or problems and are fully operational. I'm not worried. I'm just holding of new upgrades to SP1 until we have the fix. That fix will be tested on a test cluster that shows the issue before ebing rolled out into production and I will post my findings. I'm sure Robert will also post more info when he gets it.

    Best regards,

    Didier Van Hoye

    http://workinghardinit.wordpress.com

     

    Friday, April 08, 2011 1:37 PM
  • Spot on here...we hit the problem when validating the introduction of our first SP1 node. Simply bypassed the storage validation because I was confident everything would work fine. Benefits of Dynamic Memory too good to turn down for the sake of waiting for this fix...even for a couple of weeks.
    Tuesday, April 12, 2011 9:15 AM
  • Just to be clear...and to set the proper expectations....there are no guarantees around the April 18 date for a public fix.

     

     


    Chuck Timon Senior, Support Escalation Engineer (SEE) Microsoft Corporation
    Tuesday, April 12, 2011 12:55 PM
    Moderator
  • I'm experiencing the same issue.  We have a 5-node cluster running against Dell Equallogic storage (PS4000 and PS6000).  For our upgrade to SP1, I basically evicted each node, did a clean install of the OS, installing the latest drivers for the NICs and installing the Equallogic HIT, Multi-path only.  Besides the cluster validation errors as stated above, I also experience the following errors when live migrating:

    Event ID 4096

    The Virtual Machines configuration <GUID> at 'C:\ClusterStorage\Volume4\<VMname>' is no longer accessible:  The segment is already unlocked.  (0x8007009E)

    When this occurs, the VM will go offline for about 5 secs.  In the Failover Cluster, it will show the configuration as "failed" and then it will start on destination node and resume, similar to a quick migration.  This is random, and doesn't happen all the time.  Besides this error, I've also had virtual machines fail when migrating,

    I'll receive an Event ID 4096, but it will state the following:

    The Virtual Machines configuration <GUID> at 'C:\ClusterStorage\Volume4\<VMname>' is no longer accessible:  The process cannot access the file because another process has locked a portion of the file. (0x80070021). 

    With our backup software (CA ArcServe r15), I have also experienced Windows 2000 or XP VMs fail to resume from their "saved state" with the previous Event ID 4096 (The segment is already unlocked". 

    This prompted a run of the Cluster Validation, which shows the same errors as stated in this post.  Hopefully a fix can be found soon

    Chris

    Friday, April 15, 2011 1:38 PM
  • Chris, have you opened up a case with Product support Services (PSS)?  If not I suggest that you to.

     

    Thanks

    William

     

     

     


    William Bressette, Network Architect, Horn IT Solutions, http://wpjsplace.spaces.live.com/blog
    Friday, April 15, 2011 1:41 PM
  • Hello All,

    I just got confirmation from Microsoft Support that this is in fact a bug in Windows Server 2008 R2 SP1.  A fix will be available around April 18, 2011.  If you are setting up a cluster or adding nodes to the cluster, Microsoft Tech recommended to skip storage validation in the meantime.  Of course it is at your own risk, but if you are confident that disks can failover to all nodes, then you're most likely good to bypass storage validation.

    -Robert

    I'm not sure this is the best advice.  I, like a few other people here, saw the disks I was using for validation become corrupted and show up as "raw" disks.

    I agree. We lost entire partition's after the validate. We basically had to delete the cluster storage and redo the entire cluster storage, cluster share volume etc etc all over again. Luckily this happened on a test environment which had backups. If I had to do that again I would wait for the fix before touching SP1 with a 100ft pole.

    Prahalad

    Friday, April 15, 2011 2:44 PM
  • I have 2 diffrent clusters in 2 sites, on custom built servers with Promise VessRaid 1840i ISCSI units.  Both are 4-node and both were build using Server 2008 DataCenter with SP1 pre-installed (slipstreamed ISO).  The both experience exactly this same issue.  Both have all patches downloaded and installed.  Any word on the hot-fix?
    Sunday, April 17, 2011 10:21 AM
  • Hi Kelly,

    The hotfix is currently being tested now and will be available in the near future (depending on the result of the tests). Just to be sure, please don't use the Cluster Validation test if you have SP1 installed without this hotfix.

    Regards,

    Hans Vredevoort
    http://hyper-v.nu
    @hypervnu | @hvredevoort

     

    Sunday, April 17, 2011 11:34 AM
  • Chris, have you opened up a case with Product support Services (PSS)?  If not I suggest that you to.

     

    Thanks

    William

     

     

     


    William Bressette, Network Architect, Horn IT Solutions, http://wpjsplace.spaces.live.com/blog


    I would, but this is basically the same issue as stated here.  I don't want to waste a support call, mainly the $$$, to fix an issue that will be resolved shortly. 

    Chris

    Tuesday, April 19, 2011 12:58 PM
  • Any word on when the hotfix will be posted?

     

    thank you

     

    ted

    Tuesday, April 19, 2011 7:47 PM
  • As always, it will be ready when has been tested succesfully. We're getting close though. I'll let you guys know as soon as I can.

    Regards, Hans

    Tuesday, April 19, 2011 8:06 PM
  • Thanks for the follow-up Hans, I'll just hang tight for the update.
    Wednesday, April 20, 2011 9:22 AM
  • i have the same problem.. also hanged too


    Wednesday, April 20, 2011 1:44 PM
  • Same issue adding a 3rd cluster node to my existing W2K8 R2 SP1 cluster with PowerPath for MPIO.

    Validate does not attempt to use any disk in a CSV, but any disk I take out of the CSV and take offline gets the errors described above when running cluster validatation.  A 512GB LUN survived the failed validatation test, but a 1TB LUN got wiped during the test (MBR gone, no signature, no data).  I am on a holding pattern adding additional nodes to the cluster until the hotfix is ready.

     

    Wednesday, April 20, 2011 7:48 PM
  • Same issue here adding 3rd node to cluster with NetApp SAN, removing a disk from cluster and putting in available storage leaves the disk errored after running vlaidation test. Deleting and adding back to availabe storage seems to be ok.

    Will try validation without disk tests

    Wednesday, April 20, 2011 10:42 PM
  • If I'm reading this thread correctly, this deserves to be a much higher priority issue on Microsoft's radar. 

    I've been planning our 4-node, 10 CSV Hyper V SP1 upgrade all week and this is the only potential roadblock I've yet to overcome. It threatens to delay my upgrade because I don't have good backups currently (due to an issue between Snap Manager for Hyper V and VSS Writer that we haven't been able to resolve). I don't have a test environment either; money is tight.

    Telling the cluster validation tool to omit storage sounds like my only surefire option if I don't want to postpone the SP1 upgrade. But then Microsoft won't support my cluster config. 

    We have 4 Dell R810s, use iSCSI pretty much exclusively, 10 CSVs which are NetApp LUNs. In addition we use NetApp's DSM 4.0 instead of Microsoft's MPIO. Any thoughts on whether this SP1 upgrade problem might affect our cluster?


    Wednesday, April 20, 2011 11:35 PM
  • Sonoma76,

     

    On your current backup issue, ping me.

     

    J

     

    Thursday, April 21, 2011 12:30 AM
  • I assure you, it has the highest possible priority with MS. It just takes time to test and verify that no other problems are caused by the QFR under development. Just wait for the white smoke and I'll let you know as soon as the QFE goes public.

    Regards, Hans

    Thursday, April 21, 2011 7:03 AM
  • Same issue for us. We have dozens of 16 node clusters using Dell blades. All are updated to SP1. We were adding hosts the other day to one of the clusters. First one went fine. We saw the original errors in the first post in this thread and we continued. One the second host it completely crashed on of the CSV's shared between that cluster (we usually have about 10-15 CSVs per cluster). 4 hours on the phone with Microsoft and we got it back online and running. This is a serious issue. Don't try to add hosts to your existing clusters if your running SP1!

    Thursday, April 21, 2011 7:04 AM
  • Same issue for us. We have dozens of 16 node clusters using Dell blades. All are updated to SP1. We were adding hosts the other day to one of the clusters. First one went fine. We saw the original errors in the first post in this thread and we continued. One the second host it completely crashed on of the CSV's shared between that cluster (we usually have about 10-15 CSVs per cluster). 4 hours on the phone with Microsoft and we got it back online and running. This is a serious issue. Don't try to add hosts to your existing clusters if your running SP1!

    Perhaps I can get our used support incident back since this is a confirmed bug.
    Thursday, April 21, 2011 7:05 AM
  • You don't have to pay for bugs, so just mention this to your PSS contact.

    Regards, Hans

    Thursday, April 21, 2011 7:07 AM
  • You don't have to pay for bugs, so just mention this to your PSS contact.

    Regards, Hans

    Figured. I just didn't mention it at the time. Thanks for the post.
    Thursday, April 21, 2011 7:11 AM
  • Thanks Hans. 

    It sounds like this problem manifests itself in two ways: either CSVs are affected immediately after SP1 install during cluster validation, or they may be affected sometime in the future if we had more nodes to the cluster. Is that fair? 

    I can't decide whether to go through with SP1 install this weekend in other words. 

    Thursday, April 21, 2011 12:36 PM
  • No confirmation of release of the fix yet. But is shouldn't be to long now unless they hit problems. So perhaps hold of another week. MS is on the job and testing is critical, nobody wants an QFE to make matters worse or introduce oter issues.
    Thursday, April 21, 2011 2:04 PM
  • Thanks Hans. 

    It sounds like this problem manifests itself in two ways: either CSVs are affected immediately after SP1 install during cluster validation, or they may be affected sometime in the future if we had more nodes to the cluster. Is that fair? 

    I can't decide whether to go through with SP1 install this weekend in other words. 


    Sonoma76,

    my suggestion for you is the same plan I have, Stay in a holdin patter.  If your cluster is 3 nodes ot larger post pone your upgrade.  If your cluster is 2 nodes Continue.  This is what we have done interanally with great success.  We are waiting for the updated information to be release from the MS Teams

     

    As Chuck said on March 29th, they are aware of the issue and activaly working on it.

     

    If you do proceed with your upgrade and you run into the error open a MS Support case ASAP.


    William Bressette, Network Architect, Horn IT Solutions, http://wpjsplace.spaces.live.com/blog
    Thursday, April 21, 2011 2:11 PM
  • Hi all,

    Thanks for checks regarding this issue, I´m very sad I didn´t check back inot this more often, I´ll explain below.

    As I thought I might fix this issue with evicting the node that had the issues according to the validate report, I wen´t on a

    a journey to do just that, and that lead into much more serious problems :(

    In this cluster the customer used the Cluster Fileservice to hold all their fileshares, and after evicting this node that service

    came into failed state,I really checked that no services or disks or nothing was currently owned or running on the node I was to evict.

    Anyway, it was impossible to get the filservice up and running again, and after checking the cluster events and disks in diskmgr I found out why....that darn

    thing had been somehow formatted or at least it said unallocated, I was somewhat chocked as I´m sure you understand, anyways, I knew now

    that my only choice was to restore from backup....next chock, no good backups oniste since 10days :(

    Leaving the customer with a choice to use Ibas restore services, which they opted for, know I´m sitting here after working like 20hours during easter holidays to get it all up and running. Restored all fileshares from last good DPM backup ( 11/4) and restoring all files Ibas has restored with their service.

    Which btw costs the customer approx $6500 :(

     

    I wonder if they can get a refund from Microsoft....big chance...not :(

    I have 2 upcoming 2 node clusters to setup for new customers, judging by this thread I should put SP1 on hold for these ones ?

    I was opting to install with SP1 media otherwise. Any thoughts ?

    /Tony

     

     

    Monday, April 25, 2011 4:39 PM
  • Dell MD3000i

    5 Node Cluster

    1 Library Node

    I ran into this issue the day that my entire cluster was updated to SP1. The major issue was that it changed the MBR of one of our cluster shared volumes, which promptly vanished. I was able to attach the missing share to the library node and run a disk utility on it. This allowed me to change the partition type back to primary, I was able to re-attach the storage to my cluster and regain access to my VMs. Although, I'm still getting the errors reported by the OP.

    Tuesday, April 26, 2011 2:01 PM
  • A hotfix is now available that addresses the Win2008 R2 service pack 1 issue with Validate on a 3+ node cluster.  This is KB 2531907.  You can find more information and the hotfix to download at the following link:
    http://support.microsoft.com/kb/2531907

    Thanks!
    Elden


    Tuesday, April 26, 2011 4:29 PM
    Owner
  • A hotfix is now available that addresses the Win2008 R2 service pack 1 issue with Validate on a 3+ node cluster.  This is KB 2531907.  The KB article and download link will be published shortly, in the mean time you can obtain this hotfix immediately free of charge by calling Microsoft support and referencing KB 2531907.

    Thanks!
    Elden


    And here it finally is :)

    http://support.microsoft.com/kb/2531907/en-us?sd=rss&spid=14134

    Thanks for that, anyone tried it yet ?


    Thx /Tony
    • Edited by fahlis Wednesday, April 27, 2011 11:56 AM
    Wednesday, April 27, 2011 9:28 AM
  • Hi

    I've tried to install the patch get the message "the update is not applicable to your computer", I am running cluster services on Windows 2008 R2 SP1.

     

    Bob

    Wednesday, April 27, 2011 11:55 AM
  • I applied it on a prebuilt not clusterjoined Win2008 R2 Sp1 host, worked without problems.

    Just hope the validation also does later :)


    Thx /Tony
    Wednesday, April 27, 2011 11:57 AM
  • Can you let us know what SKU you are running? Data Center, Enterprise, Hyper-V Server? Thx

     

    Wednesday, April 27, 2011 12:02 PM
  • Windows Server 2008 R2 SP1 Datacenter

     

    Wednesday, April 27, 2011 12:07 PM
  • Wouldn't install for me. Windows Server 2008 R2 Enterprise with SP1
    "The update is not applicable to your computer"

    EDIT:

    McFlyKDR is correct...I downloaded it on my pc instead of the server...got the 32 bit version.

    • Edited by jeffyb Wednesday, April 27, 2011 2:25 PM
    Wednesday, April 27, 2011 1:32 PM
  • Windows Server 2008 R2 Enterprise with SP1

    I have not tried it on a clusterjoined host yet..


    Thx /Tony
    Wednesday, April 27, 2011 1:33 PM
  • Make sure that you're downloading it from a computer that's x64 if you're cluster is running on a 64 bit OS. Otherwise, if you're downloading from your desktop you'll get the x32, and it won't run on your cluster hosts.
    Wednesday, April 27, 2011 1:34 PM
  • Should be OK. Is the cluster (service) up and running on the node where you try to install?
    Wednesday, April 27, 2011 1:37 PM
  • Please do, I think it needs the cluster role installed/ cluster service running.
    Wednesday, April 27, 2011 1:38 PM
  • I installed the Hotfix on my management station which is also our Library node and was able to validate the cluster. My question now is should I continue and install the Hotfix on the rest of my cluster hosts?
    Wednesday, April 27, 2011 2:16 PM
  • Make sure that you're downloading it from a computer that's x64 if you're cluster is running on a 64 bit OS. Otherwise, if you're downloading from your desktop you'll get the x32, and it won't run on your cluster hosts.


    Couldn't download from an x64 computer as none had access to internet then noticed Show hotfixes for all platforms and languages (3), expanded and there was the x64 download. Many Thanks for leading me during my blindness ;)

    I've applied the patch to all 3 cluster joined nodes and run the cluster test which ahs worked!

    Thanks.

     

    Wednesday, April 27, 2011 2:22 PM
  • Possible dumb question, but it looks like this hotfix also applies to Windows 7 SP1 machines with the Failover Cluster Manager installed?

    Regards,

    Jim

    Wednesday, April 27, 2011 6:04 PM
  • Possible dumb question, but it looks like this hotfix also applies to Windows 7 SP1 machines with the Failover Cluster Manager installed?

    Regards,

    Jim

    Wednesday, April 27, 2011 6:05 PM
  • Great news. 

    Glad I waited a week to apply SP1 to my Hyper V cluster. Just got the green light to update it this weekend. We're looking forward to getting Dynamic Memory and hopefully a resolution for the virtual nic crash issue that's plaguing us. 

    Wednesday, April 27, 2011 6:12 PM
  • I think that's indeed the case.
    Wednesday, April 27, 2011 7:27 PM
  • I get the error msg : Update not applicable to your computer.

    I've validated and I'm running the x64 version of the fix.

     

    My OS Version  :Enterprise 2008 R2 Core

    Wednesday, April 27, 2011 7:39 PM
  • With SP1  and clustering installed and cluster service running or not?
    Wednesday, April 27, 2011 7:43 PM
  • I just applied it on our own 2 node cluster with all services running, no problems and no reboot requiered, thanks for that :)
    Thx /Tony
    Wednesday, April 27, 2011 8:09 PM
  • With SP1  and clustering installed and cluster service running or not?


    Sp1 is installed.

     

    Clustering is not installed.. it's a new server for a new cluster

     

     

    Wednesday, April 27, 2011 8:15 PM
  • Could you install it, no need to configure anything yet, but I think you need it installed at least for the hot fix to install.

     

    Good luck,

    Didier Van Hoye

    http://workinghardinit.wordpress.com

     

    Wednesday, April 27, 2011 8:18 PM
  • 5 node CSV cluster VALIDATED following hotfix. Happy bunny.
    Wednesday, April 27, 2011 9:06 PM
  • Applied hotfix, added my 3rd node successfully with no issues with the original cluster disks.  The 1TB LUN I used in the test 2 weeks ago failed the validation again today stating it was corrupted (which likely happened during the failed validation 2 weeks ago)


    Removed , destroyed, rebuilt, and re-presented the LUN.  I was then able to add it back to the cluster and have it pass all tests with no issues.

     

    Wednesday, April 27, 2011 11:03 PM
  • adding the clustering roles fixed my issue. Thanks All !!
    Thursday, April 28, 2011 4:15 AM
  • Happy to hear that. Have a nice day,

    Didier Van Hoye

    Thursday, April 28, 2011 5:27 AM
  • Just tested the same issue by installing a 3-node cluster with a "Windows 2008 R2 SP1" OS installation (SP1 was included in the image, through VL).
    After 1st validation, had similar errors. I installed the patch as we speak and after reboot, on ALL LUNs (3) that i had created i received from the OS a message to scan and fix. I am now deleting the LUNs and re-creating them to see if they will pass  validation.

    Anyone tried a fresh installation with OS media of Win 2008R2 with SP1?

    Thursday, April 28, 2011 7:29 AM
  • While this patch is benifical it does not correct the damage that has already been done.

    We have a 8 node cluster with 8 drives presented. When we tried to add and additional node we had two drives that ended up with corrupted Parition tables. The disks showed unrecognized and looked like a complete data loss.

    Fortunatly I found a tool that would analyze the disks and recreate the paritition tables and we were able to get the disks back online. There were over 20 servers on these two disks which is a complete DR situation for our client.

    To resolve the issue I did the following:

    Unpresent errored drives from the cluster and present to a stand alone server. (Do not initialize or format and drives or you chance losing data.)

    Download and run the free utility TestDisk (http://www.cgsecurity.org/wiki/TestDisk)

    Let TestDisk rewrite the partition tables. Rescan your drives and you should now be able to access your data on the drive.

    Remove the cluster disks from the Microsoft cluster (there are no drives presented at this time so no chance of data loss)

    Unpresent drives from Stand alone server and represent and recreate clustered shared volumes. (make sure you get the names exactly the same so the file paths match in VMM)

    You can now start all your lost VM's and everything is back to normal.

    • Proposed as answer by balboa41 Monday, October 08, 2012 3:48 PM
    Friday, February 10, 2012 2:30 PM
  • A hotfix is now available that addresses the Win2008 R2 service pack 1 issue with Validate on a 3+ node cluster.  This is KB 2531907.  You can find more information and the hotfix to download at the following link:
    http://support.microsoft.com/kb/2531907

    Thanks!
    Elden


    HALLELUAH!!!  Just got handed a half-configured SAN/Cloud and was getting stuck on the Failover Cluster Validation on this exact issue...2008 R2 SP1, adding a 3rd node.  Knew JACK about a SAN/Cloud until this and have been learning trial by fire!  Been looking at it for two days and couldn't find anything online.  Stumbled across this and BAM...FIXED!!!  Thank you guys!  
    Thursday, May 24, 2012 9:18 PM
  • Thank you very much for the information of this thread.  I did a cluster validation trying to add two new nodes to a cluster and a whole CSV drive with about 30 VM's got "unallocated", as if it was not even formatted or initialized!

    I was thinking about restoring everything from backup but...it could take about one week!!!  It was just unacceptable.  Well, thanks to Google I got this thread and I found the tool "TestDisk", and that did the trick!  The affected disk was even re-initialized and the tool anyway recovered everything back to order in about one minute!  That is just awesome!

    Thank you for commenting about that tool.  And, for everyone out there, if you already ran the Cluster Validation tool without the hotfix, this tool will repair any damaged volumes.  Try it before a full restoring.

    Regards,


    Jose Angel Rivera

    Monday, October 08, 2012 3:48 PM