none
CSV Volume disk signature changed.

    Question

  • Hello everyone

    I encountered this on one of our Hyper-V clusters a few days ago. We had a node failure due to (root cause uncertain, the logs are inconclusive) either network or FC-HBA failure. we lost contact with all machines which resided on the disks that the failing node owned.

    the disks were moved to other nodes and machines were brought back online. but oddly enough there appeared "duplicates" of machines across some nodes, this was fixed. and the failing node was taken offline and we did a driver/firmware update and placed the node back into the cluster.

    a few days later we noticed that one of the disks were "online, redirected". and we were unable to bring it back to online direct access mode. so migrated it to another node and rebooted the node. Now the volume was online BUT it was not able to find the disk, because the disk signature had changed!? fortunately there were no critical servers on the disk, but were not able to get it back online.

    I've come across some "old" KB's and i wonder if its still possible for them to apply(it says win2000 and 2003)? because it explains what I have experienced.

    http://support.microsoft.com/kb/293778

    http://support.microsoft.com/kb/280425

    Any other suggestions of what we can do to bring the disk back online without re-initializing the disk?

    thanks for your time.

    Wednesday, December 05, 2012 9:37 AM

Answers

  • Just to tell everyone how we solved the problem.

    we actually had this issue happen again, just a few days ago with another disk on the same cluster.

    two different solutions were tried in paralell, both were a success.

    we removed the disks as clustered disks, and then deleted them from available storage. we then unexported them from the SAN and re-exported the two disks to 1 node only which was vacated and paused. we also stopped the cluster service on this host.

    we examined the disk with diskpart and behold! there was no partition table on the disks, disk id was "0000000". while filing a ticket with microsoft we started a third party partition recovery tool to work on one disk. this process would take an estimated 3-4 hours to complete.

    when we got help from microsoft they used a tool called diskprobe to manually write the partition table to the second disk. this was a far quicker process than the partition recovery tool. it took maybe 30 mins to complete.

    after that we just had to put the disks back as available storage and then add them as cluster shared volumes, rename them back to their old names and start up the machines again. we had to fix the networking on a lot of the machines, the network read "Configuration error". fixed by just setting the correct network and voila!.

    Hope this can help someone else caught in this pinch.

    • Marked as answer by Blinkage17 Saturday, December 22, 2012 11:05 PM
    Saturday, December 22, 2012 11:05 PM

All replies

  • Hi,

    You mentioned CSV volume, so your Hyper-V cluster should run at least Windows Server 2008 R2.

    > we lost contact with machines which resided on the disks that the failing node owned. the disks were moved to other nodes and
    > machines were brought back online.

    When failure occurs, resource on the failure node should failover to other node, also include VMs reside on these CSVs. This should happen without user intervention, but according to your description, it seems like you did that manually. So what’s the fact?

    > but oddly enough there appeared "duplicates" of machines across some nodes, this was fixed.

    How you fixed that?

    > a few days later we noticed that one of the disks were "online, redirected".

    Storage connectivity failures sometimes prevent a given node from communicating directly with the storage. To maintain function until the failure is corrected, the node redirects the disk I/O through a cluster network (the preferred network for CSV) to the node where the disk is currently mounted. This is called CSV redirected I/O mode.

    So seems this cluster still has storage connection issue. Check event viewer to find related event logs and post it.

    > so migrated it to another node and rebooted the node. Now the volume was online BUT it was not able to find the disk

    What did you mean “migrated”, a manually failover? Now the volume was online with direct access mode? On which node, the failure node, or failover node?

    A CSV volume should firstly be an offline disk in the cluster node system, and secondly a shared storage, and then a cluster shared volume. Check the disk status in Disk Management on the cluster node and give us feedback. If possible give us some screen capture to help us understand your cluster issue.

    For more information please refer to following MS articles:

    Understanding redirected I/O mode in CSV communication
    http://technet.microsoft.com/en-us/library/ff182358(v=WS.10).aspx#BKMK_redirected
    Using Cluster Shared Volumes in a Failover Cluster in Windows Server 2008 R2
    http://technet.microsoft.com/en-us/library/ff182346(v=ws.10).aspx

    Hope this helps!

    TechNet Subscriber Support

    If you are TechNet Subscription user and have any feedback on our support quality, please send your feedback here.


    Lawrence

    TechNet Community Support

    Thursday, December 06, 2012 2:51 AM
  • Hi thanks for your reply, I see I forgot to mention some things in my original post.

    >You mentioned CSV volume, so your Hyper-V cluster should run at least Windows Server 2008 R2.

    Yes, off course are we running Windows 2008 R2 with hyper-v and CSV.

    >When failure occurs, resource on the failure node should failover to other node, also include VMs reside on these CSVs. This should happen without user intervention,

    >but according to your description, it seems like you did that manually. So what’s the fact?

    The machines on the node that failed (node3), did start up on the other nodes automatically. But some techs were too fast, and started to start machines manually before the cluster had time to converge(or atleast thats what we think caused the duplication, maybe some MPIO error caused it?). so we ended up with duplicates. We fixed the duplicates by live-migrating to the other nodes which contained a duplicate.

    >So seems this cluster still has storage connection issue. Check event viewer to find related event logs and post it.

    When we failed the Redirected volume over to another node (from node1 to node5), we recived this error:

    Event id: 1034

    Cluster physical disk resource 'VM-Volume5' cannot be brought online because the associated disk could not be found. The expected signature of the disk was '<hex-id>'. If the disk was replaced or restored, in the Failover Cluster Manager snap-in, you can use the Repair function (in the properties sheet for the disk) to repair the new or restored disk. If the disk will not be replaced, delete the associated disk resource.

    and we cant seem to find the repair funtion, on any sheet/tab what ever. and failing it over to another node (any one in the cluster) gives the same error.

    the disk is now listed as Unknown in the disk manager

    so my question is do the knowledge base article describing that MPIO errors may cause a new signature to be written to the disk still apply to 2008 R2? even tho the article states its for 2000? because it coincides with what we have experienced on our cluster. And if its possible to write the expected signature to the disk as described in the KB on win 2008 R2 and save the data on the disk?

    thanks


    • Edited by Blinkage17 Thursday, December 06, 2012 9:06 AM typo
    Thursday, December 06, 2012 9:03 AM
  • Hi,

    Was this screenshot get from node 1? Was node 1 the owner of this disk source? What’s the disk status in other node?

    Event ID 1034 indicates storage issue, you may refer to following procedures to resolve it:

    1. On each node in the cluster, open Disk Management (which is in Server Manager under Storage) and see if the disk is visible from one of the nodes (it should be visible from one node but not multiple nodes). If it is visible to a node, continue to the next step. If it is not visible from any node, still in Disk Management on a node, right-click any volume, click Properties, and then click the Hardware tab. Click the listed disks or LUNs to see if all expected disks or LUNs appear. If they do not, check cables, multi-path software, and the storage device, and correct any issues that are preventing one or more disks or LUNs from appearing. If this corrects the overall problem, skip all the remaining steps and procedures.
    2. Review the event log for any events that indicate problems with the disk. If an event provides information about the disk signature expected by the cluster, save this information and skip to the last step in this procedure.
    3. To open the failover cluster snap-in, click Start, click Administrative Tools, and then click Failover Cluster Management. If the User Account Control dialog box appears, confirm that the action it displays is what you want, and then click Continue.
    4. In the Failover Cluster Management snap-in, if the cluster you want to manage is not displayed, in the console tree, right-click Failover Cluster Management, click Manage a Cluster, and then select or specify the cluster that you want.
    5. If the console tree is collapsed, expand the tree under the cluster you want to manage, and then click Storage.
    6. In the center pane, find the disk resource whose configuration you want to check, and record the exact name of the resource for use in a later step.
    7. Click Start, point to All Programs, click Accessories, right-click Command Prompt, and then click Run as administrator.
    8. Type:

    CLUSTER RESOURCE DiskResourceName /PRIV >path\filename.TXT

    For DiskResourceName, type the name of the disk resource, and for path\filename, type a path and a new filename of your choosing.

    1. Locate the file you created in the previous step and open it. For a master boot record (MBR) disk, look in the file for DiskSignature. For a GPT disk, look in the file for DiskIdGuid.
    2. Use the software for your storage to determine whether the signature of the disk matches either the DiskSignature or DiskIdGuid for the disk resource. If it does not, use the following procedure to repair the disk configuration.

    Also you may try command mentioned in this article to remove the disk configuration in the node and then re-add the disk.
    Windows 2008 R2 Disks and SAN policies
    http://social.technet.microsoft.com/Forums/en/winserverfiles/thread/5fb7b833-7952-4c49-889a-6be01298e923

    For more information please refer to following MS articles:

    Event ID 1034 — Cluster Storage Functionality
    http://technet.microsoft.com/en-us/library/dd353968(v=WS.10).aspx
    Unable to add disk to cluster
    http://social.technet.microsoft.com/forums/en-us/winserverClustering/thread/24C2FA76-34BA-4F03-A51D-054C13712877

    Hope this helps!

    TechNet Subscriber Support

    If you are TechNet Subscription user and have any feedback on our support quality, please send your feedback here.


    Lawrence

    TechNet Community Support

    Friday, December 07, 2012 6:25 AM
  • Hi,

    I would like to confirm what is the current situation? Have you resolved the problem or do you have any further progress?

    If there is anything that we can do for you, please do not hesitate to let us know, and we will be happy to help.


    Lawrence

    TechNet Community Support

    Monday, December 10, 2012 6:48 AM
  • Hi,

    As this thread has been quiet for a while, we assume that the issue has been resolved. At this time, we will mark it as 'Answered' as the previous steps should be helpful for many similar scenarios.

    If the issue still persists and you want to return to this question, please reply this post directly so we will be notified to follow it up. You can also choose to unmark the answer as you wish.

    In addition, we'd love to hear your feedback about the solution. By sharing your experience you can help other community members facing similar problems.

    Thanks!


    Lawrence

    TechNet Community Support

    Friday, December 14, 2012 9:21 AM
  • Just to tell everyone how we solved the problem.

    we actually had this issue happen again, just a few days ago with another disk on the same cluster.

    two different solutions were tried in paralell, both were a success.

    we removed the disks as clustered disks, and then deleted them from available storage. we then unexported them from the SAN and re-exported the two disks to 1 node only which was vacated and paused. we also stopped the cluster service on this host.

    we examined the disk with diskpart and behold! there was no partition table on the disks, disk id was "0000000". while filing a ticket with microsoft we started a third party partition recovery tool to work on one disk. this process would take an estimated 3-4 hours to complete.

    when we got help from microsoft they used a tool called diskprobe to manually write the partition table to the second disk. this was a far quicker process than the partition recovery tool. it took maybe 30 mins to complete.

    after that we just had to put the disks back as available storage and then add them as cluster shared volumes, rename them back to their old names and start up the machines again. we had to fix the networking on a lot of the machines, the network read "Configuration error". fixed by just setting the correct network and voila!.

    Hope this can help someone else caught in this pinch.

    • Marked as answer by Blinkage17 Saturday, December 22, 2012 11:05 PM
    Saturday, December 22, 2012 11:05 PM
  • Hi,

    Thanks for sharing your experience!

    You experience and solution can help other community members facing similar problems.

    Thanks for your contribution to Windows Server Forum!

    Have a nice day!

    Lawrence

    TechNet Community Support

    Monday, December 24, 2012 2:54 AM