none
Device has a bad block and VM randomly loosing access to disk

    Question

  • Hi,

    We have a Hyper-V cluster with 3 nodes running Windows 2012 R2. The cluster uses shared disks hosted on 2 quantastor devices that run RAID-10.

    We are currently getting lots of errors visible in the event viewer of one of the cluster nodes:

    Event ID: 7

    The device, \Device\Harddisk4\DR4, has a bad block.

    Additionally one of the VMs is randomly being unable to access one of its drives (in one case we even had a windows BSD the VM).

    We had a failed disk in the array which has been replaced and now all disks are reported as having no errors, but the cluster node event error is still occurring.

    To me this now "seems" makes sense, that is, the failed disk had nothing to do with the event viewer errors because a single disk failure in RAID-10 should not return such error.

    The problem is that we just don't understand what is causing this error. Any ideas on where to start looking for clues?

    Wednesday, August 08, 2018 4:22 AM

All replies

  • Hi ,

    Please run the CHKDSK command to check disk error.

    Are there any other error messages in event viewer?

    It's hard for us to troubleshooting the issue since the event error 7 is still related with the disk.

    I would suggest you contact the hardware vendor to confirm the health of hard disk and configuration of RAID.

    Best Regards,

    Candy


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   

    Wednesday, August 08, 2018 6:42 AM
  • Hi,

    Thanks for the answer. I ran a CHKDSK on the VM disk that randomly goes away and it returned no errors.

    We are gonna get someone to double check all disks on the arrays (but this has been done).

    I was curios as per if there is any known issues with such configurations. The Drive that is supposedly failing is hooked to the cluster using iscasi. And so far it has only caused issues on one VM, the other VMs also use this disk and have not had that problem.

    There are no more errors in the event viewer, unfortunately this particular error is happening a lot and is currently pretty much filling the log:

    Thanks again

    Fernando



    • Edited by fernando06 Wednesday, August 08, 2018 7:39 AM added images
    Wednesday, August 08, 2018 7:31 AM
  • Hi ,

    >>And so far it has only caused issues on one VM, the other VMs also use this disk and have not had that problem.

    Does this VM occur the issue randomly ?

    Please understand, for randomly issue, it is hard for us to do further research since we cannot reproduce the problem.

    >>The Drive that is supposedly failing is hooked to the cluster using iscasi.

    If possible, could you please rebuild the problematic VM connection with CSV and then check the results?

    Best Regards,

    Candy


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   

    Wednesday, August 08, 2018 8:41 AM
  • Hi Candy,

    Thanks for your response, I do understand that the randomness doesn't help at all.

    Unfortunately, rebuilding the connection to the CSV will mean an outage (I suspect) so it is a last resource at the moment, considering the VMs are working and the problem is that one of them temporarily looses access to drive E. This is a problem that we can temporarily live with (unless it starts affecting the drive that hosts the database server, which is our major concern).

    Wednesday, August 08, 2018 10:40 AM
  • in the event viewer of one of the cluster node

    This suggests to me that you have ailing network hardware on that node or some physical connection associated with that node (like a switch port). Also possible that the back-end device does not support multiple simultaneous connections from multiple hosts, but the Cluster Validation Wizard should have caught that. You did run the wizard before going into production, right?

    If you want to try a software solution, you should be able to evacuate the failing node and completely rebuild its iSCSI connections without impacting any VM or the cluster. If you can't, then you have a configuration problem.


    Eric Siron
    Altaro Hyper-V Blog
    I am an independent contributor, not an Altaro employee. I accept all responsibility for the content of my posts. You accept all responsibility for any actions that you take based on the content of my posts.

    Wednesday, August 08, 2018 5:25 PM
  • Hi Eric,

    Thanks for your response. This cluster has been running for over 5 years now (I don't know if the validation wizard was ran by whoever built it, but I think is very likely it was). The issue just started happening a couple of weeks ago.

    Our next move is to run some further hardware checks on the cluster nodes, make sure sure they are fully patched and restart them one at a time (which we sometimes do). Our concern is of course that because this is an unexplained issue so far, and it affects one of the main drives, that restart and moving vms around may cause other connectivity issues (but I guess we will have to take this risk in the absence of an explanation). Additionally, this may not actually solve the problem.

    Thanks again,

    Fernando

    Friday, August 10, 2018 1:34 AM
  • Hi Fernando,

    If you have any updates during this process, please feel free to let me know.

    Best Regards,

    Candy


    Please remember to mark the replies as an answers if they help.
    If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com   

    Wednesday, August 15, 2018 7:57 AM