none
Hyper-V host and all guests become unresponsive, event ID 153 fills system log RRS feed

  • Question

  • We have some Windows 2012 R2 Hyper-V with dozens of VM running legacy systems and some departmental systems (XP, Win2003, Win2008, Win2012R2). They had worked for years without any problem.

    In last December, we approve the monthly Windows Updates in the guest systems. In the next day, our IBM x3650 hosts presented disk errors and hanged, losing the disk array after restarted.

    The host system logs had these records on Windows event viewer:

    • Megasas2 – ID 129 – reset to device, \device\raidport0, was issued.
    • Disk – ID 153 - the IO operation at logical block address 0x… for disk 2 (PDO: \device\0000005d) was retried.
    • Megasas2 – ID 11 - the driver detected a controller error on \device\raidport0

    The systems running on HP DL160/360 were not affected. The problematic environment is IBM x3650 M4 HD 5460 configured to boot from internal LSI MegaRAID M5210e. We are using fixed VHDX.

    All diagnosis made after reboot have no errors and the hosts work fine after reinstall until an updated guest is started. VM guests run normally in another environment.

    With this, we have tried to reinstall the host system using all versions of LSI MEGASAS drivers that we can, with or without Windows updates in host OS. In all attempts, we have a server crash and disk array loss when a guest starts.

    Has anyone experienced this problem? How can we confirm that this is a driver issue?


    Wednesday, March 11, 2015 7:49 PM

Answers

  • Hi guys,

    The hardware is 2012R2 certified (http://windowsservercatalog.com/item.aspx?idItem=a8ee4fd3-a6de-8d5c-9426-b8f8d1cba0d0&bCatID=1282). 

    Continuing our trial and error approach, this week some tests revealed new information about the issue. The problem occurs when Hyper-V Storage Accelerator driver is enabled in an XP/2003 x32 guest OS. If it is disabled then all runs fine. In an x64 guest OS too, even when HSA driver is enabled.

    Yesterday, we tested another option, disabling data protection in RAID BIOS (T10-DIF PI). Now, we get no errors or issues. It sounds like a workaround.

    However, T10 data protection prevents silent corruption like FIFO overruns and underruns or firmware errors (such as arithmetic overflow or incorrect pointer usage). Is it safe disable it?

    Regards,
    Wednesday, March 25, 2015 7:26 PM

All replies

  • Hi Sir,

    >>All diagnosis made after reboot have no errors and the hosts work fine after reinstall until an updated guest is started. VM guests run normally in another environment.

    Do you mean if you don't install update for that VM the hyper-v host will work without issue ?

    I have searched some simialr event ID , please give them a try :

    http://blogs.technet.com/b/kevinholman/archive/2013/06/21/event-id-129-storachi-reset-to-device-device-raidport0-was-issued.aspx

    http://blogs.msdn.com/b/ntdebugging/archive/2013/04/30/interpreting-event-153-errors.aspx

    Any further information please feel free to let us know .

    Best Regards,

    Elton Ji


    Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com .

    Thursday, March 12, 2015 9:59 AM
    Moderator
  • Hi Elton,

    The VM images restored from tape backup without updates work fine.

    However, it´s not so simple. Some updated guests work. Another runs OK only in compatibility mode.

    Now, we are trying to isolate the cause comparing images and updates history. It´s hard, after each test that has the problem we lose the host and we have to reinstall it.

    It´s strange that a guest problem can ruin the host. The hypervisor should provide isolation between the environments, shouldn´t?

    We already knew the links you indicated. They were sent to IBM support team, very useful.

    Thanks.
    Thursday, March 12, 2015 1:02 PM
  • Hi Sir,

    >>Now, we are trying to isolate the cause comparing images and updates history. It´s hard, after each test that has the problem we lose the host and we have to reinstall it.

    I can understand that .

    I would suggest you try to analyze the dump file with Debugging Tools by yourself. You can install it and it’s Symbol Packages from the following link:
     
    http://www.microsoft.com/whdc/Devtools/Debugging/default.mspx
     
    WinDbg will tell you the possible cause. For more information, please read Microsoft KB Article: 
    How to read the small memory dump files that Windows creates for debugging
     
    http://support.microsoft.com/kb/315263

    In addition , please keep the hyper-v host up-to-date also the host's hardware drivers.

    If there is no clue , please contact Microsoft Customer Service and Support (CSS) via telephone so that a dedicated Support Professional can assist with your request.
    To obtain the phone numbers for specific technology request, please refer to the web site listed below:
    http://support.microsoft.com/default.aspx?scid=fh;EN-US;OfferProPhone#faq607

    Best Regards,

    Elton Ji


    Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com .

    Friday, March 13, 2015 2:40 AM
    Moderator
  • Hi Elton,

    Sorry for my bad English. I'm working on improving it. Let me clarify.

    When the problem occurs, the host has stopped responding. There's no dump or blue screen. If we try to interact with any application, it stops responding. The resource monitor shows zero disk usage on write queue. Read Queue Length reaches the scale. When CHKDSK is running, it shows a lot of “file record segment ### is unreadable”. Event Viewer shows 11, 129 and 153 errors. In a few minutes, the host freezes.

    After restarting, the boot partition is unreachable and the recover process fails. All disk data was lost, and Virtual Machines too. However, if we install again, the system works fine, it runs all diagnostics or disk checks without errors.

    Yesterday, we got some situations that cause the problem almost instantly: run an antivirus full scan in a guest or just format a volume in a guest without antivirus, by example. This, only in XP or Windows 2003 guests.

    The host OS and the host's hardware drivers are up-to-date. The problem occurs only in our xSeries. We think it's a driver bug. There is an open support request with hardware supplier, but they are not convinced about driver bug hypothesis. They suggest it's a Hyper-V bug.

    Is this a driver or hypervisor issue? How to confirm?

    Regards.

    Friday, March 13, 2015 12:56 PM
  • This sounds like a driver issue. Have you installed the latest IBM drivers from the vendor? Another good log to look at is the Hyper-V VMMS operational logs. Event Viewer\Applications and Services Logs\Microsoft\Windows\Hyper-V-VMMS

    Friday, March 13, 2015 7:46 PM
  • We have tried with latest version drivers provided by MS, IBM and LSI.

    • MS: lsi_sas2 v2.00.60.82 / lsi_sas3 v2.50.65.01  / megasas2 v6.600.21.8
    • IBM: lsi_sas2 v2.00.69.01 / lsi_sas3 v2.50.75.00 / megasas2 v6.704.12.00
    • LSI: lsi_sas2 v.2.00.72.00 / lsi_sas3 v2.50.92.00 / megasas2 v6.705.05.00

    The Hyper-V VMMS-Storage log shows: “Failed to open attachment ... Error: 'The file or directory is corrupted and unreadable.'

    The problem only occurs when Integration Components are installed.

    Monday, March 16, 2015 4:40 PM
  • Hi Sir,

    >>The resource monitor shows zero disk usage on write queue. Read Queue Length reaches the scale. When CHKDSK is running, it shows a lot of “file record segment ### is unreadable”. Event Viewer shows 11, 129 and 153 errors. In a few minutes, the host freezes.

    After restarting, the boot partition is unreachable and the recover process fails. All disk data was lost, and Virtual Machines too. However, if we install again, the system works fine, it runs all diagnostics or disk checks without errors.

    >>The problem occurs only in our xSeries.

    Sorry for the delay .

    As you metioned the issue only happens to XSeries , based on my experience there should be a conflict  .

    Please refer to following web site to check if X3650 is supported to install 2012R2 :

    http://windowsservercatalog.com/results.aspx?text=IBM+x3650+&=Go&bCatID=1282&avc=10&ava=0&OR=5&chtext=&cstext=&csttext=&chbtext=

    Best Regards,

    Elton Ji


    Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com .

    Wednesday, March 25, 2015 6:09 AM
    Moderator
  • Hi guys,

    The hardware is 2012R2 certified (http://windowsservercatalog.com/item.aspx?idItem=a8ee4fd3-a6de-8d5c-9426-b8f8d1cba0d0&bCatID=1282). 

    Continuing our trial and error approach, this week some tests revealed new information about the issue. The problem occurs when Hyper-V Storage Accelerator driver is enabled in an XP/2003 x32 guest OS. If it is disabled then all runs fine. In an x64 guest OS too, even when HSA driver is enabled.

    Yesterday, we tested another option, disabling data protection in RAID BIOS (T10-DIF PI). Now, we get no errors or issues. It sounds like a workaround.

    However, T10 data protection prevents silent corruption like FIFO overruns and underruns or firmware errors (such as arithmetic overflow or incorrect pointer usage). Is it safe disable it?

    Regards,
    Wednesday, March 25, 2015 7:26 PM
  • Hi Sir,

    >>However, T10 data protection prevents silent corruption like FIFO overruns and underruns or firmware errors (such as arithmetic overflow or incorrect pointer usage). Is it safe disable it?

    Glad to hear that you have found a workaround .

    But for this hardware setting , I would suggest you to contact hardware vendor to check if there is any potential issue .

    Best Regards,

    Elton JI


    Please remember to mark the replies as answers if they help and unmark them if they provide no help. If you have feedback for TechNet Subscriber Support, contact tnmff@microsoft.com .


    Friday, March 27, 2015 1:29 PM
    Moderator
  • Hi Rafael,

    As you know, these events can be logged for multiple reasons and the underlying cause can be as simple as performing a Buffered large file copy operation when it should be Unbuffered, insufficient controller command queue (Queue Depth) or workload related down to FW/Drivers and HW.

    Q: Where are these events being logged (within host server or the VMs)?

    Q: Can you copy the "Details" of Event 153 and paste it here?

    Right click the event 153 > Select Copy  > Copy Details as text

    Also, what is the workload on this server?

    What type of IO operation is occurring when this happens?

    Thanks

    Thursday, April 2, 2015 4:21 AM
  • "When CHKDSK is running, it shows a lot of “file record segment ### is unreadable”. Event Viewer shows 11, 129 and 153 errors. In a few minutes, the host freezes."

    Id say your array is hosed, either member disks or the controller its self. Could be firmware \ drivers, but as its only one system affected My money would be on the tin. Backup Backup Backup and run a RAID consistency check from the Controllers BIOS. you may also find there are some other extended hardware tests dependent on the controller in question.

    Thursday, April 2, 2015 1:16 PM
  • Hi Rona,

    All log events that I reported were logged within host server. We sent these logs to the IBM Support Team and a Local MS Partner. They can be download from here:

    https://drive.google.com/file/d/0B9wVZhtQCLu8eFdnaWtVUEhkV0k/view?usp=sharing

    There are also videos that show the issue. The IO operation is not always the same, the workload is usually low, and format a volume or perform a full scan antivirus (within guest OS) were the actions that caused more failures.

    The IBM support team answered our question, suggesting apply the workaround (disable T10 data protection) and confirming that this is safe. And, there were no more incidents here with this configuration.

    These are the “details” of event 153:

    <Eventxmlns="http://schemas.microsoft.com/win/2004/08/events/event">

    <System>

      <Provider Name="disk" />

      <EventID Qualifiers="32772">153</EventID>

      <Level>3</Level>

      <Task>0</Task>

      <Keywords>0x80000000000000</Keywords>

      <TimeCreated SystemTime="2015-03-18T13:48:14.205538700Z" />

      <EventRecordID>1083</EventRecordID>

      <Channel>System</Channel>

      <Computer>IVAI-VM5.ivai.intranet</Computer>

      <Security />

      </System>

    <EventData>

      <Data>\Device\Harddisk0\DR0</Data>

      <Data>a8</Data>

      <Data>0</Data>

      <Data>\Device\0000005a</Data>

      <Binary>0F01040004003400000000009900048000000000000000000000000000000000000000000000000000020B2A</Binary>

    </EventData>

    </Event>

    Regards,

    Thursday, April 2, 2015 1:37 PM