How to look at Storport ETW Trace logs?
What are Storport Trace Logs? It is a new feature implemented in the STORPORT.SYS binaries released on KB KB 979764 for Windows Server 2008 and KB 978000 for Windows Server 2008 R2. It is always available on Windows Server 2008 R2 SP1 and Windows Server 2012. It provides a measure of time delay (latency) for requests made to a system’s disk unit. The measurements are taken form the Storport.sys driver, the lowest Microsoft component in the Storage Stack. This means It only includes the latency for the components BELOW the Storport driver, normally they only include: The Storage Adapter Miniport Driver (Hardware device driver) The hardware layers (Storage Controller, Cables, Disk, etc)
Why is that useful? The most common use is to determine how much of the IO latency is happening below the Microsoft components. This is very useful to diagnose hardware problems.
How to obtain a Storport trace? For Windows Server 2008 and 2008 R2 follow the instructions from Bob Golding on the blog below. Storport ETW Logging to Measure Requests Made to a Disk Unit http://blogs.msdn.com/b/ntdebugging/archive/2010/04/22/etw-storport.aspx (Update) For Windows Server 2012 and Windows 8 follow the instructions from Paul Reynolds on the blog below. Tracing with Storport in Windows 2012 and Windows 8 http://blogs.technet.com/b/askcore/archive/2013/01/12/tracing-with-storport-in-windows-2012-and-windows-8.aspx
Pre-work You should have the .ETL file in your hands by now, make sure it is larger than 8K or chances are it is an empty file. A simple way to confirm it has data is to open the ETL on event viewer, but event viewer won’t let you manipulate the data, this is a problem, especially when there are thousands of events.
Step 1. Export the ETL to a CSV. This will enable you to manipulate the data on excel and make pretty graphs (everyone loves graphs).
How to export the Storport Trace ETL to a CSV:
Step 2. Look at the CSV file:
Step 3. Decode the SCSI commands to learn what the system is doing:
b. You may see vendor specific commands that are not part of the standard. In that case you may need to find the vendor specific reference.
Step 4. Look for a pattern that allows you to take action.
You may learn only Persistent Reservation commands are taking a very long time, this may be a problem with the storage. You may learn only reads are taking a very long time and that is most likely due to saturated disks. You may find writes are taking a long time and this may indicate hardware write cache is disabled. You may see only a certain LUN, port number or path is having trouble. Then you draw your conclusions from this point on, maybe a bad cable. Also mind the SCSI Status and SRB. SCSI Status = 0 means the command completed successfully, this is the most common status.
Below you will find 3 quick reference tables: Table 1: SCSI Command Codes Table 2: SCSI Status Codes Table 3: SCSI Request Block (SRB) Codes.
For additional information on SCSI commands, refer to the following documents from T10 Technical Committee (www.T10.org)
SCSI-3 Block Commands (SBC)
SCSI Primary Commands - 3 (SPC-3)
SCSI Architecture Model - 3 (SAM-3)
PS: Thank you Tate Calhoun for the help with xperf processing tips and James Burrage for the quick reference tables.
Table1: SCSI Command Codes
00 TEST UNIT READY (mandatory) 01 RECALIBRATE or REWIND 03 REQUEST SENSE (mandatory except RBC) 04 FORMAT UNIT 05 READ BLOCK LIMITS 07 REASSIGN BLOCKS 07 INITIALIZE ELEMENT STATUS 08 READ (6) 0A WRITE (6) 0B SEEK (6) 0F READ REVERSE (6) 10 WRITE FILEMARKS (6) 11 SPACE (6) 12 INQUIRY (mandatory) 13 VERIFY(6) 14 RECOVER BUFFERED DATA 15 MODE SELECT (6) 16 RESERVE (6) 17 RELEASE (6) 18 COPY 19 ERASE (6) 1A MODE SENSE (6) 1B START/STOP UNIT 1B LOAD UNLOAD 1C RECEIVE DIAGNOSTIC RESULTS 1D SEND DIAGNOSTIC 1E PREVENT/ALLOW MEDIUM REMOVAL 23 READ FORMAT CAPACITIES (MMC) 24 SET WINDOW 25 READ CAPACITY (10) 28 READ (10) 29 READ GENERATION 2A WRITE (10) 2B SEEK (10) 2C ERASE (10) 2D READ UPDATED BLOCK 2E WRITE AND VERIFY (10) 2F VERIFY (10) 30 SEARCH DATA HIGH (10) 31 SEARCH DATA EQUAL (10) 32 SEARCH DATA LOW (10) 33 SET LIMITS (10) 34 PRE-FETCH (10) 35 SYNCHRONIZE CACHE (10) 36 LOCK/UNLOCK CACHE (10) 37 READ DEFECT DATA (10) 37 INITIALIZE ELEMENT STATUS WITH RANGE 38 MEDIUM SCAN 39 COMPARE 3A COPY AND VERIFY 3B WRITE BUFFER 3C READ BUFFER 3D UPDATE BLOCK 3E READ LONG 3F WRITE LONG 40 CHANGE DEFINITION 41 WRITE SAME (10) 44 REPORT DENSITY SUPPORT 45 PLAY AUDIO (10) 46 GET CONFIGURATION 47 PLAY AUDIO MSF 48 AUDIO TRACK INDEX (not mentioned in T10 overview) 49 AUDIO TRACK RELATIVE 10 (not mentioned in T10 overview) 4A GET EVENT STATUS NOTIFICATION 4B PAUSE / RESUME 4C LOG SELECT 4D LOG SENSE 50 XDWRITE (10) 51 XPWRITE (10) /*READ DISC INFORMATION CDB (in mmc5r02c 6.22.2)*/ 52 XDREAD (10) 53 XDWRITEREAD (10) 54 SEND OPC INFORMATION 55 MODE SELECT (10) 56 RESERVE (10) 57 RELEASE (10) 58 REPAIR TRACK 5A MODE SENSE (10) 5B CLOSE TRACK / SESSION 5C READ BUFFER CAPACITY 5D SEND CUE SHEET 5E PERSISTENT RESERVE IN 5F PERSISTENT RESERVE OUT 7E EXTENDED CDB 7F VARIABLE LENGTH CDB 80 XDWRITE EXTENDED (16) 80 WRITE FILEMARKS (16) 81 REBUILD (16) 81 READ REVERSE (16) 82 REGENERATE (16) 83 EXTENDED COPY 84 RECEIVE COPY RESULTS 85 ATA COMMAND PASS THROUGH (16) 86 ACCESS CONTROL IN 87 ACCESS CONTROL OUT 88 READ (16) 89 COMPARE AND WRITE 8A WRITE (16) 8B ORWRITE 8C READ ATTRIBUTE 8D WRITE ATTRIBUTE 8E WRITE AND VERIFY (16) 8F VERIFY (16) 90 PRE-FETCH (16) 91 SYNCHRONIZE CACHE (16) 91 SPACE (16) 92 LOCK UNLOCK CACHE (16) 93 WRITE SAME (16) 9E SERVICE ACTION IN (16) 9F SERVICE ACTION OUT (16) A0 REPORT LUNS A1 ATA COMMAND PASS THROUGH (12) A2 SECURITY PROTOCOL IN A2 SEND EVENT (not mentioned in T10 overview) A3 REPORT SUPPORTED OPCODES A4 MAINTENANCE (OUT) (REPORT_KEY) A5 MOVE MEDIUM A5 PLAY AUDIO 12 (not mentioned in T10 overview) A6 EXCHANGE MEDIUM A7 MOVE MEDIUM ATTACHED A8 READ (12) A9 SERVICE ACTION OUT (12) A9 AUDIO TRACK RELATIVE 12 (not mentioned in T10 overview) AA WRITE (12) AB SERVICE ACTION IN (12) AC ERASE (12) AD READ DVD STRUCTURE AE WRITE AND VERIFY (12) AF VERIFY (12) B0 SEARCH DATA HIGH (12) B1 SEARCH DATA EQUAL (12) B2 SEARCH DATA LOW (12) B3 SET LIMITS (12) B4 READ ELEMENT STATUS ATTACHED B5 SECURITY PROTOCOL OUT B6 SEND VOLUME TAG B7 READ DEFECT DATA (12) B8 READ ELEMENT STATUS B9 READ CD MSF BA REDUNDANCY GROUP (IN) BB REDUNDANCY GROUP (OUT) BC SPARE (IN) BC PLAY CD (not mentioned in T10 overview) BD SPARE (OUT) BE VOLUME SET (IN) BF VOLUME SET (OUT)
Table 2: SCSI Status Codes. 00h OK This status code indicates that the device has completed the task successfully.
02h Check condition When the target returns a Check Condition in response to a command it is indicating that it has entered a contingent allegiance condition. This means that an error occurred when it attempted to execute a SCSI command. The initiator usually then issues a SCSI Request Sense command in order to obtain a Key Code Qualifier (KCQ) from the target.
04h Condition Met This status code is returned on successful completion of a Search Data command or a Pre-fetch Command.
08h Busy The target returns Busy if it cannot complete a command at that time, for example if it is in the contingent allegiance condition.
10h Intermediate The target returns Intermediate when it successfully completes a linked command (except the last command).
14h Intermediate - Condition met As the name suggests, this status code is simply a combination of 10h Intermediate and 04h Condition Met.
18h Reservation conflict The target returns this status code if an initiator attempts to access a LUN that has been previously reserved by another initiator using the Reserve or Reserve Unit command.
22h Command terminated The target returns this status if the target has to terminate the current I/O process because it received a Terminate I/O Process message. This status code is obsolete as of SAM-3.
28h Queue (task set) full Similar to 08h Busy status, this status code is returned when the logical unit lacks the resources to accept a received task from an I_T nexus. This status code implies that the logical unit already has at least one task from that I_T nexus, where the 08h Busy status code implies that the logical unit currently has no tasks from that I_T nexus.
30h ACA active The target returns this status code if an auto-contingent allegiance condition has occurred.
40h Task aborted The target returns this status code when a task is aborted by another I_T nexus and the TAS bit in the Control mode page is set to one.
Table 3: SCSI Request Block (SRB) Status Codes
Returns the status of the completed request.
SRB status codes:
0x00 = SRB_STATUS_PENDING
0x01 = SRB_STATUS_SUCCESS
0x02 = SRB_STATUS_ABORTED
0x03 = SRB_STATUS_ABORT_FAILED
0x04 = SRB_STATUS_ERROR
0x05 = SRB_STATUS_BUSY
0x06 = SRB_STATUS_INVALID_REQUEST
0x07 = SRB_STATUS_INVALID_PATH_ID
0x08 = SRB_STATUS_NO_DEVICE
0x09 = SRB_STATUS_TIMEOUT
0x0A = SRB_STATUS_SELECTION_TIMEOUT
0x0B = SRB_STATUS_COMMAND_TIMEOUT
0x0D = SRB_STATUS_MESSAGE_REJECTED
0x0E = SRB_STATUS_BUS_RESET
0x0F = SRB_STATUS_PARITY_ERROR
0x10 = SRB_STATUS_REQUEST_SENSE_FAILED
0x11 = SRB_STATUS_NO_HBA
0x12 = SRB_STATUS_DATA_OVERRUN
0x13 = SRB_STATUS_UNEXPECTED_BUS_FREE
0x14 = SRB_STATUS_PHASE_SEQUENCE_FAILURE
0x15 = SRB_STATUS_BAD_SRB_BLOCK_LENGTH
0x16 = SRB_STATUS_REQUEST_FLUSHED
0x20 = SRB_STATUS_INVALID_LUN
0x21 = SRB_STATUS_INVALID_TARGET_ID
0x22 = SRB_STATUS_BAD_FUNCTION
0x23 = SRB_STATUS_ERROR_RECOVERY
0x24 = SRB_STATUS_NOT_POWERED
Known Issues: On some CPU models the processor timer may report invalid entries and display a very high value for the request duration, this is not a software issue. As a workaround for this specific CPUs, on could switch from the default timer to the HPET timer (iuf supported by the computer), but we recommend that for troubleshoot purposes you just ignore the entries. Example (times in millisecond): RequestDuration : 52 RequestDuration : 87 RequestDuration : 57 RequestDuration : 65 RequestDuration : 51 RequestDuration : 207 RequestDuration : 116 RequestDuration : 16777210 Note: Any entry above 8 minutes should be suspected as a probable processor time problem. RequestDuration : 67 RequestDuration : 114 RequestDuration : 61
this does not work, my etl is always 8K
Wayne,
make sure you are accessing a SCSI device.
For example, for IDE or ATA devices the port driver will be ATAPORT and you will get no result. An empty return file may also mean no IO has exceeded your threshold. Sometimes the IO may de split at a layer above the storport and complete really quick in the hardware while queued inside the operating system.