How to look at Storport ETW Trace logs?

What are Storport Trace Logs?
It is a new feature implemented in the STORPORT.SYS binaries released on KB KB 979764 for Windows Server 2008 and KB 978000 for Windows Server 2008 R2. It is always available on Windows Server 2008 R2 SP1 and Windows Server 2012.
It provides a measure of time delay (latency) for requests made to a system’s disk unit. The measurements are taken form the Storport.sys driver, the lowest Microsoft component in the Storage Stack. This means It only includes the latency for the components BELOW the Storport driver, normally they only include:
The Storage Adapter Miniport Driver (Hardware device driver)
The hardware layers (Storage Controller, Cables, Disk, etc)

Why is that useful?
The most common use is to determine how much of the IO latency is happening below the Microsoft components. This is very useful to diagnose hardware problems.

How to obtain a Storport trace?
For Windows Server 2008 and 2008 R2 follow the instructions from Bob Golding on the blog below.
Storport ETW Logging to Measure Requests Made to a Disk Unit
http://blogs.msdn.com/b/ntdebugging/archive/2010/04/22/etw-storport.aspx

(Update) For Windows Server 2012 and Windows 8 follow the instructions from Paul Reynolds on the blog below.
Tracing with Storport in Windows 2012 and Windows 8
http://blogs.technet.com/b/askcore/archive/2013/01/12/tracing-with-storport-in-windows-2012-and-windows-8.aspx

How to look at Storport ETW Trace logs?

Pre-work
You should have the .ETL file in your hands by now, make sure it is larger than 8K or chances are it is an empty file. A simple way to confirm it has data is to open the ETL on event viewer, but event viewer won’t let you manipulate the data, this is a problem, especially when there are thousands of events.

Step 1. Export the ETL to a CSV. This will enable you to manipulate the data on excel and make pretty graphs (everyone loves graphs).

How to export the Storport Trace ETL to a CSV:

  1. Get your storport trace ETL file (http://blogs.msdn.com/b/ntdebugging/archive/2010/04/22/etw-storport.aspx)
  2. You should have the latest version of xperf on your computer. It comes with Windows Performance Toolkit 5.
    1. Open an elevated command prompt and run “xperf -i mystorporttrace.ETL -o myoutputfile.csv
  3. You will get a CSV file with multiple columns including:
    *PortNumber
    *PathID
    *TargetID
    *LUN
    *RequestDuration (in milliseconds)
    *Command
    *ScsiStatus
     *SrbStatus

 

Step 2. Look at the CSV file:

  1. Open the file in excel
    1. The columns are self-explanatory, but you want to pay special attention to the “RequestDuration” and the “Command” columns.
    2. RequestDuration shows the measured delay.
    3. Command shows the type of scsi command.
    4. Order the list to see the largest delays. You should be interested in the types of scsi commands are being delayed, their SCSI Status and SRB Status condes.

 

Step 3. Decode the SCSI commands to learn what the system is doing:

  1. Know that Storport Trace logs all SCSI commands in decimal values.
  2. The tables below (and every other table you will ever find) lists all SCSI commands in Hexadecimal, you need to convert to decimal to match the Storport Log.
    1. Example: PERSISTENT RESERVE IN: 5Eh = shows in storport trace as 94

b.      You may see vendor specific commands that are not part of the standard. In that case you may need to find the vendor specific reference.

Step 4. Look for a pattern that allows you to take action.

You may learn only Persistent Reservation commands are taking a very long time, this may be a problem with the storage.
You may learn only reads are taking a very long time and that is most likely due to saturated disks.
You may find writes are taking a long time and this may indicate hardware write cache is disabled.
You may see only a certain LUN, port number or path is having trouble. Then you draw your conclusions from this point on, maybe a bad cable.
Also mind the SCSI Status and SRB. SCSI Status = 0 means the command completed successfully, this is the most common status.

Below you will find 3 quick reference tables:
Table 1: SCSI Command Codes
Table 2: SCSI Status Codes
Table 3: SCSI Request Block (SRB) Codes.

For additional information on SCSI commands, refer to the following documents from T10 Technical Committee (www.T10.org)

SCSI-3 Block Commands (SBC)

SCSI Primary Commands - 3 (SPC-3)

SCSI Architecture Model - 3 (SAM-3)

PS: Thank you Tate Calhoun for the help with xperf processing tips and James Burrage for the quick reference tables.

 

Table1: SCSI Command Codes 

00 TEST UNIT READY (mandatory)
01 RECALIBRATE or REWIND
03 REQUEST SENSE (mandatory except RBC)
04 FORMAT UNIT
05 READ BLOCK LIMITS
07 REASSIGN BLOCKS
07 INITIALIZE ELEMENT STATUS
08 READ (6)
0A WRITE (6)
0B SEEK (6)
0F READ REVERSE (6)
10 WRITE FILEMARKS (6)
11 SPACE (6)
12 INQUIRY (mandatory)
13 VERIFY(6)
14 RECOVER BUFFERED DATA
15 MODE SELECT (6)
16 RESERVE (6)
17 RELEASE (6)
18 COPY
19 ERASE (6)
1A MODE SENSE (6)
1B START/STOP UNIT
1B LOAD UNLOAD
1C RECEIVE DIAGNOSTIC RESULTS
1D SEND DIAGNOSTIC
1E PREVENT/ALLOW MEDIUM REMOVAL
23 READ FORMAT CAPACITIES (MMC)
24 SET WINDOW
25 READ CAPACITY (10)
28 READ (10)
29 READ GENERATION
2A WRITE (10)
2B SEEK (10)
2C ERASE (10)
2D READ UPDATED BLOCK
2E WRITE AND VERIFY (10)
2F VERIFY (10)
30 SEARCH DATA HIGH (10)
31 SEARCH DATA EQUAL (10)
32 SEARCH DATA LOW (10)
33 SET LIMITS (10)
34 PRE-FETCH (10)
35 SYNCHRONIZE CACHE (10)
36 LOCK/UNLOCK CACHE (10)
37 READ DEFECT DATA (10)
37 INITIALIZE ELEMENT STATUS WITH RANGE
38 MEDIUM SCAN
39 COMPARE
3A COPY AND VERIFY
3B WRITE BUFFER
3C READ BUFFER
3D UPDATE BLOCK
3E READ LONG
3F WRITE LONG
40 CHANGE DEFINITION
41 WRITE SAME (10)
44 REPORT DENSITY SUPPORT
45 PLAY AUDIO (10)
46 GET CONFIGURATION
47 PLAY AUDIO MSF
48 AUDIO TRACK INDEX (not mentioned in T10 overview)
49 AUDIO TRACK RELATIVE 10 (not mentioned in T10 overview)
4A GET EVENT STATUS NOTIFICATION
4B PAUSE / RESUME
4C LOG SELECT
4D LOG SENSE
50 XDWRITE (10)
51 XPWRITE (10) /*READ DISC INFORMATION CDB (in mmc5r02c 6.22.2)*/
52 XDREAD (10)
53 XDWRITEREAD (10)
54 SEND OPC INFORMATION
55 MODE SELECT (10)
56 RESERVE (10)
57 RELEASE (10)
58 REPAIR TRACK
5A MODE SENSE (10)
5B CLOSE TRACK / SESSION
5C READ BUFFER CAPACITY
5D SEND CUE SHEET
5E PERSISTENT RESERVE IN
5F PERSISTENT RESERVE OUT
7E EXTENDED CDB
7F VARIABLE LENGTH CDB
80 XDWRITE EXTENDED (16)
80 WRITE FILEMARKS (16)
81 REBUILD (16)
81 READ REVERSE (16)
82 REGENERATE (16)
83 EXTENDED COPY
84 RECEIVE COPY RESULTS
85 ATA COMMAND PASS THROUGH (16)
86 ACCESS CONTROL IN
87 ACCESS CONTROL OUT
88 READ (16)
89 COMPARE AND WRITE
8A WRITE (16)
8B ORWRITE
8C READ ATTRIBUTE
8D WRITE ATTRIBUTE
8E WRITE AND VERIFY (16)
8F VERIFY (16)
90 PRE-FETCH (16)
91 SYNCHRONIZE CACHE (16)
91 SPACE (16)
92 LOCK UNLOCK CACHE (16)
93 WRITE SAME (16)
9E SERVICE ACTION IN (16)
9F SERVICE ACTION OUT (16)
A0 REPORT LUNS
A1 ATA COMMAND PASS THROUGH (12)
A2 SECURITY PROTOCOL IN
A2 SEND EVENT (not mentioned in T10 overview)
A3 REPORT SUPPORTED OPCODES
A4 MAINTENANCE (OUT) (REPORT_KEY)
A5 MOVE MEDIUM
A5 PLAY AUDIO 12 (not mentioned in T10 overview)
A6 EXCHANGE MEDIUM
A7 MOVE MEDIUM ATTACHED
A8 READ (12)
A9 SERVICE ACTION OUT (12)
A9 AUDIO TRACK RELATIVE 12 (not mentioned in T10 overview)
AA WRITE (12)
AB SERVICE ACTION IN (12)
AC ERASE (12)
AD READ DVD STRUCTURE
AE WRITE AND VERIFY (12)
AF VERIFY (12)
B0 SEARCH DATA HIGH (12)
B1 SEARCH DATA EQUAL (12)
B2 SEARCH DATA LOW (12)
B3 SET LIMITS (12)
B4 READ ELEMENT STATUS ATTACHED
B5 SECURITY PROTOCOL OUT
B6 SEND VOLUME TAG
B7 READ DEFECT DATA (12)
B8 READ ELEMENT STATUS
B9 READ CD MSF
BA REDUNDANCY GROUP (IN)
BB REDUNDANCY GROUP (OUT)
BC SPARE (IN)
BC PLAY CD (not mentioned in T10 overview)
BD SPARE (OUT)
BE VOLUME SET (IN)
BF VOLUME SET (OUT)

Table 2: SCSI Status Codes.
00h OK
This status code indicates that the device has completed the task successfully.

02h Check condition
When the target returns a Check Condition in response to a command it is indicating that it has entered a contingent allegiance condition. This means that an error occurred when it attempted to execute a SCSI command. The initiator usually then issues a SCSI Request Sense command in order to obtain a Key Code Qualifier (KCQ) from the target.

04h Condition Met
This status code is returned on successful completion of a Search Data command or a Pre-fetch Command.

08h Busy
The target returns Busy if it cannot complete a command at that time, for example if it is in the contingent allegiance condition.

10h Intermediate
The target returns Intermediate when it successfully completes a linked command (except the last command).

14h Intermediate - Condition met
As the name suggests, this status code is simply a combination of 10h Intermediate and 04h Condition Met.

18h Reservation conflict
The target returns this status code if an initiator attempts to access a LUN that has been previously reserved by another initiator using the Reserve or Reserve Unit command.

22h Command terminated
The target returns this status if the target has to terminate the current I/O process because it received a Terminate I/O Process message. This status code is obsolete as of SAM-3.

28h Queue (task set) full
Similar to 08h Busy status, this status code is returned when the logical unit lacks the resources to accept a received task from an I_T nexus. This status code implies that the logical unit already has at least one task from that I_T nexus, where the 08h Busy status code implies that the logical unit currently has no tasks from that I_T nexus.

30h ACA active
The target returns this status code if an auto-contingent allegiance condition has occurred.

40h Task aborted
The target returns this status code when a task is aborted by another I_T nexus and the TAS bit in the Control mode page is set to one.

Table 3: SCSI Request Block (SRB) Status Codes

Returns the status of the completed request.

SRB status codes:

0x00 = SRB_STATUS_PENDING

0x01 = SRB_STATUS_SUCCESS

0x02 = SRB_STATUS_ABORTED

0x03 = SRB_STATUS_ABORT_FAILED

0x04 = SRB_STATUS_ERROR

0x05 = SRB_STATUS_BUSY

0x06 = SRB_STATUS_INVALID_REQUEST

0x07 = SRB_STATUS_INVALID_PATH_ID

0x08 = SRB_STATUS_NO_DEVICE

0x09 = SRB_STATUS_TIMEOUT

0x0A = SRB_STATUS_SELECTION_TIMEOUT

0x0B = SRB_STATUS_COMMAND_TIMEOUT

0x0D = SRB_STATUS_MESSAGE_REJECTED

0x0E = SRB_STATUS_BUS_RESET

0x0F = SRB_STATUS_PARITY_ERROR

0x10 = SRB_STATUS_REQUEST_SENSE_FAILED

0x11 = SRB_STATUS_NO_HBA

0x12 = SRB_STATUS_DATA_OVERRUN

0x13 = SRB_STATUS_UNEXPECTED_BUS_FREE

0x14 = SRB_STATUS_PHASE_SEQUENCE_FAILURE

0x15 = SRB_STATUS_BAD_SRB_BLOCK_LENGTH

0x16 = SRB_STATUS_REQUEST_FLUSHED

0x20 = SRB_STATUS_INVALID_LUN

0x21 = SRB_STATUS_INVALID_TARGET_ID

0x22 = SRB_STATUS_BAD_FUNCTION

0x23 = SRB_STATUS_ERROR_RECOVERY

0x24 = SRB_STATUS_NOT_POWERED

 
Known Issues:
On some CPU models the processor timer may report invalid entries and display a very high value for the request duration, this is not a software issue. As a workaround for this specific CPUs, on could switch from the default timer to the HPET timer (iuf supported by the computer), but we recommend that for troubleshoot purposes you just ignore the entries.

Example (times in millisecond):
RequestDuration : 52
RequestDuration : 87
RequestDuration : 57 
RequestDuration : 65
RequestDuration : 51
RequestDuration : 207
RequestDuration : 116
RequestDuration : 16777210 Note: Any entry above 8 minutes should be suspected as a probable processor time problem.
RequestDuration : 67
RequestDuration : 114
RequestDuration : 61