Finding busy disks with iostat

por | 6 enero, 2012

The iostat(1M) utility provides several I/O statistics, which can be useful for analyzing I/O workloads and troubleshooting performance problems. When reviewing I/O problems, I usually start by reviewing the number of reads and writes to a device, which are available in iostat’s “r/s” and “w/s” columns:

$ iostat -zxnM 5

                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   85.2   22.3   10.6    2.6  7.2  1.4   67.0   13.5  18  89 c0t0d0

Once I know how many reads and writes are being issued, I like to find the number of Megabytes read and written to each device. This information is available in iostat’s “Mr/s” and “Mw/s” columns:

$ iostat -zxnM 5

                    extended device statistics
   r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   85.2   22.3   10.6    2.6  7.2  1.4   67.0   13.5  18  89 c0t0d0

After reviewing these items, I like to check iostat’s “wait” value to see the I/O queue depth for each device:

$ iostat -zxnM 5

                    extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
   85.2   22.3   10.6    2.6  7.2  1.4   67.0   13.5  18  89 c0t0d0

To see how these can be applied to a real problem, I captured the following data from device c0t0d0 a week or two back:

$ iostat -zxnM 5

               extended device statistics
    r/s    w/s   Mr/s   Mw/s wait actv wsvc_t asvc_t  %w  %b device
    0.2   71.2    0.0    7.7 787.3  2.0 11026.8   28.0 100 100 c0t0d0

Device c0t0d0 was overloaded, had 787 I/O operations waiting to be serviced, and was causing application latecy (since the application in question performed lots of reads/writes, and the files were open O_SYNC). Once iostat returned the above statistics, I used ps(1) to find the processes that were causing the excessive disk activity, and used kill(1) to terminate them!

* The Solaris iostat utility was used to produce this output.
** The first iostat line contains averages since the system was booted, and should be ignored.