WHEN TO (AND NOT TO) USE RAID-Z
WHEN TO (AND NOT TO) USE RAID-Z
RAID-Z is the technology used by ZFS to implement a data-protection scheme
which is less costly than mirroring in terms of block
overhead.
Here, I'd like to go over, from a theoretical standpoint, the
performance implication of using RAID-Z. The goal of this technology
is to allow a storage subsystem to be able to deliver the stored data
in the face of one or more disk failures. This is accomplished by
joining multiple disks into a N-way RAID-Z group. Multiple RAID-Z
groups can be dynamically striped to form a larger storage pool.
To store file data onto a RAID-Z group, ZFS will spread a filesystem
(FS) block onto the N devices that make up the group. So for each FS
block, (N - 1) devices will hold file data and 1 device will hold
parity information. This information would eventually be used to
reconstruct (or resilver) data in the face of any device failure. We
thus have 1 / N of the available disk blocks that are used to store
the parity information. A 10-disk RAID-Z group has 9/10th of the
blocks effectively available to applications.
A common alternative for data protection, is the use of mirroring. In
this technology, a filesystem block is stored onto 2 (or more) mirror
copies. Here again, the system will survive single disk failure (or
more with N-way mirroring). So 2-way mirror actually delivers similar
data-protection at the expense of providing applications access to
only one half of the disk blocks.
Now let's look at this from the performance angle in particular that
of delivered filesystem blocks per second (FSBPS). A N-way RAID-Z
group achieves it's protection by spreading a ZFS block onto the N
underlying devices. That means that a single ZFS block I/O must be
converted to N device I/Os. To be more precise, in order to acces an
ZFS block, we need N device I/Os for Output and (N - 1) device I/Os for
input as the parity data need not generally be read-in.
Now after a request for a ZFS block has been spread this way, the IO
scheduling code will take control of all the device IOs that needs to
be issued. At this stage, the ZFS code is capable of aggregating
adjacent physical I/Os into fewer ones. Because of the ZFS
Copy-On-Write (COW) design, we actually do expect this reduction in
number of device level I/Os to work extremely well for just about any
write intensive workloads. We also expect it to help streaming input
loads significantly. The situation of random inputs is one that needs
special attention when considering RAID-Z.
Effectively, as a first approximation, an N-disk RAID-Z group will
behave as a single device in terms of delivered random input
IOPS. Thus a 10-disk group of devices each capable of 200-IOPS, will
globally act as a 200-IOPS capable RAID-Z group. This is the price to
pay to achieve proper data protection without the 2X block overhead
associated with mirroring.
With 2-way mirroring, each FS block output must be sent to 2 devices.
Half of the available IOPS are thus lost to mirroring. However, for
Inputs each side of a mirror can service read calls independently from
one another since each side holds the full information. Given a
proper software implementation that balances the inputs between sides
of a mirror, the FS blocks delivered by a mirrored group is actually
no less than what a simple non-protected RAID-0 stripe would give.
So looking at random access input load, the number of FS blocks per
second (FSBPS), Given N devices to be grouped either in RAID-Z, 2-way
mirrored or simply striped (a.k.a RAID-0, no data protection !), the
equation would be (where dev represents the capacity in terms of
blocks of IOPS of a single device):
Random
Blocks Available FS Blocks / sec
---------------- --------------
RAID-Z (N - 1) \* dev 1 \* dev
Mirror (N / 2) \* dev N \* dev
Stripe N \* dev N \* dev
Now lets take 100 disks of 100 GB, each each capable of 200 IOPS and
look at different possible configurations; In the table below the
configuration labeled:
"Z 5 x (19+1)"
refers to a dynamic striping of 5 RAID-Z groups, each group made of 20
disks (19 data disk + 1 parity). M refers to a 2-way mirror and S to a
simple dynamic stripe.
Random
Config Blocks Available FS Blocks /sec
------------ ---------------- ---------
Z 1 x (99+1) 9900 GB 200
Z 2 x (49+1) 9800 GB 400
Z 5 x (19+1) 9500 GB 1000
Z 10 x (9+1) 9000 GB 2000
Z 20 x (4+1) 8000 GB 4000
Z 33 x (2+1) 6600 GB 6600
M 2 x (50) 5000 GB 20000
S 1 x (100) 10000 GB 20000
So RAID-Z gives you at most 2X the number of blocks that mirroring
provides but hits you with much fewer delivered IOPS. That means
that, as the number of devices in a group N increases, the expected
gain over mirroring (disk blocks) is bounded (to at most 2X) but the
expected cost in IOPS is not bounded (cost in the range of [N/2, N]
fewer IOPS).
Note that for wide RAID-Z configurations, ZFS takes into account the
sector size of devices (typically 512 Bytes) and dynamically adjust
the effective number of columns in a stripe. So even if you request a
99+1 configuration, the actual data will probably be stored on much
fewer data columns than that. Hopefully this article will contribute
to steering deployments away from those types of configuration.
In conclusion, when preserving IOPS capacity is important, the size of
RAID-Z groups should be restrained to smaller sizes and one must
accept some level of disk block overhead.
When performance matters most, mirroring should be highly favored. If
mirroring is considered too costly but performance is nevertheless
required, one could proceed like this:
Given N devices each capable of X IOPS.
Given a target of delivered Y FS blocks per second
for the storage pool.
Build your storage using dynamically striped RAID-Z groups of
(Y / X) devices.
For instance:
Given 50 devices each capable of 200 IOPS.
Given a target of delivered 1000 FS blocks per second
for the storage pool.
Build your storage using dynamically striped RAID-Z groups of
(1000 / 200) = 5 devices.
In that system we then would have 20% block overhead lost to maintain
RAID-Z level parity.
RAID-Z is a great technology not only when disk blocks are your most
precious resources but also when your available IOPS far exceed your
expected needs. But beware that if you get your hands on fewer very
large disks, the IOPS capacity can easily become your most precious
resource. Under those conditions, mirroring should be strongly favored
or alternatively a dynamic stripe of RAID-Z groups each made up of a
small number of devices.