WHEN TO (AND NOT TO) USE RAID-Z

By user13278091 on mai 31, 2006

		WHEN TO (AND NOT TO) USE RAID-Z

RAID-Z is the technology  used by ZFS  to implement a data-protection  scheme
which is less  costly  than  mirroring  in  terms  of  block
overhead.

Here,  I'd  like  to go  over,    from a theoretical standpoint,   the
performance implication of using RAID-Z.   The goal of this technology
is to allow a storage subsystem to be able  to deliver the stored data
in  the face of one  or more disk   failures.  This is accomplished by
joining  multiple disks into  a  N-way RAID-Z  group. Multiple  RAID-Z
groups can be dynamically striped to form a larger storage pool.

To store file data onto  a RAID-Z group, ZFS  will spread a filesystem
(FS) block onto the N devices that make up the  group.  So for each FS
block,  (N - 1) devices  will  hold file  data  and 1 device will hold
parity  information.   This information  would eventually   be used to
reconstruct (or  resilver) data in the face  of any device failure. We
thus  have 1 / N  of the available disk  blocks that are used to store
the parity  information.   A 10-disk  RAID-Z group  has 9/10th of  the
blocks effectively available to applications.

A common alternative for data protection, is  the use of mirroring. In
this technology, a filesystem block is  stored onto 2 (or more) mirror
copies.  Here again,  the system will  survive single disk failure (or
more with N-way mirroring).  So 2-way mirror actually delivers similar
data-protection at   the expense of   providing applications access to
only one half of the disk blocks.

Now  let's look at this from  the performance angle in particular that
of  delivered filesystem  blocks  per second  (FSBPS).  A N-way RAID-Z
group  achieves it's protection  by spreading a  ZFS block  onto the N
underlying devices.  That means  that a single  ZFS block I/O must  be
converted to N device I/Os.  To be more precise,  in order to acces an
ZFS block, we need N device I/Os for Output and (N - 1) device I/Os for
input as the parity data need not generally be read-in.

Now after a request for a  ZFS block has been spread  this way, the IO
scheduling code will take control of all the device  IOs that needs to
be  issued.  At this  stage,  the ZFS  code  is capable of aggregating
adjacent  physical   I/Os  into   fewer ones.     Because of  the  ZFS
Copy-On-Write (COW) design, we   actually do expect this  reduction in
number of device level I/Os to work extremely well  for just about any
write intensive workloads.  We also expect  it to help streaming input
loads significantly.  The situation of random inputs is one that needs
special attention when considering RAID-Z.

Effectively,  as  a first approximation,  an  N-disk RAID-Z group will
behave as   a single   device in  terms  of  delivered    random input
IOPS. Thus  a 10-disk group of devices  each capable of 200-IOPS, will
globally act as a 200-IOPS capable RAID-Z group.  This is the price to
pay to achieve proper data  protection without  the 2X block  overhead
associated with mirroring.

With 2-way mirroring, each FS block output must  be sent to 2 devices.
Half of the available IOPS  are thus lost  to mirroring.  However, for
Inputs each side of a mirror can service read calls independently from
one another  since each  side   holds the full information.    Given a
proper software implementation that balances  the inputs between sides
of a mirror, the  FS blocks delivered by a  mirrored group is actually
no less than what a simple non-protected RAID-0 stripe would give.

So looking  at random access input  load, the number  of FS blocks per
second (FSBPS), Given N devices to be grouped  either in RAID-Z, 2-way
mirrored or simply striped  (a.k.a RAID-0, no  data protection !), the
equation would  be (where dev  represents   the capacity in  terms  of
blocks of IOPS of a single device):

					Random
		Blocks Available	FS Blocks / sec
		----------------	--------------
RAID-Z		(N - 1) \* dev		1 \* dev		
Mirror		(N / 2) \* dev		N \* dev		
Stripe		N \* dev			N \* dev		

Now lets take 100 disks of  100 GB, each each  capable of 200 IOPS and
look  at different  possible configurations;  In the   table below the
configuration labeled:

	"Z 5 x (19+1)"

refers to a dynamic striping of 5 RAID-Z groups, each group made of 20
disks (19 data disk + 1 parity). M refers to a 2-way mirror and S to a
simple dynamic stripe.

						Random
	 Config		Blocks Available	FS Blocks /sec
	 ------------	----------------	--------- 
	 Z 1  x (99+1) 	9900 GB		  	  200	  
	 Z 2  x (49+1)	9800 GB		  	  400	  
	 Z 5  x (19+1)	9500 GB			 1000	  
	 Z 10 x (9+1)	9000 GB			 2000	  
	 Z 20 x (4+1)	8000 GB			 4000	  
	 Z 33 x (2+1)	6600 GB			 6600	  

	 M  2 x (50) 	5000 GB			20000	  
	 S  1 x (100)   10000 GB		20000	  

So RAID-Z  gives you  at most 2X  the number  of blocks that mirroring
provides  but hits you  with  much fewer  delivered IOPS.  That  means
that, as the number of  devices in a  group N increases, the  expected
gain over mirroring (disk blocks)  is bounded (to  at most 2X) but the
expected cost  in IOPS is not  bounded (cost in  the range of [N/2, N]
fewer IOPS).  

Note  that for wide RAID-Z configurations,  ZFS takes into account the
sector  size of devices  (typically 512 Bytes)  and dynamically adjust
the effective number of columns in a stripe.  So even if you request a
99+1  configuration, the actual data  will probably be  stored on much
fewer data columns than that.   Hopefully this article will contribute
to steering deployments away from those types of configuration.

In conclusion, when preserving IOPS capacity is important, the size of
RAID-Z groups    should be restrained  to smaller   sizes and one must
accept some level of disk block overhead.

When performance matters most, mirroring should be highly favored.  If
mirroring  is considered too   costly but performance  is nevertheless
required, one could proceed like this:

	Given N devices each capable of X IOPS.

	Given a target of delivered  Y FS blocks per second
	for the storage pool.

	Build your storage using dynamically  striped RAID-Z groups of
	(Y / X) devices.

For instance: 

	Given 50 devices each capable of 200 IOPS.

	Given a target of delivered 1000 FS blocks per second
	for the storage pool.

	Build your storage using dynamically striped RAID-Z groups of
	(1000 / 200) = 5 devices.

In that system we then would have  20% block overhead lost to maintain
RAID-Z level parity.

RAID-Z is a great  technology not only  when disk blocks are your most
precious resources but also  when your available  IOPS far exceed your
expected needs.  But beware  that if you  get your hands on fewer very
large  disks, the IOPS capacity  can  easily become your most precious
resource. Under those conditions, mirroring should be strongly favored
or alternatively a  dynamic stripe of RAID-Z  groups each made up of a
small number of devices.