+7 (812) 622 16 80

Testing of Program RAID Arrays for NVMe Devices by SNIA Methodology

04.09.2018

Several months ago while working on regular scheduled project, our research lab team was studying NVMe drives and software in order to find the best way of assembling software arrays. Testing results obtained that time were surprisingly confusing, as productivity of software in use was not corresponding to huge potential of NVMe drives and their speed. Our developers did not like it at all and decided to create their own solution.

More than 10 companies worldwide manufacture servers and adapted for NVMe — market of products that support and develop solutions compatible with this technology is definitely of great potential. G2M analytical report provides figures convincing that this data protocol will be leading in the nearest future.

In this article will share results we got from the tests of Intel hardware and MDRAID and Zvol on ZFS RAIDZ2  software arrays comparing to our new product that was afterwards named RAIDIX ERA by our marketing team. Main goal is to valuate possibilities of existing software to manage innovative NVMe hardware.

The tests were arranged with assistance of Promobit company, our partner and manufacturer of servers and data storage systems under BITBLAZE trademark. Hardware platform from Intel was used as one of the leading manufacturers of NVMe compliant components. The testing was done by SNIA methodology.

 

Hardware configuration

Intel® Server System R2224WFTZS equipped with 2 sockets for Intel Xeon Scalable processors and 12 memory channels (totaling 24 DIMMs) DDR 4 with frequency up to 2666 MHz was used as a basement. Check Intel website to get more information about the platform. All NVMe drives were connected through 3 F2U8X25S3PHS backplanes — our system had 12 NVMe drives Intel SSDPD2MD800G4 uploaded with CVEK6256004E1P0BGN. Server platform had 2 processors Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz with enabled Hyper-Threading function allowing launch of 2 computing threads from each core. Therefore, we had 64 processing streams.

 

Preparation for testing

All tests described in this note were performed according to SNIA SSS PTSe v 1.1 specification. SNIA allows setting number of streams and queuing depth — that is why we set 64/32 keeping in mind that we have 64 calculating streams on 32 cores. Every test was performed 16 times in order to exclude random output.

Early storage preparation was held in order to get stable and clear results. We prepared the system before the start of the tests:

  1. Installed the Linux kernel 4.11 over CentOS 7.4.
  2. Turned off C-STATES and P-STATES.
  3. Started utility tuned-adm and set latency-performance profile.

Testing of each product and element followed the several stages:

  1. Preparation of devices according to SNIA (dependent and independent from load type).
  2. Tests for IOps blocks 4k, 8k, 16k, 32k, 64k, 128k, 1m with variations of read and write combinations 0/100, 5/95, 35/65, 50/50, 65/35, 95/5, 100/0.
  3. Latency tests with blocks 4k, 8k, 16k with variations of read and write combinations 0/100, 65/35 и 100/0. Number of streams and queue depth 1-1. Results were saved as average and maximum latency.
  4. Throughput test with 128k and 1M blocks, 64 in every queue 8 commands each.

We have started with hardware performance, latency and throughput tests. This allowed us to evaluate potential of suggested hardware and compare capabilities of software applied.

 

Test 1. Hardware Tests

To start with, we decided to see what single NVMe Intel drive DC D3700 is capable of.

Manufacturer’s specification reports the following performance parameters:

    • Random Read (100% Span) 450000 IOPS
    • Random Write (100% Span) 88000 IOPS

Test 1.1 Single NVMe drive IOps test

See results of IOps test in the table and diagram below. Read / Write Mix %.

 Block size      

R0% / W100%      

R5% / W95%     

R35% / W65%      

R50% / W50%      

R65% / W35%      

R95% / W5%      

R100% / W0%      

 4k

84017.8

91393.8

117271.6

133059.4

175086.8

281131.2

390969.2

 8k

42602.6

45735.8

58980.2

67321.4

101357.2

171316.8

216551.4

 16k

21618.8

22834.8

29703.6

33821.2

52552.6

89731.2

108347

 32k

10929.4

11322

14787

16811

26577.6

47185.2

50670.8

 64k

5494.4

5671.6

7342.6

8285.8

13130.2

23884

27249.2

 128k

2748.4

2805.2

3617.8

4295.2

6506.6

11997.6

13631

 1m

351.6

354.8

451.2

684.8

830.2

1574.4

1702.8

 

We got results slightly less than specified in factory papers that was most likely caused by NUMA (scheme of computer memory implementation that is used in multiprocessor systems when time of access to the memory is stipulated by its location relating to processor), but we are going to disregard it here.

Test 1.2 Single NVMe Drive Latency Tests

Average response time (ms). Read / Write Mix %.

 Block size     

R0% / W100%      

R65% / W35%      

R100% / W0%      

 4k

0.02719

0.072134

0.099402

 8k

0.029864

0.093092

0.121582

 16k

0.046726

0.137016

0.16405

 

Maximum response time (ms). Read / Write Mix %.

 Block size      

R0% / W100%      

R65% / W35%      

R100% / W0%      

 4k

6.9856

4.7147

1.5098

 8k

7.0004

4.3118

1.4086

 16k

7.0068

4.6445

1.1064

 

Test 1.3 Throughput Test

The last stage of the first test was throughput evaluation. We got the following results:

1 MB seq write – 634 MBps

1MB seq read — 1707 MBps

128KB seq write — 620 MBps

128KB seq read — 1704 MBps

Got things done with one drive we moved further with evaluation of the whole platform, consisting of 12 drives.

Test 1.4 System of 12 Drives IOps Test

Here we have decided to save some time and show results for 4k block only (at the moment it is the mostly widespread and representative scenario of performance evaluation).

 Block size      

R0% / W100%      

R5% / W95%      

R35% / W65%      

R50% / W50%      

R65% / W35%     

R95% / W5%      

R100% / W0%        

 4k

1363078.6

1562345

1944105

2047612

2176476

3441311

4202364

 

Test 1.5 System of 12 Drives. Throughput Tests

1MB seq write — 8612 MBps

1MB seq read — 20481 MBps

128KB seq write — 7500 MBps

128KB seq read — 20400 MBps

We will return to the hardware performace tests in the end of the note, comparing them to results of software tests.

 

Test 2. MDRAID Tests

MDRAID is the first thing that comes up when talking about software arrays. Just remember that this is basic program RAID for Linux that is distributed for free.

Let’s have a look how MDRAID works with 12 drives system of RAID 0 array level. You are absolutely right to think that RAID 0 consisting of 12 drives is a bit too much but we need this level to demonstrate maximum that we can get.

Test 2.1 MDRAID. RAID 0. IOps Test

 Block size    

R0% / W100%    

R5% / W95%    

R35% / W65%    

R50% / W50%    

R65% / W35%    

R95% / W5%    

R100% / W0%    

 4k

1010396

1049306.6

1312401.4

1459698.6

175086.8

2692752.8

2963943.6

 8k

513627.8

527230.4

678140

771887.8

101357.2

1894547.8

2526853.2

 16k

261087.4

263638.8

343679.2

392655.2

52552.6

1034843.2

1288299.6

 32k

131198.6

130947.4

170846.6

216039.4

309028.2

527920.6

644774.6

 64k

65083.4

65099.2

85257.2

8285.8

154839.8

268425

322739

 128k

32550.2

32718.2

43378.6

4295.2

78935.8

136869.8

161015.4

 1m

3802

3718.4

3233.4

684.8

3546

6150.8

8193.2

 

Тест 2.2 MDRAID. RAID 0. Latency Tests

Average response time (ms). Read / Write Mix %.

 Block size      

R0% / W100%      

R65% / W35%      

R100% / W0%      

 4k

0.03015

0.067541

0.102942

 8k

0.03281

0.082132

0.126008

 16k

0.050058

0.114278

0.170798

 

Maximum response time (ms). Read / Write Mix %.

 Block size      

R0% / W100%      

R65% / W35%      

R100% / W0%      

 4k

6.7042

3.7257

0.8568

 8k

6.5918

2.2601

0.9004

 16k

6.3466

2.7741

2.5678

 

Тест 2.3 MDRAID. RAID 0. Throughput Tests

1MB sequential write — 7820 MBps

1MB sequential read — 20418 MBps

128KB sequential write — 7622 MBps

128KB sequential read — 20380 MBps

Тест 2.4 MDRAID. RAID 6. IOps Test

Let’s have a look what the system can get at RAID 6 level.

Array build options: mdadm –create –verbose –chunk 16K /dev/md0 –level=6 –raid-devices=12  /dev/nvme0n1 /dev/nvme1n1 /dev/nvme2n1 /dev/nvme3n1 /dev/nvme4n1 /dev/nvme5n1 /dev/nvme8n1 /dev/nvme9n1 /dev/nvme10n1 /dev/nvme11n1 /dev/nvme6n1 /dev/nvme7n1

The total array capacity is 7450.87 GiB. Let’s run the test after preliminary initialization of RAID array.

 Block size      

R0% / W100%      

R5% / W95%      

R35% / W65%      

R50% / W50%      

R65% / W35%      

R95% / W5%      

R100% / W0%       

 4k

39907.6

42849

61609.8

78167.6

108594.6

641950.4

1902561.6

 8k

19474.4

20701.6

30316.4

39737.8

57051.6

394072.2

1875791.4

 16k

10371.4

10979.2

16022

20992.8

29955.6

225157.4

1267495.6

 32k

8505.6

8824.8

12896

16657.8

23823

173261.8

596857.8

 64k

5679.4

5931

8576.2

11137.2

15906.4

109469.6

320874.6

 128k

3976.8

4170.2

5974.2

7716.6

10996

68124.4

160453.2

 1m

768.8

811.2

1177.8

1515

2149.6

4880.4

5499

 

Тест 2.5 MDRAID. RAID 6. Latency Tests

Average response time (ms). Read / Write Mix %.

 Block size      

R0% / W100%      

R65% / W35%      

R100% / W0%      

 4k

0.193702

0.145565

0.10558

 8k

0.266582

0.186618

0.127142

 16k

0.426294

0.281667

0.169504

 

Maximum response time (ms). Read / Write Mix %.

 Block size      

R0% / W100%      

R65% / W35%      

R100% / W0%      

 4k

6.1306

4.5416

4.2322

 8k

6.2474

4.5197

3.5898

 16k

5.4074

5.5861

4.1404

 

Worth notice that here MDRAID latency test result was quiet good.

Тест 2.6 MDRAID. RAID 6. Throughput Tests

1MB sequential write — 890 MBps

1MB sequential read — 18800 MBps

128KB sequential write — 870 MBps

128KB sequential read — 10400 MBps

 

Test 3. ZVOL on ZFS RAIDZ2

ZFS has a built-in function of RAID build and pre-installed volume manager that creates a virtual block device used by many storage vendors. We will use these features too by creating pool, protected by RAIDZ2 (similar to RAID 6), and a virtual block volume on it.

We used version 0.79 of ZFS on Linux. Array and volume build options:

ashift=12 / compression – off  / dedup – off / volblocksize=16k / atime=off / cachefile=none / RAIDZ2

ZFS showed good results with freshly created pool. But after repeated rewrites, performance results reduced significantly.

SNIA approach is good actually and allows to see real results while testing file systems of this type (which is the base of ZFS) after multiple rewrites.

Test 3.1 ZVOL (ZFS). RAIDZ2. IOps Test

Performance results on a RAIDZ2 zvol (ZFS):

 Block size      

R0% / W100%      

R5% / W95%      

R35% / W65%      

R50% / W50%      

R65% / W35%      

R95% / W5%      

R100% / W0%     

 4k

15719.6 15147.2 14190.2 15592.4 17965.6 44832.2 76314.8
 8k 15536.2 14929.4 15140.8 16551 17898.8 44553.4 76187.4
 16k 16696.6 15937.2 15982.6 17350 18546.2 44895.4 75549.4
 32k 11859.6 10915 9698.2 10235.4 11265 26741.8 38167.2
 64k 7444 6440.2 6313.2 6578.2 7465.6 14145.8 19099
 128k 4425.4 3785.6 4059.8 3859.4 4246.4 7143.4 10052.6
 1m 772 779.6 779.6 784 824.4 995.8 1514.2

 

Performance figures obtained are quite unimpressive. The clean zvol (before rewrites) gives much better results (5-6 times higher). Here the test showed that after the first overwrite performance dropped.

Test 3.2 ZVOL (ZFS). RAIDZ2. Latency Tests

Average response time (ms). Read / Write Mix %.

 Block size     

R0% / W100%      

R65% / W35%      

R100% / W0%      

 4k

0.332824 0.255225 0.218354

 8k

0.3299 0.259013 0.225514

 16k

0.139738 0.180467 0.233332

 

Maximum response time (ms). Read / Write Mix %.

 Block size      

R0% / W100%      

R65% / W35%      

R100% / W0%      

 4k

90.55 69.9718 84.4018

 8k

91.6214 86.6109 104.7368

 16k

108.2192 86.2194 105.658

 

The maximum zvol latency were much higher than MDRAID.

Test 3.3 ZVOL (ZFS). RAIDZ2. Throughput Tests

1MB sequential write — 1150 MBps

1MB sequential read — 5500 MBps

128KB sequential write — 1100 MBps

128KB sequential read — 5300 MBps

Zvol could provide higher throughput with tuned volblocksize parameter but we decided to perform all tests with single set of settings.

 

Test 4. RAIDIX ERA

Now let’s have a look at test of our new product — RAIDIX ERA.

We created a RAID6 with 16kb stripe size and ran the test once initialization was completed.

Test 4.1 RAIDIX ERA. RAID 6. IOps Test

 Block size      

R0% / W100%      

R5% / W95%      

R35% / W65%      

R50% / W50%      

R65% / W35%      

R95% / W5%      

R100% / W0%      

 4k

354887 363830 486865.6 619349.4 921403.6 2202384.8 4073187.8
 8k 180914.8 185371 249927.2 320438.8 520188.4 1413096.4 2510729
 16k 92115.8 96327.2 130661.2 169247.4 275446.6 763307.4 1278465
 32k 59994.2 61765.2 83512.8 116562.2 167028.8 420216.4 640418.8
 64k 27660.4 28229.8 38687.6 56603.8 76976 214958.8 299137.8
 128k 14475.8 14730 20674.2 30358.8 40259 109258.2 160141.8
 1m 2892.8 3031.8 4032.8 6331.6 7514.8 15871 19078

 

Test 4.2 RAIDIX ERA. RAID 6. Latency Tests

Average response time (ms). Read / Write Mix %.

 Block size      

R0% / W100%      

R65% / W35%      

R100% / W0%      

 4k

0.16334 0.136397 0.10958

 8k

0.207056 0.163325 0.132586

 16k

0.313774 0.225767 0.182928

 

Maximum response time (ms). Read / Write Mix %.

 Block size      

R0% / W100%      

R65% / W35%      

R100% / W0%      

 4k

5.371 3.4244 3.5438

 8k

5.243 3.7415 3.5414

 16k

7.628 4.2891 4.0562

 

The latency is similar to what MDRAID showed. For results that were more accurate, we were to evaluate latency under more serious load.

Test 4.3 RAIDIX ERA. RAID 6. Throughput Tests

1MB sequential write — 8160 MBps

1MB sequential read — 19700 MBps

128MB sequential write — 6200 MBps

128MB sequential read — 19700 MBps

 

In The End

In the end let’s compare the figures from software tests with results from the hardware platform.

To analyze the performance of the random load, we compare RAID 6 (RAIDZ2) speed when working with 4k size.

MD RAID 6      

RAIDZ2        

RAIDIX ERA RAID 6       

Hardware         

 4k R100% / W0%

1902561 76314 4073187 4494142

 4k R65% / W35%

108594 17965 921403 1823432

 4k R0% / W100%

39907 15719 354887 958054

 

To analyze performance of sequential load, let’s have a look at RAID 6 (RAIDZ2) with 128k block IOs. We used 10GB shift in between the streams to exclude the cache and show real performance.

MD RAID 6        

RAIDZ2        

RAIDIX ERA RAID 6       

Hardware       

 128k seq read

10400 5300 19700 20400

 128k seq write

870 1100 6200 7500

 

 

Conclusion

Popular and affordable software RAID arrays for NVMe devices can’t show the performance adequate to hardware potential. There is a clear need of software that can let NVMe drives be as productive and flexible as possible.

RAIDIX ERA is focusing on the following tasks:

  • High read and write performance (a few million IOps) on the arrays with Parity and mix mode
  • Stream performance from 30GBps, during failure and recovery as well
  • RAID levels 5, 6, 7.3 support
  • Background initialization and reconstruction
  • Flexible settings for different types of load (from user)

At the moment we can say that these goals were met and the product is ready to use. Realizing the demand we have prepared release of free license, which rightfully can be used for both NVMe and SSD.