Micro… Supermicro!

03.11.2021

We were once asked to test a Supermicro platform based on E1.S. form factors — and, fortunately, we recently just had servers available for us in a laboratory in the Netherlands. So we ran the tests and are now ready to tell you how it went.
But before we dive into the tests, we will remind you what are these E1.S, and what they are made for. So let’s start with the basics.

How the EDSFF appeared

As the market got rid of outdated data access methods, NVMe SSD manufacturers have taken on the form factor. This is how the new EDSFF standard appeared, which, according to Intel developers, was better adapted to a data center environment. EDSFF (which stands for Enterprise and Data Center SSD Form Factor, and for many is known as Ruler) was invented to do just this: to provide the minimum total cost of storage on Flash drives on a data center scale, according to the principle ‘less space, more 1PB drives per 1U, less costs’.

But, of course, there was something to be sacrificed with this approach — performance per drive, for instance.

Supermicro servers support two types of drives: long and short. These form factors are described in the SNIA specifications:

  • ‘short’ — SFF-TA-1006;
  • ‘long’ — SFF-TA-1007.

In addition to the density of TB, IOps and GBps, drives from the EDSFF family can significantly reduce power consumption. In their marketing materials, Intel and Supermicro claim it is dozens of percent compared to U.2.

How did our tests go?

So, our colleagues from Supermicro have installed an SSG-1029P-NES32R server in that Dutch laboratory. The server is positioned as hardaware for databases, IOps-intensive applications and HPC infrastructures. It is based on the X11DSF-E motherboard with 2 sockets for installing the second generation Intel Xeon Scalable processor. In our case there were two Intel® Xeon® Gold 6252 processors, 8 memory sticks of 32GB, DDR4-2933 MHz and 32 Intel® SSDs DC P4511 Series. The platform, by the way, supports Intel® Optane ™ DCPMM.

For communicating with the ‘outside world’, there were these interfaces:

  • 2x PCI-E 3.0 x16 (FHHL) slots,
  • 1x PCI-E 3.0 x4 (LP) slot.

We will not list the rest of the technical characteristics as all the information is available on the vendor’s website. It is better to pay attention to the configuration and the results.

It is worth saying that Supermicro supplies such platforms only assembled. As a vendor, we can understand this policy, but as a potential buyer, we do not are not excited 🙂

FIO configuration

[global]

filename=/dev/era_dimec

ioengine=libaio

direct=1

group_reporting=1

runtime=900

norandommap

random_generator=tausworthe64

[seq_read]

stonewall

rw=read

bs=1024k

offset_increment=10%

numjobs=8

[seq_write]

stonewall

rw=write

bs=1024k

offset_increment=10%

numjobs=8

[rand_read]

stonewall

rw=randread

bs=4k

numjobs=48

[rand_write]

stonewall

rw=randwrite

bs=4k

numjobs=48

iodepth was varied with the script.

System configuration

OS: Ubuntu Server 20.04.3 LTS

Kernel: 5.11.0-34-generic

raidix-era-3.3.1-321-dkms-ubuntu-20.04-kver-5.11

BOOT_IMAGE = / vmlinuz-5.11.0-34-generic root = / dev / mapper / ubuntu - vg-ubuntu - lv ro noibrs noibpb nopti nospectre_v2 nospectre_v1 l1tf = off nospec_store_bypass_disable no_stf_barx = offas = off off

So, we had 32 Intel® SSD DC P4511 installed in our system.

As we always suggest, you need to do the math first, and ask yourself: how much performance, in theory, can I get?

According to the specification, the capabilities of each drive are as follows:

  • maximum sequential read speed — 2800 MB/s;
  • maximum sequential write speed — 2400 MB/s;
  • random read speed —  610 200 IOps (4K Blocks);
  • random write speed —  75,000 IOps (4K Blocks).

 

But as we ran a simultaneous test of the performance of all drives, we reached 9 999 000 IOps.

Almost 10 million! Although the expected performance should be close to 20 million IOps … At first, we thought it was just us, not the system. But after we examined it all thoroughly, it turned out that the problem lied in PCIe lines’ oversubscription. In such a system, the maximum per drive can be obtained only when the drives are half-loaded:

 

By reducing the block size to 512b, we managed to achieve a total drive performance of 12 million IOps for reads and 5 million IOps for writes.

Sure it hurt to lose half the performance, but 12 million IOps per 1U is more than enough. It is unlikely that you will ever find an application that would put such a workload on a storage system with two sockets.

But it wouldn’t be us…

if we hadn’t run two tests with RAIDIX on board!

As usual, we tested RAIDIX ERA in comparison with mdraid. Here is a summary of the results at 1U:

Parameter RAIDIX ERA

RAID 5 / 50

LINUX SW

RAID 10 

LINUX SW

RAID 5

Rand Read Performance (IOps / Latency) 11 700 000

0.2 ms

2 700 000

1.7 ms

2 000 000

1.5 ms

Rand Write Performance (IOps / Latency) 2 500 000

0.6 ms

350 000

5.3 ms

150 000

10 ms

Useful capacity 224 TB 128 TB 250 TB
Sequential Read Performance 53 GBps 56.2 GBPS 53.1 GBPS
Sequential Write Performance 45 GBps 24.6 GBps 1.7 GBps
Sequential Read performance in degraded mode  42.5 GBps  42.6 GBps  1.4 GBps 
Mean CPU load at MAX perf 13% 24% 37%

We also got results for ERA and for various workloads:

Workload / Configuration Performance
4k Random Reads / 32 drives RAID 5 9 999 000 IOps,

latency 0,25 ms

512b Random Reads / 32 drives RAID 5, 512b BS 11 700 000 IOps, 

latency 0,2 ms

4k Random Reads / 16 drives RAID 5 5 380 000 IOps,

latency 0,25 ms

512b Random Reads / 16 drives RAID 5, 512b BS 8 293 000 IOps,

latency 0,2 ms

4k Random Writes / 32 drives RAID 50 2 512 000 IOps,

latency 0,6 ms

512b Random Writes / 32 drives RAID 50, 512b BS 1 644 000 IOps,

latency 0,7 ms

4k Random Writes  / 16 drives RAID 50 1 548 000 IOps,

latency 0,6 ms

512b Random Writes / 16 drives RAID 50, 512b BS 859 000 IOps,

latency 0,7 ms

1024k Sequential Reads / RAID 5, RAID 50 53 GBps
1024k Sequential Writes / RAID 50, ss=64, merges=1, , mm=7000, mw=7000 45,6 GBps 

Of course, all the drives were beforehand prepared in accordance with the SNIA methodology. And at launch, we would vary the load — queue depth.

At what queue depth we did we get this latency, you may ask.

It was at the minimum latency, as we reached the performance peak. On average, that’s about 16:

 

During the tests, we found out about another feature of the platform (or rather the drives): with a low offset_increment value, they began to decline in performance, and quite noticeably. It seems that the drives really hate it when time is too short between calls to the same LBA.

Final Thoughts

The use scenarios for a system based on the SSG-1029P-NES32R platform, of course, are not limitless. The reasons lie in the rather high cost of the system and the small number of PCIe slots for such a capacity of the storage subsystem.

On the other hand, we managed to achieve excellent performance results, which is rare for E1.S drives. We, of course, knew that RAIDIX ERA would boost IOps, but to witness (again) a 5-10 times increase in random workloads IOps, and a modest 13% CPU load is always nice.

Do you need all this? Ask yourself (and a couple of other people at work). It happens that everything suits you, and then let a server, the form factor and performance remain as they are. If you want something more modern and faster, then you have just read about one alternative.