RIKEN: Multipetabyte RAIDIX Based Storage Implemented In Center Of Computational Sciences
About the Project
The largest so far RAIDIX based storage (Core Micro Systems Integrated Storage Appliance/HyperSTOR Flex) was installed last year. The system consisting out of 11 high-availability clusters was deployed in RIKEN Center for Computational Science (Japan). Main purpose of the system is to store data from HPCI (High Performance Computing Infrastructure) that has been implemented as part of massive national project named Academic Cloud (based on SINET network) aiming to accelerate exchange of academic data.
Total volume of the project is the outstanding feature — 65PB, including 51.4PB of usable volume. In order to better understand this value, we’ll just add that this is 6 512 drives 10TB each (the most modern at the time of installation). That’s a lot.
The project was developed throughout a year, after that the system was supervised on the matter of stability for another year. Results obtained matched targeted values and now we might report this project successful and present team victory, much significant for us.
Key Project Figures
Usable Capacity per Cluster: 4.64PB ((RAID6 / 8D+2P) LUN × 58)
Usable Capacity of the whole system: 51.04PB (4.64PB x 11 clusters)
Total Capacity of the whole system: 65PB.
Actual System Performance: 17GB/s for writes, 22GB/s for reads.
Total performance of Gfarm File System with 11 storage clusters is 250GB/s.
Supercomputer in RIKEN Center for Computational Science
Supercomputer is a major part of large-scale and complex researches produced by the Institute: it helps in modeling of climate, weather conditions and molecular behavior, calculation and analysis of reaction in nuclear physics, forecasting of earthquakes and other tasks. Supercomputer powers are also used for more “casual” and applied researches: for exploring oil deposits and predicting trends on the stock markets.
Such calculations and experiments generate huge volume of data and its significance cannot be overrated. In order to maximize the benefit, Japanese scientists developed the concept of united information space where HPC professionals from different research centers have access to HPC resources.
High Performance Computing Infrastructure (HPCI)
HPCI is powered by SINET (The Science Information Network) that is the baseline network for exchanging scientific data between Japanese universities and research centers. Currently SINET brings together about 850 colleges and universities, creating enormous opportunities for information exchange in the fields of nuclear physics, astronomy, surveying, seismology and computer science.
HPCI is unique infrastructure project that forms unified information exchange environment in the field of high performance computing between universities and research centers across Japan.
Scientific community has obvious benefits from collaboration with valuable data generated by the supercomputer computations by combining capabilities of “K” supercomputer and other scientific centers.
There were high requirements to storage access speed in order to ensure effective joint user access to HPCI environment. Storage cluster in RIKEN Center for Computational Science was considered to have not less than 50 PB of working storage due to “K” computer hyper productivity.
Infrastructure of HPCI project was based on Gfarm file system that allows ensuring high level of productivity and unifying disparate storage clusters into single name space for joint access.
Gfarm File System
Gfarm is distributed open source file system that has been developed by Japanese engineers. Gfarm was introduced by Institute for Advanced Industrial Science and Technology (AIST), and the name of the system refers to the Grid Data Farm architecture.
This file system combines number of seemingly incompatible properties:
- High scalability in terms of capacity and performance
- Network distribution over long distances with support of single namespace for several remote research centers
- POSIX API Support
- High level of performance required for parallel computing
- Ensuring data storage security
Gfarm creates virtual file system using resources of multiple storage servers. Data is allocated by metadata server, herewith distribution scheme is hidden from users. Worth mentioning that Gfarm consists of not only storage cluster but also computing grid that utilizes resources of the same server.
The file system architecture is asymmetrical. Clearly allocated roles are: Storage Server, Metadata Server, Client. But at the same time, all three roles can be performed one by one by the same machine. Storage servers store multiple file copies, metadata servers work in master-slave mode.
Core Micro Systems — RAIDIX strategic partner & exclusive distributer in Japan — was the main integrator of the system in RIKEN Center for Computational Science. Active phase of the project took about 12 months of hard work by Core Micro Systems specialists and RAIDIX team.
RAIDIX demonstrated consistently high performance and efficiency when dealing with such volumes of data and that was proved by long tests and inspections.
Worth mentioning details about modification: there was a need not only to integrate storage with Gfarm file system, but to expand some functional characteristics of the software. For example, it was crucial to develop and implement technology of Automatic Write-Through as soon as possible in order to meet requirements of technical specifications.
The actual deployment of the system went smoothly: Core Micro Systems engineers carried out each stage of testing very carefully and accurately, gradually increasing system scale. First phase of deployment was completed in August 2017, that time the system volume reached 18PB. Second phase was implemented the same year in October with the system volume increasing up to record 51PB.
Success in implementation of such long and laborious project was feasible by collaborative work and close interaction of all participants.
DC platform configuration:
|CPU||Intel Xeon E5-2637|
|Motherboard||Support the chosen processor and PCI Express 3.0 x8/x16|
|Internal cache memory||256GB for each node|
|SAS-controller (extra ports may be utilized for JBOD connectivity)||Broadcom 9305 16e, 9300 8e|
|HDD||HGST Helium 10TB SAS HDD|
|HeartBeat||Ethernet 1 GbE|
|CacheSync||6 x SAS 12G|
Fig. 1. Single data storage cluster in HPCI system
The team managed to build scalable storage using 11 DC RAIDIX systems in conjunction with Gfarm.
Connection to Gfarm servers is performed with 8 x SAS 12G.
Fig. 2. Image of cluster with separate data server for each node.
(1) 48Gbps x 8 Mesh SAN connections (bandwidth: 384Gbps)
(2) 48Gbps x 40 Mesh FABRIC connections (bandwidth: 1920Gbps)
Both nodes in every cluster are connected to JBOD enclosures (60 disks, 10TB each) through 20 SAS 12G ports for each node.
Fig. 3. Fault-tolerant cluster with scheme of 10 JBOD connection
RAID 6 58 arrays 10ТB each were created on these shelves RAID6 (8 drives with data (D) +2 parity (P) drives) and 12 x hot spare HDD.