Certified Storage for Nvidia DGX SuperPOD
As of July 2023, the Nvidia DGX SuperPOD is the flagship AI market product from Nvidia. The Nvidia DGX SuperPOD is delivered as a complete AI capable supercomputer, including compute, storage, networking, software, and services.
The Nvidia DGX SuperPOD consists of between 20 to 140 DGX A100 or H100 systems. Each DGX A100/H100 system consists of 8 A100 or H100 GPUs, 640 GB of GPU RAM, 2 AMD Rome 7742 CPUs with a total of 128 cores, and 1 TB RAM packed into a 6RU box. To act as a storage fabric, each DGX A/H 100 also has a Mellanox ConnectX-6 NIC which has 2 ports, each port capable of 200Gb/s. Details of the GPU fabric to interconnect the GPUs are skipped to keep the focus of this blog on storage. Each DGX A100/H100 system also has 2 1.92TB NVMe drives for the OS and 4 3.84TB NVMe drives for data.
Any storage for the SuperPOD must be capable of low latency, consistent high-speed response and must be capable of ensuring the GPUs do not spend cycles waiting for data to arrive from the storage.
One feature that helps with high-speed data transfer is that the Mellanox ConnectX NICs are RDMA capable. Not only that, Nvidia has written the necessary drivers and other software to have the GPU RAM memory space be directly addressable by these Mellanox NICs. This means data from a Mellanox ConnectX NIC enhanced storage can land directly into GPU RAM via the Mellanox NIC in the GDX A/H 100 system. Nvidia calls this feature Magnum I/O GPU Direct.
In the absence of GPU Direct, the alternative is to RDMA data from storage into system RAM (CPU main memory) and then let the CPU copy data from system RAM into GPU RAM. VAST has published results that show GPUDirect copies data 3X faster and consumes 2/3 less CPU cycles as compared to having RDMA move data into CPU RAM and then copying the data into GPU RAM. Storage systems other than VAST Data should also see a similar improvement, but the exact numbers will have to be examined on a case-by-case basis.
IBM
IBM and Nvidia have collaborated to qualify the IBM ESS3200 for use with the Nvidia SuperPOD. IBM ESS3200 is an all-flash array that supports Nvidia Magnum I/O GPUDirect and runs IBM Spectrum Scale software (previously called GPFS). The IBM ESS3200 running Spectrum Scale supports Nvidia Magum I/O GPUDirect.
NetApp
NetApp and Nvidia have jointly qualified the NetApp all flash E Series EF600 array running BeeGFS to work with the Nvidia flagship DGX SuperPOD. BeeGFS was originally developed at the Fraunhofer Institute and is currently maintained and developed by ThinkParQ. This Nvidia DGX SuperPOD + NetApp EF600 are expected to be installed on premise. The NetApp EF600 running BeeGFS supports Nvidia Magnum I/O GPUDirect.
Curiously, Nvidia also offers a subscription called “Nvidia DGX Foundry AI Service” which is a cloud hosted development environment. This cloud hosted environment is based upon the Nvidia DGX SuperPOD. But in this instance, the storage used is NetApp A800 which has a slightly lesser performance. It is not clear what the file system protocol between the SuperPOD and the NetApp A800 is, but presumably it is also BeeGFS.
DDN
DDN and Nvidia have collaborated to qualify the DDN A1400X2 array to work with the Nvidia DGX SuperPOD. The DDN A1400X2 array is Nvidia Magnum I/O GPUDirect capable.
The DDN A1400X2 has 4 versions, a highest speed TLC all flash, a lower priced (compared to TLC) and lower performant (compared to TLC) all flash QLC version, an HDD version, and a hybrid flash/HDD version. Of these, the TLC version is the one certified with Nvidia DGX SuperPOD. The QLC version has been announced and will ship before the end of August 2023. It is likely this version will also be SuperPOD certified, but only time will tell.
VAST Data
VAST Data and Nvidia have jointly qualified VAST Data to be an Nvidia SuperPOD certified storage solution. VAST lays claim to be the first Enterprise NAS qualified for Nvidia DGX SuperPOD. All other solutions use a Parallel File System. VAST Data explains why this is important by claiming that customers building out a super computing infrastructure must make a compromise between choosing:
Performance: Is the storage fast enough? Is the latency & response time low enough? Is the throughput large enough and consistent?
Scalability: Can the storage scale to meet my capacity needs (Petabytes/Exabytes)
Capabilities: Does the storage have all the capabilities I need? Examples include immutable snapshots, ACLs, security, uptime (5 9s or higher), etc.
Simplicity: Is the storage simple to administer and use?
VAST claims to be the first AI/HPC storage that meets all 4 criteria without a compromise. VAST Data competitors will of course, have something to say.
VAST uses NFS as the protocol to move data between VAST Data Storage and the Nvidia DGX SuperPOD. VAST has made some changes to the NFS protocol stack to make it perform better. In particular
VAST Data uses the nconnect mount option to establish multiple connections between the Nvidia DGX SuperPOD and the VAST storage. VAST has modified older Linux kernels/older NFS v3.X protocol stacks to be able to use this feature.
VAST Data has modified NFS stack to use Multi-Path. Since VAST has a “Disaggregated Shared Everything architecture”, all VAST nodes (VAST refers to them as CNodes) have access to all files/directories.
Both nconnect and Multi-Path work over both TCP and RDMA. Effectively, NFS has been turned into a Parallel File System if you think of it that way! But without having to setup separate meta data servers and data servers!
VAST Data NFS stack also supports GPUDirect.
Things to ponder
Look for more storage vendors to qualify their offering with Nvidia DGX SuperPOD
VAST Data has achieved a milestone in matching the speed/latency of Parallel File Systems with the simplicity and availability of Enterprise NAS (NFS in particular). Will other storage vendors take advantage of the NFS modifications VAST Data has contributed?
Will any storage vendor look at OTHER Enterprise NAS protocols? SMB 3 in particular? The author believes that the amount of effort needed is less than the effort VAST Data possibly put into modifying NFS.