Nvidia GPUDirect Part1 – GPUDirect and Storage vendors
Nvidia popularized the use of GPUs for AI. These chips specialize in efficiently performing mathematical operations like matrix multiplication to name one that is a little easier to understand. Of course, Nvidia GPUs are about much more than just graphics or matrix operations.
The classic model is a CPU with a very limited amount of on chip memory (RAM) which can access a fair or large amount of memory (RAM) over a relatively high-speed bus. Nvidia evolved from this architecture to one where the GPU has a large amount or RAM (order of 10s of GB) on chip itself.
The classic storage solutions dealt with moving data between storage and RAM. The more advanced ones used RDMA to move data to/from RAM with minimal assistance from the CPU. Hence when Nvidia shipped the first GPUs with a substantial amount of RAM, the I/O pattern was:
· Data is moved from storage into RAM (for a Read operation). This may involve RDMA or it may not.
· From RAM, the data is copied into GPU RAM
· Finally, the GPU can perform operations on the data
The drawback with this solution is:
· It takes longer and meanwhile the expensive GPU is idle
· It takes CPU cycles, so the CPU cannot be deployed for other tasks
To overcome these drawbacks, Nvidia announced GPUDirect which is both a solution outline, as well as part of an Nvidia certification. With GPUDirect, data is moved between storage and GPU RAM directly, completely bypassing CPU RAM. Note that this requires not just the storage and the GPU server to have RDMA capabilities, but also all the intervening network devices. And of course, everything needs to be configured properly.
We will look at GPUDirect from 3 different perspectives – AI Storage vendors, LLM Model developers, and AI solution developers. Thie remaining part of this blog will focus on AI Storage vendors. Subsequent blogs will focus on LLM Model developers and AI solution developers.
A number of storage vendors including Vast Data, DDN, Peak: AIO, NetApp, and Dell have announced support for GPUDirect. This is NOT an exhaustive list. The odd Object storage vendor has also announced support for GPUDirect albeit with objects (and not files).
Storage vendors seem to look at GPUDirect support as table stakes for participating in the AI storage segment. Those that do not yet have a shipping solution have announced their intention of having one, though in many cases the exact timeframe is unspecified. In general, GPUDIrect can be implemented by
· Using RDMA with NFS
· Using RDMA with the vendor chosen network file system protocol
· Using RDMA with the SMB 3 protocol, also sometimes referred to as SMBDirect. There are a number of advantages with this approach. Start with the fact that every Linux kernel since 2018 (or maybe 2017?) includes an SMB 3 client. And this SMB 3 client is actively maintained and enhanced by a number of corporations including Microsoft. Happy to share more thoughts on the pros and cons.
The next blog will continue the series by discussing GPUDirect with a focus on generative AI and LLM model developers in particular.