Nvidia GPUDirect Part 2 – GPUDirect and LLM model developers

Apr 04, 2025

Part 1 of this blog series explained GPUDirect. Please read this blog if you have not already.

LLM Model development can involve 100s of GPUs running for weeks. LLM Model developers need a number of features besides just the 100s or 1000s of GPUs

· A high speed, low latency network that connects all the GPUs (and GPU RAMs) since many operations will need multiple GPUs to collaborate

· GPUDirect to quickly move data between storage and GPU RAM

· Note that LLM model developers will frequently create checkpoints. This is so that all the work done can be archived and if required, the work can be restarted from the checkpoint. When checkpoints are created, there will be TBs of data to write to disk (the RAM contents plus other state) and hence the storage must be capable of supporting this write I/O. GPUDirect helps.

· When the computation is restarted from checkpoints, and also when model parameters are loaded, there will be an intense read I/O load. Again, the storage must be capable of supporting this. Again, GPUDirect helps.

So, if you are an LLM model developer, how do you get a hardware setup that satisfies these requirements. The choices seem to be:

· Deploy your own hardware. Of course, this requires the appropriate budget and engineering resources, datacenter premises, etc.

· Rent the setup. The problem is that none of the hyper scalars (Microsoft, AWS, Google, Oracle) indicate a way of specifying GPUDirect storage! You can specify the GPU, CPU, RAM but that’s about it. Neither do the GPU Cloud “challengers” like Coreweave, Lambda Labs, Vast.ai, etc. allow you to specify GPUDIrect storage when you rent. So, you will have to work “directly/privately” with a cloud GPU rental vendor to ensure you get an adequate hardware setup

· Some other arrangement e.g. renting/leasing from a company that is not in the Cloud GPU rental business.

To summarize, LLM model development requires GPUDirect storage and some other hardware characteristics that will need some effort to procure and make available.

Enterprise Storage Gyaan

Discussion about this post