Get More Out of Your Cloud GPUs
Stop letting them idle waiting for data!
As the Cambrian explosion in generative AI gains momentum, so does the underlying demand placed on core enabling technologies – data, networks, and most importantly accelerated compute capabilities that GPUs provide. However, a rapidly growing shortage of GPUs is putting companies of every size under increasing pressure. Today, most organizations with AI initiatives are focused on the issues of limited GPU availability and the resulting high cost to gain access to the most powerful GPUs suited for training Large Language Models (LLMs). However, leading organizations are starting to realize they are missing a trick: the GPUs they do have are sitting idle, starved for data.
GPUs Are Starved for Data
According to reports from Google, Microsoft and organizations around the world, 70% of model training time is taken up by data staging operations. Put another way, your GPUs are spending up to 70% of their time sitting idle, starved of data to train the model. It’s no surprise when you start to look at the typical generative AI data pipeline.
The diagram below shows the Generative AI data pipeline commonly used by enterprise customers and recently described by leading generative AI researchers at Google.
As shown in the diagram above, at the beginning of each training epoch, training data kept on high-capacity object storage is typically moved to a file staging tier and then moved again to GPU local storage which is used as scratch space for GPU calculations. Each “hop” introduces data copying time latency and management intervention, slowing each training epoch considerably. Valuable GPU processing resources are kept idle waiting for data, and vital training time is needlessly extended.
WEKA Has a Better Way: The Data Platform for AI
The primary design objective with deep learning model training is to constantly saturate GPUs doing the training processing by providing the highest throughput at the lowest latency. The more data a deep learning model can learn from, the faster it can converge on a solution, and the greater its accuracy will be.
Born in the cloud, the WEKA Data Platform is a software solution that ensures you can constantly saturate your GPUs doing the model training by providing the highest throughput at the lowest latency. WEKA collapses the typical GPU-starving “multi-hop” AI data pipeline using a single namespace where your entire data set is stored.
This zero-copy architecture eliminates the multiple steps needed to stage data prior to training. Your GPUs gain fast access to data needed for training, while WEKA automatically manages tiering of data between high-performance, NVMe-based storage, and low-cost object storage. Incorporating the WEKA Data Platform for AI into deep learning data pipelines saturates data transfer rates to NVIDIA GPU systems. It eliminates wasteful data copying and transfer times between storage silos to geometrically increase the number of training data sets that can be analyzed per day.
Optimized Performance that Drives Every Step in Generative AI Pipelines
Every step in the Generative AI workflow has a different performance requirement across high bandwidth stages like data ingest, low latency stages including model training, model evaluation, and fine-tuning, and mixed patterns during model validation, and inferencing. Further, as Generative AI workflows increase in complexity, applications need to switch between models optimized for different outcomes. Legacy solutions optimized for a single step in the workflow ultimately drive organizations to create multiple data silos to support the entire AI pipeline at very high cost, or suffer through slow pipelines overall.
WEKA collapses the typical GPU-starving “multi-hop” AI data pipeline using a single namespace where your entire data set is stored, eliminating the need to stand up multiple environments for every stage in the AI pipeline. WEKA customers see anywhere from 7x to 10x performance improvement in every stage in the pipeline and overall are able to collapse total epoch times by 90%.
A 20X Reduction in Epoch Time By Switching to the WEKA Data Platform for AI
One example of an organization using WEKA to accelerate their AI data pipeline is Atomwise, a pharmaceutical research company that uses artificial intelligence for structure-based drug discovery. Atomwise uses 3D structural analysis to train their model, which is then used to identify a pipeline of small-molecule drug candidates that advance into preclinical trials. Model training typically relies on millions of structures, tens of millions of individual small files, and 30 to 50 epochs requiring as many as 12 data scientists managing an AI pipeline that could take as long as 4 days to complete. A deep dive into the data pipeline showed a major bottleneck in I/O that, once cleared could result in dramatic improvement in the data pipeline. That’s when Atomwise adopted the WEKA Data Platform, running in AWS for the model training and drug discovery workflows. With the new solution, Atomwise was able to shift from a traditional multi-copy data pipeline, where it would take them 80 hours to do each training cycle, to WEKA’s zero-copy data pipeline. They reduced their epoch time 4 hours – a 20X improvement in model training times. This allowed them to do in 12 days what would take a year on their old infrastructure, drastically speeding their final product to market.
“We wanted to train a model on the 30 million files we had, but the models are fairly large, with 30-50 epochs, a timeline of up to four days, and a lot of random-access-file lookups. GPUs are quite fast and hungry for data – you want to feed them as much data as you can,” says Jon Sorenson, VP of Technology Development at Atomwise. With WEKA, “We could now consider experiments that earlier – because of all these headaches – might take us three months to figure out how to run. Now we can do this exact same experiment in less than a week.”
Contact your WEKA representative for more information on how WEKA can accelerate your Generative AI data pipelines.