Meet Stability AI
Accelerating Large Language Model Training in the Cloud
Stability AI is a visionary open-source generative AI company on a mission to build an intelligent foundation to activate humanity’s potential. The company delivers breakthrough, open-access AI models with minimal resource requirements for imaging, video, 3D, language, code, and audio.
Faster, Safer AI Through Open Models in Every Modality
By providing modular, open models in every modality, Stability AI can deliver new AI models to market faster, more sustainably, and with greater safety controls. Stability AI customers can bring pre-trained models into their domain and fine-tune them using their own IP-rich datasets in a secure and controlled manner. This approach improves AI researchers’ ability to deeply understand model dynamics, reduce the risk of model hallucination, and build AI applications that are safer and more responsible.
In the Generative AI space, the pace of innovation, already off the charts, is expected to remain high for years as foundation model providers create the AI building blocks that will drive next-generation applications. Stability AI seeks to win the generative AI race based on market speed, model accuracy, responsiveness, and accessibility.
“We can now reach 93% GPU utilization when running our AI model training environment using WEKA on AWS.”
The Challenge
Delivering fast time to market for features and new versions of the Stable Diffusion model requires access to the fastest GPU infrastructure. However, Stability found the scale and performance limitations of their legacy Lustre file system deployment in the cloud limited their ability to utilize their GPU infrastructure fully and drove big cost surprises.
Idle GPU Infrastructure
The fastest networking, compute, and GPUs required to support model training and tuning have become scarce commodities, commanding high prices. Performance and scale limitations in legacy data management caused storage bottlenecks that left GPU infrastructure sit idle, starved for data.
High Cloud Storage Costs
Rapid growth in researcher and customer interest drove explosive data growth in the legacy Lustre environment. However, storage over-provisioning requirements and limited data management capabilities led to unexpected cost overruns for storage in the cloud.
“Before WEKA, we’d push a button, and the cloud storage cost would just blow up like $100K per month in a matter of hours.”
The Solution
Stability AI uses WEKA Converged Mode on AWS for its generative AI model training and tuning environment. In Converged Mode, the data storage environment resides on the same infrastructure resources as the model training environment, in contrast to a traditional architecture where the model training and data storage reside on separate silos of infrastructure. This innovative solution enables Stability to increase resource efficiency and recognize massive savings in cloud storage costs associated with their GPU infrastructure.
“When we switched to WEKA Converged Mode, we got 15 times more cloud storage capacity at about 80% of the previous cost.”
Outcomes with WEKA
Using the WEKA Data Platform, Stability AI transformed its infrastructure strategy, driving increased resource utilization. The new approach reduced data storage costs by 95% on a cost-per-TB basis while helping improving GPU utilization and accelerate model training times. With the new approach, Stability is also moving forward on its goals for sustainable AI.
95%
Reduction in Cost per TB
Stability AI reduced their storage costs by 95% by switching to WEKA Converged Mode, which uses existing NVMe resources available in the GPU cluster.
93%
GPU utilization efficiency
Stability AI achieves 93% GPU utilization by eliminating stroage bottlenecks associated with small file handling and metadata look-ups.
35%
Faster Model Training
Stability AI found AI model epoch times reduced by 3 weeks on average with a faster data platform that eliminated manual data management tasks.
Simplified Data Operations
Manual data copy, movement, and manage have been eliminated by relying on WEKA zero-copy, zero-tuning data architecture.
Increased Sustainability
Stability AI is able to reduce the carbon footprint associated with model training through Improved GPU utilization, and a reduced storage footprint.
Increased Productivity
Researchers at Stability AI spend less time manually loading and pre-processing data. Model checkpoints are more reliable.
Accelerate LLM Training in AWS
Learn more about Stability AI and WEKA in AWS.