Meet Center for AI Safety
Accelerating AI Safety Research at Lower Cost in the Cloud
The Center for AI Safety (CAIS — pronounced ‘case’) is a San Francisco-based nonprofit that supports research and field building that promotes safe and responsible artificial intelligence (AI). CAIS believes that while AI has the potential to benefit the world profoundly, many fundamental problems in AI safety have yet to be solved. CAIS’s mission is to reduce societal-scale risks associated with AI by conducting and building the field of AI safety research and advocating for safety standards.
The Rising Importance of AI Safety Research
CAIS-supported research includes topics central to the development of safe AI. Research on the robustness of safety guardrails in large language models (LLMs) highlights the need to limit third-party developers from bypassing safety controls. Research focused on developing technical methods to identify and measure the tendency of LLMs to hallucinate can help make AI systems more truthful and reliable. Work focused on determining the extent to which AI systems act based on reward systems versus ethics will help AI researchers understand how AI systems act according to ethical considerations.
Artificial Intelligence Research in an Era of GPU Scarcity
AI safety researchers who want to experiment on the latest LLMs face a dilemma. Conducting relevant research requires access to the latest GPU infrastructure to run experiments resembling real-world scenarios. However, the cost, complexity, space, and infrastructure skill sets needed to build an AI research cluster create high barriers for most AI-safety researchers.
The CAIS Compute Cluster is a dedicated GPU-accelerated cluster that provides AI safety researchers with subsidized, on-demand access to state-of-the-art infrastructure for LLM training and other AI safety projects. The CAIS compute cluster is specifically designed for researchers working on the safety of machine learning systems and supports a diverse range of research interests and collaborators.Â
“We immediately saw our cloud storage costs drop by 90% when we switched to WEKA.”
The Challenge
AI safety research is a hugely diverse and raidly growing field that requires affordable access to the latest GPU-accelerated infrastructure. As the CAIS Cluster grows to support a rapidly expanding set of researchers, challenges around scale, performance, cost control, and data management have to come the fore.ground
Performance Bottlenecks
Scaling the CAIS Compute Cluster was blocked by storage IO bottlenecks due to slow metadata handling across millions of small files. Fast metadata processing is a critical challenge for LLM training and tuning.
Storage Cost Spiral
CAIS experienced a data copy sprawl problem born from a lack of data management controls or quotas. Multiple researchers each manage model data in their own way, often storing multiple copies of the same training data set.
Low GPU Utilization
Storage bottlenecks and poor metadata handling in legacy lustre storage led to GPUs sitting idle, starved for data during traing and tuning.
“80% of the storage attached to our GPU cluster was going unused.”
The Solution
CAIS deployed WEKA Converged Mode in Oracle Cloud for all operations in the CAIS Compute Cluster to reduce costs through more efficient resource utilization. In Converged Mode, the data storage environment resides on the same infrastructure resources as the model training environment. Contrast this with a traditional architecture where the model training and data storage reside on separate silos of infrastructure. The goal is to drive significant cost savings across the data infrastructure, increase resource utilization in the GPU cluster, and be more efficient in data management.
“We would need a faster network to even stress the WEKA environment.”
Outcomes with WEKA
The WEKA Data Platform deployment enabled CAIS to increase utilization of their GPU-intensive cloud infrastructure, reduce cloud storage costs, and increase researcher productivity.
90%
Reduced Storage Costs
CAIS eliminated their dedicated storage silos by using WEKA Converged Mode which takes advantage of unused NVMe storage on GPU/CPU nodes in the compute cluster.
5x
Faster Data Storage
CAIS started with a small WEKA cluster to deliver 1.7 Million IOPs. Today it seamlessly handles tens of millions of small files in their AI training data sets.
500%
Growth in Research
The CAIS compute environment now supports over 200 researchers globally, representing 6x increase in just 18 months and still growing.
Increased GPU Utilization
CAIS has been able to increase utilization of local NVMe storage in their GPU instances from 20% to 100%.
Eliminate Data Bottlenecks
Growth in the CAIS Compute Cluster is no longer storage constrained thanks to a data environment optimized to handle metadata intensive operations.
Faster AI Model Training
The new CAIS data environment accelerate AI model training times, enabling researchers to investigate new aspects of AI safety more quickly.
Center for AI Safety
Learn how the Center for AI Safety empowers research into safe and responsible AI through the CAIS compute cluster.