GENERATIVE AI

Meet Center for AI Safety

Accelerating AI Safety Research at Lower Cost in the Cloud

AREAS OF FOCUS

Cloud
Generative AI
OCI

Region

Global

Customer Links

safe.ai

The Center for AI Safety (CAIS — pronounced ‘case’) is a San Francisco-based nonprofit that supports research and field building that promotes safe and responsible artificial intelligence (AI). CAIS believes that while AI has the potential to benefit the world profoundly, many fundamental problems in AI safety have yet to be solved. CAIS’s mission is to reduce societal-scale risks associated with AI by conducting and building the field of AI safety research and advocating for safety standards.

The Rising Importance of AI Safety Research

CAIS-supported research includes topics central to the development of safe AI. Research on the robustness of safety guardrails in large language models (LLMs) highlights the need to limit third-party developers from bypassing safety controls. Research focused on developing technical methods to identify and measure the tendency of LLMs to hallucinate can help make AI systems more truthful and reliable. Work focused on determining the extent to which AI systems act based on reward systems versus ethics will help AI researchers understand how AI systems act according to ethical considerations.

Artificial Intelligence Research in an Era of GPU Scarcity

AI safety researchers who want to experiment on the latest LLMs face a dilemma. Conducting relevant research requires access to the latest GPU infrastructure to run experiments resembling real-world scenarios. However, the cost, complexity, space, and infrastructure skill sets needed to build an AI research cluster create high barriers for most AI-safety researchers.

The CAIS Compute Cluster is a dedicated GPU-accelerated cluster that provides AI safety researchers with subsidized, on-demand access to state-of-the-art infrastructure for LLM training and other AI safety projects. The CAIS compute cluster is specifically designed for researchers working on the safety of machine learning systems and supports a diverse range of research interests and collaborators.

“We immediately saw our cloud storage costs drop by 90% when we switched to WEKA.”

Stephen Basart, R&D Lead, Center for AI Safety

The Challenge

AI safety research is a hugely diverse and raidly growing field that requires affordable access to the latest GPU-accelerated infrastructure. As the CAIS Cluster grows to support a rapidly expanding set of researchers, challenges around scale, performance, cost control, and data management have to come the fore.ground

Performance Bottlenecks

Scaling the CAIS Compute Cluster was blocked by storage IO bottlenecks due to slow metadata handling across millions of small files. Fast metadata processing is a critical challenge for LLM training and tuning.

Storage Cost Spiral

CAIS experienced a data copy sprawl problem born from a lack of data management controls or quotas. Multiple researchers each manage model data in their own way, often storing multiple copies of the same training data set.

Low GPU Utilization

Storage bottlenecks and poor metadata handling in legacy lustre storage led to GPUs sitting idle, starved for data during traing and tuning.

“80% of the storage attached to our GPU cluster was going unused.”

Stephen Basart, R&D Lead, Center for AI Safety

The Solution

CAIS deployed WEKA Converged Mode in Oracle Cloud for all operations in the CAIS Compute Cluster to reduce costs through more efficient resource utilization. In Converged Mode, the data storage environment resides on the same infrastructure resources as the model training environment. Contrast this with a traditional architecture where the model training and data storage reside on separate silos of infrastructure. The goal is to drive significant cost savings across the data infrastructure, increase resource utilization in the GPU cluster, and be more efficient in data management.

“We would need a faster network to even stress the WEKA environment.”

Stephen Basart, R&D Lead, Center for AI Safety

Outcomes with WEKA

The WEKA Data Platform deployment enabled CAIS to increase utilization of their GPU-intensive cloud infrastructure, reduce cloud storage costs, and increase researcher productivity.

90%

Reduced Storage Costs

CAIS eliminated their dedicated storage silos by using WEKA Converged Mode which takes advantage of unused NVMe storage on GPU/CPU nodes in the compute cluster.

5x

Faster Data Storage

CAIS started with a small WEKA cluster to deliver 1.7 Million IOPs. Today it seamlessly handles tens of millions of small files in their AI training data sets.

500%

Growth in Research

The CAIS compute environment now supports over 200 researchers globally, representing 6x increase in just 18 months and still growing.

Increased GPU Utilization

CAIS has been able to increase utilization of local NVMe storage in their GPU instances from 20% to 100%.

Eliminate Data Bottlenecks

Growth in the CAIS Compute Cluster is no longer storage constrained thanks to a data environment optimized to handle metadata intensive operations.

Faster AI Model Training

The new CAIS data environment accelerate AI model training times, enabling researchers to investigate new aspects of AI safety more quickly.

Case Study

Center for AI Safety

Learn how the Center for AI Safety empowers research into safe and responsible AI through the CAIS compute cluster.

Download the Case Study

RESOURCES

Dive a little deeper

View Gen AI Resources

VIDEO

WEKA Delivers a Data Platform for Next-Gen AI Workloads and High-Performance Applications

WEB

WEKA for Generative AI

WEB

WEKA for Oracle Cloud Workloads

Conquer the Impossible with WEKA

Get Started

WEKA DATA PLATFORM

DEPLOYMENT OPTIONS

USE CASES

INDUSTRIES

ARCHITECTURES

Learn AI

RESOURCES

TECHNICAL RESOURCES

ABOUT US

JOIN US

Meet Center for AI Safety

The Rising Importance of AI Safety Research

Artificial Intelligence Research in an Era of GPU Scarcity

“We immediately saw our cloud storage costs drop by 90% when we switched to WEKA.”

The Challenge

Performance Bottlenecks

Storage Cost Spiral

Low GPU Utilization

“80% of the storage attached to our GPU cluster was going unused.”

The Solution

“We would need a faster network to even stress the WEKA environment.”

Outcomes with WEKA

90%

Reduced Storage Costs

5x

Faster Data Storage

500%

Growth in Research

Increased GPU Utilization

Eliminate Data Bottlenecks

Faster AI Model Training

Center for AI Safety

Dive a little deeper

Conquer the Impossible with WEKA