Hot Take: Google Leveling up AI in the Cloud
The first few months of 2024 have picked up right where we left off last year, with unparalleled levels of innovation and no end in sight. If anything, we’ve likely increased the pace. During the recent NVIDIA GTC conference, now fondly referred to as the “Woodstock of AI ” conference, the new NVIDIA Blackwell GPU architecture and Grace-Blackwell superchips offer a massive step forward in AI infrastructure, and NIMs/NeMo developer tools promise to make it easier for AI developers bought into the CUDA architecture (of which there are many). WEKA’s Joel Kaufman unpacks these announcements from GTC beautifully here. Now, the team at Google has picked up the mantle, introducing hundreds of new meaningful capabilities for customers.
Fast Cloud Infrastructure for AI Workloads
On the compute side, the focus has shifted in a few ways. The race for the fastest processing power for model training, tuning, and inferencing has actually separated into several elements – NVIDIA GPUs, custom silicon, and purpose-built compute instances – with leading cloud providers pursuing an all of the above strategy, calling it customer choice, but usually with some twists.
The race to Blackwell is critical to continue to push the envelope on large language models (LLMs). As we accelerate beyond the trillion-parameter models now in development, Grace-Blackwell is viewed as the answer to driving model training. With the ability to accelerate model training times by 30x and reduce power and energy costs by 25x, Blackwell is central for the largest of the LLMs, and time to market will be huge. For Google customers, it’ll be early 2025 for Blackwell availability. It will be fascinating to see how the Blackwell roll-out proceeds across traditional NVIDIA customers on-prem, within DGX Cloud, and natively in the cloud. Until then, Google Cloud customers will be pleased that the A3 Mega powered by NVIDIA H100 GPUs will be available next month, joining similar instance types available for AWS and Azure customers.
In custom silicon for AI applications, Google was a pioneer with the introduction in 2016 of the Tensor Processing Unit (TPU), a custom ASIC built specifically for machine learning — and tailored for TensorFlow. The Google team continues that tradition with the GA of the Cloud TPU v5p, a next-generation accelerator that is purpose-built to train some of the largest and most demanding generative AI models. The new TPU v5p offers 2x the chips, 2x higher FLOPS and 3x more high-bandwidth memory on a per-chip basis, and 12x more bandwidth per VM versus the TPU v4. The performance improvements along with security enhancements make this an excellent option for enterprises looking to tune their own custom models.
Of course, custom silicon in the cloud is also about the price-performance benefits from ARM. So it’s really great to see Google Cloud jump in with Axion. While ARM is the headline announcement here, the real news is the custom architecture that backs up Axiom. Axiom is underpinned by Titanium, a system of purpose-built custom silicon microcontrollers that offload platform operations for networking, security, and storage I/O through Hyperdisk. So the innovation here goes way beyond ARM. Taking some inspiration from the AWS Nitro System, Google is rethinking on behalf of their customers at every layer of the infrastructure stack.
Handling HPC and Specialized Workloads
One smart story Google just pushed was a set of launches around workload-optimized infrastructure. I love the customer obsession with these new offerings from the Google Cloud compute team. While generative AI and buzzy new methods like multi-modal LLMs, RAG, MOE, and the like grab the headlines, accelerated compute and GPUs have been steadily improving for high-performance workloads in the cloud. Pharmas use accelerated compute for drug discovery and genomics processing. Media firms use accelerated compute for most stages of their production pipelines. Oil & Gas firms use accelerated compute for geothermal exploration, seismic analyses, and a host of processing applications. Accelerated compute is also super useful in EDA and CAD/CAM applications, and many more. As many have pointed out, we’re reaching the end of Moore’s Law, but quantum computing is still pretty far out. In the meantime, accelerated computing is an incredibly innovative approach to solving some really difficult problems for customers. This is particularly useful in cloud computing, where customers can see the performance benefits of the new architecture while also recognizing the scale economies in the cloud.
Smart networks to improve the AI user experience
One final area we expect to see more innovation around soon will be the cloud networks. On the one hand, there are already a ton – between 400 / 800 Gbps networks rolling out in most cloud provider networks, cross-cloud interconnects, and major expansions coming in the performance and resilience of the global network backbone. The infrastructure goodness is there. However, as organizations deploy their AI models in the real world, the demands on the network will alter dramatically. Increasing the size of the context window for inference requires even more data within that fully trained, deployed model. The larger the context window, the larger the data set retained during a session, and the more the network needs to flex and scale to handle extremely large data sets even within sessions. So it’s great to see the Google networking team think hard about this problem and offer up Model as a Service Endpoints to enable model developers to start building AI applications with a set of tools to help optimize the network for these unique conditions – including load balancing, private service endpoints, and service discovery.
The big emerging problem: Fast Access to Data
The amazing innovations in compute and accelerators have rightfully captured the world’s attention. Compute is just one leg in the infrastructure stool. As the performance demands on compute scale linearly, the demands on data storage multiply exponentially. The problem is twofold. First is the simple need to handle more data (now trillions of parameters for the largest models) and do it faster. Second, the workload patterns for AI model training, tuning, and inference are all different…and different than most workload patterns we’re used to. Consider model training – which Joel Kaufman and Mike Bloom each address extremely well – is all about fast metadata handling across millions of tiny files.
So, it’s great to see Google raise their game on the storage side this week to help customers with AI model training, tuning, and inference with several additions to their portfolio of native storage offerings. The Google Cloud Storage FUSE (now GA) and ParallelStore (in preview) offerings both got upgrades with the addition of caching capabilities expected to improve throughput by 3x to 4x and model training times by 2x to 3x. For inference, the Hyperdisk ML (preview) is a new block storage service optimized for AI inference/serving workloads. Google claims Hyperdisk ML will dramatically accelerate model load times compared to common alternatives and offers cost efficiency through read-only, multi-attach, and thin provisioning.
These are exciting announcements and follow on the heels of speed bumps AWS added to Amazon S3, and FSx last fall. While AI in the cloud is grabbing all the headlines, the broader trend is to leverage cloud deployment models for all sorts of high-performance workloads. Forward-looking customers would be wise to proceed with caution. Incremental improvements in the form of speed bumps to existing capabilities (even seemingly impressive ones) are more likely to just kick the can down the road. For example, adding block storage to accelerate ML training and tuning times will be extremely challenging in a world where file-based storage is the lingua franca for AI. It’s more likely to create another – slightly faster – silo of data the AI scientists will have to manage across their AI pipeline. The complexity of multiple file storage offerings (AWS offers four flavors, Google now has three) is more likely to confuse customers and lead to even more silos of sub-optimized storage than ever before.
Building data solutions that meet these same requirements – incredibly high performance at massive scale – is an area the team at WEKA has been focused on since our founding in 2013. We’ve learned that the data management techniques useful to drive more traditional high-performance compute workloads are ideal for solving data problems for AI model training. It’s why we’ve built things like a software-first approach to data management based on a massively parallel data processing architecture. The WEKA® Data Platform combines the speed of the NVMe flash cache that’s available right on the compute (or the GPU/TPU accelerator in the case of AI) with the economics of object storage in a single namespace. Along the way, we learned that customers didn’t want to have to manage data operations across a pipeline and certainly didn’t have time to worry about managing storage tiering, so we built it into our software and handle it all automatically – eliminating the need for silos between files, block, and object and simplifying data operations.