Extracting the Signals from the Noise
“The signal is the truth. The noise is what distracts us from the truth.”
– Nate Silver, 2012
One of the most important concepts in communication systems is the signal-to-noise ratio (SNR). In most communication systems there is a certain degree of background static – the noise. So, when you want to transmit or convey information – your signal – it’s critical that it stands out from the background. This important concept is useful in systems for communication, radar, imaging, data acquisition, and even information theory. It is also a useful mental model for thinking about human-to-human communications of all kinds, like AWS re:Invent. With hundreds of new product and feature announcements, thousands of sessions, and countless talking heads, extracting the signal from the noise is an incredibly difficult task. Here are a few of the key customer signals from AWS re:Invent.
Signal #1: Customers demand an exponential leap forward in data performance
AWS re:Invent kicked off Monday morning with a big bang of data and storage announcements, including a faster option for S3 storage and some updates to the Amazon FSx family, which we discussed last week. It’s great to see AWS innovating on behalf of customers to deliver faster-performing storage solutions. All this innovation is indicative of something we work with customers to solve every day: high-performance workloads are moving to the cloud, but the existing solutions are not up to the challenge. Generative AI, HPC, and numerous industry-specific applications for drug discovery, VFX rendering, electronic design automation, high-frequency trading, and many more use cases all demand an exponential leap forward in data performance. We love the customer focus AWS shows in these new offerings.
However, in our view, the massive leap forward in data performance, scale, cost control, and simplicity requires a completely new approach. WEKA’s Joel Kaufman recently did a deep dive into the emerging data performance requirements for Generative AI model training. These scenarios require massive IO, high throughput, AND the ability to process millions of tiny files across multiple read/write profiles. Oh, and by the way, it also needs to handle many data pipelines, all processing the data in different stages in parallel, accessed by multiple users spread around the world. While AI model training is the “hot” use case for these capabilities, it turns out, it’s not the only one. Customers across nearly every industry are looking for these capabilities. It’s the fundamental reason we built WEKA to start, and it’s why we’re now seeing such traction with customers in a wide range of industries. Even with these new announcements, WEKA remains the fastest, most scalable, and most affordable data option for AWS and for any cloud (or on-prem). WEKA is still the fastest data platform in the cloud by about 10x, still the only offering that scales to exabytes while scaling performance and capacity independently, AND the only offering that scales both up and down to meet customers’ changing workload needs.
Signal #2: Costs in the cloud are still not under control
It’s worth noting the entirety of the AWS re:Invent Thursday keynote from Werner Vogels focused on “Architecting to control costs”. Following the keynote, a new microsite went up on The Frugal Architect from Werner’s All Things Distributed blog. AWS is clearly hearing from their customers that costs in AWS are high and most customers are having difficulty controlling those costs. The seven laws Werner introduced are a good framework for guiding architects, and they read as an excellent set of Tenets for controlling costs in the cloud. However, as anyone familiar with Amazon Tenets knows, they are more like guiding principles than practical advice. They are the starting point, not the finish line.
At WEKA, we work with customers every day to help them control costs in the cloud. A few very predictable patterns emerge. The most common of these is – surprisingly enough – resource over-provisioning. Fundamentally, improvements in network (Amazon EFA now delivers 3200 Gbps aggregate throughput) and compute capacity (the new Amazon EC2 P5 instances have 192 vCPUs, 640 Gbps memory, and 30 TB of NVMe storage) have now outstripped the ability for traditional storage offerings to keep up. This results in a set of knock-on effects that impact costs. First, customers deploy massive amounts of capacity they don’t actually use. Scale and performance limits drive customers to deploy idle capacity that is only there to support peak performance profiles for storage IO and bandwidth. Second, and even more importantly, the network, compute and GPU resources sit idle, starved for data (and wasting energy)! This is another area where WEKA is ideally situated. WEKA is truly unique in our ability to be hyper-efficient in utilizing AWS resources for either capacity or performance (or both) and to scale up and down. They never over-provision cloud resources, they never pay for resources they don’t use, and they can save on energy consumption. All this means that WEKA customers can reduce their cloud storage spend by half (usually it’s more than that) with a more sustainable solution.
Signal #3: The best developer tools are winning the AI race in the cloud
AWS has a solid story when it comes to the three layers of the AI stack, and had meaningful launches in every layer, AI infrastructure, AI developer tools, and AI applications. The most interesting customer signal in all these announcements was AWS’ intention to offer NVIDIA DGX Cloud to their customers. This turn-about from previously stated plans reflects NVIDIAs growing influence over AI use cases in the cloud. This plus the other announcements from NVIDIA about deeper integration with AWS (Grace Hopper Supercomputer, Project Ceiba, and the new developer/training tools in AWS) signal an emerging key selection criteria in the race to win AI workloads in the cloud – who has the best developer tools. At the moment, NVIDIA’s decade-long investment in the CUDA developer stack is showing its strength. AWS customers are clearly demanding native access to more of the NVIDIA stack of software and developer tools.
We’re excited about this development as WEKA is making it our mission to provide AI architects and builders with the fastest, most affordable, and most scalable data platform for Gen AI model training, tuning, and inferencing. The WEKA Data Platform in the cloud was the first such offering to deliver high-performance data management using infrastructure-as-code principles, first through automated cloud deployments with AWS CloudFormation, and just this past week, through Terraform templates. These capabilities are enabling builders to get started in as little as 30 minutes. Leading Gen AI companies like Stability AI are accelerating their time to results while reducing infrastructure costs; all driven by the simplicity, scalability, performance, and sustainability of our data platform.
It’s great to see AWS innovation focusing on rapidly evolving customer needs for high-performance data, while helping control costs and enabling builders with deeper toolkits. We’re excited to see what the next wave of builders can accomplish with these features and with the capabilities WEKA is bringing for their AWS workloads.