Machine Learning Stacks in the Cloud
Not long ago, we went through the latest on ML infrastructure in the cloud. Since that post, the race for AI leadership amongst cloud providers has continued to heat up. Now, on the eve of major customer conferences from Microsoft and AWS, it is a good time to check in on the latest developments in this race. In the marathon to win AI workloads, the major cloud providers are starting to play their cards, and we’re already witnessing some surprising entrants.
Forward-thinking organizations continue to scrutinize their AI and cloud strategies and ask hard questions. What are the intersection points? How can the cloud help me accelerate my AI initiatives? What are the risks? What are the blind spots? Three main criteria emerging that leading organizations are using to answer these questions and set themselves apart from the competition: AI Infrastructure, AI developer platforms, and AI-driven applications. Our first blog went deep on AI Infrastructure. In this post, we’ll unpack where things stand for the AI developer platforms and uncover surprising opportunities to capitalize on rapid shifts in the market.
Machine Learning Developer Stacks in the Cloud
When it comes to the choice of developer platform, the choice between Azure ML, Amazon Sagemaker, and Google’s Vertex AI largely comes down to familiarity, vertical integration, and framework selection. The difference between the three offerings exposes the strategies each cloud provider is using to convince developers to build on their platform. Azure ML offers the deepest integration with the broader Microsoft ecosystem, including Azure Cloud, and Microsoft developer tools like VS Code. Developers can build natively on Azure ML using TensorFlow or Pytorch frameworks natively and can take advantage of deep integration with Azure Open AI – which is Microsoft’s true ace in the hole at the moment (more on this below). A great example of the depth of integration that sets them apart are Microsoft Copilots. By deploying the Open AI foundational model for security, and Microsoft 365, Microsoft is delighting its customers with highly targeted and specific AI applications that surface AI capability in the context that Azure customers already know and love.
In contrast, AWS developer tools for ML workloads and AI training continue to offer developers the most flexibility but remain highly fragmented. Over time, AWS has brought together its ML developer tools under Amazon Sagemaker – all 18 and counting major capabilities from everything to RStudio and Experiments, to Low Code – and recently launched Amazon Bedrock to access an impressively large catalog of foundational models. As with all the major cloud providers, developers can also get started building AI applications directly on Amazon EC2 using Deep Learning AMIs to build custom models and AI applications either using the catalog of models from Amazon Bedrock or, increasingly, through the Hugging Face catalog of models, which can be accessed via OpenAI APIs directly. In the AI assistant space, Amazon Code Whisperer is an intriguing entry, but the jury is still out about customer adoption.
Meanwhile, Google’s Vertex AI platform arguably has the deepest capabilities for AI developers and provides access to some of the most experienced foundational models. Google recently claimed that 70% of AI start-ups rely on Google Cloud infrastructure and AI capabilities. This is likely driven by Vertex AI support and continued adoption of the most popular developer frameworks as they evolve, including Tensorflow, and Pytorch. Google had an impressive set of updates around Duet AI earlier this summer, embedding its developer AI assistant to the full suite of Google Workspace apps.
Over the last 6 to 9 months, NVIDIA CUDA has emerged as an unlikely but increasingly dominant leader in the race to win AI developers. The time, energy, and investment, NVIDIA has placed in the CUDA has been well documented. Developers can build AI models using their preferred tools like Jupyter notebooks, or Pytorch, but training the models on what remains the de facto (at least for now) compute platform in NVIDIA H100 and now just announced H200 GPUs. NVIDIA continues to extend the advantage they have in the CUDA stack by now offering the NVIDIA DGX Cloud, and NVIDIA Omniverse software stack through most of the major cloud providers (including Microsoft, Google, and Oracle). Further, as NVIDIA doubles down on investment in GPU cloud providers like Applied Digital, Iris Energy, and NextGen Cloud, a fourth cloud provider, laser-focused on fractional GPU delivery for ML and AI workloads at an affordable price point and great performance could emerge over the next 18 to 24 months.
Where do we go from here?
With Microsoft Ignite upon us and AWS reInvent only days away, we’re bound to see a raft of new offerings that may shake up the landscape further. The extent to which Microsoft is leading in the race for AI leadership is largely thanks to the combination of a deeply integrated stack of developer tools and the OpenAI partnership, which gives builders access to the most popular foundational model in GPT. AWS, having ceded early ground to the rest of the field, is rapidly gaining traction due to the flexibility and choice customers have with AWS continues to attract developers. Meanwhile, the most momentum is happening at NVIDIA. The developer toolkits NVIDIA has built in the CUDA and their Enterprise AI stack will look increasingly attractive to developers, and the just announced NVIDIA H200 GPUs will likely extend their lead. The transformation that is being unleashed by AI is just beginning, and we’re excited to see what the next leg in the race brings for our customers and for our world.