Bridging the Gap: How WEKA Redefines Networking for AI and HPC Workloads using NVIDIA Spectrum-X
Choosing between InfiniBand and Ethernet for HPC workloads often feels like a matter of faith—InfiniBand devotees swear by its unmatched performance and ultra-low latency, while Ethernet enthusiasts champion its versatility, scalability, and ever-improving capabilities. WEKA takes a pragmatic approach to this “religious” debate by seamlessly supporting both InfiniBand and Ethernet, with its highly optimized network stack ensuring you get the best performance and flexibility regardless of your chosen network fabric to address the critical needs of high-bandwidth networked access to petabytes of data for model training and low-latency access to storage for inference tasks like Retrieval-Augmented Generation (RAG).
Ethernet is emerging as the backbone of AI factories, providing scalable, high-throughput networking for massive datasets and complex neural networks. WEKA provides a highly optimized network stack to offer users high performance and flexibility regardless of their chosen network fabric. The NVIDIA Spectrum-X AI networking platform is bringing low latency and highest effective bandwidth to Ethernet with advanced adaptive routing and congestion control, delivering consistent, predictable performance for demanding AI and HPC workloads. Paired with WEKA’s optimized architecture, Spectrum-X helps minimize data bottlenecks and ensure smooth data flows, making it ideal for multi-node, GPU-intensive applications and large-scale AI/ML pipelines. WEKA, with NVIDIA technologies, enables researchers to focus on innovation, free from infrastructure constraints.
WEKA has conducted proof-of-concept tests with Spectrum-X, yielding impressive results that demonstrate the transformative potential of these technologies. Early testing revealed up to a 42% increase in throughput, underscoring the combined power of Spectrum-X’s advanced networking and WEKA’s high-performance data platform. These results highlight the ability of the solution to handle even the most demanding AI and HPC workloads with efficiency and reliability.
Storage operations are the backbone of the cluster that power AI factories, supporting the end-to-end data lifecycle required for training, inference, and optimization. This includes critical tasks such as data ingestion for bulk loading and streaming, preprocessing for staging and caching, and efficient training access for managing metadata and performing sequential or random reads. AI clusters also rely on robust storage for model checkpoints, artifacts, and versioning, as well as distributed storage for managing gradients, parameters, and shared buffers.
To ensure operational excellence, additional tasks like logging training metrics, system telemetry, and performing backups through snapshots and archival are integral. AI factories require performant data infrastructure that addresses these diverse needs, to enable them to maintain seamless data flow, minimize latency, and enhance overall productivity.
Spectrum-X’s advanced networking capabilities let WEKA storage traffic remain uninterrupted even under heavy loads. This is critical in AI and HPC environments where data availability and consistent performance are paramount. By dynamically rerouting traffic to avoid bottlenecks and leveraging technologies like RoCE for low-latency data transfers, Spectrum-X delivers smooth and efficient storage operations. This translates directly into faster AI training cycles, reduced time to insight, and enhanced performance for other demanding AI workloads, enabling seamless data flow without delays or disruptions.
WEKA’s testing with Spectrum-X has demonstrated significant improvements in networking performance through the implementation of adaptive routing (AR) and congestion control. In scenarios utilizing three inter-switch links (ISLs) with a combined capacity of 300Gbps, performance without AR was limited to 23.5GB/s. However, by enabling Spectrum-X’s AR capabilities, WEKA achieved the full network bandwidth of 33.5GB/s—an impressive 42% increase in throughput. This enhancement allows AI and HPC workloads to fully leverage available network capacity without requiring any modifications to WEKA’s software, streamlining deployment and maximizing efficiency.
Furthermore, testing revealed that without AR, a single ISL often became congested, creating a performance bottleneck that restricted overall throughput. When AR was enabled, traffic was intelligently distributed across all ISLs, ensuring balanced load distribution and unlocking the full potential of the network. As a result, once the ISLs were saturated at 256K reads and above, performance remained consistent, further underscoring the ability of Spectrum-X to dynamically optimize data flow and reduce networking bottlenecks.
NVIDIA Spectrum-X enables exceptional scale, supporting up to 256 200G ports in a single hop or 16,000 ports in a two-tier leaf/spine topology. This scalability, combined with WEKA’s ability to support thousands of data nodes and dynamically distribute data across them, means that AI clusters can grow while maintaining high performance and operational efficiency.
Critical components that enable WEKA to harness the power of Spectrum-X are NVIDIA BlueField-3 SuperNICs and DPUs. These advanced network accelerators bring intelligence to networking and storage by offloading and accelerating data-intensive tasks, ensuring optimal performance and scalability. BlueField-3 SuperNICs and DPUs work seamlessly with Spectrum-X to enable the critical adaptive routing, real-time telemetry, and congestion management features for maintaining peak efficiency in AI and HPC environments.
WEKA’s commitment to innovation includes full support for SuperNICs and DPUs, highlighted by its recent announcement about BlueField-3 integration. This integration means that WEKA’s high-performance data platform leverages the full capabilities of BlueField-3, enabling customers to achieve exceptional efficiency, visibility, and performance. WEKA, with BlueField-3 SuperNICs and DPUs, delivers a robust foundation for next-generation AI factories and HPC workloads.
NVIDIA Air and What Just Happened (WJH) technologies represent pivotal advancements in network visibility and simulation, enabling detailed insights and streamlined configuration for complex AI and HPC environments. NVIDIA Air provides a virtualized testing and simulation environment for enterprises to model and validate their AI storage fabric before deployment, reducing errors and optimizing performance. WJH offers in-band telemetry tools that capture real-time performance data and pinpoint bottlenecks so that AI factories can run at peak efficiency.
WEKA recognizes the transformative potential of these tools and is excited to support them in the future. By integrating these capabilities, WEKA aims to provide customers with unparalleled network visibility and performance validation, further enhancing the reliability and efficiency of AI and HPC infrastructures.
WEKA is thrilled to announce that full support for Spectrum-X will be coming to market later this year. This development will bring even greater performance enhancements and expanded capabilities to AI and HPC environments. By fully leveraging Spectrum-X’s advanced features, WEKA aims to empower customers with a cutting-edge solution that optimally integrates networking, compute, and storage for seamless, high-performance operations.
WEKA’s high-performance data platform, working with NVIDIA Spectrum-X, exemplifies how modern AI factories can achieve extraordinary efficiency, scalability, and innovation. By offering seamless integration of networking, compute, and storage, these technologies eliminate bottlenecks, enhance throughput, and enable dynamic growth to meet the demands of AI and HPC workloads. As the scale and complexity of AI environments continue to grow, solutions like WEKA and Spectrum-X will play a pivotal role in empowering organizations to innovate faster, optimize resources, and stay ahead in an increasingly competitive, data-driven world.