For The Want Of A Nail – Part 5 Of 5: Enabling AI Storage For Organizations Of All Sizes

Liran Zvibel. March 2, 2018

For the Want of a Nail Part 1 – How Infrastructure May Be Limiting AI Adoption
For the Want of a Nail Part 2 – Aligning Data Center Storage with the Needs of AI Workloads
For the Want of a Nail – Part 3 – AI Depends on Large Scale Storage
For the Want of a Nail – Part 4 – Want AI? You’ll Need a Modular Approach to Maximize GPU Performance

Over my past few blogs I’ve discussed how Artificial Intelligence (AI) is all around us and its potential to unleash latent knowledge deep within your organization’s data stores. It does not matter what industry your business is in, AI can help.

If you want to take full advantage of the business transformation that AI affords, then you’ll need to think differently about your infrastructure. AI and learning applications (deep learning and machine learning) require massive amounts of compute power, network bandwidth, and fast AI storage. Fortunately, GPU based servers are ideally suited for AI and learning type applications. Infiniband is well suited to deliver extremely low latency and high network bandwidth, and parallel file systems make all the server centric storage and data shareable.

However, not all parallel file systems are created equal. In fact, the traditional file systems used in high performance computing (Lustre and IBM Spectrum Scale) were not designed to take advantage of the performance and low latency of NVMe flash. This is important because AI is one of the most demanding workloads today; it consists of both large and small files, random and sequential access, and structured and unstructured data. AI applications are also very metadata intensive, so the file system must be able to consistently deliver very high metadata performance – not an easy task. For these legacy file systems to perform, AI storage systems must be over-engineered and augmented with large caching devices to provide decent small file and metadata performance. The result is an overly expensive solution.

GPU servers are an expensive and scarce resource but they can process data hundreds of times faster than a similar CPU based server. The table below from an article in The Next Platform illustrates this point well. Note the extreme difference between the performance of a Xeon CPU based server and that of Nvidia’s DGX-1 GPU server. This difference in performance puts a huge demand on the supporting network and AI storage infrastructure.

Newest generation GPU servers consume data at a rate of 40 to 80 gigabytes per second, so a 10-node GPU cluster requires an interconnect and AI storage system that can sustain 800 gigabytes per second. Such an infrastructure would be quite expensive using legacy storage solutions. However, it doesn’t have to be.

You can position your organization for the future while protecting your existing investments by taking a software-centric approach to AI, learning systems, and data management. WekaIO has developed an AI storage solution well-suited to demanding machine learning workloads and has achieved the #1 rank as the world’s fastest file system. When coupled with an Infiniband network, it provides over 10 gigabytes per second of bandwidth per GPU network link, and scale to multiple connections in a single GPU system. In fact, this combination provides performance that is over 3x faster than a local file system with a direct attached all-flash array.

As a shareable file system, Weka is also cloud native, meaning that you can easily burst your AI workloads to a Weka enabled GPU cluster in AWS using the Snap-to-S3 feature. This allows you to eliminate the investment in a huge AI cluster. Simply spin up a GPU cluster in AWS on-demand. Weka leverages S3 compatible object storage to cost-effectively scale as your training data sets grow, and data management is point and click easy, or run your automated scripts using our CLI. A single admin without any special training can easily manage petabytes of AI storage.

Overcoming the infrastructure challenge means that access to AI is no longer just for the big guys, but indeed within cost-effective reach for organizations like yours.

If this sounds intriguing to you, or if you’d just like to learn a bit more, I suggest you check out these resources or our website in general. You can learn in detail how Weka fundamentally changes AI and data management for the better. In addition, you can see real-world applications of Weka technology and how we partner with some of the leading supercomputer centers and server and networking vendors to build out an optimized AI storage solution. Better yet, if you are ready to embrace the future of AI, give us a call, we’d be happy to discuss your needs further.

You, too, can maximize your business potential through AI applications. You can do this with WekaIO, the fastest, most scalable AI storage file for compute intensive applications. WekaIO: Intelligence Accelerated. Thanks for reading.