For the Want of a Nail – Part 4 of 5: Want AI? You’ll need a modular approach to maximize GPU performance

Liran Zvibel. February 14, 2018

For the Want of a Nail Part 1 – How Infrastructure May Be Limiting AI Adoption
For the Want of a Nail – Part 2 – Aligning Data Center Storage with the Needs of AI Workloads
For the Want of a Nail – Part 3 – AI Depends on Large Scale Storage
For the Want of a Nail – Part 5 – Enabling AI For Organizations Of All Sizes

Previously I’ve discussed how AI is all around us. It’s obvious that AI storage scalability is essential, perhaps at a level that you might not have experienced in the past. AI workloads are highly parallel with continuously-interrelated activities that act as feedback loops to deliver more refined and actionable data. However, simply expanding your existing infrastructure is not going to enable a successful AI deployment. While you can achieve scale through Cloud Service Providers, their offerings are rarely tailored to a specific need, and WAN connectivity is not cost effective at the requisite I/O bandwidth and latency. Alternatively, onsite high performance NAS isn’t the answer either. Whereas NAS might appear to be an easy-and-immediate solution, design limitations severely limit its scalability, and hence suitability. Scaling is more than just raw capacity, it’s the ability to handle millions of directories with billions of files per directory, while delivering consistent performance/latency regardless of whether a directory contains five 10 TB files or 7 million 4 KB files. And consideration needs to be paid to optimizing GPU performance-don’t let these expensive resources sit idle.

Data management is an important part of any AI storage solution. While data layout is not as critical as in the past (for example locality has been overcome by flash and low latency interconnects such as InfiniBand), it remains a consideration in overall design. Effective management includes appropriate storage tiering that aligns with performance and cost considerations, data protection commensurate with data value, support for concurrent access and multiple protocols as well as differing data types such as file and object, and structured/unstructured. Further, disparate workloads require support for multiple file systems. For example, logical partitions can have different Quality of Service needs such as high performance, data protection level (N+2, N+4, etc.), and use cases (scratch vs. library vs. working data set, etc.).

It’s time for something new. But what is a viable alternative?

A successful solution needs to prioritize fast and efficient data storage and scale to petabytes of capacity. This implies a modular approach, one that balances capacity and performance while maintaining the ability to scale either independently of the other. Innovative data management tools that can enable dynamic and tailored high-performance AI storage that operates in a multi-petabyte namespace are essential. Keep in mind that scalability is a two-way street. Unused capacity represents money and resources that could be better spent elsewhere. Therefore, an efficient data protection scheme is critical. Why settle for the cost of triple replication (300% overhead!) when you could achieve the same protection at significantly lower capital and operating expense?

Critically, any solution must be able to deliver high bandwidth and low latency while maintaining a consistent performance profile across all file sizes. AI relies heavily on GPU based servers that can cost between ~$100k and ~$250k each. Inadequate performance from the AI storage systems results in poor GPU performance, because the GPUs are starved for data, with idle compute cycles, which is a very inefficient use of such expensive resources. Since AI data can be found along the edges of the network, support for distributed environments including parallel access for POSIX, NFS, SMB, and HDFS protocols is key. Coherent and consistent performance regardless of file location is also an essential part of any data management solution. “What about data locality?” you might be thinking. Data locality is irrelevant since local copy architectures (e.g. Hadoop, or caching solutions) were developed in an era when networking was 1GbitE and HDD was the default storage media. Modern 10GbitE and 100GbitE networks are 10x to 100x faster than any single SSDs, and it is much easier today to create distributed algorithms where locality is not important. You see, with right networking stack, shared distributed storage is faster than local storage.

Furthermore, the solution must be able to cost-effectively achieve these goals so that organizations of most any size can economically justify investments in AI. This precludes simply investing in additional equipment and personnel as in the past. Instead, why not rethink how the current resources within an organization are deployed to maximize the ROI on existing investments? After all, AI is all about unleashing latent knowledge and commercial value. By solving data management and AI storage challenges in a new way at a fraction of the cost of traditional approaches, AI initiatives could become within the economic reach of not only large enterprises but small and mid-sized organizations as well.

Rather than investing in additional hardware or CSPs, consider leveraging your existing infrastructure and spare resources to develop a software-centric data management approach that can meet your AI needs. Does this sound too good to be true? Well, it’s not. In my final blog in this series I will show exactly how WekaIO is enabling AI storage for organizations of all sizes. I invite you to join me there.