Artificial intelligence (AI) and machine learning (ML) promise a step change in the automation fundamental to IT, with applications ranging from simple chatbots to almost unimaginable levels of complexity, content generation and control.
Storage forms a key part of AI, to provide data for training and store the potentially large volumes of data generated, or during inference when applying the results of AI to real workloads.
In this article, we look at the key characteristics of AI workloads, their storage input/output (I/O) profile, the types of storage suitable for AI, the suitability of cloud and object storage for AI, and storage provider strategy and products for AI.
What are the key characteristics of AI workloads?
AI and ML are based on training an algorithm to detect patterns in data, gain insight into data, and often trigger responses based on those findings. These can be very simple recommendations based on sales data, such as the “people who bought this also bought” type of recommendation. Or it could be the kind of complex content we see from large language models (LLMs) in generative AI (GenAI) trained on large and multiple data sets to enable it to create compelling text, images and video.
There are three key phases and deployment types for AI workloads:
Training, where recognition is built into the algorithm from the AI model dataset, with varying degrees of human supervision; Inference, during which the patterns identified in the training phase are put into action, either in standalone AI deployments and/or; Deployment of AI to an application or sets of applications.
Where and how AI and ML workloads are trained and executed can vary significantly. On the one hand, this can look like batch or one-time training and inference runs that look like high-performance computing (HPC) processing on specific data sets in science and research environments. On the other hand, AI, once trained, can be applied to continuous application workloads, such as the type of sales and marketing operations described above.
The types of data in training and operational datasets can range from a large number of small files in, for example, sensor readings in internet of things (IoT) workloads, to very large objects such as image and movie files or discrete batches of scientific data. File size on ingest also depends on AI frameworks used (see below).
Datasets can also form part of primary or secondary data storage, such as sales records or data held in backups, which are increasingly seen as a valuable source of corporate information.
What are the I/O characteristics of AI workloads?
Training and inference in AI workloads typically require massively parallel processing, using graphics processing units (GPUs) or similar hardware that offload processing from central processing units (CPUs).
Processing performance must be exceptional to handle AI training and inference within a reasonable time frame and with as many iterations as possible to maximize quality.
Infrastructure also needs to be potentially massively scalable to handle very large training datasets and outputs from training and inference. It also requires speed of I/O between storage and processing, and possibly also being able to manage portability of data between locations to enable the most efficient processing.
Data is likely to be unstructured and in large volumes, rather than structured and in databases.
What kind of storage do AI workloads need?
As we have seen, massively parallel processing using GPUs is at the heart of AI infrastructure. So in short, the job of storage is to provide those GPUs as quickly as possible to ensure that these very expensive hardware items are optimally used.
More often than not, this means flash storage for low latency in I/O. The required capacity will vary according to the scale of workloads and the likely scale of the results of AI processing, but hundreds of terabytes, even petabytes, are likely.
Sufficient throughput is also a factor as different AI frameworks store data differently, such as between PyTorch (large number of smaller files) and TensorFlow (the reverse). So it’s not just a case of getting data to GPUs quickly, but also at the right volume and with the right I/O capabilities.
Recently, storage vendors have been pushing flash-based storage—often using high-density QLC flash—as a potential general-purpose storage, including for data sets that have until now been considered “secondary,” such as backup data, because customers may now want to access them at higher speed using AI.
Storage for AI projects will vary from that which provides very high performance during training and inference to various forms of long-term retention because it will not always be clear at the start of an AI project which data will be useful.
Is cloud storage good for AI workloads?
Cloud storage can be a viable consideration for AI workload data. The advantage of keeping data in the cloud brings an element of portability, with data that can be “moved” closer to its processing location.
Many AI projects start in the cloud because you can use the GPUs for the time you need them. The cloud isn’t cheap, but deploying on-premise hardware requires you to commit to a production project before it’s justified.
All the key cloud providers offer AI services ranging from pre-trained models, application programming interfaces (APIs) to models, AI/ML computing with scalable GPU deployment (Nvidia and their own), and storage infrastructure scalable to multiple petabytes.
Is object storage good for AI workloads?
Object storage is good for unstructured data, can scale massively, often found in the cloud, and can handle almost any data type as an object. This makes it well suited for the large, unstructured data workloads likely in AI and ML applications.
The presence of rich metadata is another plus for object storage. It can be searched and read to help find and organize the right data for AI training models. Data can be kept almost anywhere, including in the cloud with communication via the S3 protocol.
But metadata, for all its benefits, can also overwhelm storage controllers and impact performance. And if cloud is a place for cloud storage, cloud costs must be considered as data is accessed and moved.
What do storage providers offer for AI?
Nvidia provides reference architectures and hardware stacks that include servers, GPUs, and networking. These are the DGX BasePOD reference architecture and DGX SuperPOD turnkey infrastructure stack, which can be specified for industry verticals.
Storage vendors have also focused on the I/O bottleneck so that data can be efficiently delivered to large numbers of (very expensive) GPUs.
These efforts ranged from integrations with Nvidia infrastructure – the key player in GPU and AI server technology – via microservices such as NeMo for training and NIM for inference to storage product validation with AI infrastructure, and to entire storage infrastructure stacks aimed at AI.
Supplier initiatives have also centered on the development of renewable enhanced generation (RAG) pipelines and hardware architectures to support them. RAG validates the findings of AI training by referring to external, trusted information, in part to tackle so-called hallucinations.
Which storage vendors offer products validated for Nvidia DGX?
Numerous storage providers have products validated with DGX offers, including the following.
DataDirect Networks (DDN) presents its A³I AI400X2 all-NVMe storage devices with SuperPOD. Each device delivers up to 90GBps throughput and three million IOPS.
Dell’s AI Factory is an integrated hardware stack that spans desktop, laptop and server PowerEdge XE9680 computing, PowerScale F710 storage, software and services and powered with Nvidia’s AI infrastructure. This is available via Dell’s Apex as-a-Service scheme.
IBM has Spectrum Storage for AI with Nvidia DGX. It is a converged but separately scalable compute, storage and networking solution validated for Nvidia BasePOD and SuperPod.
Backup provider Cohesity announced at Nvidia’s GTC 2024 event that it will integrate Nvidia NIM microservices and Nvidia AI Enterprise into its Gaia multicloud data platform, enabling the use of backup and archive data to form a source of training data.
Hammerspace has GPUDirect certification with Nvidia. Hammerspace markets its Hyperscale NAS as a global file system built for AI/ML workloads and GPU-driven processing.
Hitachi Vantara has its Hitachi iQ, which provides industry-specific AI systems that use Nvidia DGX and HGX GPUs with the company’s storage.
HPE has GenAI supercomputing and enterprise systems with Nvidia components, a RAG reference architecture, and plans to build in NIM microservices. In March 2024, HPE upgraded its Alletra MP storage arrays to connect twice the number of servers and four times the capacity in the same rack space with 100 Gbps connectivity between nodes in a cluster.
NetApp has product integrations with BasePOD and SuperPOD. At GTC 2024, NetApp announced integration of Nvidia’s NeMo Retriever microservice, a RAG software offering, with OnTap client hybrid cloud storage.
Pure Storage has AIRI, a flash-based AI infrastructure certified with DGX and Nvidia OVX servers and using Pure’s FlashBlade//S storage. At GTC 2024, Pure announced that it has created a RAG pipeline that uses Nvidia NeMo-based microservices with Nvidia GPUs and its storage, plus JAGs for specific industry verticals.
Vast Data introduced its Vast Data Platform in 2023, which combines its QLC flash-and-fast cache storage subsystems with database-like capabilities at the native storage I/O level and DGX certification.
In March 2024, hybrid cloud NAS manufacturer Weka announced a hardware device certified to work with Nvidia’s DGX SuperPod AI data center infrastructure.
Disclaimer for Uncirculars, with a Touch of Personality:
While we love diving into the exciting world of crypto here at Uncirculars, remember that this post, and all our content, is purely for your information and exploration. Think of it as your crypto compass, pointing you in the right direction to do your own research and make informed decisions.
No legal, tax, investment, or financial advice should be inferred from these pixels. We’re not fortune tellers or stockbrokers, just passionate crypto enthusiasts sharing our knowledge.
And just like that rollercoaster ride in your favorite DeFi protocol, past performance isn’t a guarantee of future thrills. The value of crypto assets can be as unpredictable as a moon landing, so buckle up and do your due diligence before taking the plunge.
Ultimately, any crypto adventure you embark on is yours alone. We’re just happy to be your crypto companion, cheering you on from the sidelines (and maybe sharing some snacks along the way). So research, explore, and remember, with a little knowledge and a lot of curiosity, you can navigate the crypto cosmos like a pro!
UnCirculars – Cutting through the noise, delivering unbiased crypto news