NVIDIA NeMo: How it Works and More

What is NVIDIA NeMo?

What does NVIDIA NeMo stand for? NVIDIA NeMo stands for “neural modules.” These are the basic components of the custom models users can build and train with the NeMo framework. NeMo is part of the NVIDIA AI Enterprise platform and service, Foundry.

What is NeMo NVIDIA and what is it used for generally? 

NVIDIA NeMo is an open-source, state of the art, enterprise grade kit for building, customizing, and deploying generative AI models. The cloud-native NVIDIA NeMo framework includes tools for data curation, model customization and pretraining, retrieval-augmented generation (RAG), and guardrailing.

The key components of NVIDIA NeMo include:

  • NeMo Core. These foundational elements support training and inference, and provide a streamlined process for developing generative AI models.
  • NeMo Collections. These specialized modules for natural language processing (NLP), automatic speech recognition (ASR), and text to speech (TTS) applications include both training scripts and pre-trained models for a range of tasks.
  • Neural modules. These are the interconnectable building blocks that define trainable components of comprehensive models.
  • Application scripts. NVIDIA NeMo offers ready-to-use scripts that allow users to train or fine-tune models on specific datasets more rapidly.

The NVIDIA NeMo platform can build a variety of generative AI models, including large language models (LLMs), vision language models (VLMs), video models, and speech AI in clouds, data centers, and edge environments.

NeMo offers a fast, cost-effective way for enterprises to adopt generative AI. It offers precise data curation, advanced options for creating custom models, high performance, and API stability.

NVIDIA NeMo Framework Explained

What is the NVIDIA NeMo framework and how does it work? The NeMo framework from NVIDIA is a cloud-native kit for developing, customizing, and deploying generative AI applications and models.

Its modular design encourages users to mix and match core components. NVIDIA NeMo uses existing code and pre-trained model checkpoints to simplify the user experience and accelerate model training and development.

The NVIDIA NeMo framework simplifies the training, fine-tuning, and deployment of LLMs. Here is a brief summary of how the process works:

  • Data collection and preprocessing. NVIDIA NeMo collects and preprocesses massive quantities of textual data from various sources, and then cleans and formats it so it is suitable for training.
  • Training. NVIDIA NeMo relies on high-performance GPUs to accelerate the model training process. Supervised learning allows the system to learn nuanced linguistic patterns.
  • Fine tuning. After training, the model can be fine tuned for specific domains or tasks. NVIDIA NeMo fine tuning involves training the model on a select, focused dataset so it can adapt to specific use cases such as analysis of legal documents, medical diagnosis based on image analysis, or customer service.
  • Inference and deployment. NVIDIA NeMo tools integrate models into various applications easily, allowing for real-time inference. For example, the model can process and respond to text inputs from chatbots or virtual assistants in real-time.
  • Continuous learning. NVIDIA NeMo supports ongoing learning, so models can be updated with new data, remain relevant and accurate, and adapt to new trends and patterns as they emerge.
  • Retrieval-augmented generation (RAG). NVIDIA NeMo Retriever is a service that retrieves information and can be deployed in the cloud or on-premises. It integrates enterprise-grade RAG capabilities into customized AI applications simply and securely. NVIDIA NeMo Retriever also features a production-ready pipeline for information retrieval and a core of models trained using auditable, curated data sources.

NVIDIA NeMo Architecture

NVIDIA NeMo is composed of a suite of microservices. Here are some key NVIDIA NeMo microservices and some information about how they work:

NVIDIA NeMo Curator improves the accuracy of generative AI models by processing image, text, and video data at scale for customization and training. The NVIDIA NeMo data curator also offers pre-built pipelines that generate synthetic data. This allows users to evaluate and customize generative AI systems.

NVIDIA NeMo fine tuning is handled by Customizer, a scalable, high performance microservice that simplifies alignment and fine-tuning of LLMs for domain specific use cases. The current LLM collection that has adopted the more recent Python-based Nemo 2.0 supports the Nemotron, GPT, Llama, and NVIDIA Mistral NeMo models. This LLM collection supports pretraining, Parameter-Efficient Fine-Tuning (PEFT), and Supervised Fine-Tuning (SFT). NVIDIA NeMo documentation offers more information about fine tuned models.

NVIDIA NeMo Retriever is a collection of generative AI microservices that allows the seamless connection of diverse business data and custom models. This allows NVIDIA NeMo RAG applications to deliver more accurate responses. NVIDIA NeMo supports a basic RAG pipeline and is adding new RAG features and models on an ongoing basis.

NVIDIA NeMo Guardrails handles dialog management to ensure appropriateness, accuracy, and safety in smart LLM applications. This NVIDIA NeMo LLM tool secures organizations overseeing generative AI systems.

NVIDIA NeMo models include everything necessary for training and reproducing conversational AI models, including data augmentors, data pre- and postprocessing, datasets/data loaders, language models, neural network architectures, optimizers and schedulers, and tokenizers. Hydra is used to configure both the Pytorch Lightning Trainer and any NVIDIA NeMo model.

The NeMo-Aligner codebase supports efficient model alignment within the NVIDIA NeMo Framework. All NeMo-Aligner algorithms work with any GPT-based NeMo model from NVIDIA. Start with one of three pretrained NeMo LLM NVIDIA models to align: 2B GPT, LLama2-7B, or Nemotron-340B.Finally, use TensorRT-LLM and Triton to deploy NVIDIA NeMo LLM models. The system will execute the provided script from a NeMo checkpoint on Triton, export the model to TensorRT-LLM, and then begin the service session on Triton.

NVIDIA NeMo Megatron

NeMo Megatron from NVIDIA is a large, powerful transformer that currently supports three types of models: decoder only GPT-style models, encoder only BERT-style models, and encoder-decoder T5/BART-style models.

The NVIDIA NeMo Megatron library is highly optimized and efficient for training LLMs. NVIDIA NeMo automatically handles parallel checkpoints for pretrained models from Megatron-LM which share the same features as other NVIDIA NeMo models.

The parallelism of Megatron models allows users to train language models with billions of weights and then use them in NeMo for downstream tasks. NVIDIA recommends that users pre-train, tune, and run inference with NeMo Megatron containers for large Megatrons (1B and above).

NVIDIA NeMo Guardrails

NVIDIA NeMo Guardrails is an open-source toolkit that helps developers add programmable guardrails to conversational applications built on LLMs. NeMo Guardrails uses embedding search or vector databases for its knowledge base functionality and to implement the guardrails process.

Guardrails are safety measures that monitor and control how a user interacts with an LLM application. They help ensure that the AI model produces output that’s appropriate, accurate, and secure and operates within defined principles.

NVIDIA NeMo Guardrails can do a number of things, including:

  • Control the topics an application can discuss
  • Respond in a particular way to specific user requests
  • Define and follow a dialog path
  • Use a specific language style
  • Extract structured dataCall APIs to ensure information is accurate

NVIDIA NeMo Guardrails is usable with all LLMs, including ChatGPT. It also integrates with NVIDIA NeMo microservices and NVIDIA NIM to help developers control the LLM applications they build and deploy.

NVIDIA NeMo Pricing

NVIDIA NeMo pricing is part of an NVIDIA AI Enterprise subscription. The actual cost is based on factors like the deployment environment, number of nodes, and desired support level and includes access to the comprehensive AI suite.

NVIDIA NeMo vs NIM

Both NVIDIA NIM and NVIDIA NeMo are part of NVIDIA’s AI platform. NVIDIA NeMo is a framework for building and customizing generative AI models, while the NVIDIA NIM microservices deploy inference models across various environments. NVIDIA NeMo trains AI models, while NVIDIA NIM places them into production and runs them in applications.

Learn more about how to customize NVIDIA NIM for domain-specific uses in this NeMo NVIDIA tutorial.

NVIDIA Riva vs NeMo

The difference between NVIDIA NeMo vs Riva is essentially that Riva deploys full pipelines, made up of one or multiple supported NVIDIA NeMo models and other components from the pre- and post-processing stages, while NVIDIA NeMo trains individual models, exposing more of them and their PyTorch internals.

Users must export NVIDIA Riva pipelines to an efficient inference engine and optimize them for their target platform. The Riva server cannot directly use unsupported NVIDIA NeMo models, but it can import supported NeMo trained models using the nemo2riva tool, available via the Riva Quick Start scripts.

NVIDIA Riva vs NeMo

Some NVIDIA NeMo examples include the following:

  • Build a voice translation application using NVIDIA Riva and NVIDIA NeMo docker
  • Use generative AI models in NVIDIA NeMo to accelerate content creation
  • The NVIDIA AI Enterprise platform and various components of it can be used for fraud detection
  • Accelerate the extraction of data from documents with NVIDIA NeMo
  • NVIDIA NeMo offers highly personalized ecommerce experiences for users together with NVIDIA Merlin, RAPIDS, Triton, and NIM
  • Create intelligent virtual assistants and AI chatbots with RAG

WEKA and NVIDIA NeMo

Running NVIDIA NeMo with the WEKA Data Platform delivers exceptional performance and scalability for AI applications, particularly those leveraging large language models (LLMs) and generative AI workflows. WEKA enhances NeMo deployments by providing ultrafast, low-latency access to massive datasets, a critical requirement for the high-throughput demands of NeMo’s training and inferencing pipelines. Its advanced architecture supports NVIDIA Magnum IO GPUDirect Storage, allowing GPUs to directly access data from WEKA without CPU intervention, significantly reducing latency and boosting data transfer speeds.

WEKA’s ability to manage large-scale model repositories and vector databases ensures faster embedding retrieval and efficient storage of high-dimensional data vectors. This is essential for NeMo’s capabilities, such as training custom models, generating embeddings, and delivering high-quality inferencing results. The platform also accelerates NeMo’s synthetic data generation, context retrieval, and fine-tuning processes, reducing the time-to-market for AI-driven solutions.

With WEKA’s unified approach to training and inferencing environments, NeMo workflows can operate on shared infrastructure, eliminating silos and cutting operational costs. Its cloud-agnostic design enables seamless deployment of NeMo across hybrid and multi-cloud environments, providing consistent performance regardless of infrastructure. By combining WEKA’s optimized data pipeline with NeMo’s robust AI frameworks, organizations achieve unparalleled speed, efficiency, and scalability in building and deploying generative AI applications.