Introducing Craylm v0.5: Unifying LLM Inference and Training for RL Agents
Today we are excited to announce the release Craylm v0.5 (pronounced KRAY-lem)
Craylm is a fully open source, CC-0 Licensed (unrestricted commercial use), integrated LLM inference and training platform.
We created Craylm to simplify the development of reinforcement learning agents with advanced reasoning and memory capabilities, similar to those of DeepSeek R1. By integrating inference and training engines into a single platform, Craylm enables the seamless generation and utilization of reasoning trajectories for training updates, streamlining the development process.
- Contact us for help deploying Craylm on your GPU supercomputer: Contact
- Access the source code at: Github
- Read the documentation: Docs
- Join the community on Discord
Craylm builds on top of the vLLM inference engine, the Megatron-LM training framework, and the HuggingFace model hub. It unifies the capabilities of these tools into a single platform, enabling users to easily perform LLM inference and training, and build higher level applications such as LLM-Agents.
Craylm is designed for high performance, with out of the box support for AMD ROCm GPUs and NVIDIA CUDA GPUs. It includes development mode support for x86 CPUs and ARM CPUs. It inherits the distributed training capabilities of Megatron-LM and the optimized inference engine of vLLM. Cray is also designed to be easy to use. It provides an OpenAI compatible server and a simple command line interface for users to interact with the platform.
Craylm is inspired by the work of Seymour Roger Cray, an American electrical engineer and supercomputer architect who designed a series of computers that were the fastest in the world for decades, and founded Cray Research, which built many of these machines. Called “the father of supercomputing”, Cray has been credited with creating the supercomputer industry.
We hope that Craylm will help to democratize access to large language models, and enable more people to build and deploy AI systems that push the boundaries of accuracy to benefit society. We are excited to see what you will build with Craylm!
If you are interested in contributing to Craylm, please reach out to us at Contact.
v0.5 - Functionality and Feedback
Craylm v0.5 is a fully functional prototype that demonstrates the core capabilities of the platform. It is intended to show the potential of the platform and gather feedback from the community. We are actively seeking feedback on the platform, including suggestions for new features, bug reports, and ideas for how to improve the platform.
We expect to release v1.0 in the coming months, which will include additional features, performance optimizations, benchmarks, and bug fixes. We are also planning to release a series of blog posts and tutorials that will help users get started with Craylm and build their own LLMs with advanced reasoning capabilities.
Lessons Learned
Over the last ten years, we have built seven generations of training and inference stacks at Baidu, Apple, AI Fund, MLCommons, several startups including Lamini, and multiple enterprise partners. We have learned a lot about what works and what doesn’t. Craylm is a clean slate rewrite of a unified inference and training stack incorporating the lessons learned. Here are some of the key lessons:
- Training and Inference are both essential, but most frameworks focus on one or the other.
- SLURM and Kubernetes have different worldviews, but pieces of both are useful.
- A model exchange format has not yet emerged, and the best we have is the huggingface model hub.
- The main dependency is the GPU, and it it possible to achieve a simpler and more secure design by cutting out the rest.
These infrastructure issues are so difficult that many organizations focus on either training or inference, leaving significant accuracy on the table. Our vision for Craylm is to enable organizations to build LLM systems that leverage backpropagation in addition to in-context learning, just like humans, enabling them to solve more challenging problems.
Unified Training and Inference
Training is essential because it enables models to learn. Training creates flywheels. A machine learning model without training could never be updated and could not learn.
Inference is essential for models to have practical value. Inference is used when a model is deployed in production to serve predictions to users.
Craylm unifies training and inference into a single platform. This enables users to simultaneously train and deploy models using the same GPU cluster, and to easily switch between the two modes without changing their code.
It does this by building on pytorch models stored in the huggingface model hub. This allows training models that are supported by huggingface. The inference engine is based on vLLM, which loads the saved huggingface models. It is possible to train new models from scratch, or to continue training existing models.
A Craylm cluster will spin up inference and training workers as needed. Training jobs are submitted to the SLURM queue and pulled out by the training workers as batch jobs. Inference requests are submitted to a persistent queue and pulled out by the inference workers as needed. As checkpoints are produced by the training workers, they are pickup up by the inference workers and new inference requests can immediately use the new model.
Craylm follows the API design of the Llama Stack, which is a set of APIs that are designed to be easy to build Agent applications on top of open source LLMs. The Llama Stack includes guards including AILuminate benchmark suite, which is a set of benchmarks that are designed to measure the performance of LLMs. Craylm is designed to be compatible with the Llama Stack, and to work well with the AILuminate benchmarks.
SLURM and Kubernetes are Better Together
Many software developers are familiar with Kubernetes for cloud and enterprise apps. So the natural temptation is to use Kubernetes as the job queue, e.g. with the MPI Operator to get the network to work. This leaves network performance on the table and forces the user to set up the RDMA network stack manually.
Kubernetes without SLURM can work, but it forces the infra team to reinvent the wheel.
Some HPC clusters run SLURM on bare metal. This also works and simplifies the network configuration, but deployments without containers create version and dependency management hell.
In Craylm, we run a virtual SLURM cluster inside of a kubernetes application, providing the best of both worlds. We are closely monitoring development on the SLINKY project and expect to fold in better approaches as they become available.
Let’s look at SLURM and Kubernetes in more detail.
SLURM is a batch job scheduler that is widely used in high performance computing environments. It is designed to run large numbers of jobs on a cluster of machines with a high performance network. The high performance network is essential for training large models, because it allows the GPUs to communicate with each other quickly using RDMA primitive operations that are composed into collective operations like allreduce and allgather. Superficially it seems like SLURM only handles job scheduling, but it also sets up the MPI environment that brings up the high performance network that distributed training frameworks like Megatron-LM tap into.
Kubernetes is a container orchestration system that is widely used in cloud computing environments. It handles the problem of deploying containers onto a cluster of machines. Containers are essential to manage the complex software dependencies. If you want to upgrade the software on the cluster, you can just upgrade the container image and restart the container. Kubernetes allows you to do this without affecting the other containers running on the cluster.
Putting them together may seem difficult, and indeed, there are many subtle details that need to be taken care of including how SLURM daemons manage hostnames and handle cgroups which make conflicting assumptions with the lower layers of kubernetes networking, containerd, and accelerator management. However, SLURM is a essentially set of lightweight C programs that coordinate over TCP sockets. They tap into rdma interfaces exposed by linux network drivers. It is possible to entirely encapsulate these SLURM daemons in containers, and run those containers on Kubernetes. Craylm does this.
When the Craylm worker containers start, they have access to a virtual SLURM network and control plane. We build the orchestration,
distributed training framework, and collectives on top of these interfaces. So a user can helm install cray
, and get a fully functional
SLURM cluster.
Huggingface is the model exchange format
New models come up almost daily, and it is important to be able to train and deploy them with minimal porting effort.
The huggingface model hub is the closest thing we have to a model exchange format. The transformers library has pytorch source code implementations of hundreds of classes of models. Even though most LLMs are based on transformers, there is no standard implementation of a transformer. Different models might choose different position embeddings or layer shapes. One of the biggest points of friction between training and inference teams in large companies, is resolving these differences. Craylm modifies Megraton-LM to use the huggingface model hub as the source of truth for model implementations, which makes it possible to immediately load most trained models directly into the inference engine.
Secure Enclave
Multiple previous generations of Craylm used a variety of cloud services including parallel distributed object storage such as Azure Blob or GCS, database services like BigQuery, cloud load balancers, etc. These lead to significant complexity and reduced portability between supercomputers, which are often optimized for efficiency and have different cloud stacks. The current generation of Craylm requires GPU compute nodes, a high performance interconnect, and nothing else.
This allows Craylm to be deployed in a secure enclave, with no external dependencies.
We had to address several design issues to make this possible.
Training State Versioning without a Database or Object Store
Training jobs need to produce a model that is passed to the inference engine and state needs to be tracked between the training and inference workers. We uniquely identify each model by a hash of the training data, the model code, and the hyperparameters. This state is saved to a shared filesystem that is accessible by both the training and inference workers. The training system picks up this training state and launches a SLURM job to train the model. As it trains, it produces checkpoints that are saved to the shared filesystem. The inference system picks up these checkpoints as they are made available. This enables fault tolerance, because if a training worker fails, training system can restart from a checkpoint. It also avoids the need for a database to track the state of the training jobs.
Massive Inference without Cloud Queues or Load Balancers
LLMs are big and expensive, even on GPUs. Generating thousands of tokens can take minutes. A user who submits a few requests from their laptop in a loop can quickly overwhelm a GPU cluster that costs millions of dollars. Submitting these requests as HTTP REST requests leads to load imbalance and timeouts. Submitting too few requests leads to underutilization. vLLM uses continuous batching to handle load imbalance, but that can still lead to overwhelming the workers. Craylm pushes inference requests into a persistent queue, which can be very large in size. The requests are pulled out by vLLM workers as they have capacity. The user client polls to get the results from the queue. This allows efficient and lossless processing of very high numbers of inference requests, common to inference pipelines.
Datasets without an Object Store
Training requires training data. Typically this requires uploading a training dataset. The natural place to put it is in a cloud object store. It initially seems hard to get around this. But how hard it is really?
How big is each data item that an LLM is trained on? A few hundred tokens.
How many bytes does that take up? Let’s say one byte per token compressed, so a few hundred bytes per item.
How many training items are there for a dataset like Alpaca? 52K.
So how big is that? 52K (Q&A pairs) * ~300 (bytes per pair) = 15.6 MB
How long does it take to upload 15.6MB over a cloud internet connection, e.g. 1 Gbs? = 0.124 seconds
How long does it take to train a 70B parameter model on that data on an H100 at 40% MFU? = 6.5 hours
So the upload perf isn’t nearly as big of a practical problem as it seems.
We built a fast file uploader in Craylm that can handle 10s to 100s of GBs of data over common cloud connections. It hashes the data to avoid uploading duplicates. No need for an object store.
Our main insight is as follows. If you are running Craylm, you built a supercomputer capable of training an LLM. That computer can handle moving around a few GB, so could most laptops today. An object store managing thousands of parallel disks is overkill.
Future
Our target for v1.0 release is to include performance optimizations and kernel level benchmarks on modern GPUs including MI300X and H200.
We plan to keep Craylm up to date with the latest advances in LLM research, including new models, new training stack optimizations, and new inference stack optimizations.
We are excited to build a new generation of LLM supercomputers that democratize access to training language models. We are in particular interested in applications that build on top of Craylm to support LLM agents that train and deploy themselves.
Contributing
Craylm is an open source project developed by a small team at TensorWave.
- Greg Diamos, PhD, is a computer architect and deep learning researcher. He has worked at NVIDIA.
- Sudnya Diamos is a 3x startup founding engineer, and former NVIDIA GPU architdect.
- Naila Farooqui, PhD, is a deep learning researcher. She is the co-author of NVIDIA Cutlass.
- Suhabe Bugrara, PhD, is a 3x CTO.
Contact Us to explore opportunities to work with us.
- Check out the examples - Example - Coming Soon
- Read deeper into Tokenformer Adaptors, a new method for efficiently training LLMs: Tokenformer Blog - Coming Soon
We welcome community contributions. Some areas that could use help:
- Help completing the Llama Stack client
- Running AILuminate evals
- Performance optimizations in the training and inference engines
- New deployment targets, benchmarks, and regression tests
- Hosted cray-lm clusters on your cloud
- A refactor of the unified vLLM build that makes it easier to accept upstream changes
- Integration with SLINKY