What We Learned Building Enterprise Linux Optimized for AI

Is your AI infrastructure built for the future? Guest writer Brian Dawson, product manager at CIQ, reveals surprising lessons learned while creating an OS designed for AI workloads.

As a product manager, my colleagues and I have been spending a lot of time thinking about infrastructure problems that don’t get enough attention. One that stood out to us is that the majority of AI workloads run on Linux, and production inference costs — the expenses associated with a machine learning model in production — are growing faster than development costs. Even so, most organizations are running these critical workloads on operating systems that weren’t designed for them.

That led us to build a variant of Rocky Linux optimized specifically for AI workloads. Rocky Linux from CIQ – AI is not just another Linux build with some extra packages, but something designed from the ground up to handle the unique demands of AI and Machine Learning workloads, including the long-term upstream kernel, and AI-focused optimizations.

When we started this project, we had some strong ideas about what enterprises needed — better hardware support, streamlined framework integration, etc. — but there was a lot we had yet to discover. After weeks of intensive work getting our first release candidate ready, we’ve learned some things that both validated our original assumptions and others that completely surprised us.

We’re still deep in development — RLC-AI is a work in progress — but I want to share some of the key insights we’ve gathered along the way. If you’re dealing with AI infrastructure challenges, some of these lessons might sound familiar, and if you’re not there yet, consider this a heads-up about what’s coming.

PyTorch Isn’t Just a Framework: It’s a Fulcrum for AI Workloads

The first thing that became obvious? The open source machine learning framework PyTorch is the center of gravity for AI workloads. Everything either sits on top of it or needs to talk to it, which makes PyTorch, along with its web of dependencies, a critical pain point as well as an opportunity.

Supporting PyTorch isn’t just about making sure it runs. It means ensuring it is compatible with the hardware and drivers below it, as well as the frameworks and apps above it. That includes framework compatibility, hardware support, driver management, and performance tuning. It turns out that packaging those elements the right way (we chose RPMs, not Python’s package manager, pip) makes a big difference in security, reproducibility, and ease of support.

While pip and Python virtual environments work well for development and many production scenarios, enterprise deployments benefit from additional controls around dependency management, security scanning, and reproducible builds. Our RPM-based approach provides signed packages and controlled supply chains, which complement Python packaging tools for organizations with strict compliance requirements.

Driver Management: The Silent Killer of Human and GPU Cycles

We suspected driver dependencies would be tough. We didn’t realize just how tough.

In general-purpose Linux, GPU drivers are often bolted on after the fact. That means you’re on your own for debugging, version mismatches, and subtle incompatibilities. More than half of the users we spoke with reported driver installation or compatibility issues were one of the top blockers when standing up AI infrastructure.

That it was more than half, and not just a minor issue, surprised us. If your GPU isn’t recognized or underperforms because of a missing kernel module, you are effectively burning dollars due to configuration time and compute spend. Evidence of this is seen in the variance in OSS model performance across model providers.

Most distros punt here, telling users to fetch drivers directly from vendor repositories. But that’s a brittle approach. If your security policy doesn’t allow for external repos, you’re stuck. Worse yet, if the upstream package changes or disappears, you’re not just stuck, you’re probably moving backwards. The challenge isn’t just integration, it’s keeping current. Even containers don’t fully solve this. You still need host-level driver installation and version matching.

We made the decision to pre-integrate AMD, NVIDIA, and Intel drivers into our builds. So far, it hasn’t been easy. It meant taking on the pain of building and validating images so that they work out of the box across a variety of hardware, which turned out to be a complex and expensive effort.

To us, that work validates the market need and the benefit RLC-AI will provide, because when an engineer downloads our AI-optimized enterprise Linux variant, they’re getting a tested, supported foundation that saves them the time, effort, and expense that we incurred.

Confidential Computing and Security: An Industry Priority

One thing that did surprise us was the value the companies we spoke to placed on security, and specifically support for confidential computing.

We went into this project expecting performance to be the headline, but increasingly the people we’re talking to are just as concerned about encryption at rest, in transit, and during inference. They want protection from tampering or leakage, especially in multi-tenant environments where many AI solution providers deliver their inferencing services.

While Linux has made strides here, support for confidential computing features is still uneven. Some hardware vendors are ahead of the curve; others are still catching up. Accordingly, we see a need in the market to deliver support for emerging confidential computing hardware needs, along with other security hardening, in a trusted, mostly turnkey package.

AI Engineers Need an OS, Too

Initially, we thought our core audience would be Linux system administrators. After all, they’re the ones who know how to tune a kernel, install a driver, or troubleshoot boot logs. And yes, we got interest from that camp, but not as much as we expected.

The bigger group, it turns out, was AI and ML engineers. These are folks who don’t necessarily want to become Linux experts. They just want to run models, experiment with frameworks, and get results. And they were hitting real friction trying to do that on generic operating systems. Either things aren’t compatible, don’t work after updates, or break when they move to production.

What that told us is that we weren’t just building for sysadmins. We were building for people who write models, deploy inference endpoints, and test performance benchmarks. That means rethinking documentation, packaging, and support to address these needs. It also means creating examples, starter workloads, and the tools AI/ML engineers actually use, like vLLM and MLPerf. In other words, we’re providing everything engineers need to move from “downloaded the ISO” to “running models” faster, so they can focus on innovation. We’re tracking time from “OS download to first token” as a key metric.

AI-Ready Linux Needs to Stay Flexible

The other bet we made (and are seeing validated) is that no one wants to be locked into a particular model or framework. Today you might be using Mistral. Tomorrow it might be LLaMA or Mixtral or something we haven’t seen yet. Same with hardware. Maybe you start on NVIDIA, but your next deployment lands on AMD or Intel accelerators.

$Premium VPS Hosting at a fraction of the cost$

If your OS can’t keep up with that churn, it becomes a liability. That’s why we built Rocky Linux from CIQ – AI to be model- and hardware-agnostic as a design principle. We are focusing on a BYM (bring your own model) approach.

RLC-AI is not perfect yet, but it’s improving fast. And because we’re pulling the latest long-term kernel from upstream, we will be able to deliver support for emerging hardware and workloads better than the industry is doing today.

We are not trying to build a whole new stack for your AI workloads — that is your choice — but we are making sure the layer underneath is rock solid, performant, and secure. That includes secure images, pre-integrated drivers, and reproducible builds.

This Is Harder Than It Looks

I’ve been in open source and enterprise software for a long time. And I’ve learned the hard way that “just package it up and make it work” is never as easy as it sounds. Especially not when it involves fast-moving frameworks, proprietary drivers, confidential compute, and performance-critical workloads.

That’s exactly why I think a purpose-built Linux for AI matters. It’s not about doing something flashy or on trend; it’s about filling a gap in the industry to address a foundational need, one we believe will deliver wide and meaningful impact.

The reality is that if your operating system is dragging down AI performance, that’s more than just an inconvenient technical issue—it’s a material impact on your business.

Brian Dawson

Guest writer Brian Dawson is a product manager at CIQ, where he helps define and deliver enterprise-grade Linux solutions for AI workloads. A longtime open source advocate, Brian has more than 25 years experience in software development and is a recognized voice in DevOps and developer productivity.