Many customers are buying AI systems originally designed for research labs and trying to integrate them into production environments, only to wind up with a bad case of AI buyer’s remorse.

Once these boxes land in the data center, the AI team or data science team may be thrilled, but I can’t say the same for IT operations. Research hardware may look like a production platform at first glance, but that impression collapses as soon as real workloads appear.

A big part of the problem is expectations.

Enterprise buyers see the word “appliance” and assume it means a feature-rich, automated platform hardened for production—as capable and complete as VMware Cloud Foundation (VCF). However, these “research” appliances were never designed to support long-lived, business-critical applications. Core enterprise requirements which are required for production inference workloads are either missing or lack maturity, such as: patching and lifecycle management, observability, firewalls, high availability, automation and logging, role-based access controls, audit, basic backup and restore functionality, and more. When it comes to production workloads, the GPU is the easy part.

Operational Pain

You see this when teams attempt their first real operational task. For pure R&D, you don’t worry about logging or change control. But the moment someone tries to patch software, update firmware, or manage dependencies, the illusion breaks. They suddenly realize they need another tool for something that should have been built-in.

From there, the operational pain grows. IT has to layer on backup, HA, monitoring, security, and compliance tools after the fact. The economics deteriorate just as quickly. Utilization on these appliances often stalls at 40 to 60 percent because the hardware footprint is oversized for the inference work being done, and automation tooling that could drive up utilization is either absent or immature. By contrast, virtualized platforms routinely hit around 80 percent or higher utilization. That 30-point gap matters: it’s 30 percent fewer servers, 30 percent less network hardware, and 30 percent less power and cooling. The true penalty is the cascading cost of buying the wrong appliance—and that’s far more than the sticker price.

Misaligned incentives compound the problem. GPU sales teams get compensated on volume, not fit. Server vendors want to sell the biggest possible configuration. Meanwhile, many customers misread the moment and assume they need training-scale hardware, even though most enterprises should be leaning on foundation models and focusing on inference. So they buy appliances designed for AI model training and quickly discover they can’t operate production inference workloads. All the things they just get in VCF that “just work” simply aren’t there.

Last year’s board-level pressure to “do something with AI” accelerated this trend. Some teams didn’t even know what their use cases were yet—they just bought the hardware. Now CIOs worry they’ll be held responsible for oversizing data centers or buying co-lo capacity based on the wrong assumptions. Much of that sizing was driven by sales teams motivated by how many GPUs they could move.

Another source of confusion comes from the “AI factory” narrative. Vendors present AI nodes as token-generating factories, as if generating tokens were the whole job. But enterprises need far more: secure token pipelines, logging, observability, availability, change control, and full lifecycle management. None of these come with research appliances, and once the appliance cost is sunk, all of the absent operational requirements start piling on.

Course Corrections

Some organizations discover this the hard way. A large European-based research institution built its sovereign cloud on VCF, then bought AI factory appliances thinking that’s simply what you’re supposed to do for AI. They immediately ran into integration issues. Their enterprise software automation couldn’t interact with the appliances, and because those appliances were now a sunk cost, they had to build an entire parallel management plane to support them. Going forward, their new AI inference infrastructure will be based on VCF.

Others have already corrected course. One of the largest manufacturers in the world, after years on bare metal, wanted to be more lean, more agile, and more cost-optimized. They’re in the process of moving nearly half of their HPC estate—about 10,000 cores—to VCF, and will move all of their AI workloads there as well. Their alternative was to continue with bare-metal “AI factory” footprints, but they understood the additional costs that come with those architectures. A major regional cloud provider we work with reached the same conclusion and is rejecting bare-metal AI factories after uncovering their hidden operational expense.

The industry is also shifting away from the belief that enterprises should build foundation models. The number of inference workloads is overtaking training, yet many organizations continue buying training-scale hardware as if that’s their role. Only a handful of companies should be training large foundation models. Most enterprises don’t need to build models—they need to optimize them, and open source has plenty of strong options, along with commercial models that you can run on your own infrastructure.

Many assumptions still trip people up. Buyers assume lifecycle management, RBAC, high availability, secure pipelines, and version control automatically come with the appliance. They assume they can patch and update systems the way they always have. They assume the pipelines are secure and that role-based access controls are enforceable. They assume high availability is built in. They assume the primary metric is how many tokens the system can generate. None of that reflects what’s actually required in production.

Production AI needs observability, log management, workload scheduling, firewalling, encryption, maintenance workflows, and intelligent GPU placement. Scheduling behavior matters—some schedulers behave randomly and leave you with fragmented GPUs. Firewalls matter. Encryption matters. These are everyday enterprise problems. AI is no exception. You can’t base architecture decisions purely on quantity (i.e., token generation) without giving equal consideration to quality (e.g., availability, security and compliance). Both go hand-in-hand.

Private AI’s Answer

There are three questions that help prevent buyer’s remorse:

Can I operate this solution using my tools of choice? If not, I’ll need different people.
Can I run other AI models and software on this? If not, you could be walking down the path toward many disparate “AI islands” you will have to independently operate at a higher cost.
What is the real total cost once tools, utilization, and power are factored in?

This brings us to the solution. VCF on compatible systems like HGX can virtualize the capacity, unlock stranded resources, and restore the enterprise-grade controls customers expected in the first place.

Broadcom anticipated this shift years ago. We believed AI models would move to where the data resides and that inference, not training, would become the dominant enterprise use case. The features we built at the time may have seemed boring, but the strategy was right. Private AI isn’t just a Broadcom term anymore; the industry has embraced it. That’s because AI without automation, resiliency, privacy and compliance isn’t production AI. It’s a research project. Together, we can do better than that.

Operational Pain

Course Corrections

Private AI’s Answer

Get news delivered directly to your inbox