Artificial Intelligence4 min read

Beyond Bare Metal: How Automation Is Redefining AI Infrastructure

Photo for Chris WolfChris Wolf
A hand touching a spacial tunnel

Whenever I meet with CIOs and AI leaders to discuss scaling AI, the conversation almost always starts with performance. The most common myth I hear is that you can't have performance and flexibility at the same time.

The thinking goes: If you want speed, you've got to sacrifice everything else, settling for rigid infrastructure, underused hardware, and ballooning costs. However, our latest MLPerf Inference 5.0 results clarify that this trade-off is no longer a given.

We showed that AI workloads can run as fast in a virtualized environment as on dedicated hardware. Using VMware Cloud Foundation and NVIDIA H100 GPUs, we delivered top-tier performance on large models like Mixtral-8x7B and GPT-J, covering tasks in computer vision, medical imaging, and natural language processing. We did it while using only a portion of the CPU, which means there's still room to run other applications. 

Virtualized AI infrastructure isn't a science project anymore. It's hitting its stride, and automation is the engine driving that momentum.

Why Automation Matters Now

Without automation that can respond and adapt quickly, enterprises are either burning money or failing. Or, in many cases, both.

Running AI infrastructure requires more than spinning up containers or provisioning a few servers. You're also juggling compute, GPU, network, and storage resources, often under unpredictable, high-demand conditions that shift in real-time.

That's where Broadcom's VMware Cloud Foundation (VCF) platform earns its keep. Our distributed resource scheduler does more than balance workloads; it actively adjusts them on the fly. It tracks memory usage, I/O, GPU saturation, and other signals to reallocate resources automatically and keep systems running smoothly.

Inside Broadcom, we've pushed our clusters to 95% utilization and kept them there. That level of efficiency doesn't happen by accident. It comes from having automation woven into every layer of the stack.

Alan Davidson, CIO at Broadcom shares the results of Broadcom deploying a private cloud platform

Virtualization Without Compromise

Skeptics often assume that virtualizing AI workloads means giving something up. After all, AI was built to run on bare metal for a good reason. But those old assumptions don't hold up once they see the performance we're delivering, which is on par with, and sometimes better than, bare metal.

So what's behind the results? It's not magic. It's the result of decades of hard-won experience in scheduling and resource sharing. VMware has always focused on getting the most out of infrastructure. Our hypervisor doesn't just keep workloads running, it keeps them running smart.

That means we're not wasting resources or building separate silos every time a new workload comes along. Instead, we can pull from a shared pool and keep everything running smoothly. That kind of flexibility is a big deal for enterprise teams balancing legacy systems, core business apps, and AI inference all at once. You don't have to retrain teams or overhaul your tools. The workflows you already know still apply. The result is steadier operations, fewer fire drills, and costs that make sense.

We often think about automation in narrow terms: faster deployment and fewer manual steps. That's true but the real value comes from end-to-end continuity. In our architecture, automation governs everything:

  • Model deployment and versioning;
  • Security scanning and compliance;
  • Data encryption in transit and at rest;
  • High availability and failover, and;
  • Backup and disaster recovery.

If a node fails, workloads are rebalanced automatically with no downtime. If a performance hotspot pops up, we shift resources in real-time. Should ransomware strike, we can quickly identify the issue and roll back to a known-good state.

Proof, Not Promises

We understand that some AI architects are still hesitant. They believe serious AI work needs to live on bare metal. They assume performance losses are inevitable or that orchestration is too hard. Some continue to cling to the outdated notion that private infrastructure will always be more complex than the public cloud.

The MLPerf results challenge that.

The numbers prove that virtualized environments can match bare metal on performance and that automation can remove complexity, not add to it. The results also make clear that you can build AI infrastructure that's secure, resilient, and flexible without needing to hand over your data or rack up cloud bills. 

These are independently verified benchmarks, and not corner-case scenarios. We ran MLPerf Inference v5.0 across eight virtualized H100 GPUs using vSphere 8.0.3 and showed virtually no performance degradation. In some cases, we even outperformed bare metal. The surprise isn't that it worked; it's how efficiently we can now do this at scale.

A New Beginning

Think about what those numbers suggest is now possible. Our customers can run AI and non-AI workloads side by side, all within the same trusted security controls, backup routines, and audit frameworks they already use. They get the ease and consistency of the cloud without giving up control of their infrastructure.

The conversation has shifted. Enterprises are no longer asking whether virtualization can handle AI. They want to know how fast they can get started. Once they see what we can deliver, they’re off and running. In some cases, they’re going from pilot to production in a single day.

That kind of momentum doesn’t happen by accident. It’s driven by automation built into every layer of the platform. As model architectures become more specialized and iteration cycles accelerate, automation isn’t a nice-to-have. It’s the thing that keeps you from falling behind.

Our job is to help customers stay ahead. That means giving them a platform that’s flexible, resilient, and ready to adapt in real time. With Broadcom’s Private AI architecture and VMware Cloud Foundation at the core, we’re helping teams run smarter, move faster, and stay in control of their infrastructure.