When should I use VMware vGPU vs GPU passthrough?

Use VMware vGPU when your workload requires GPU sharing, live migration capability, or you are deploying VDI/graphics workloads. Use GPU passthrough when your workload requires native CUDA performance, you want to avoid NVIDIA vGPU licensing, or you are deploying on XCP-ng. Decision factors: Workload type — vGPU for VDI/CAD/light AI inference vs Passthrough for AI training/HPC/full CUDA compute. GPU sharing — vGPU multiple VMs per GPU vs Passthrough 1:1 dedicated assignment. Live migration — vGPU supported via vMotion vs Passthrough not supported. Performance overhead — vGPU has CUDA virtualization overhead vs Passthrough native bare-metal speed. NVIDIA licensing — vGPU subscription required vs Passthrough not required. Hypervisor licensing — VMware vSphere Enterprise Plus vs XCP-ng Vates Enterprise. For organizations evaluating VMware alternatives due to post-Broadcom licensing changes, XCP-ng passthrough is often the most cost-effective path.

What hardware requirements must be validated for GPU passthrough?

GPU passthrough requires specific CPU, BIOS, GPU, and PCIe topology configurations. Most passthrough doesn't work issues come from misconfiguration, not software bugs. Hardware prerequisites checklist: CPU + chipset (IOMMU support required — Intel VT-d or AMD-Vi). BIOS/UEFI (IOMMU enabled, virtualization extensions enabled). GPU (supports PCI passthrough — most NVIDIA Tesla/A-series and AMD Instinct GPUs do). PCIe lanes (minimum x16 lanes per GPU, direct CPU connection preferred). NUMA topology (GPU and VM must align on the same NUMA node). Guest OS (appropriate native NVIDIA/AMD drivers installed). Hypervisor (XCP-ng host must not use the GPU — no display output on passed-through GPU). Pre-deployment infrastructure audit prevents expensive surprises. For enterprise IT infrastructure with GPU-ready Gigabyte servers configured by Virtualtek, all these prerequisites are validated before delivery.

Can we integrate GPU virtualization with Kubernetes?

Yes. Kubernetes GPU device plugins work with both vGPU and GPU passthrough. The integration approach differs slightly between the two architectures. Component comparison: Device plugin — vGPU uses NVIDIA GPU Operator with vGPU support vs Passthrough uses standard NVIDIA device plugin. Time-slicing — vGPU has native support via vGPU profiles vs Passthrough available through MIG (A100/H100). Multi-instance GPU (MIG) — supported on H100 and A100 in both architectures. Live migration of pods — vGPU supported via vMotion vs Passthrough pod-level only with VM rebind required. GPU scheduling policies (time-slicing, multi-instance GPU) apply at the Kubernetes layer regardless of underlying virtualization. Virtualtek designs GPU-aware Kubernetes clusters on both XCP-ng and VMware infrastructure.

How do we prevent users from monopolizing shared GPU resources?

GPU virtualization technology alone doesn't solve allocation problems. Without governance, shared GPU environments create new issues: users monopolize resources, priority workloads get blocked, no cost accountability. Effective GPU governance combines four layers: Resource quotas (per team, per project, per user limits). Scheduling policies (fair-share, priority queues, preemption rules). Usage tracking (showback or chargeback by team/project). Access control (who can request what GPU profile). Implementation by architecture: vGPU — configure appropriate profiles matching workload requirements, use VMware resource pools and reservations. Passthrough — use Kubernetes resource limits and node affinity, implement job queuing systems (SLURM, Kubernetes batch scheduling). Both — monitor GPU utilization, memory usage, and queue depths to identify bottlenecks before they become incidents. Technology choice (vGPU vs passthrough) interacts with governance model — vGPU enables fine-grained sharing, passthrough provides stronger isolation but less flexibility.

How much does NVIDIA vGPU licensing add to the GPU TCO?

NVIDIA vGPU adds recurring subscription licensing on top of GPU hardware costs and hypervisor licensing. The actual cost depends on workload type and current NVIDIA pricing — but for AI training workloads, it can represent a significant portion of TCO. Cost components: GPU hardware identical CapEx in both. Hypervisor — VMware vSphere Enterprise Plus subscription vs XCP-ng Vates Enterprise (lower). NVIDIA vGPU software — required subscription per-GPU or per-user vs not required for passthrough. Guest OS GPU driver — vGPU driver included with subscription vs standard NVIDIA driver free for passthrough. Licensing model — recurring annual vs one-time + support contract. Scaling cost — linear with GPU count vs flat per-socket. For AI training workloads where you need full CUDA performance, passthrough often delivers better TCO. For mixed VDI + light inference where GPU sharing matters more, vGPU may justify its licensing cost.

What network infrastructure is needed for GPU clusters?

Network requirements scale with cluster size. Small deployments (2-4 servers) work with 100GbE. Large GIGAPOD-scale clusters require 400Gb/s InfiniBand with non-blocking fat-tree topology and RDMA support. Cluster scale guide: 1-2 GPU servers — 100GbE Ethernet, direct or single switch. 2-4 GPU servers — 100GbE with RoCE v2, spine-leaf with RDMA enabled. 4-32 GPU servers — 400Gb/s InfiniBand or RoCE v2, non-blocking fat-tree. GIGAPOD scale (256 GPUs) — NVIDIA Quantum-2 QM9700 (400Gb/s NDR InfiniBand), fat-tree with 8 leaf + 4 spine switches. Each GPU server requires one NIC per GPU (8 NICs for 8-GPU server). RDMA support is essential for GPU-to-GPU communication without CPU involvement — this is what enables distributed training to scale linearly with GPU count instead of plateauing. Network design integrates with enterprise storage architecture for AI workloads — storage and network must be sized together to avoid GPU starvation.

What storage architecture supports GPU training workloads?

AI training workloads require storage that sustains throughput matching GPU consumption rates. Under-provisioned storage creates GPU idle time — you pay for compute that waits on data instead of training. Storage throughput by GPU cluster size: 4× H100 needs 12-20 GB/s (single NVMe all-flash appliance). 8× H100 needs 24-40 GB/s (dual NVMe clustered Infortrend EonStor GSx). 16× H100 needs 48-80 GB/s (multi-node parallel cluster). GIGAPOD 256 GPUs needs 200+ GB/s aggregate (dedicated parallel storage rack). For reference: NVIDIA A100 (80GB) can consume 2 TB/s memory bandwidth internally; external storage should minimize data loading bottlenecks. Storage architecture must support parallel access from multiple GPU nodes — this is where parallel NVMe storage like Infortrend EonStor GSx matters. Gigabyte GIGAPOD integrates dedicated storage servers in the management rack, sized for the full cluster's training workload.

How much performance overhead does GPU virtualization add?

Performance overhead depends entirely on the architecture chosen. Passthrough is near bare-metal; vGPU adds measurable but acceptable overhead for most workloads. Architecture comparison: Bare-metal — 0% baseline compute overhead, full memory bandwidth, best for single-tenant max performance. Passthrough — approximately 1-2% (negligible) compute overhead, full memory bandwidth, best for AI training/HPC/full CUDA. vGPU full profile — approximately 5-10% compute overhead, allocated portion of memory bandwidth, best for single-VM dedicated profile. vGPU shared — approximately 10-25% compute overhead, per-profile allocation, best for multi-tenant VDI and inference. Actual performance depends on workload patterns, PCIe topology, NUMA alignment, and driver versions. Most GPU virtualization is slow complaints stem from infrastructure misconfiguration (cross-NUMA placement, insufficient PCIe lanes, driver mismatches), not from virtualization overhead itself. Pre-deployment validation includes physical topology mapping, NUMA configuration testing, and baseline performance benchmarking.

When does GPU virtualization stop and GIGAPOD scale-out start?

GPU virtualization addresses departmental-scale GPU sharing on single or small clusters. For multi-rack AI infrastructure, Gigabyte GIGAPOD architecture replaces virtualization with dedicated cluster nodes orchestrated through specialized fabrics. Scale-out approach: Single 8-GPU server uses Virtualization (vGPU or passthrough) for multi-tenant team sharing. 2-4 GPU servers uses Hybrid virtualization + cluster for mixed VDI + training workloads. 4-32 GPU servers uses Bare-metal cluster + Kubernetes for distributed training scale. GIGAPOD 256+ GPUs uses Dedicated AI Factory architecture for foundation model training. GIGAPOD scalable unit specifications: 32× Gigabyte G593 series (8 GPUs per server = 256 GPUs total), NVIDIA HGX H200/B200/B300 or AMD Instinct MI300/MI350, 400Gb/s InfiniBand fat-tree, air-cooled or direct liquid-cooled (DLC) options. Managed via Gigabyte POD Manager (GPM) for DCIM, workload orchestration, and MLOps integration. As Official Gigabyte AI partner, Virtualtek delivers GIGAPOD architecture from assessment through deployment.

GPU Virtualization for AI Workloads

Are Your Expensive GPUs Sitting Idle While Teams Fight for Resources?

GPU virtualization for AI workloads sits between underutilized GPU capacity, rising licensing costs, and teams demanding more resources.

The architecture you choose — vGPU, passthrough, or hybrid — determines whether your GPU infrastructure becomes a productivity enabler or an expensive bottleneck.

→ Quick ROI Comparison → Technical Deep-Dive → Cluster Specifications

Cost complexity

Enterprise GPU hardware represents significant capital investment, virtualization licensing adds recurring costs, and VMware's post-Broadcom uncertainty forces platform re-evaluation—wrong architecture compounds these expenses without delivering performance

Utilization paradox

Your monitoring shows GPUs aren't fully utilized, but data scientists still complain they're blocked waiting for resources
—the problem isn't capacity, it's allocation

No clear playbook

CPU virtualization is mature and well-understood; GPU virtualization involves non-obvious tradeoffs between performance, sharing, and operational flexibility that most infrastructure teams haven't navigated before

We evaluate your workload mix, infrastructure constraints, and cost parameters
to deliver a GPU virtualization approach tailored to your requirements.

Why Is GPU Virtualization Critical for
AI Infrastructure ROI?

GPU virtualization determines whether your AI infrastructure investment delivers productivity gains or becomes an expensive bottleneck.

Unlike mature CPU virtualization, GPU sharing involves fundamental tradeoffs between performance, cost, and operational flexibility.

The architecture you choose—vGPU, passthrough, or hybrid—directly impacts your TCO (Total Cost of Ownership), time-to-deployment for AI projects, and team productivity.

Most infrastructure teams underestimate this complexity because
traditional virtualization patterns don’t apply to GPU workloads.

How Does GPU Virtualization Strategy Affect Your Infrastructure TCO?

Direct Cost Impact

Enterprise GPU hardware represents major capital investment. Your virtualization approach determines operational costs through three mechanisms: licensing fees (vGPU requires NVIDIA licensing on top of VMware/virtualization platform costs), utilization efficiency (shared GPUs reduce hardware requirements but add management complexity), and operational overhead (different approaches require different operational models and skillsets).
Wrong architectural choice compounds these costs without delivering expected performance.

Cost - Time to Value

Beyond infrastructure costs, GPU allocation delays impact project velocity.
When data scientists wait hours or days for GPU resources, you're paying expensive engineering time for idle waiting.
Effective GPU virtualization with proper governance reduces provisioning time from days to minutes, directly impacting your AI project time-to-market.
This operational efficiency often delivers greater ROI than raw hardware cost savings.

Capacity Planning Risk

GPU underutilization wastes capital; GPU oversubscription blocks productivity.
Virtualization strategy determines your buffer requirements.
Passthrough requires more hardware buffer (dedicated GPUs may sit idle). vGPU enables higher density but requires careful contention management.
Hybrid approaches balance these tradeoffs but add architectural complexity.
Your workload patterns and governance maturity determine optimal approach.

GPU Infrastructure ROI: Virtualization Approaches

To maximize your AI infrastructure efficiency, we analyze two key financial metrics: CapEx (Capital Expenditures for hardware and GPU acquisition) and OpEx (Operating Expenditures including NVIDIA licensing and energy consumption). This strategic view helps IT Managers optimize their Total Cost of Ownership (TCO).

Virtualization Approach	CapEx Impact	OpEx Impact	Time-to-Value	Best For
VMware vGPU	Lower (High hardware density)	Higher (Recurring licensing)	Medium	Enterprise Multi-tenant AI
XCP-ng Passthrough	Higher (1:1 GPU Allocation)	Lower (No licensing fees)	Fast	Dedicated Model Training
Hybrid Model	Balanced	Optimized	Variable	Scalable Production Environments

What Makes GPU Virtualization Fundamentally Different
from CPU Virtualization?

Memory Architecture Challenge

CPUs share system memory with transparent paging. GPUs use dedicated high-bandwidth VRAM that must be explicitly managed. There's no memory overcommitment—once GPU memory is allocated to a VM, it's locked. This rigidity means vGPU profile sizing directly impacts utilization and performance. Wrong profile creates waste (too large) or contention (too small).
Unlike CPU RAM, you can't dynamically adjust GPU memory allocation without VM restart.

Driver and Kernel Integration

CPU virtualization operates cleanly at the hypervisor level. GPU drivers require kernel-mode, direct hardware access. Virtualization layers must mediate this access carefully, creating driver compatibility dependencies between GPU firmware, hypervisor version, and guest OS. Version mismatches cause instability or feature unavailability.
This integration complexity requires thorough compatibility validation before deployment—you can't assume "it will work" like CPU virtualization.

Performance Sensitivity to Infrastructure

GPUs demand sustained high-bandwidth data transfer through PCIe, memory, storage, and network. Small infrastructure mistakes cause large performance degradation. Cross-NUMA GPU placement reduces throughput significantly. Insufficient PCIe lanes create bottlenecks. Inadequate storage I/O starves GPUs despite adequate compute.
Unlike CPU workloads that tolerate some infrastructure suboptimality, GPU workloads expose every bottleneck immediately. This sensitivity requires infrastructure validation before GPU deployment.

Architecture Deep-Dive: CPU vs. GPU Virtualization

While CPU virtualization is a mature technology based on hardware abstraction and resource overcommitment, GPU virtualization requires specialized kernel-mode drivers and is highly sensitive to infrastructure components like PCIe topology and NUMA nodes. Understanding these architectural differences is critical for scaling AI and ML workloads.

Technical Aspect	CPU Virtualization	GPU Virtualization
Memory Model	Shared, overcommitable (RAM ballooning)	Dedicated, fixed VRAM allocation
Driver Complexity	Standardized abstraction	Version-sensitive Kernel-mode drivers
Context Switching	Low overhead (Microseconds)	Heavy overhead (Milliseconds)
Infrastructure Sensitivity	General Purpose Hardware	Topology Aware (NUMA, PCIe Gen4/5)
System Maturity	Fully mature & standardized	Dynamic, architecture-specific

What Infrastructure Requirements
Must Be Validated Before
GPU Virtualization Deployment?

PCIe and NUMA Configuration

GPU performance depends on correct PCIe topology and NUMA alignment. Each GPU requires minimum x16 PCIe lanes with direct CPU connection preferred. GPUs physically connect to specific CPU sockets (NUMA nodes)—cross-NUMA traffic incurs performance penalty. Hypervisor must schedule VM CPUs on matching NUMA node and allocate memory from same node. Common failure: GPU in NUMA node 1, VM on node 0 = automatic performance degradation regardless of GPU capability.

Platform Compatibility Matrix

Not all hardware supports all virtualization approaches. vGPU requires NVIDIA datacenter GPUs with GRID support (not all models). Passthrough requires CPU/chipset IOMMU support (Intel VT-d or AMD-Vi) enabled in BIOS. Driver compatibility between specific GPU model, hypervisor version, and guest OS must be validated. Assumptions about compatibility cause deployment failures.
Pre-deployment compatibility audit prevents expensive surprises.

Power, Cooling, and Throughput

Modern datacenter GPUs draw sustained high power under AI workloads. Power delivery must handle multi-GPU peaks; cooling must manage continuous full utilization (not just "meets TDP spec"). Storage throughput must match GPU data consumption rate—training workloads stream large datasets continuously. Network bandwidth must support distributed training communication patterns.
Common mistake: adequate GPU compute but infrastructure bottleneck limits actual performance.

Infrastructure Validation Checklist

Critical Pre-Deployment Validations

PCIe topology mapping (which GPU connects to which CPU socket)
NUMA configuration and alignment testing
IOMMU capability verification (for passthrough deployments)
Power delivery adequacy assessment (peak multi-GPU load)
Thermal management validation (sustained full load, not idle)
Storage I/O bandwidth testing (dataset streaming capability)
Network fabric assessment (distributed training requirements)
Driver compatibility matrix validation (GPU + hypervisor + guest OS)

How Does GPU Governance Impact Virtualization ROI?

Resource Allocation Without Governance

GPU virtualization technology alone doesn't solve allocation problems.
Without governance, shared GPU environments create new issues: users monopolize resources, priority workloads get blocked, no cost accountability. Effective governance requires resource quotas, scheduling policies (fair-share, priority queues), usage tracking (showback/chargeback), and access control. Technology choice (vGPU vs passthrough) interacts with governance model—vGPU enables fine-grained sharing, passthrough provides stronger isolation but less flexibility.

Team Enablement Requirements

GPU virtualization changes operational procedures from traditional IT infrastructure. Teams need training on: GPU resource management (allocation, monitoring, troubleshooting), workload classification (which jobs need which GPU approach), performance optimization (identifying bottlenecks), and cost optimization (right-sizing allocations).
Underestimating this enablement requirement leads to underutilized infrastructure and frustrated users. Plan for knowledge transfer and documentation as part of deployment.

Measuring Success - KPIs for GPU Infrastructure

Effective GPU virtualization delivers measurable improvements. Track: GPU utilization rates (before vs after), resource provisioning time (days to minutes), cost per GPU-hour (TCO divided by useful work), team satisfaction (reduced waiting time), project velocity (time-to-deployment for AI models). These metrics justify investment and guide optimization.
Without measurement, you can't prove ROI or identify improvement opportunities.

The Fundamental Tradeoffs You're Forced to Navigate

Unlike CPU virtualization where « more cores = more capacity » scales predictably, GPU virtualization forces you to choose between competing priorities:

Performance vs. Sharing

Direct GPU access (passthrough) delivers maximum performance but limits flexibility
Mediated access (vGPU) enables sharing across multiple VMs but introduces overhead
There's no "free lunch"—you optimize for raw speed or resource efficiency, not both

Key

Cost vs. Flexibility

vGPU enables high VM density but requires NVIDIA licensing on top of already-expensive hardware
Passthrough avoids licensing costs but reduces sharing capability
VMware platform licensing adds another cost layer, especially post-Broadcom acquisition

Isolation vs. Utilization

Dedicated GPUs guarantee predictable performance but typically run underutilized
Shared GPUs improve utilization but create potential for resource contention
Your governance model determines which tradeoff is acceptable

The Decision Point

Most infrastructure teams haven’t navigated these tradeoffs before because CPU virtualization doesn’t force these choices.

Understanding which constraint matters most for your organization determines the right approach.

Architectural Differences That Make GPU Virtualization Complex

GPUs weren’t designed with virtualization in mind.

Their architecture creates challenges that don’t exist with CPU virtualization:

Memory Architecture

CPUs use shared system memory with transparent paging and swapping. GPUs use dedicated high-bandwidth memory (VRAM) that must be explicitly managed:

No transparent memory overcommitment like CPU RAM
Data must be explicitly transferred to/from GPU memory
Memory allocation is rigid—once assigned, it can't be dynamically shared

Impact: vGPU profile sizing becomes critical; wrong choice creates waste or contention

Driver Complexity

CPU virtualization happens at the hypervisor level. GPU virtualization requires deep driver integration:

GPU drivers operate in kernel mode with direct hardware access
Virtualization layers must carefully mediate this access
Driver compatibility varies significantly between hypervisor platforms

Impact: Version mismatches cause instability; thorough compatibility validation required

Performance Characteristics

CPUs optimize for latency (fast individual operations). GPUs optimize for throughput (massive parallelism):

Thousands of parallel threads executing simultaneously
High bandwidth requirements for sustained performance
Sensitive to data transfer patterns and PCIe topology

Impact: Infrastructure bottlenecks (storage, network, PCIe) starve GPUs despite adequate compute

Execution Model

CPU workloads context-switch efficiently. GPU workloads don't:

GPU context switches are expensive (state save/restore)
Long-running GPU kernels can monopolize resources
Preemption is limited compared to CPU scheduling

Impact: vGPU environments require careful scheduler tuning to prevent one VM from blocking others

The Technical Reality

These aren’t configuration issues—they’re fundamental architectural differences requiring different design approaches.

Infrastructure Prerequisites That Determine Success

Most "GPU virtualization doesn't work" problems stem from infrastructure issues, not software bugs. Critical prerequisites must be validated before deployment:

PCIe Configuration

GPUs require sustained high-bandwidth PCIe connectivity:

Minimum x16 lanes per GPU required
Direct CPU-to-GPU connections preferred over PCIe switches
Actual negotiated speed often differs from physical slot specs
Inadequate PCIe bandwidth creates immediate bottlenecks
Validation required: Physical topology mapping and speed testing before GPU assignment

NUMA Topology Alignment

Modern servers have Non-Uniform Memory Architecture (NUMA):

GPUs physically connect to one CPU socket (NUMA node)
Cross-NUMA traffic incurs significant performance penalty
Hypervisor must schedule VM CPUs on correct NUMA node
Memory must be allocated from matching NUMA node
Common failure: GPU in NUMA node 1, VM CPUs scheduled on node 0

Platform Compatibility

Not all hardware supports GPU virtualization equally:

For vGPU: Requires NVIDIA datacenter GPUs with vGPU support
For Passthrough: Requires CPU/chipset IOMMU support
BIOS configuration must enable virtualization features
Driver compatibility between GPU, hypervisor, and guest OS
Pre-deployment requirement: Hardware compatibility validation

Power and Thermal Management

Modern datacenter GPUs draw substantial power:

Power delivery must handle peak multi-GPU loads
Cooling must be adequate for continuous full utilization
Thermal throttling reduces effective performance invisibly
Design requirement: Power and cooling validation under realistic workload

vGPU vs GPU Passthrough: Technical Comparison

Software GPU Slicing

VMware vGPU (NVIDIA GRID)

Technology: NVIDIA vGPU software partitions physical GPUs into virtual GPU instances. Each vGPU profile allocates a portion of GPU framebuffer, compute resources, and memory bandwidth.

Supported workloads:

VDI (Virtual Desktop Infrastructure)
CAD/CAM rendering applications
Graphics-accelerated remote workstations
AI inference workloads with moderate GPU requirements

Technical characteristics:

Multiple VMs can share a single physical GPU
vMotion support for live migration
Profiles define resource allocation (e.g., A100-8GB)
Scheduling managed by NVIDIA & ESXi hypervisor

Licensing requirements:

NVIDIA vGPU software license (subscription)
VMware vSphere Enterprise Plus license
Broadcom licensing structure applies

Limitations:

CUDA performance overhead
Framework compatibility constraints
Static profile resizing (requires VM restart)

Discuss VMware vGPU Architecture

PCI Device Assignment

XCP-ng GPU Passthrough

Technology: PCI passthrough (VT-d/AMD-Vi) assigns an entire physical GPU directly to a single VM for native, bare-metal performance.

Supported workloads:

AI model training (PyTorch, TensorFlow)
Deep learning with full CUDA access
GPU-accelerated HPC simulations
Latency-sensitive workloads

Technical characteristics:

1:1 mapping (Dedicated Hardware)
Zero virtualization overhead
Full hardware feature access
IOMMU security isolation

Licensing requirements:

XCP-ng Vates Enterprise Edition
No NVIDIA software license required
Standard native drivers

Limitations:

No live migration support
Reduced scheduling flexibility
Requires more physical hardware
VM shutdown for reassignment

Discuss XCP-ng GPU Architecture

Which Architecture Fits Your Infrastructure?

The choice between vGPU and GPU passthrough depends on
workload characteristics, team size, and operational requirements.

Here’s how to decide:

Choose VMware vGPU
When:

You have many concurrent users running VDI, CAD, or light AI inference requiring GPU sharing
Your workloads require live migration (vMotion) for maintenance windows
You already have VMware vSphere Enterprise Plus infrastructure
GPU workloads are graphics-intensive rather than CUDA compute-intensive
You can absorb NVIDIA vGPU subscription licensing costs (check current NVIDIA pricing)

Choose XCP-ng Passthrough
When:

Your workloads are AI model training requiring full CUDA performance (PyTorch, TensorFlow)
You have departmental AI teams who can work with dedicated GPU assignment
You want to eliminate NVIDIA vGPU licensing and control virtualization TCO
Your teams accept no live migration in exchange for native GPU performance
You're evaluating VMware alternatives due to Broadcom licensing changes

Licensing Cost Comparison

GPU hardware costs are identical regardless of virtualization choice.
The Total Cost of Ownership (TCO) difference comes from virtualization platform licensing:

Licensing Component	VMware vGPU	XCP-ng Passthrough
Hypervisor Platform	VMware vSphere Enterprise Plus (Broadcom subscription)	XCP-ng Vates Enterprise Edition
NVIDIA vGPU Software	Required subscription (per GPU or per user)	Not required
GPU Driver in Guest VM	NVIDIA vGPU driver (included in vGPU license)	Standard NVIDIA driver (no additional license)
Licensing Model	Recurring (Annual subscription)	One-time + support contract

Scaling from Shared GPUs to AI Clusters

GPU virtualization addresses departmental-scale GPU sharing. For multi-rack AI infrastructure, we deploy Gigabyte GIGAPOD architecture integrating GPU compute, high-speed networking, and storage.

GIGAPOD scalable unit specifications (source: Gigabyte GIGAPOD documentation):

Component	Specification
GPU servers	32x Gigabyte G593 series (8 GPUs per server = 256 GPUs total)
GPU options	NVIDIA HGX H200/B200/B300, AMD Instinct MI300/MI350 Series, Intel Gaudi 3
Intra-server interconnect	NVIDIA NVLink (900GB/s GPU-to-GPU) or AMD Infinity Fabric Link
Inter-server networking	NVIDIA Quantum-2 QM9700 switches (400Gb/s NDR InfiniBand), fat-tree topology
Network topology	Non-blocking fat-tree: 8 leaf switches (middle layer), 4 spine switches (top layer)
Cooling options	Air-cooled (8 compute racks, 50-100kW/rack) OR Liquid-cooled (4 compute racks, 90-120kW/rack with DLC)
Management	Gigabyte POD Manager (GPM) for DCIM, workload orchestration, MLOps integration

The foundation of this architecture is the Gigabyte G593 series, a specialized 8-GPU compute node engineered specifically for the thermal and power demands of high-density AI training. Whether deployed in air-cooled or liquid-cooled configurations, these servers provide the raw compute power and I/O throughput required for the GIGAPOD’s non-blocking fabric.

Gigabyte G593 series server specifications:

Form factor5U chassis (industry-leading density for air-cooled 8-GPU configuration)
CPUDual Intel Xeon Scalable (4th/5th gen) or AMD EPYC 9004/9005 series
Memory24 DIMMs (AMD) or 32 DIMMs (Intel) DDR5 support
Storage8x 2.5" Gen5 NVMe/SATA/SAS-4 hot-swap bays
PCIe expansion4x PCIe Gen5 switches for RDMA, NVMe direct GPU access
Power4+2 redundant 3000W 80 PLUS Titanium PSUs
Network8x NVIDIA ConnectX-7 NICs (one per GPU) for InfiniBand/Ethernet RDMA

Direct Liquid Cooling (DLC) variant: 4U chassis with cold plates on CPU, GPU, and NVSwitch. Achieves higher rack density by removing air-cooling components.

Reference: GIGAPOD One-Stop Service Documentation

The Virtualtek Way

There’s no free lunch in GPU virtualization, but we make sure you’re not paying for the whole restaurant.

GPU Virtualization — Frequently Asked Questions

Independent technical guidance — vGPU, passthrough, hybrid, GIGAPOD.

Direct answer: Use VMware vGPU when your workload requires GPU sharing, live migration capability, or you're deploying VDI/graphics workloads. Use GPU passthrough when your workload requires native CUDA performance, you want to avoid NVIDIA vGPU licensing, or you're deploying on XCP-ng.

Decision Factor	VMware vGPU	XCP-ng Passthrough
Workload type	VDI, CAD, light AI inference	AI training, HPC, full CUDA compute
GPU sharing	Multiple VMs per GPU	1:1 dedicated assignment
Live migration	Supported (vMotion)	Not supported
Performance overhead	CUDA virtualization overhead	Native bare-metal speed
NVIDIA licensing	vGPU subscription required	Not required
Hypervisor licensing	VMware vSphere Enterprise Plus	XCP-ng Vates Enterprise

For organizations evaluating VMware alternatives due to post-Broadcom licensing changes, XCP-ng passthrough is often the most cost-effective path. For complete platform comparison, see our XCP-ng Enterprise Virtualization capabilities.

Need an architecture recommendation? Book an AI infrastructure consultation.

Direct answer: GPU passthrough requires specific CPU, BIOS, GPU, and PCIe topology configurations. Most "passthrough doesn't work" issues come from misconfiguration, not software bugs.

Hardware prerequisites checklist:

CPU + chipset — IOMMU support required (Intel VT-d or AMD-Vi)
BIOS/UEFI — IOMMU enabled, virtualization extensions enabled
GPU — supports PCI passthrough (most NVIDIA Tesla/A-series and AMD Instinct GPUs do)
PCIe lanes — minimum x16 lanes per GPU, direct CPU connection preferred
NUMA topology — GPU and VM must align on the same NUMA node
Guest OS — appropriate native NVIDIA/AMD drivers installed
Hypervisor — XCP-ng host must not use the GPU (no display output on passed-through GPU)

Pre-deployment infrastructure audit prevents expensive surprises. For enterprise IT infrastructure with GPU-ready Gigabyte servers configured by Virtualtek, all these prerequisites are validated before delivery.

Need a hardware compatibility audit? Schedule a technical consultation.

Direct answer: Yes. Kubernetes GPU device plugins work with both vGPU and GPU passthrough. The integration approach differs slightly between the two architectures.

Component	vGPU integration	Passthrough integration
Device plugin	NVIDIA GPU Operator with vGPU support	Standard NVIDIA device plugin
Time-slicing	Native support via vGPU profiles	Available through MIG (A100/H100)
Multi-instance GPU (MIG)	Supported on H100, A100	Supported on H100, A100
Live migration of pods	Supported (via vMotion)	Pod-level only (VM rebind required)

GPU scheduling policies (time-slicing, multi-instance GPU) apply at the Kubernetes layer regardless of underlying virtualization. We design GPU-aware Kubernetes clusters on both XCP-ng and VMware infrastructure, with full integration into your AI infrastructure stack.

Designing GPU-aware Kubernetes? Book an AI infrastructure consultation.

Direct answer: GPU virtualization technology alone doesn't solve allocation problems. Without governance, shared GPU environments create new issues: users monopolize resources, priority workloads get blocked, no cost accountability.

Effective GPU governance combines four layers:

Resource quotas — per team, per project, per user limits
Scheduling policies — fair-share, priority queues, preemption rules
Usage tracking — showback or chargeback by team/project
Access control — who can request what GPU profile

Implementation by architecture:

vGPU — configure appropriate profiles matching workload requirements; use VMware resource pools and reservations
Passthrough — use Kubernetes resource limits and node affinity; implement job queuing systems (SLURM, Kubernetes batch scheduling)
Both — monitor GPU utilization, memory usage, and queue depths to identify bottlenecks before they become incidents

Technology choice (vGPU vs passthrough) interacts with governance model — vGPU enables fine-grained sharing, passthrough provides stronger isolation but less flexibility. Your team enablement strategy matters as much as the architecture itself.

Need a governance framework for AI infrastructure? Explore RAIGF — Responsible AI Governance Framework.

Direct answer: NVIDIA vGPU adds recurring subscription licensing on top of GPU hardware costs and hypervisor licensing. The actual cost depends on workload type and current NVIDIA pricing — but for AI training workloads, it can represent a significant portion of TCO.

Cost Component	VMware vGPU	XCP-ng Passthrough
GPU hardware	Identical (CapEx)	Identical (CapEx)
Hypervisor	VMware vSphere Enterprise Plus subscription	XCP-ng Vates Enterprise (lower)
NVIDIA vGPU software	Required subscription (per-GPU or per-user)	Not required
Guest OS GPU driver	NVIDIA vGPU driver (included with subscription)	Standard NVIDIA driver (free)
Licensing model	Recurring annual	One-time + support contract
Scaling cost	Linear with GPU count	Flat per-socket

For AI training workloads where you need full CUDA performance, passthrough often delivers better TCO. For mixed VDI + light inference where GPU sharing matters more, vGPU may justify its licensing cost. We provide detailed TCO modeling during the AI infrastructure consultation.

Want a real TCO comparison for your workload? Schedule a consultation.

Direct answer: Network requirements scale with cluster size. Small deployments (2-4 servers) work with 100GbE. Large GIGAPOD-scale clusters require 400Gb/s InfiniBand with non-blocking fat-tree topology and RDMA support.

Cluster Scale	Network Fabric	Topology
1-2 GPU servers	100GbE Ethernet	Direct or single switch
2-4 GPU servers	100GbE with RoCE v2	Spine-leaf, RDMA enabled
4-32 GPU servers	400Gb/s InfiniBand or RoCE v2	Non-blocking fat-tree
GIGAPOD scale (256 GPUs)	NVIDIA Quantum-2 QM9700 (400Gb/s NDR InfiniBand)	Fat-tree: 8 leaf + 4 spine switches

Each GPU server requires one NIC per GPU (8 NICs for 8-GPU server). RDMA support is essential for GPU-to-GPU communication without CPU involvement — this is what enables distributed training to scale linearly with GPU count instead of plateauing.

Network design integrates with enterprise storage architecture for AI workloads — storage and network must be sized together to avoid GPU starvation.

Designing AI cluster networking? Book a consultation.

Direct answer: AI training workloads require storage that sustains throughput matching GPU consumption rates. Under-provisioned storage creates GPU idle time — you pay for compute that waits on data instead of training.

Storage throughput by GPU cluster size:

GPU Cluster	Required Throughput	Storage Architecture
4× H100	12-20 GB/s	Single NVMe all-flash appliance
8× H100	24-40 GB/s	Dual NVMe clustered (Infortrend EonStor GSx)
16× H100	48-80 GB/s	Multi-node parallel cluster
GIGAPOD (256 GPUs)	200+ GB/s aggregate	Dedicated parallel storage rack

For reference: NVIDIA A100 (80GB) can consume 2 TB/s memory bandwidth internally; external storage should minimize data loading bottlenecks. Storage architecture must support parallel access from multiple GPU nodes — this is where parallel NVMe storage like Infortrend EonStor GSx matters.

Gigabyte GIGAPOD integrates dedicated storage servers in the management rack, sized for the full cluster's training workload. See our complete AI infrastructure solutions for detailed reference architectures.

Need help sizing AI storage? Schedule a consultation.

Direct answer: Performance overhead depends entirely on the architecture chosen. Passthrough is near bare-metal; vGPU adds measurable but acceptable overhead for most workloads.

Architecture	Compute Overhead	Memory Bandwidth	Best For
Bare-metal	0% (baseline)	Full	Single-tenant max performance
Passthrough	~1-2% (negligible)	Full	AI training, HPC, full CUDA
vGPU (full profile)	~5-10%	Allocated portion	Single-VM dedicated profile
vGPU (shared)	~10-25%	Per-profile allocation	Multi-tenant VDI, inference

The overhead numbers above are typical ranges — actual performance depends on workload patterns, PCIe topology, NUMA alignment, and driver versions. Most "GPU virtualization is slow" complaints stem from infrastructure misconfiguration (cross-NUMA placement, insufficient PCIe lanes, driver mismatches), not from virtualization overhead itself.

Pre-deployment validation includes physical topology mapping, NUMA configuration testing, and baseline performance benchmarking — making sure your GPUs actually deliver their rated performance before workloads go live.

Concerned about GPU performance? Get an architecture review.

Direct answer: GPU virtualization addresses departmental-scale GPU sharing on single or small clusters. For multi-rack AI infrastructure, Gigabyte GIGAPOD architecture replaces virtualization with dedicated cluster nodes orchestrated through specialized fabrics.

Scale	Approach	Why this fits
Single 8-GPU server	Virtualization (vGPU or passthrough)	Multi-tenant team sharing
2-4 GPU servers	Hybrid virtualization + cluster	Mixed VDI + training workloads
4-32 GPU servers	Bare-metal cluster + Kubernetes	Distributed training scale
GIGAPOD (256+ GPUs)	Dedicated AI Factory architecture	Foundation model training

GIGAPOD scalable unit specifications: 32× Gigabyte G593 series (8 GPUs per server = 256 GPUs total), NVIDIA HGX H200/B200/B300 or AMD Instinct MI300/MI350, 400Gb/s InfiniBand fat-tree, air-cooled or direct liquid-cooled (DLC) options. Managed via Gigabyte POD Manager (GPM) for DCIM, workload orchestration, and MLOps integration.

For complete AI Factory infrastructure design, see our AI Solutions page. As an Official Gigabyte AI partner, we deliver GIGAPOD architecture from assessment through deployment.

Designing AI infrastructure beyond single-server scale? Book a consultation.

Deploy GPU Virtualization Infrastructure

We design GPU virtualization and cluster architectures for XCP-ng and VMware platforms.
Independent technical guidance covering vGPU, GPU passthrough, and GIGAPOD deployments.

You bring the business challenges.

We design the ICT architecture to address them.

Partner

of Medium Business Success

AI Infrastructure & Virtualization Experts

Specialized in:
– AI Infrastructure (Official Gigabyte & NVIDIA Partner)
– Virtualization (VMware Expert + Official Vates MSP)
– Enterprise Storage (Open-e, StorONE, Infortrend, AIC)
– RAIGF™ Governance (Exclusive European Distributor)

Your virtualization path

VMware → XCP-ng Migration

Choose your RAIGF™ level