Independently Reproducible

Real Hardware, Real Results

Every number on this page was measured on physical hardware in our lab. No simulations, no projections, no cherry-picked runs. Full methodology and hardware specs are disclosed below so you can reproduce these results.

Methodology & Transparency

We believe performance claims require context. Here's exactly how we measured.

How We Test

Baseline

Default compiler flags, default OS scheduler settings, default block sizes. No manual tuning. This represents what most users get out of the box.

Optimized

Parameters selected by Mesh Optimizer's behavior atlas after profiling 83,000+ configurations. The optimizer discovers optimal block sizes, ILP chains, tile dimensions, and memory access patterns per kernel type.

Measurement

Each benchmark runs 20+ iterations after warmup. We report the median result. GPU benchmarks use hardware event counters (rocprofv3 on AMD, nvprof on NVIDIA) rather than wall-clock timing where possible.

Reproducibility

All probe kernels are open source. The atlas database files used for these measurements contain the raw data points. Results will vary by hardware, driver version, and system load.

Important: Percentage improvements are measured against unoptimized defaults on the same hardware. Larger improvements (e.g., atomics, WMMA) reflect cases where default parameters are far from optimal for the specific operation. Smaller improvements (e.g., overlap, reduction) reflect operations where defaults are already reasonable. Your results will depend on your workloads, hardware, and baseline configuration.

Test Hardware

Primary GPU

AMD Radeon RX 7900 XTX

ArchitectureRDNA 3 (gfx1100)
Compute Units48 (96 shader arrays)
VRAM24 GB GDDR6X
Peak BW960 GB/s
Peak FP3261 TFLOPS
DriverROCm 7.2.0
Data Points81,305
NVIDIA GPU

Quadro RTX 5000

ArchitectureTuring (sm_75)
CUDA Cores3,072
VRAM16 GB GDDR6
Peak BW448 GB/s
Peak FP3211.2 TFLOPS
DriverCUDA 13.0
Data Points266
CPU

Intel Xeon Silver 4108

Cores / Threads8C / 16T
Frequency1.8–3.0 GHz
RAM96 GB DDR4
NUMA Nodes1
Data Points410
Remote Node

AMD Ryzen 9 9950X

Cores / Threads16C / 32T (Zen 5)
RAM124 GB DDR5
AcceleratorXilinx FPGA
iGPURDNA 3.5 (2 GB)
Data Points285
Remote Node

2× Intel Xeon E5-2640 v2

Cores / Threads16C / 32T (2-socket)
RAM220 GB DDR3
PlatformDell PowerEdge R620
RoleFPGA synthesis host
Data Points283
Budget GPU

NVIDIA GeForce GTX 1050

ArchitecturePascal (sm_61)
CUDA Cores640
VRAM2 GB GDDR5
Peak FP321.86 TFLOPS
DriverCUDA 12.6
Data Points451

Total atlas: 83,174 data points · 95,520 invariants · 29 kernel types · 8 hardware sources

AMD RX 7900 XTX — Kernel-Level Results

Before/after comparison across 11 kernel types. "Before" uses default parameters (block_size=256, no ILP, stride=1). "After" uses atlas-discovered optimal parameters.

Kernel Type Before (Default) After (Optimized) Improvement Key Optimization
Atomic Operations 2.9 GOPS 431 GOPS +5,100% Uncontended partitioning, optimal block size
WMMA (Tensor Ops) ~1.8 TFLOPS 88.8 TFLOPS +4,776% 6 WMMA chains, k_tiles=34
Tiled GEMM ~0.3 TFLOPS ~9 TFLOPS +2,831% Tile size + shared memory padding
Wave Scheduling ~0.5 TFLOPS ~11 TFLOPS +2,106% Wavefront occupancy balancing
Strided Access ~55 GB/s ~884 GB/s +1,506% stride=1, coalesced access
Memory Bandwidth ~148 GB/s 867 GB/s +484% Vectorized loads, block=128
LDS (Shared Memory) ~237 GB/s 878 GB/s +271% stride=22 (avoids bank conflicts)
FP32 Compute ~16 TFLOPS 50 TFLOPS +208% 7 FMA chains, 2.33x ILP speedup
FP16 Compute ~28 TFLOPS 57.9 TFLOPS +102% 11 half2 chains (less ILP-sensitive)
Reduction ~450 GB/s ~780 GB/s +73% Block=512, warp-level reduction
Overlap (Roofline) ~595 GB/s 781 GB/s +31% compute_intensity=2 saturation point

Note on large improvements: Atomic (+5,100%) and WMMA (+4,776%) show the largest gains because their default configurations are particularly suboptimal. Atomics default to fully contended access patterns; WMMA defaults don't utilize instruction-level parallelism. These represent the upper bound of what optimization can achieve. Workloads that are already memory-bandwidth-bound (reduction, overlap) show more modest but still significant gains.

NVIDIA Quadro RTX 5000 — cuBLAS & cuDNN

Oracle-verified measurements against vendor library implementations. These validate that our optimization framework produces correct results at near-peak performance.

cuBLAS (GEMM)

Matrix SizeGFLOPS% of Peak
128 × 1281521.4%
1024 × 10247,83670.0%
4096 × 409610,82196.6%
Batched (64×1024)9,65286.2%
LLaMA FFN shape10,38692.7%

Peak: 11.2 TFLOPS FP32. 31/31 correctness tests pass (100%).

cuDNN (Convolution & Normalization)

OperationGFLOPS / GB/s% of Peak
Conv2d (ResNet res4a)14,391 GFLOPS128%*
Conv2d (ResNet res3a)12,899 GFLOPS115%*
Conv2d (ResNet conv1)4,934 GFLOPS44%
Linear (LLaMA up)9,826 GFLOPS87.7%
BatchNorm290 GB/s64.8%
Softmax354 GB/s79.0%

*Exceeds theoretical FP32 peak via Tensor Core acceleration. 44/48 tests pass; 2 disabled (MIOpen JIT bug), 2 flaky timing.

GTX 1050 (Budget GPU) — Optimization Still Matters

Even a 7-year-old 2GB GPU benefits from atlas-driven optimization. Measured with CUDA 12.6 (sm_61).

Metric Measured % of Peak Key Finding
Memory Bandwidth 89 GB/s 119%* Cache-line hits exceed DRAM peak; block=256 optimal
FP32 Compute 581 GFLOPS 31% 15 ILP chains at block=32
GEMM (cuBLAS) 1,772 GFLOPS 95% 2048×2048 optimal for 2GB VRAM
Shared Memory 311 GB/s stride=15 best; 79% of configs hit bank conflict cliffs
Occupancy 2,000 GFLOPS block=1024 with 205 blocks; steep cliff at low block counts

*Bandwidth exceeds DRAM peak (75 GB/s) because L2 cache serves repeated accesses. 451 data points, 209 invariants, 70 performance cliffs detected.

CPU & Memory Subsystem Optimization

Mesh Optimizer doesn't just tune GPUs. System-level optimizations target scheduler, hugepages, NUMA, and memory hierarchy.

System Tuning Recommendations

Generated per-node based on hardware inventory and JEPA confidence data:

  • CPU governor → performance (vs. default powersave)
  • Hugepages scaled to RAM (128GB → 4096×2MB = 8GB)
  • NUMA balancing enabled for multi-socket (E5-2640v2)
  • Swappiness reduced (60 → 1 on 64GB+ systems)
  • I/O scheduler selection per device type
  • Network buffer scaling for FPGA data streaming
  • Scheduler autogroup disabled, util_clamp raised for 16+ core systems

Memory Hierarchy Discoveries

MetricValue
Infinity Cache (7900 XTX)2,562 GB/s at <96MB working set (2.7× DRAM)
IC cliff point96 MB (beyond this, falls to DRAM speeds)
LDS optimal stride22 (avoids 32-bank conflicts)
LDS bank conflict penalty60% throughput loss at stride=32
PCIe throughput13–14 GB/s (43–45% of PCIe 4.0 x16)
Launch latency2.6 µs per kernel dispatch

Aggregate Impact — 3.15x Average Speedup

Geometric mean across all workload types on the RX 7900 XTX. Individual results vary significantly by workload type.

3.15x
Average Speedup
Geometric mean, all kernel types
52x
Maximum Speedup
Atomic operations (worst-case default)
1.31x
Minimum Speedup
Overlap/roofline (already near-optimal)
83,174
Data Points
8 hardware sources, 29 kernel types

Interpreting the 3.15x average: This is the geometric mean across all 11 GPU kernel types. It is heavily influenced by kernels with large improvements (atomics, WMMA). For real-world applications that primarily use GEMM, convolution, and memory-bound operations, expect improvements in the 1.5–3x range depending on how far your current configuration is from optimal. Workloads that are already well-tuned will see smaller gains.

JEPA Neural Model Performance

The Joint-Embedding Predictive Architecture model that drives continuous optimization.

Model Architecture

PropertyValue
Parameters249K
Input dimensions28 (workload features)
Latent space256D
Output6D (performance prediction)
Training loss0.136 → 0.020 (60 epochs)
Inference latency<1 ms (CPU)
Online learning<1 ms per feedback call

Training Data Coverage

Hardware SourceData Points
AMD RX 7900 XTX81,305
NVIDIA GTX 1050451
Intel Xeon 4108 (CPU)410
AMD Ryzen 9 9950X285
Intel E5-2640v2 (2-socket)283
NVIDIA Quadro RTX 5000266
DDR4 Memory208
Total83,174

Model continues learning from live workload observations via online feedback. Sample count grows as nodes run real workloads.

Software Environment

Exact versions used for all benchmarks on this page.

Operating System

Pop!_OS 24.04 LTS
Kernel 6.18.7

AMD GPU Stack

ROCm 7.2.0
rocprofv3, MIOpen, rocBLAS

NVIDIA GPU Stack

CUDA 13.0 (RTX 5000)
CUDA 12.6 (GTX 1050, sm_61)

ML Framework

PyTorch 2.10.0+cu128

Compiler

GCC 13, hipcc (ROCm)
nvcc (CUDA)

Profiling

rocprofv3 (AMD)
nvprof / nsight (NVIDIA)

Disclaimers

  • Results may vary. Performance improvements depend on your specific hardware, driver versions, workload characteristics, and baseline configuration. The numbers on this page represent measurements on the specific hardware listed above.
  • Micro-benchmarks vs. applications. Kernel-level benchmarks (atomic, WMMA, bandwidth) measure isolated operations. Real application performance depends on the mix of operations, data transfer overhead, and host-device synchronization. Application-level improvements are typically lower than peak kernel improvements.
  • "Before" baselines. Default parameters (e.g., block_size=256, stride=1, no ILP unrolling) represent common out-of-the-box configurations. If you have already manually tuned your workloads, your baseline will be higher and improvements will be smaller.
  • Percentage calculations. All percentage improvements are calculated as: ((optimized - baseline) / baseline) × 100. We use geometric mean for aggregate comparisons across different metric types (throughput, bandwidth, latency).
  • Hardware-specific optimizations. Some optimizations (e.g., LDS stride=22 for RDNA 3, Infinity Cache 96MB working set) are specific to the tested hardware architecture. Mesh Optimizer discovers these parameters per-device, so your optimal values will differ.
  • Data current as of March 2026. These benchmarks were collected during February–March 2026. Driver updates, firmware changes, and software updates may affect reproducibility.

See What Your Hardware Can Do

Deploy Mesh Optimizer and let it discover your hardware's actual performance profile. Free tier includes one-time probing and behavior atlas.