Huawei vs Nvidia: The 2026-2028 AI Accelerator Race

Decoding the BIS-TPP numbers behind the Beijing summit

Published May 15, 2026 · Flopper.io Research

Donald Trump arrived in Beijing this week with a delegation of seventeen US business leaders — Jensen Huang of Nvidia, Elon Musk, and Tim Cook among them — and met Xi Jinping in the Great Hall of the People. Xi told the room China's door “will only open wider and wider.” According to the Financial Times, Huang is using the trip to revive Chinese orders for the H200 after years of export-control whiplash. Behind the diplomacy sits the actual question for anyone building AI infrastructure in 2026: how far has Huawei closed the gap, and how far is it from closing the rest?

The Receipt: Seven Ascend Chips vs Seven Years of Nvidia

Huawei's published Ascend roadmap stretches from the 2019 Ascend 910 to the projected 970 in Q4 2028. Flopper.io now indexes all seven generations alongside the contemporary Nvidia flagship for each year. The table below normalises every figure to native TFLOPS at its headline precision — not the inflated BIS-TPP numbers that dominate third-party coverage. (More on that in the next section.)

YearHuaweiPeak TFLOPSMemoryBWNvidiaPeak TFLOPSMemory
2019Ascend 910A256 FP1632 GB HBM21.23 TB/sV100 SXM2125 FP1632 GB HBM2
2024Ascend 910B400 FP1664 GB HBM2e1.6 TB/sH100 SXM51,979 FP8 (sparse)80 GB HBM3
2025Ascend 910C800 FP16 / 780 BF16128 GB HBM2e3.2 TB/sH200 SXM1,979 FP8 (sparse)141 GB HBM3e
2026Ascend 950PR2,000 FP4 / 1,000 FP8128 GB HiBL 1.01.6 TB/sGB200 (NVL72 unit)5,000 FP4 (sparse)192 GB HBM3e
2026 Q4Ascend 950DT2,000 FP4144 GB HiZQ 2.04.0 TB/sGB3007,500 FP4 (sparse)288 GB HBM3e
2027 Q4Ascend 9604,000 FP4288 GB9.6 TB/sRubin33,300 FP4 (sparse)288 GB HBM4
2028 Q4Ascend 9708,000 FP4288 GB14.4 TB/sRubin Ultra (projected)66,700 FP4 (sparse)365 GB HBM4

Memory bandwidth figures are per-package. Nvidia FP4 numbers are dense-sparse aggregate at peak; native dense throughput is half. All Huawei figures sourced from public Huawei Connect 2025 disclosures and indexed in Flopper.io's GPU database.

What “BIS-TPP” Actually Means

Most viral comparison tables circulating on X and LinkedIn cite Huawei and Nvidia in a unit called TPP — Total Processing Performance. That is not raw TFLOPS. TPP is the US Commerce Department Bureau of Industry and Security (BIS) export-control metric, and it equals peak TFLOPS multiplied by the bit-length of the highest precision the chip supports. A chip running FP4 gets its TFLOPS multiplied by 4. A chip running FP8 gets multiplied by 8. The metric exists to put accelerators of different precisions on a single regulatory axis.

Two problems follow. First, TPP makes lower-precision chips look much larger than they are in flops-of-work-done terms. Second, when commentators report “Ascend 970 hits 32,000 TPP” or “Rubin Ultra hits 266,800 TPP,” almost no reader is converting back to native throughput. The decode:

  • Ascend 970: 32,000 TPP ÷ 4 (FP4) = 8,000 TFLOPS FP4
  • Rubin Ultra: 266,800 TPP ÷ 4 (FP4) = 66,700 TFLOPS FP4
  • Ascend 910C: 6,400 TPP ÷ 8 (FP16; 910C lacks native FP4) = 800 TFLOPS FP16
  • H100 SXM5: 15,832 TPP ÷ 8 (FP8) = 1,979 TFLOPS FP8

Run the arithmetic on the 2028 endpoint and the picture sharpens. Huawei's flagship is roughly eight times behind Nvidia's flagship on raw FP4 throughput at the chip level, even after another three years of execution. That gap was twelve to fifteen times in 2024 (910B vs H100), so the trajectory is right; the absolute distance still favours Nvidia by an order of magnitude.

Where Huawei Is Actually Competitive

The chip-vs-chip comparison undersells Huawei's strategy. Two areas of genuine parity deserve attention:

Memory capacity. The Ascend 910C ships with 128 GB HBM2e — less generation, more capacity than the H100 SXM5's 80 GB. By the 950DT in late 2026, Huawei reaches 144 GB on its proprietary HiZQ 2.0 stacks. The Ascend 960 targets 288 GB — matching Nvidia's GB300 and Rubin on package memory for inference workloads where capacity dictates which models fit. For a buyer choosing between “hold one trillion-parameter model in memory” and “hit a particular tokens/second SLA,” capacity is the binary, throughput is the gradient.

System-level scale. Huawei's CloudMatrix 384 rack lashes 384 910C dies together with a custom optical fabric. Add the dies and the total system memory exceeds an equivalently provisioned GB200 NVL72: roughly 49 TB of HBM at the rack level versus 13.8 TB on the Nvidia side. The catch is power. CloudMatrix 384 reportedly draws ~559 kW; an NVL72 rack draws ~120 kW. Huawei wins on aggregate memory, loses on FLOPS-per-watt by a wide margin. For a buyer with cheap, plentiful Chinese grid power and constrained Nvidia access, that trade is currently rational.

Inference economics on long-context workloads. For frontier inference, the bottleneck is increasingly KV-cache size rather than raw matrix multiplication. A 128k-token context window on a 405B-parameter model needs roughly 80–100 GB of KV cache per replica before the model weights are even loaded. Capacity at the package level determines how many concurrent sessions a single GPU can serve and how aggressively the scheduler must evict. On a 128 GB Ascend 910C, a fleet operator can co-locate the weights and the cache on-die; on an 80 GB H100, they cannot. This is why Chinese hyperscalers running DeepSeek and Qwen at scale are publicly comfortable with Ascend — on their workload mix, raw FP8 throughput is not the binding constraint.

Where Nvidia Still Dominates

Four moats remain hard to cross:

  • FP4 throughput per watt. Nvidia's Blackwell-generation GB300 delivers roughly 7,500 TFLOPS FP4 sparse at ~1,400 W. The Ascend 950PR hits 2,000 TFLOPS FP4 dense at an undisclosed but estimated 350–500 W TDP. The per-watt gap is roughly 3–4x in Nvidia's favour and grows with Rubin. For hyperscalers paying $0.08–$0.12 per kWh and amortising hardware over 4–5 years, FLOPS-per-watt drops to the bottom line immediately.
  • Software stack. CUDA, cuDNN, NCCL, TensorRT-LLM, Triton, and a decade of framework integration are not replicated by Huawei's CANN. Buyers porting workloads to Ascend pay an engineering tax that does not show up in any spec sheet — typically measured in months of compiler tuning per model family before training throughput approaches the published numbers.
  • HBM access. Every Huawei Ascend chip from the 950 generation onward uses proprietary stacked memory (HiBL 1.0, HiZQ 2.0) because Huawei cannot reliably source HBM3e from SK Hynix, Samsung, or Micron under current export controls. The bandwidth numbers reflect that constraint. The 950DT's 4.0 TB/s sits well below the H200's 4.8 TB/s on HBM3e, and the gap to Rubin's HBM4-driven 20 TB/s is structural unless the export regime changes.
  • Scale-up fabric. NVLink 5 and NVSwitch on the GB300 deliver 1.8 TB/s of GPU-to-GPU bandwidth per accelerator within an NVL72 rack — effectively presenting 72 GPUs as one large coherent compute unit. Huawei's optical interconnect on CloudMatrix 384 is impressive at the rack level but does not yet match NVLink's per-link latency or its software abstraction for collective operations. For dense training across thousands of GPUs, that distinction matters.

A Procurement Matrix for 2026

Strip the geopolitics out and the buyer-facing decision tree is relatively crisp. The right chip depends on three variables: workload type, jurisdiction, and time horizon.

WorkloadOutside ChinaInside ChinaWhy
Frontier model pre-trainingGB300 / Rubin910C CloudMatrixScale-out fabric and software maturity decide
Fine-tuning & RLHFH200 / GB300910C / 950PRFP8/BF16 throughput and memory capacity matter most
Production LLM inferenceH200950PR (when available)FP4 path and memory-per-replica drive token economics
Long-context servingGB300950DT (Q4 2026)Package memory caps concurrent session count
Scientific HPC (FP64)H100 / GH200Limited — Ascend lacks competitive FP64Huawei optimised away from double-precision HPC

Inside-China column assumes buyers without reliable Nvidia channel access. For Chinese buyers with cleared H200 or H800 inventory, mixed fleets are common in practice.

Two notes on how to read this. Time horizon flips the answer. A two-year depreciation schedule favours the GB300 today over waiting on Rubin; a five-year schedule for a Chinese hyperscaler favours holding 910C capacity until the 960 hits volume in 2027. Workload mix dominates. A fleet that runs 70% inference and 30% fine-tuning should optimise around tokens-per-dollar on FP8, not peak FP4 sparse. That flips the framing of which spec sheet number to weight, and it is the number Huawei has closed on fastest.

What the H200 Deal Would Actually Mean

Huang's reported pitch in Beijing is to resume sales of the H200 — or a China-specific variant of it — into the Chinese market. Two observations for buyers tracking this:

First, the H200 is a 2023-designed chip. By the time it ships to Chinese customers in meaningful volume, the relevant Huawei comparison is the Ascend 950PR (already shipping in Atlas 350 systems) and the 950DT (Q4 2026). On FP4 throughput the 950PR already exceeds the H200 — the H200 has no native FP4 path. On FP16, the 910C's 800 TFLOPS beats H200's 989 TFLOPS BF16 only at the system level via CloudMatrix scale-out.

Second, the deal does not displace Ascend. Huawei retains its domestic floor through government procurement mandates and the simple fact that 910C and 950PR are already installed across China Mobile, China Telecom, Baidu, and ByteDance. The H200 deal, if it happens, opens an additional channel for Chinese frontier-lab customers who want CUDA compatibility for specific workloads. It does not reverse the localisation of inference compute that has already happened.

Third, watch the precision tier. Any H200 variant cleared for the Chinese market is likely to be a regulated SKU — lower TPP, lower interconnect bandwidth, or both — designed to fall under whatever new BIS threshold emerges from the summit. The H200's published 4.8 TB/s HBM3e bandwidth and 900 GB/s NVLink are exactly the numbers most likely to be trimmed. A China-spec H200 with a 600 GB/s NVLink cap would still outperform the 910C on FP8 single-chip throughput, but would lose much of its scale-out advantage at the cluster level, which is precisely where Nvidia's moat normally widens. Buyers should evaluate the cleared variant on its own spec sheet, not the global H200's.

The 2028 Question

Project forward to the Ascend 970 in Q4 2028. On Huawei's published roadmap it delivers 8,000 TFLOPS FP4 with 288 GB of package memory and 14.4 TB/s bandwidth. The contemporary Nvidia part — Rubin Ultra, projected for Q3 2027 and likely refreshed by late 2028 — sits at 66,700 TFLOPS FP4 with 365 GB and 53 TB/s. The raw silicon gap does not close.

Two variables determine whether that matters. HBM access is the first: if Huawei gains reliable HBM3e or HBM4 supply (either by domestic production via CXMT or by a export-control thaw of the kind this week's summit is ostensibly negotiating), the bandwidth gap closes meaningfully and the 970's effective throughput on real workloads rises. System-level architecture is the second: if CloudMatrix's successor racks scale the die count from 384 to 1,000+ while staying within manageable power envelopes, total system performance becomes the relevant comparison, not chip TFLOPS.

For an infrastructure buyer making procurement calls in 2026, the operational read is simpler: Nvidia remains the default for any frontier training workload anywhere outside China, the Ascend 910C is a credible production inference platform for buyers willing to pay the CANN tax, and the 950 generation is worth tracking closely as Atlas 350 systems begin shipping. The 2028 race is a function of policy as much as silicon.

Three concrete things to track over the next twelve months. One: whether CXMT can deliver HBM3e in volume by mid-2027, which determines whether the 960 ships at its claimed 9.6 TB/s or lands closer to 6 TB/s. Two: the post-summit BIS framework — if FP4 throughput rather than aggregate TPP becomes the gating metric, Nvidia gets meaningfully more flexibility on what it can sell into China. Three: whether Huawei publishes per-watt benchmarks for the 950PR in production CloudMatrix deployments; the absence of those numbers is currently the loudest signal in the spec sheets.

Compare Huawei Ascend vs Nvidia, head to head

Run side-by-side specs on every chip in this article. Filter by precision, memory, bandwidth, and year — or browse the full Huawei and Nvidia lineups.

Stay updated

Original Flopper.io analysis on AI infrastructure, new chip launches, and pricing shifts — delivered when it matters.