I. Introduction
The deployment of large language models at data center scale has introduced a category of reliability challenge that existing power integrity methodologies were not designed to address. LLM inference workloads execute highly structured, repetitive computational patterns—matrix multiplications through nested loops, attention computations with regular memory access patterns, kernel-level phase transitions between compute-saturated and memory-bound operation—that produce periodic power oscillations at frequencies ranging from single-digit MHz to tens of MHz [1, 2].
These oscillations become a first-order reliability concern when their frequency aligns with the resonant modes of the GPU's power delivery network. At resonance, even modest power swings of 10 W can produce voltage droops exceeding 100 mV—sufficient to violate timing margins and trigger silent data corruption [1]. This phenomenon is not hypothetical: industry measurements have documented voltage droops reaching 28% of nominal supply voltage during production workload execution [1], and large-scale GPU deployments at Meta, Google, and Alibaba have reported systematic silent data corruption events linked to voltage instability [3, 4, 5].
The resonant frequency of a power delivery network is determined by the interaction of its inductive and capacitive elements. For a simple lumped RLC circuit:
In real GPU PDNs, the situation is considerably more complex. The network is distributed across multiple physical layers—die, package substrate, board—each with distinct impedance characteristics. Multiple resonant modes exist, corresponding to different LC interactions at different levels of the hierarchy. The “first droop” resonance, typically in the tens of MHz range, arises from the interaction of package inductance with on-die decoupling capacitance. The “second droop” resonance, typically in the low MHz range, involves board-level inductance resonating with package-level capacitance [6, 7].
Despite the critical importance of these resonant modes, characterizing them for a specific chip architecture remains difficult. Full-chip SPICE-level PDN simulation is computationally prohibitive for early-stage analysis. Simplified lumped models sacrifice the spatial distribution effects that determine where droop concentrates on-die. And no widely available tool integrates distributed PDN modeling with workload-aware current profile generation.
This work addresses this gap in three ways:
- We present an analytical methodology for extracting resonant frequency parameters from distributed 2D PDN models, accounting for grid inductance, via connectivity, and hierarchical capacitance across die, package, and board levels.
- We apply this methodology to a full-stack model of the NVIDIA H100 GPU, constructed in PDNLab, extracting all relevant L and C parameters from the model’s 3,421 circuit elements spanning 160 grids, 147 capacitor blocks, 100 ideal capacitors, 1,013 transmission lines, and 175 current sources.
- We describe workload profile construction strategies—step response and resonance-targeted periodic excitation—for exposing resonant modes through time-domain simulation, with specific parameter recommendations for the H100 architecture.
The experimental validation of these resonant frequency predictions through PDNLab time-domain simulation is ongoing work. This report presents the analytical framework, the H100 model parameters, and the derived resonant frequency estimates.
II. Background and Related Work
A. GPU Power Delivery Architecture
Modern GPU power delivery networks consist of cascaded stages from the voltage regulator module (VRM) on the motherboard, through board-level bulk capacitors and package substrate planes, to on-die decoupling capacitance distributed across the chip’s functional blocks. Each stage introduces parasitic resistance and inductance through its interconnect structures—PCB traces, BGA solder balls, C4 bumps or microbumps, and on-die metal routing.
The PDN can be modeled as a ladder RLC network, where each stage presents a series R-L impedance and a shunt capacitance [6, 7]. The resonant behavior of this cascaded network produces multiple impedance peaks at different frequencies, each corresponding to a different LC interaction within the hierarchy. At each resonant frequency, the PDN impedance rises sharply, meaning the network cannot effectively supply the rapid current changes demanded by the load—resulting in amplified voltage droop.
B. The LLM-PDN Resonance Problem
Jiang et al. [1] provided the first comprehensive characterization of this threat, demonstrating that LLM inference kernels produce intra-kernel power oscillations in the 4–30 MHz range—directly overlapping with typical GPU PDN resonant frequencies. Their analysis of an A100-class GPU PDN identified resonant frequencies at 35 MHz (first droop) and 1.5 MHz (second droop). When synthetic stressors were tuned to operate at these resonant frequencies, voltage droops reached 250 mV, approximately 2× larger than those produced by real LLM workloads. Critically, real LLM kernels operating even near (not exactly at) these frequencies still generated droops exceeding 100 mV.
Several converging trends in semiconductor technology make this problem progressively worse [1]:
- Decreasing supply voltages. Scaling from 1.8 V in mature nodes toward 0.7 V and below in advanced nodes means the same absolute voltage droop represents an increasingly larger fraction of the available noise margin.
- Increasing die sizes. Larger dies carry more on-die capacitance, pushing PDN resonant frequencies downward into frequency ranges more easily excited by software workloads.
- Increasing operating frequencies. GPU boost clocks have risen from ~1400 MHz (Volta) to ~1800 MHz (Hopper), increasing the rate at which computational patterns can excite PDN resonances.
These opposing trends create a convergence scenario where workload oscillation frequencies and PDN resonant frequencies approach each other, increasing the probability of resonance excitation during normal operation.
C. The Voltage-Reliability Connection
Voltage droop has direct consequences for chip reliability. When supply voltage sags below the minimum required for correct logic operation, timing violations occur: signals arrive too late at flip-flop inputs, and incorrect values are latched. Unlike hard failures, these errors can be transient and data-dependent—they corrupt computation results without triggering hardware fault detection, producing what the industry terms silent data corruption (SDC).
Large-scale GPU deployments have documented this threat at production scale. Meta reported systematic SDC events requiring novel detection infrastructure [3]. Google documented similar challenges in fleet-wide monitoring [4]. Alibaba’s Minder system [5] addressed faulty machine detection motivated in part by voltage-induced computation errors during distributed training. These reports establish that resonance-induced voltage droop is not merely a design-time concern but an operational reliability threat affecting production AI infrastructure.
D. Distributed PDN Modeling
Lumped-element PDN models, while useful for first-order analysis, cannot capture spatial effects that are critical in large-die GPUs. Voltage droop concentrates at locations far from power entry points and near simultaneously switching current sources. The propagation delay of voltage disturbances across the die creates interference patterns—“rogue waves”—where droops from multiple switching events can constructively interfere at specific locations, producing localized voltage collapse worse than any individual event would predict.
PDNLab addresses this by modeling the PDN as a distributed 2D RLC network, where each grid tile represents a local RC + L circuit element with its own sheet resistance, inductance per square, and distributed capacitance per unit area. Grids at different z-levels represent different physical layers (die, interposer, package substrate), connected by via elements with explicit R and L parameters. This distributed approach preserves spatial voltage variation across the die surface while remaining computationally tractable for multi-microsecond time-domain simulation.
III. H100 GPU Power Delivery Model
We constructed a full-stack PDN model of the NVIDIA H100 GPU in PDNLab, based on published die dimensions, SM count, and standard packaging parameters for high-performance GPU designs. The model captures three hierarchical levels: on-die power distribution, package substrate, and die-to-package interconnect.
A. Model Architecture
The H100 model comprises 3,421 circuit elements organized across 160 grids at multiple z-levels:
| Element Type | Count | Description |
|---|---|---|
| G (Grid) | 160 | 2D distributed RLC grid tiles |
| C (Capacitor) | 147 | Discrete decoupling capacitors |
| L (Ideal Cap) | 100 | Package-level ideal capacitors |
| T (Tline/Via) | 1,013 | Transmission lines and vias |
| I (Source) | 175 | Current sources (workload) |
| N (Node) | 1,826 | Circuit nodes |
B. Die-Level Network (z-Level 1–4)
The die-level network models the on-chip power distribution grid and the individual functional blocks of the H100 GPU.
Global power grid. A single grid named global_pdn spans the full 2.85 × 2.85 cm die area at z-level 4. This represents the top-level on-die metal power mesh. Its parameters:
| Parameter | Value |
|---|---|
| Dimensions | 2.85 × 2.85 cm |
| Sheet resistance | 5 mΩ/sq |
| Inductance | 3 nH/sq |
| Default capacitance | 40 nF/cm² |
| Total area | 8.12 cm² |
| Distributed capacitance | 325 nF |
SM grids. 144 individual streaming multiprocessor grids, each 0.184 × 0.184 cm at z-level 1, model the local power distribution within each SM. Each carries 5 nH/sq inductance, 0.3 Ω/sq sheet resistance, and 200 nF/cm² default capacitance, yielding approximately 6.8 nF distributed capacitance per SM and ~975 nF total across all 144 SMs.
Peripheral blocks. The model includes dedicated grids for L2 cache partitions (2 grids, 1.2 × 0.3 cm each), memory controllers (10 grids, 0.1 × 0.35 cm each), NVLink bus (2.75 × 0.12 cm), and PCIe interface (2.1 × 0.1 cm), all at z-level 1 with parameters appropriate to their function.
C. Package-Level Network (z-Level 7)
The package substrate is modeled as a single grid named package_pdn spanning the same 2.85 × 2.85 cm footprint at z-level 7:
| Parameter | Value |
|---|---|
| Sheet resistance | 2 mΩ/sq |
| Inductance | 2 nH/sq |
| Default capacitance | 20 nF/cm² |
| Distributed capacitance | 162 nF |
100 ideal capacitors (each 10 nF) are placed on the package grid, representing package-level decoupling capacitance, totaling 1.0 μF.
D. Interconnect: Vias and Die Bumps
The die-to-package interconnect is modeled through 1,013 transmission line elements, categorized as follows:
| Connection | Count | R per element | L per element |
|---|---|---|---|
| SM ↔ global_pdn (vias) | 720 | 30 μΩ | 5 nH |
| global_pdn ↔ package_pdn (dbumps) | 100 | 3 mΩ | 5 nH |
| package_pdn ↔ BGA (sballs) | 100 | 3 mΩ | 5 nH |
| Peripheral ↔ global_pdn (vias) | 93 | 30 μΩ–5 mΩ | 1–5 nH |
The 100 die bumps (dbumps) connecting global_pdn to package_pdn are the critical inductance path for the first droop resonance. Each carries 5 nH individual inductance; in parallel, their effective inductance is:
E. Decoupling Capacitance Budget
The total capacitance in the model, categorized by location:
| Source | Value | Location |
|---|---|---|
| 144 SM decaps (110 nF each) | 15.84 μF | Die (SM grids) |
| L2 cache decaps | 4.00 μF | Die (L2 grids) |
| PCIe GT decap | 0.50 μF | Die (PCIe grid) |
| SM grid distributed | 0.98 μF | Die (144 SM grids) |
| Global grid distributed | 0.32 μF | Die (global_pdn) |
| Total on-die | 21.64 μF | |
| Package grid distributed | 0.16 μF | Package |
| Package ideal caps (100 × 10 nF) | 1.00 μF | Package |
| Total package | 1.16 μF |
The on-die capacitance is dominated by the SM decoupling capacitors (73% of total), consistent with the H100’s design focus on sustained high current delivery to its 144 streaming multiprocessors.
IV. Resonant Frequency Analysis
A. Analytical Framework
For a distributed 2D PDN model, the resonant frequency of each droop mode is determined by the effective inductance and capacitance participating in that mode. We identify the relevant LC pairs for each stage of the PDN hierarchy.
The first droop resonance is the oscillation between the package plane’s effective inductance and the total on-die capacitance. Current flowing from the package plane through the die bumps to supply the on-die load encounters two series inductance contributions:
- Die bump inductance $L_{\text{bump,eff}}$: The parallel combination of all die bump via inductances.
- Package sheet inductance $L_{\text{pkg,sheet}}$: The effective inductance of the package substrate plane, which depends on the current path length and sheet inductance per square.
For a square grid with uniform current entry, the effective sheet inductance is approximately equal to the sheet inductance $L_{\square}$ (inductance per square), since the aspect ratio of the current path is approximately unity:
The first droop resonant frequency is then:
B. H100 First Droop Calculation
Substituting the parameters extracted from the H100 model:
This represents the lower bound estimate, dominated by the package sheet inductance. If we consider only the die bump inductance path (package plane with negligible sheet inductance, as suggested by the “Low-R Package” model variant name):
C. Comparison to Prior Work
Jiang et al. [1] reported first droop resonant frequencies of ~35 MHz for an A100-class GPU PDN. The significantly lower resonant frequency we derive for the H100 model is consistent with several architectural differences:
- The H100 has substantially more on-die capacitance (~21.6 μF vs. the ~2.5 μF in the lumped A100 model from [1]), reflecting its larger die and 144 SMs (vs. 108 SMs in A100).
- The Jiang et al. model used a lumped RLC circuit with scaled parameters from earlier CPU PDN studies [7], while our model uses distributed 2D grids with explicit die bump and package plane parameterization.
- Higher on-die capacitance pushes $f_0 = 1/(2\pi\sqrt{LC})$ to lower frequency, a trend that Jiang et al. themselves noted—as die sizes increase, resonant frequencies shift downward into ranges more easily excited by software workloads.
The implication is significant: for H100-class GPUs, the first droop resonance may be low enough that even inter-kernel power transitions during LLM inference—not just intra-kernel oscillations—could excite resonant amplification. LLM inference engines like vLLM produce kernel launch cadences in the microsecond range, corresponding to frequencies of 0.5–5 MHz, which directly overlap with the predicted resonant band.
D. Second Droop Estimate
The second droop resonance involves board/VRM inductance resonating with package-level capacitance. The H100 model analyzed here does not include board-level components, but we can estimate the second droop frequency using typical board inductance values from the literature:
| Board L | Package C | $f_{0,\text{second}}$ |
|---|---|---|
| 0.5 nH | 1.16 μF | 6.6 MHz |
| 1.0 nH | 1.16 μF | 4.7 MHz |
| 2.0 nH | 1.16 μF | 3.3 MHz |
| 5.0 nH | 1.16 μF | 2.1 MHz |
V. Workload Profile Methodology for Resonance Exposure
A. Existing Workload Profiles
The H100 PDN model includes 279 pre-defined current profiles across 9 simulation scenarios. The default SM current profile models a 690 MHz Gaussian clock pulse with 11.6 A peak current per SM and a 1.45 ns period. Table VII summarizes the scenario configurations:
| Scenario | SM Profile | tmax | Character |
|---|---|---|---|
| Realistic Baseline | Ramped, per-GPC stagger | 10 ns | Multi-cycle ramp-up at 690 MHz |
| Burst (2× Peak) | 23.3 A peak | 3 ns | Maximum current stress |
| LLM Training | Burst + NVLink | 10 ns | Full utilization + interconnect |
| Rogue Wave | Phased clusters | 10 ns | Constructive interference test |
| Single SM | 1 SM burst | 10 ns | Isolation / wave propagation |
None of these existing scenarios are designed to expose the sub-5 MHz resonant modes identified in Section IV. All operate with simulation windows of 3–10 ns, while a single oscillation cycle at 1 MHz requires 1,000 ns. The resonant behavior is present in the model but invisible in the current simulation configurations.
B. Step Response Method
The most direct method for revealing PDN resonant frequencies is a step response test. All current sources simultaneously transition from zero to peak current with a fast rise time (~1 ns), then hold constant. The resulting voltage waveform exhibits damped oscillation at the PDN’s natural frequencies:
where $\alpha = R/(2L)$ is the damping coefficient determined by the resistive losses in the PDN. The oscillation frequency is the resonant frequency $f_0$, and can be measured directly from the waveform by timing successive peaks.
A step current has broadband frequency content—its Fourier transform is a sinc function spanning all frequencies—so it excites all resonant modes simultaneously. The dominant oscillation visible in the voltage waveform corresponds to the mode with the highest impedance peak, typically the first droop resonance.
The required simulation parameters for the H100 model:
- Simulation window: $t_{\max} = 5\;\mu\text{s}$ (5,000 ns)—sufficient for approximately 4 cycles at 0.8 MHz or 24 cycles at 4.8 MHz.
- Plot resolution: $n_{\text{plots}} = 500$, giving 10 ns per frame.
- Current step: 0 → 11.6 A per SM, 1 ns rise time.
- All non-SM sources: Idle (constant low current).
C. Resonance-Targeted Periodic Excitation
Once the resonant frequency is identified from the step response, a confirmation test uses a periodic current pulse train at the measured frequency. If the pulse frequency matches the PDN resonance, the voltage droop grows with each successive pulse—constructive interference between the excitation and the PDN’s natural response:
where $n$ is the cycle number and $Q$ is the charge per oscillation cycle. This progressive amplification is the hallmark of resonance and matches the behavior documented by Jiang et al. [1], who observed “progressive escalation with each subsequent oscillation” at their identified 35 MHz resonance.
For the H100 model, if step response testing reveals a resonant frequency of, say, 4 MHz:
- Pulse period: 250 ns (125 ns ON, 125 ns OFF).
- Simulation window: $t_{\max} = 2{,}500\;\text{ns}$ (10 complete cycles).
- Expected behavior: Voltage droop grows over the first 4–6 cycles before damping stabilizes the oscillation, with the steady-state amplitude determined by the PDN’s Q-factor.
D. Connection to Real LLM Workloads
The practical significance of these test methodologies lies in their connection to real workload behavior. Jiang et al. [1] demonstrated that LLM GEMM kernels produce oscillations at 4.09 MHz and 29.14 MHz on A100 GPUs. The 4.09 MHz component falls squarely within the H100’s predicted resonant band of 0.8–4.8 MHz.
Furthermore, LLM inference engines produce inter-kernel transition patterns whose cadence depends on model architecture, sequence length, and batch size. For transformer models with hundreds of layers, the kernel dispatch rate can produce power oscillation harmonics in the low-MHz range. The step response and resonance-targeted tests described above provide a controlled methodology for determining whether a specific PDN design is vulnerable to these workload-induced oscillations before silicon fabrication.
VI. Discussion
A. Implications for PDN Design
The derived resonant frequency range for the H100 model carries several design implications:
Decoupling capacitance scaling. The 21.6 μF total on-die capacitance in the H100 model is an order of magnitude larger than the ~2.5 μF in the lumped A100 PDN model from Jiang et al. [1]. This increased capacitance lowers the resonant frequency but does not eliminate the resonance—it shifts it into a frequency range that may be more dangerous, because low-MHz oscillations can be sustained for more cycles (lower damping) and are more easily excited by inter-kernel workload transitions.
Package inductance dominance. In the H100 model, the package sheet inductance (2 nH/sq) dominates over the parallel die bump inductance (50 pH). This means that package substrate design—layer count, plane geometry, via density—is the primary lever for controlling the first droop resonant frequency. Die-level changes (more decaps, lower on-die resistance) have diminishing returns if the package inductance is not simultaneously addressed.
Workload-PDN co-design. The overlap between predicted PDN resonant frequencies and measured LLM kernel oscillation frequencies suggests that future GPU architectures may require workload-aware PDN design—or equivalently, PDN-aware workload scheduling. The warp-level staggering approach proposed by Jiang et al. [1] is one such technique; others might include frequency-aware kernel dispatch scheduling or dynamic decoupling capacitance modulation.
B. Limitations
Several limitations of the current analysis should be noted:
- The model parameters are based on published specifications and typical packaging values, not physical measurements of an actual H100 GPU. Actual PDN impedance characteristics may differ.
- The analytical resonant frequency derivation uses a simplified lumped-equivalent model of what is fundamentally a distributed system. The distributed nature of the PDN may produce a band of resonant frequencies rather than a single sharp peak.
- The model does not yet include board-level components (VRM, bulk capacitors, PCB traces), which would be necessary for accurate second droop characterization.
- Time-domain simulation validation of the predicted resonant frequencies has not yet been performed. This is the subject of ongoing work.
VII. Ongoing and Future Work
This report presents the analytical foundation. The following experimental and modeling extensions are in progress:
- Step response simulation. Run the H100 model in PDNLab with a 5 μs step response scenario to directly measure the resonant frequency from the voltage waveform and compare against the analytical predictions.
- Frequency-domain impedance extraction. Develop PDNLab tooling for computing the frequency-domain impedance $Z_{\text{PDN}}(f)$ from the time-domain step response via FFT, producing impedance-vs-frequency plots analogous to those in Jiang et al. [1].
- Board-level model extension. Add VRM, bulk capacitor, and PCB trace modeling to the H100 project, enabling full first-through-third droop characterization.
- LLM workload profile integration. Develop current profiles derived from actual LLM inference power measurements (via NVML or AccelWattch), and simulate their interaction with the H100 PDN model to identify whether real workload oscillations excite the predicted resonant modes.
- Mitigation evaluation. Use the PDNLab scenario system to evaluate workload-level mitigations (kernel staggering, frequency-aware dispatch) and PDN-level mitigations (targeted decoupling, package inductance reduction) against the identified resonant vulnerabilities.
VIII. Conclusion
We have presented an analytical framework for characterizing the resonant frequencies of GPU power delivery networks from distributed 2D models, and applied it to a full-stack model of the NVIDIA H100 GPU constructed in PDNLab. The analysis identifies a primary package-to-die resonant frequency in the range of 0.8–4.8 MHz, determined by the interaction of approximately 2 nH effective package inductance with 21.6 μF total on-die capacitance.
This resonant band overlaps directly with the intra-kernel power oscillation frequencies documented in LLM workloads by Jiang et al. [1], reinforcing the concern that modern AI workloads can inadvertently excite PDN resonance during normal operation. The shift toward lower resonant frequencies in larger-die GPUs like the H100, compared to the 35 MHz reported for A100-class designs, suggests that as die sizes continue to grow, the resonance threat evolves rather than diminishes—moving from intra-kernel oscillations to inter-kernel workload cadences as the primary excitation mechanism.
The step response and resonance-targeted excitation methodologies described here provide a practical path for exposing these resonant modes in time-domain simulation, enabling PDN designers and reliability engineers to characterize vulnerability before silicon fabrication. Simulation-based validation is ongoing.
References
- Z. Jiang, J. Garrigus, A. Seigler, E. Syed, Y.-L. Huang, M. Sadi, T. Rahal-Arabi, and L. K. John, “Exploration of LLM Workload Reliability based on di/dt Effects and Voltage Droops,” IEEE International Symposium on High-Performance Computer Architecture (HPCA), 2026.
- E. Choukse et al., “Power stabilization for AI training datacenters,” arXiv:2508.14318, 2025.
- H. D. Dixit et al., “Detecting silent data corruptions in the wild,” arXiv:2203.08989, Meta, 2022.
- H. D. Dixit et al., “Silent data corruptions at scale,” arXiv:2102.11245, Facebook, 2021.
- Y. Deng et al., “Minder: Faulty Machine Detection for Large-scale Distributed Model Training,” USENIX NSDI, Alibaba, 2025.
- R. Joseph, D. Brooks, and M. Martonosi, “Control techniques to eliminate voltage emergencies in high performance processors,” HPCA, 2003.
- M. D. Powell and T. N. Vijaykumar, “Exploiting resonant behavior to reduce inductive noise,” ISCA, 2004.
- J. Leng, T. Hetherington, A. ElTantawy, S. Gilani, N. S. Kim, T. M. Aamodt, and V. J. Reddi, “GPUWattch: Enabling energy optimizations in GPGPUs,” ISCA, 2013.
- NVIDIA Corporation, “NVIDIA A100 Tensor Core GPU Architecture,” 2020.
- NVIDIA Corporation, “NVIDIA H100 Tensor Core GPU Architecture,” 2022.