Timing and Memory Telemetry on GPUs for AI Governance

Timing and Memory Telemetry on GPUs for AI Governance
Notice: This research summary and analysis were automatically generated using AI technology. For absolute accuracy, please refer to the [Original Paper Viewer] below or the Original ArXiv Source.

The rapid expansion of GPU-accelerated computing has enabled major advances in large-scale artificial intelligence (AI), while heightening concerns about how accelerators are observed or governed once deployed. Governance is essential to ensure that large-scale compute infrastructure is not silently repurposed for training models, circumventing usage policies, or operating outside legal oversight. Because current GPUs expose limited trusted telemetry and can be modified or virtualized by adversaries, we explore whether compute-based measurements can provide actionable signals of utilization when host and device are untrusted. We introduce a measurement framework that leverages architectural characteristics of modern GPUs to generate timing- and memory-based observables that correlate with compute activity. Our design draws on four complementary primitives: (1) a probabilistic, workload-driven mechanism inspired by Proof-of-Work (PoW) to expose parallel effort, (2) sequential, latency-sensitive workloads derived via Verifiable Delay Functions (VDFs) to characterize scalar execution pressure, (3) General Matrix Multiplication (GEMM)-based tensor-core measurements that reflect dense linear-algebra throughput, and (4) a VRAM-residency test that distinguishes on-device memory locality from off-chip access through bandwidth-dependent hashing. These primitives provide statistical and behavioral indicators of GPU engagement that remain observable even without trusted firmware, enclaves, or vendor-controlled counters. We evaluate their responses to contention, architectural alignment, memory pressure, and power overhead, showing that timing shifts and residency latencies reveal meaningful utilization patterns. Our results illustrate why compute-based telemetry can complement future accountability mechanisms by exposing architectural signals relevant to post-deployment GPU governance.


💡 Research Summary

The paper tackles a pressing problem in modern AI infrastructure: how to monitor the utilization of powerful GPUs after they have been deployed, especially when neither the host system nor the GPU firmware can be trusted. Current commercial GPUs lack tamper‑evident telemetry, and even vendor‑signed firmware can be altered or virtualized by a determined adversary. As a result, GPUs can be silently repurposed for training or inference that violates usage policies, circumvents legal oversight, or supports malicious activities. The authors therefore ask whether compute‑based measurements—purely software‑driven probes that exploit architectural characteristics—can serve as a reliable source of “soft” telemetry in such hostile environments.

To answer this, they design a measurement framework built around four complementary primitives, each targeting a distinct aspect of GPU behavior:

  1. Proof‑of‑Work (PoW) Parallel‑Effort Probe – A memory‑hard hash function (e.g., Argon2) is used to generate a puzzle whose difficulty is calibrated to require a predictable amount of parallel work on the GPU. The challenger sends a random salt and timestamp, the GPU performs a brute‑force search across thousands of threads, and the response time is recorded. Longer solve times indicate that the GPU was already heavily loaded or that the puzzle was under‑utilized.

  2. Verifiable Delay Function (VDF) Sequential‑Latency Probe – A VDF forces the GPU to execute a strictly sequential, memory‑hard computation. Because parallelism is deliberately suppressed, any contention from other workloads manifests as a measurable increase in solve latency. This probe is useful for detecting whether the GPU is currently occupied with serial workloads or is being throttled by co‑resident jobs.

  3. GEMM‑Based Tensor‑Core Saturation Probe – Large‑scale matrix‑multiplication kernels are dispatched to engage the tensor cores (FP8/FP16/BF16) that dominate modern AI workloads. By measuring execution time, power draw, and achieved FLOPS, the probe quantifies how saturated the tensor‑core pipeline is. Compared with PoW, this method is far more power‑efficient while still providing a strong utilization signal.

  4. VRAM‑Residency Memory‑Latency Test – This test distinguishes whether data resides in on‑device high‑bandwidth memory (HBM) or must be streamed over PCIe/NVLink. A memory‑hard hash is performed on data placed either locally or remotely; the resulting bandwidth‑limited latency reveals the memory locality of the workload.

These primitives are combined into a “challenge–response” protocol. The challenger (e.g., a cloud‑orchestrator) periodically issues a challenge containing a fresh nonce and a deadline, the GPU executes the appropriate primitive, and the challenger validates both correctness and timing. Deviations from expected solve windows are interpreted as evidence of under‑utilization, covert repurposing, or measurement evasion.

The authors evaluate the approach on recent NVIDIA Hopper and Ampere GPUs under a variety of realistic conditions: single‑tenant workloads, multi‑tenant interference (LLM inference, synthetic compute), MIG and MPS virtualization, and power‑capped scenarios. Key findings include:

  • Contention Detection – When a heavy LLM inference job runs concurrently, PoW and VDF solve times increase by 30 %–70 %, clearly exposing that the GPU is already saturated.
  • Virtualization Robustness – Even with MIG partitions, the timing signatures of all four primitives remain distinguishable; the VRAM‑Residency test, in particular, highlights bandwidth throttling at partition boundaries.
  • Power/Heat Overhead – PoW incurs the highest peak power (~350 W) and thermal load, whereas the GEMM probe stays within ~20 % of peak power, making it suitable for continuous monitoring. VDF and VRAM tests sit in the middle.
  • Statistical Reliability – Across 10,000 repetitions, 95 % confidence intervals for solve times are tight, and anomalous spikes are observed when the GPU is rebooted or its firmware is tampered with.

The paper acknowledges several limitations. First, the measurement itself consumes GPU resources; continuous PoW probing could degrade service quality. Second, a sophisticated adversary might attempt to spoof or delay responses, so the primitives are best viewed as heuristic indicators rather than cryptographic guarantees. Third, the current implementation is tuned for NVIDIA GPUs; extending the methodology to AMD Instinct or Intel Xe architectures will require additional work.

Future research directions proposed include: (1) designing lightweight PoW/VDF variants to reduce overhead, (2) applying machine‑learning models to the timing/power traces for automated anomaly detection, (3) developing collaborative telemetry protocols that aggregate measurements across multiple GPUs in a data‑center, and (4) standardizing an API and policy framework so that cloud providers can expose trustworthy utilization metrics without relying on hardware roots of trust.

In conclusion, the study demonstrates that even in the most adversarial setting—where neither host nor GPU can be trusted—software‑driven, architecture‑aware timing and memory probes can yield actionable signals about GPU activity. These signals can complement existing attestation mechanisms and form a practical foundation for AI governance, compliance auditing, and fairness enforcement in large‑scale, multi‑tenant AI deployments.


Comments & Academic Discussion

Loading comments...

Leave a Comment