Authors: Prakash KL; Chenguang Zhu, Shyam Sundar Chandrasekaran; Jon Dyer
Ensuring ML code efficiency is a multi-million-dollar problem at Meta's scale. Common PyTorch (https://pytorch.org) anti-patterns such as unpinned DataLoader memory, redundant host-to-device transfers, and suboptimal optimizer configurations silently degrade training performance across thousands of jobs. Triton GPU kernel code exhibits analogous pathologies at the kernel layer, including flattened loops that forgo proper tiling. A single one-line misconfiguration can cost thousands of dollars per trainer; replicated fleet-wide — and across the broader AI industry, where hyperscalers collectively spend tens of billions of dollars annually on GPU compute and draw megawatts of grid power — the waste compounds into massive losses of capital and energy. As AI coding agents now author a growing share of production ML code, these inefficiencies are being reproduced at machine speed and at unprecedented volume, making detection at the code-authoring phase the only scalable defense.
We present Citrine, an always-on, zero-overhead efficiency system that detects 45+ ML efficiency anti-patterns spanning both PyTorch core and Triton kernel code through static analysis and AST transformation built on top of LibCST (https://github.com/Instagram/LibCST), and is integrated directly into Meta's arc lint pipeline. Every detector ships with a deterministic AST rewriter that surfaces a one-click suggested edit on every diff under review. In the past 90 days, Citrine has landed 4,700+ accepted fixes across 10+ product groups — including Generative AI, Reality Labs, Instagram, Monetization and Ads — saving millions of dollars annually in GPU compute waste and yielding measured improvements of up to 43% on affected Triton kernels.
Beyond Meta-internal impact, Citrine has become the de facto home for TorchFix (https://github.com/pytorch/torchfix), the open-source PyTorch hygiene project: 13 TorchFix rules — covering deprecated-symbol migrations, the unsafe torch.load deserialization vector, common API typos, and TorchVision migrations — now ship as first-class Citrine patterns, giving developers a single integration point for PyTorch-ecosystem hygiene alongside specific vetted efficiency rules. Through partnership with the open-source Triton compiler team (https://github.com/triton-lang/triton), Citrine also integrates 12+ Triton-lint rules detecting kernel-level pathologies such as missing @triton.autotune decorators, accumulator-precision regressions, warp-divergent control flow, and barrier deadlocks — unifying static analysis across the high-level PyTorch layer and the low-level GPU kernel layer in a single linter.
To address the rapid growth of agent-authored ML code, we further transformed Citrine from a reactive lint tool into a proactive efficiency system by shifting left into the model-authoring workflow. By encoding anti-pattern knowledge directly into Meta's LLM code-authoring framework, PyTorch-specific efficiency rules are injected into AI coding agents at code-generation time, ensuring that both experimental and production code is free of canonical inefficiencies before it is ever written. We present this end-to-end architecture in the context of the software development lifecycle (agentic codegen → linting → CI/Diff → ship), together with an attribution methodology that connects individual lint fixes — both human and AI-authored — to validated efficiency wins at fleet scale.
The approach Citrine pioneers — uniting static analysis with AST-driven remediation at the moment of authorship and reinforcing that knowledge in the LLM-codegen layer — generalizes well beyond Meta. As AI training infrastructure expands toward multi-gigawatt scales and agentic development becomes the standard mode of ML engineering, code-time efficiency enforcement is, in our view, a necessary layer in any sustainable AI compute stack. We continue to upstream generalizable rules through TorchFix and the Triton-lint project so the broader ecosystem benefits. We are extending Citrine in three directions: (1) coverage of emerging hardware and software stacks beyond PyTorch and Triton; (2) tighter integration with LLM coding agents so efficiency knowledge evolves alongside model capabilities; and (3) publishing the attribution methodology so independent researchers and other infrastructure teams can replicate fleet-scale efficiency measurement.