arXiv.org
CODA: Rewriting Transformer Blocks as GEMM-Epilogue Programs
Transformer training systems are built around dense linear algebra, yet a nontrivial fraction of end-to-end time is spent on surrounding memory-bound operators. Normalization, activations, residual updates, reductions, and related computations repeatedly move large intermediate tensors through gl...