What is C-Transformer?
C-Transformer is a pure C runtime for transformer models, meticulously engineered for high-performance inference and training on modern x86-64 CPUs. It serves as both a practical engine and an educational tool for exploring advanced CPU optimization techniques.
This file implements:
- A unified memory layout for weights, activations, and gradients in a single contiguous block.
- Both inference and backpropagation logic from first principles.
- Advanced CPU optimizations, including:
- SIMD vectorization using AVX-512.
- Multi-threading with OpenMP, designed for NUMA and cache awareness.
- Hugepage-backed memory to minimize TLB misses.
- Hybrid parallelism, combining:
- Token-level parallelism for prompt processing.
- Head-level parallelism for attention computation.
- A fixed batch size of 1, optimized for real-time inference.
Why Focus on CPUs? The Strategic Bet
The prevailing narrative suggests GPUs are the only viable platform for serious AI. This project challenges that notion by arguing that the true comparison is not "CPU vs. GPU," but rather the Open CPU Ecosystem versus the Proprietary NVIDIA Ecosystem.
The CPU Advantage: A Bet on Open, Commodity Hardware
We believe that the relentless pace of open hardware development will make CPUs an increasingly powerful and cost-effective platform for AI.
- Massive, Affordable Memory: The latest Intel Xeon and AMD EPYC CPUs support up to 12 channels of DDR5 memory per socket, with DDR6 on the horizon. This allows entire large models (175B+ parameters) to reside in high-bandwidth, commodity DRAM, eliminating the PCIe bottleneck and HBM capacity limits inherent to GPUs.
- Explosive Parallelism: With core counts exceeding 128 per socket and advanced SIMD instruction sets like AVX-512 for 1D tensor math and AMX for 2D matrix math, CPUs are becoming massively parallel compute engines in their own right.
- Accelerated Data Movement: On-chip accelerators like Intel's Data Streaming Accelerator (DSA) and DMA engines on ARM offload memory copy operations, freeing up compute cores to focus on arithmetic.
- Freedom from Vendor Lock-In: The entire CPU ecosystem is built on commodity hardware and open standards. There is no proprietary lock-in equivalent to NVIDIA's CUDA. This fosters competition, drives down cost, and guarantees that performance gains from new hardware generations (e.g., DDR6, 600 Gb/s Ethernet) are immediately accessible without being tied to a single vendor's roadmap.
The Long-Term Vision
As AI models become more efficient and capable, the raw performance gap between CPUs and specialized accelerators will narrow. The combination of ever-improving commodity hardware and sophisticated, cache-aware software design makes the CPU a powerful, open, and increasingly competitive contender for both inference and training. This project is a bet on that future.
System Architecture
- Memory: Single 2MB hugepage-backed contiguous arena.
- Allocator: Bump allocator with dry-run mode for size estimation.
- Parallelism: OpenMP with static thread-to-core binding.
- SIMD: AVX-512 with FMA (fallback to AVX2 possible).
- Compiler: GCC/ICC with -O3 -march=native -mavx512f.
- Target: Intel Xeon (Skylake-SP or newer) / AMD EPYC (Zen 4+).
- See also
- layout_transformer For detailed memory layout.
-
transformer_layer_forward For end-to-end data flow.