Transformer Efficiency Lab
⚠️ Warning: UNDER CONSTRUCTION ⚠️
This is a project to systematically manage, use and compare the different subsystems that go into the modern transformer. Specifically, the experiments will be in 2 distinct phases:
- Phase 0 : Ground Work, reading the literature, building the mind, muscle and intuition for fundamentally training LLMs. ✅
- Phase 1 : Dense Decoder Lab, focus on pre-training and architecture
- Phase 2 : Inference Efficiency Lab, KV Cache, Inference and serving.
- Phase 3 : Sparse Pre training Lab, everything MoE’s
While there are several other aspects i wish to discover which are within the realm but currently just out of scope: Parallelism(DDP,TP,EP maybe with JAX or along the way as i train bigger models), State Space Models (Mamba) etc.
The first order of business is to establish a solid baseline for phase-1 , i.e: model_spec and eval metrics. Let’s call this Phase 0 Once this is fortified we can step into different ablation studies into Phase 1 for : Encoding, Attention, KV Cache etc. After we have experimented through Phase1 the best version of of the model will be used as baseline for Phase 2.
Some pointers to self:
- Avoid one off notebooks and script’s, build up a run&eval pipeline and use that for experiments. Develop with config based in mind.
- Have compulsory reporting “If it’s not logged, it didn’t happen!”
- Define data constraints - token budget between phase1 and phase2.
- Be hypothesis driven and document everything.
Some other similar projects out-there which actually just allow rapid experimenting :
- https://github.com/pytorch/torchtitan
- https://github.com/google-deepmind/simply
- https://github.com/karpathy/autoresearch ( senpai dropped this yesterday! )