Transformer Efficiency Lab

⚠️ Warning: UNDER CONSTRUCTION ⚠️

This is a project to systematically manage, use and compare the different subsystems that go into the modern transformer. Specifically, the experiments will be in 2 distinct phases:

  1. Phase 0 : Ground Work, reading the literature, building the mind, muscle and intuition for fundamentally training LLMs. ✅
  2. Phase 1 : Dense Decoder Lab, focus on pre-training and architecture
  3. Phase 2 : Inference Efficiency Lab, KV Cache, Inference and serving.
  4. Phase 3 : Sparse Pre training Lab, everything MoE’s

While there are several other aspects i wish to discover which are within the realm but currently just out of scope: Parallelism(DDP,TP,EP maybe with JAX or along the way as i train bigger models), State Space Models (Mamba) etc.

The first order of business is to establish a solid baseline for phase-1 , i.e: model_spec and eval metrics. Let’s call this Phase 0 Once this is fortified we can step into different ablation studies into Phase 1 for : Encoding, Attention, KV Cache etc. After we have experimented through Phase1 the best version of of the model will be used as baseline for Phase 2.


Some pointers to self:

  • Avoid one off notebooks and script’s, build up a run&eval pipeline and use that for experiments. Develop with config based in mind.
  • Have compulsory reporting “If it’s not logged, it didn’t happen!”
  • Define data constraints - token budget between phase1 and phase2.
  • Be hypothesis driven and document everything.

Some other similar projects out-there which actually just allow rapid experimenting :

4 items under this folder.