Count tokens in The Pile training data.
- Download and decompress
jsonlfiles. - Compile
count_tokens.c - Run
count_tokens data.jsonl temp.bin [output.csv]
Output csv file format:
<token_index>,<quantity>
<token_index>,<quantity>
All zero-quantity tokens will be at the end.
On Linux, just execute run.sh script. It will do all the work automatically. ~60 GB of free space on disk required.