Skip to content

automainint/count_tokens

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

count_tokens

Count tokens in The Pile training data.

Usage

  • Download and decompress jsonl files.
  • Compile count_tokens.c
  • Run count_tokens data.jsonl temp.bin [output.csv]

Output csv file format:

<token_index>,<quantity>
<token_index>,<quantity>

All zero-quantity tokens will be at the end.

Quick run

On Linux, just execute run.sh script. It will do all the work automatically. ~60 GB of free space on disk required.

About

Count tokens in The Pile training data.

Resources

License

Stars

Watchers

Forks

Contributors