Lindera
A morphological analysis library in Rust. Lindera is forked from kuromoji-rs and aims to provide easy installation and concise APIs for tokenizing text in multiple languages.
Key Features
| Feature | Description |
|---|---|
| Morphological Analysis | Viterbi-based segmentation and part-of-speech tagging |
| Multi-language Support | Japanese (IPADIC, IPADIC NEologd, UniDic), Korean (ko-dic), Chinese (CC-CEDICT, Jieba) |
| Dictionary System | Pre-built dictionaries, user dictionaries, and custom dictionary training |
| Text Processing Pipeline | Composable character filters and token filters for flexible text normalization |
| CRF Training | Train custom CRF models for dictionary cost estimation |
| Python Bindings | Use Lindera from Python via PyO3 |
| WebAssembly | Run Lindera in the browser via wasm-bindgen |
| Pure Rust | No C/C++ dependencies; works on any platform Rust supports |
Tokenization Flow
graph LR
subgraph Your Application
T["Text"]
end
subgraph Lindera
CF["Character Filters"]
SEG["Segmenter\n(Dictionary + Viterbi)"]
TF["Token Filters"]
end
T --> CF --> SEG --> TF --> R["Tokens"]
Document Map
| Section | Description |
|---|---|
| Getting Started | Installation, quick start, and examples |
| Dictionaries | Available dictionaries and how to use them |
| Configuration | YAML-based tokenizer configuration |
| Advanced Usage | User dictionaries, filters, and CRF training |
| CLI | Command-line interface reference |
| Architecture | Crate structure and design overview |
| API Reference | Rust API documentation |
| Contributing | How to contribute to Lindera |
Quick Example
use lindera::dictionary::load_dictionary; use lindera::mode::Mode; use lindera::segmenter::Segmenter; use lindera::tokenizer::Tokenizer; use lindera::LinderaResult; fn main() -> LinderaResult<()> { let dictionary = load_dictionary("embedded://ipadic")?; let segmenter = Segmenter::new(Mode::Normal, dictionary, None); let tokenizer = Tokenizer::new(segmenter); let text = "関西国際空港限定トートバッグ"; let mut tokens = tokenizer.tokenize(text)?; println!("text:\t{}", text); for token in tokens.iter_mut() { let details = token.details().join(","); println!("token:\t{}\t{}", token.surface.as_ref(), details); } Ok(()) }
Run the example:
cargo run --features=embed-ipadic --example=tokenize
Output:
text: 関西国際空港限定トートバッグ
token: 関西国際空港 名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
token: 限定 名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
token: トートバッグ 名詞,一般,*,*,*,*,*,*,*
License
Lindera is released under the MIT License.