Lindera

License: MIT Crates.io

A morphological analysis library in Rust. Lindera is forked from kuromoji-rs and aims to provide easy installation and concise APIs for tokenizing text in multiple languages.

Key Features

FeatureDescription
Morphological AnalysisViterbi-based segmentation and part-of-speech tagging
Multi-language SupportJapanese (IPADIC, IPADIC NEologd, UniDic), Korean (ko-dic), Chinese (CC-CEDICT, Jieba)
Dictionary SystemPre-built dictionaries, user dictionaries, and custom dictionary training
Text Processing PipelineComposable character filters and token filters for flexible text normalization
CRF TrainingTrain custom CRF models for dictionary cost estimation
Python BindingsUse Lindera from Python via PyO3
WebAssemblyRun Lindera in the browser via wasm-bindgen
Pure RustNo C/C++ dependencies; works on any platform Rust supports

Tokenization Flow

graph LR
    subgraph Your Application
        T["Text"]
    end
    subgraph Lindera
        CF["Character Filters"]
        SEG["Segmenter\n(Dictionary + Viterbi)"]
        TF["Token Filters"]
    end
    T --> CF --> SEG --> TF --> R["Tokens"]

Document Map

SectionDescription
Getting StartedInstallation, quick start, and examples
DictionariesAvailable dictionaries and how to use them
ConfigurationYAML-based tokenizer configuration
Advanced UsageUser dictionaries, filters, and CRF training
CLICommand-line interface reference
ArchitectureCrate structure and design overview
API ReferenceRust API documentation
ContributingHow to contribute to Lindera

Quick Example

use lindera::dictionary::load_dictionary;
use lindera::mode::Mode;
use lindera::segmenter::Segmenter;
use lindera::tokenizer::Tokenizer;
use lindera::LinderaResult;

fn main() -> LinderaResult<()> {
    let dictionary = load_dictionary("embedded://ipadic")?;
    let segmenter = Segmenter::new(Mode::Normal, dictionary, None);
    let tokenizer = Tokenizer::new(segmenter);

    let text = "関西国際空港限定トートバッグ";
    let mut tokens = tokenizer.tokenize(text)?;
    println!("text:\t{}", text);
    for token in tokens.iter_mut() {
        let details = token.details().join(",");
        println!("token:\t{}\t{}", token.surface.as_ref(), details);
    }

    Ok(())
}

Run the example:

cargo run --features=embed-ipadic --example=tokenize

Output:

text:   関西国際空港限定トートバッグ
token:  関西国際空港    名詞,固有名詞,組織,*,*,*,関西国際空港,カンサイコクサイクウコウ,カンサイコクサイクーコー
token:  限定    名詞,サ変接続,*,*,*,*,限定,ゲンテイ,ゲンテイ
token:  トートバッグ    名詞,一般,*,*,*,*,*,*,*

License

Lindera is released under the MIT License.