sentencepiece 0.1.97
Channel: guix
Home page: https://github.com/google/sentencepiece
Licenses: ASL 2.0
Synopsis: Unsupervised tokenizer for Neural Network-based text generation
Description:
SentencePiece is an unsupervised text tokenizer and detokenizer mainly for Neural Network-based text generation systems where the vocabulary size is predetermined prior to the neural model training. SentencePiece implements subword units---e.g., byte-pair-encoding (BPE) and unigram language model---with the extension of direct training from raw sentences. SentencePiece allows us to make a purely end-to-end system that does not depend on language-specific pre- or post-processing.
Total results: 3