TextNormalizer
is a string normalizer that uses SentenceTransformers as a backbone to obtain vector representations of sentences.
It's designed for repeated normalization task against a large corpus of strings.
The main contribution of TextNormalizer
is to gain time by eliminating the need to compute the normalized strings embeddings every time.
pip install t-normalizer
- Create an instance of
TextNormalizer
, can be initialized with aSentenceTransformer
model or aSentenceTransformer
model path. - Obtain the vector representation of the normalized string with
.fit
method. - Transform the string to the most similar normalized form using the
.transform
method.
from textnormalizer import TextNormalizer
normalizer = TextNormalizer()
normalized_text = ['senior software engineer', 'solutions architect', 'junior software developer']
to_normalize = ['experienced software engineer', 'software architect', 'entry level software engineer']
normalizer.fit(normalized_text)
transformed = normalizer.transform(to_normalize)
The model along with the normalized strings and their vector representations can be saved and loaded with .save
and .load
methods.
# save
normalizer.save('path/to/model')
# load
model = TextNormalizer.load('path/to/model')