Skip to content
yooper edited this page Oct 28, 2017 · 5 revisions

Helpful Shortcut Methods

Helpers help simplify the process of text analysis.

Tokenization

$tokens = tokenize($text);

You can customize which type of tokenizer to tokenize with by passing in the name of the tokenizer class

$tokens = tokenize($text, \TextAnalysis\Tokenizers\PennTreeBankTokenizer::class);

The default tokenizer is \TextAnalysis\Tokenizers\GeneralTokenizer::class . Some tokenizers require parameters to be set upon instantiation.

Normalization

By default, normalize_tokens uses the function strtolower to lowercase all the tokens. To customize the normalize function, pass in either a function or a string to be used by array_map.

$normalizedTokens = normalize_tokens(array $tokens); 
$normalizedTokens = normalize_tokens(array $tokens, 'mb_strtolower');

$normalizedTokens = normalize_tokens(array $tokens, function($token){ return mb_strtoupper($token); });

Frequency Distributions

The call to freq_dist returns a FreqDist instance.

$freqDist = freq_dist(tokenize($text));

Ngram Generation

By default bigrams are generated.

$bigrams = ngrams($tokens);

Customize the ngrams

// create trigrams with a pipe delimiter in between each word
$trigrams = ngrams($tokens,3, '|');