Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

The word AI is classified as the word be during POS tagging. #141

Open
moskaliukua opened this issue Aug 21, 2024 · 1 comment
Open

The word AI is classified as the word be during POS tagging. #141

moskaliukua opened this issue Aug 21, 2024 · 1 comment

Comments

@moskaliukua
Copy link

Hi,
I have run into one problem in POS tagging.
in sentences like:
"It is an AI"
It seems to be consisten in other sentences as well:

"it made a lot of waves in the AI field."
I would expect that the word "AI" is classified as PROPN, but instead I get AUX and lemma is be

import winkNLP from 'wink-nlp';
import model from 'wink-eng-lite-web-model';
const nlp = winkNLP(model);
const doc = nlp.readDoc('It is an AI.').
console.log(doc.tokens().out(its.lemma));
 // [ 'it', 'be', 'an', 'be', '.' ]
doc.printTokens();

token      p-spaces   prefix  suffix  shape   case    nerHint type     normal/pos
———————————————————————————————————————————————————————————————————————————————————————
It                0   It      It      Xx      3       0       word     it / PRON
is                1   is      is      xx      1       0       word     is / AUX
an                1   an      an      xx      1       0       word     an / DET
AI                1   AI      AI      XX      2       0       word     ai / **AUX**
.                 0   .       .       .       0       0       punctuat . / PUNCT


total number of tokens: 5

versions of packages:
"wink-eng-lite-web-model": "^1.8.0",
"wink-nlp": "^2.3.0",

@moskaliukua moskaliukua changed the title The word AI is classified as the word be during POS tagging." The word AI is classified as the word be during POS tagging. Aug 22, 2024
@rachnachakraborty
Copy link
Member

Hi @moskaliukua,

Thanks for highlighting this issue.

The lexicon was trained using corpus containing archaic words like Ain't. This gets tokenised as two tokens 'Ai, not', where Ai is a Auxiliary verb.

We plan to rebuild it soon with the corrections incorporated.

Shall keep you posted.

Best,
Rachna

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants