Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Formatting Issues while converting NLM-Chem corpus & question about converting relations #8

Open
mobashgr opened this issue Jul 4, 2022 · 5 comments

Comments

@mobashgr
Copy link

mobashgr commented Jul 4, 2022

Hi Lenz!
Great Library and a life saver (Y). However, I want to state that I have been doing extensive post-processing after converting BioC XML to Conll, even if I set byte_offsets to False. Briefly, the problem is many tokens and their corresponding labels exist as if there is one token. The second problem is the labels would look something like "B-Chemicalentity;B-Chemicalentity". Here is the corpus that I am using.

Regarding the relationships, can you provide an example or extra hints other than the ones in the documentation to convert relations from BioC XML to Conll?

Best,
Ghadeer

@lfurrer
Copy link
Owner

lfurrer commented Jul 4, 2022

Hi Ghadeer,
Converting BioC to CoNLL can be tricky, because CoNLL doesn't have the same expressive power as a stand-off format like BioC. In BioC, annotations may have gaps or they may overlap, but in CoNLL this doesn't work nicely, so simplification is needed, as described in the entity-flattening docs. The "B-Chemicalentity;B-Chemicalentity" looks like you specified avoid_overlaps=None instead of the default "keep-longer" strategy.

But it's also possible that there's a bug or an edge case I didn't consider. I'm sure I can get to the root of the problem if you provide the code you used for conversion and a minimal excerpt of the BioC file along with the unexpected output (eg. one paragraph with one or two annotations). The same goes for the many-to-one mappings you mentioned.

Relations aren't supported in the CoNLL format, as far as bconv is concerned at least. I wouldn't know how to represent relations in CoNLL (maybe derive something from the dependency notation used in the scheme for syntax parsing?). Have you seen an example of relations encoded in CoNLL in the bio/med domain?

@mobashgr
Copy link
Author

mobashgr commented Jul 5, 2022

Hi Lenz,
Thanks for the heads-up, I will try it out and keep you posted. Sure, I can provide the code I used for conversion and sample of the used BioC file and a snippet of the buggy CoNLL output.

Yes, you are completely right. I just mixed up. I meant the trivial way to extract relations from BioC XML. The documentation isn't clear for me or at least doesn't have an example/hint on relations conversion (please correct me if I am wrong)

Best,
Ghadeer

@lfurrer
Copy link
Owner

lfurrer commented Jul 6, 2022

I had a look the BC7 corpus and realised it can't be parsed by bconv. I then found out that BioC allows annotations without a <location> element, ie. entities that aren't text-bound – I wasnt' aware of that, but the DTD clearly allows it, so this is definitely an issue in bconv. I'm not so sure how to deal with this, because the assumption that entities are anchored in the text is built deep into bconv's data model... I'll try to come up with a solution eventually.

@lfurrer
Copy link
Owner

lfurrer commented Jul 6, 2022

For the relations, maybe this little REPL log may be of help:

>>> import bconv
>>> coll = bconv.load('test/data/bioc_xml/BC5CDR-example.xml', fmt='bioc_xml')
>>> doc = coll[0]
>>> doc
<Document with 2 sections at 0x7f567ee241c0>
>>> rel = next(doc.iter_relations())
>>> rel
<Relation with 2 members at 0x7f565ef857c0>
>>> rel.type
'CID'
>>> rel[0]
RelationMember(refid='1', role='Chemical')
>>> rel[1]
RelationMember(refid='2', role='Disease')
>>> entities_by_refid = {e.id: e for e in doc.iter_entities()}
>>> e = entities_by_refid[rel[0].refid]
>>> e
<bconv.doc.document.Entity at 0x7f565ee6a540>
>>> e.text
'Lidocaine'
>>> e.metadata
{'type': 'Chemical', 'cui': 'D008012'}

@mobashgr
Copy link
Author

mobashgr commented Jul 7, 2022

bconv was able to convert NLM-Chem (track 2) in BC7 for example but had to do lots of post-processing as I have explained before. This corpus has <location>. I will try the solution that you have proposed and get back to you. Many thanks for the heads-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants