-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Formatting Issues while converting NLM-Chem corpus & question about converting relations #8
Comments
Hi Ghadeer, But it's also possible that there's a bug or an edge case I didn't consider. I'm sure I can get to the root of the problem if you provide the code you used for conversion and a minimal excerpt of the BioC file along with the unexpected output (eg. one paragraph with one or two annotations). The same goes for the many-to-one mappings you mentioned. Relations aren't supported in the CoNLL format, as far as |
Hi Lenz, Yes, you are completely right. I just mixed up. I meant the trivial way to extract relations from BioC XML. The documentation isn't clear for me or at least doesn't have an example/hint on relations conversion (please correct me if I am wrong) Best, |
I had a look the BC7 corpus and realised it can't be parsed by |
For the relations, maybe this little REPL log may be of help: >>> import bconv
>>> coll = bconv.load('test/data/bioc_xml/BC5CDR-example.xml', fmt='bioc_xml')
>>> doc = coll[0]
>>> doc
<Document with 2 sections at 0x7f567ee241c0>
>>> rel = next(doc.iter_relations())
>>> rel
<Relation with 2 members at 0x7f565ef857c0>
>>> rel.type
'CID'
>>> rel[0]
RelationMember(refid='1', role='Chemical')
>>> rel[1]
RelationMember(refid='2', role='Disease')
>>> entities_by_refid = {e.id: e for e in doc.iter_entities()}
>>> e = entities_by_refid[rel[0].refid]
>>> e
<bconv.doc.document.Entity at 0x7f565ee6a540>
>>> e.text
'Lidocaine'
>>> e.metadata
{'type': 'Chemical', 'cui': 'D008012'} |
bconv was able to convert NLM-Chem (track 2) in BC7 for example but had to do lots of post-processing as I have explained before. This corpus has |
Hi Lenz!
Great Library and a life saver (Y). However, I want to state that I have been doing extensive post-processing after converting BioC XML to Conll, even if I set byte_offsets to False. Briefly, the problem is many tokens and their corresponding labels exist as if there is one token. The second problem is the labels would look something like "B-Chemicalentity;B-Chemicalentity". Here is the corpus that I am using.
Regarding the relationships, can you provide an example or extra hints other than the ones in the documentation to convert relations from BioC XML to Conll?
Best,
Ghadeer
The text was updated successfully, but these errors were encountered: