Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange header mis-classificaton from DL #1209

Open
lfoppiano opened this issue Dec 18, 2024 · 1 comment
Open

Strange header mis-classificaton from DL #1209

lfoppiano opened this issue Dec 18, 2024 · 1 comment

Comments

@lfoppiano
Copy link
Collaborator

I'm having a strange mis-classification from the DL model (the CRF works better, in this specific case).

The sequence labelling changes but the new sequence does not start with I-:

:	:	:	:	:	:	:	:	:	:	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	0	0	0	0	0	0	0	PUNCT	0	0	1	0	<availability>
10	10	1	10	10	10	0	10	10	10	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	ALLDIGIT	0	0	0	0	0	1	0	0	NOPUNCT	0	0	1	0	<abstract>

Here the larger context:

Availability	availability	A	Av	Ava	Avai	y	ty	ity	lity	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	1	0	INITCAP	NODIGIT	0	0	1	0	0	0	0	0	NOPUNCT	0	0	1	0	<other>
Statement	statement	S	St	Sta	Stat	t	nt	ent	ment	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	1	0	INITCAP	NODIGIT	0	0	1	0	0	0	0	0	NOPUNCT	0	0	1	0	<other>
:	:	:	:	:	:	:	:	:	:	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	1	0	ALLCAP	NODIGIT	1	0	0	0	0	0	0	0	PUNCT	0	0	1	0	<other>
The	the	T	Th	The	The	e	he	The	The	BLOCKIN	LINEIN	ALIGNEDLEFT	NEWFONT	SAMEFONTSIZE	0	0	INITCAP	NODIGIT	0	0	1	0	0	0	0	0	NOPUNCT	0	0	1	0	I-<availability>
raw	raw	r	ra	raw	raw	w	aw	raw	raw	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	1	0	0	0	0	0	NOPUNCT	0	0	1	0	<availability>
sequencing	sequencing	s	se	seq	sequ	g	ng	ing	cing	BLOCKIN	LINEEND	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	0	0	0	0	0	0	NOPUNCT	0	0	1	0	<availability>
reads	reads	r	re	rea	read	s	ds	ads	eads	BLOCKIN	LINESTART	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	1	0	0	0	0	0	NOPUNCT	0	0	1	0	<availability>
for	for	f	fo	for	for	r	or	for	for	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	1	0	0	0	0	0	NOPUNCT	0	0	1	0	<availability>
the	the	t	th	the	the	e	he	the	the	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	1	0	0	0	0	0	NOPUNCT	0	0	1	0	<availability>
metagenomic	metagenomic	m	me	met	meta	c	ic	mic	omic	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	0	0	0	0	0	0	NOPUNCT	0	0	1	0	<availability>
samples	samples	s	sa	sam	samp	s	es	les	ples	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	1	1	0	0	1	0	0	NOPUNCT	0	0	1	0	<availability>
used	used	u	us	use	used	d	ed	sed	used	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	1	0	0	1	0	0	NOPUNCT	0	0	1	0	<availability>
in	in	i	in	in	in	n	in	in	in	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	1	1	0	0	1	0	0	NOPUNCT	0	0	1	0	<availability>
this	this	t	th	thi	this	s	is	his	this	BLOCKIN	LINEEND	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	1	0	0	1	0	0	NOPUNCT	0	0	1	0	<availability>
study	study	s	st	stu	stud	y	dy	udy	tudy	BLOCKIN	LINESTART	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	1	0	0	0	0	0	NOPUNCT	0	0	1	0	<availability>
were	were	w	we	wer	were	e	re	ere	were	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	1	0	0	0	0	0	NOPUNCT	0	0	1	0	<availability>
downloaded	downloaded	d	do	dow	down	d	ed	ded	aded	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	0	0	0	0	0	0	NOPUNCT	0	0	1	0	<availability>
from	from	f	fr	fro	from	m	om	rom	from	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	1	0	0	0	0	0	NOPUNCT	0	0	1	0	<availability>
public	public	p	pu	pub	publ	c	ic	lic	blic	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	1	0	0	0	0	0	NOPUNCT	0	0	1	0	<availability>
repositories	repositories	r	re	rep	repo	s	es	ies	ries	BLOCKIN	LINEEND	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	1	0	0	0	0	0	NOPUNCT	0	0	1	0	<availability>
listed	listed	l	li	lis	list	d	ed	ted	sted	BLOCKIN	LINESTART	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	1	0	0	0	0	0	NOPUNCT	0	0	1	0	<availability>
in	in	i	in	in	in	n	in	in	in	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	1	1	0	0	1	0	0	NOPUNCT	0	0	1	0	<availability>
the	the	t	th	the	the	e	he	the	the	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	1	0	0	0	0	0	NOPUNCT	0	0	1	0	<availability>
following	following	f	fo	fol	foll	g	ng	ing	wing	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	1	0	0	0	0	0	NOPUNCT	0	0	1	0	<availability>
publications	publications	p	pu	pub	publ	s	ns	ons	ions	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	NODIGIT	0	0	1	0	0	0	0	0	NOPUNCT	0	0	1	0	<availability>
:	:	:	:	:	:	:	:	:	:	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	0	0	0	0	0	0	0	PUNCT	0	0	1	0	<availability>
10	10	1	10	10	10	0	10	10	10	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	ALLDIGIT	0	0	0	0	0	1	0	0	NOPUNCT	0	0	1	0	<abstract>
.	.	.	.	.	.	.	.	.	.	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	0	0	0	0	0	0	0	DOT	0	0	1	0	<abstract>
1038	1038	1	10	103	1038	8	38	038	1038	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	ALLDIGIT	0	0	0	1	0	0	0	0	NOPUNCT	0	0	1	0	<abstract>
/	/	/	/	/	/	/	/	/	/	BLOCKIN	LINEEND	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	0	0	0	0	0	0	0	NOPUNCT	0	0	1	0	<abstract>
nature11209	nature11209	n	na	nat	natu	9	09	209	1209	BLOCKIN	LINESTART	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	CONTAINSDIGITS	0	0	0	1	0	0	0	0	NOPUNCT	0	0	1	0	<abstract>
,	,	,	,	,	,	,	,	,	,	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	0	0	0	0	0	0	0	COMMA	0	0	1	0	<abstract>
10	10	1	10	10	10	0	10	10	10	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	ALLDIGIT	0	0	0	0	0	1	0	0	NOPUNCT	0	0	1	0	<abstract>
.	.	.	.	.	.	.	.	.	.	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	0	0	0	0	0	0	0	DOT	0	0	1	0	<abstract>
1038	1038	1	10	103	1038	8	38	038	1038	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	ALLDIGIT	0	0	0	1	0	0	0	0	NOPUNCT	0	0	1	0	<abstract>
/	/	/	/	/	/	/	/	/	/	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	ALLCAP	NODIGIT	1	0	0	0	0	0	0	0	NOPUNCT	0	0	1	0	<abstract>
nature11450	nature11450	n	na	nat	natu	0	50	450	1450	BLOCKIN	LINEIN	ALIGNEDLEFT	SAMEFONT	SAMEFONTSIZE	0	0	NOCAPS	CONTAINSDIGITS	0	0	0	1	0	0	0	0	NOPUNCT	0	0	1	0	<abstract>

The portion of the abstract that is mis-classified is then dumped on the floor.

PDF: journal.pbio.3002472.pdf

@lfoppiano
Copy link
Collaborator Author

With a quick fix that check whether the I- is always the first when there is a change of label, would fix the issue, but this incorrect labelling should not something the DeLFT model should ever output.

<div type="availability">
                <div
                    xmlns="http://www.tei-c.org/ns/1.0">
                    <p>The raw sequencing reads for the metagenomic samples used in this study were downloaded from public repositories listed in the following publications: 10.1038/ nature11209, 10.1038/nature11450, 10.1016/j. cels.2016.10.004, and 10.1101/gr.233940.117. Data underlying all figures, such as the numerical values of bar plots, can be found in 10.5281/ zenodo.10304481. All other metadata, as well as the source code for the sequencing pipeline, downstream analyses, and figure generation are available at Zenodo (10.5281/zenodo.10368227) or GitHub (
                        <ref type="url" target="https://github.com/zhiru-liu/microbiome_evolution">https://github.com/zhiru-liu/microbiome_ evolution</ref>).
                    </p>
                </div>
            </div>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant