Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Duplicates in data? #44

Open
geraltofrivia opened this issue Aug 18, 2021 · 0 comments
Open

Duplicates in data? #44

geraltofrivia opened this issue Aug 18, 2021 · 0 comments

Comments

@geraltofrivia
Copy link

Hi,

The following is most likely a misunderstanding on my part but I notice that there are many duplicates and pseudo-duplicates in the jsonl files.

For instance, this line in lama/TREx/P17.jsonl:

{
	"uuid": "df10f035-6269-4cdf-88df-26395e0dc3b4",
	"obj_uri": "Q16",
	"obj_label": "Canada",
	"sub_uri": "Q7517499",
	"sub_label": "Simcoe Composite School",
	"predicate_id": "P17",
	"evidences": [{
		"sub_surface": "Simcoe Composite School",
		"obj_surface": "Canada",
		"masked_sentence": "Simcoe Composite School is a high school in Simcoe, Ontario, [MASK]."
	}, {
		"sub_surface": "Simcoe Composite School",
		"obj_surface": "Canada",
		"masked_sentence": "Simcoe Composite School is a high school in Simcoe, Ontario, [MASK]."
	}]
}            

has two evidences both of which are the same. This is not always the case, i.e., in many other cases the evidences are different sentences.

Further, in the conceptnet corpora, apparently every UUID appears twice. As an example, here are two instances with the same UUID:

{
	"sub": "alive",
	"obj": "think",
	"pred": "HasSubevent",
	"masked_sentences": ["One of the things you do when you are alive is [MASK]."],
	"obj_label": "think",
	"uuid": "d4f11631dde8a43beda613ec845ff7d1"
}

and

{
	"pred": "HasSubevent",
	"masked_sentences": ["One of the things you do when you are alive is [MASK]."],
	"obj_label": "think",
	"uuid": "d4f11631dde8a43beda613ec845ff7d1",
	"sub_label": "alive"
}

Here, in the second time the instance does not have the following fields sub, obj but otherwise seems to remain unchanged.


So, based on this, my question is:

Are the duplicates intentional? For instance, when computing metrics of my model over the probe, am I to treat the task as-is and if need be, make predictions twice over the same instance?

Alternatively, I could easily root out the duplicates when processing the files? Do I do that instead? Have others done that?

I know for a fact that LAMA on HuggingFace datasets (https://huggingface.co/datasets/lama) contains these duplicates.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant