-
-
Notifications
You must be signed in to change notification settings - Fork 90
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Record overlap: rewe-shop
, rewe-group-com
#2150
Comments
Thank you for opening this issue (based on my email). |
Should we turn this into a test? @baltpeter |
I haven't looked into that particular case yet. Are we sure that that is a mistake? But, either way, we can't generally forbid two records having identical https://github.com/datenanfragen/data/blob/master/companies/amazon-de.json |
Haven't looked either. 😅
Yeah, we should only do that test if there is no overlap in the countries. 🤔 |
If there is overlap in the countries, you mean, right? But even then, I'm not sure whether there can never be a case where that is valid… |
Yes, oops. I wrote a little script to implement this: from collections import defaultdict
import os
import json
hashmap = defaultdict(list)
for file in os.listdir("companies/"):
with open("companies/" + file, "r") as f:
company = json.load(f)
slug = company["slug"]
hashmap[company["name"]].append(slug)
if "runs" in company:
for run in company["runs"]:
hashmap[run].append(slug)
simple_overlap = {k: v for k, v in hashmap.items() if len(v) > 1}
print("simple", len(simple_overlap.keys()))
for name, slugs in simple_overlap.items():
used_rvs = defaultdict(list)
alls = set()
for slug in slugs:
with open("companies/" + slug + ".json", "r") as f:
company = json.load(f)
if "relevant-countries" in company:
if company["relevant-countries"] == ["all"]:
alls.add(name)
else:
for rv in company["relevant-countries"]:
used_rvs[rv].append(slug)
filtered_overlap = {k: v for k,v in used_rvs.items() if len(v) > 2 or name in alls}
if(filtered_overlap):
print(name, filtered_overlap, alls)
Yeah, we'd also have to check if the websites are different. And probably every other key as well. However, we can close this issue: The rewe group collision is okay, since the webpages are different. |
I see my original concern as unresolved. The database currently shows 2 officials for REWE Markt GmbH:
As I understand it, this cannot be the case, as the unambiguity is missing. |
Both have "Rewe Markt GmbH" in the
runs
-Array. Seems like a mistake we should resolve?The text was updated successfully, but these errors were encountered: