Skip to content

Commit

Permalink
[feature][m] Adding wikidata-id scripts
Browse files Browse the repository at this point in the history
  • Loading branch information
gradedSystem committed Sep 30, 2024
1 parent 6951093 commit bb8fc49
Show file tree
Hide file tree
Showing 4 changed files with 55 additions and 2 deletions.
6 changes: 5 additions & 1 deletion Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,9 @@ all: data/country-codes.csv

.SECONDARY:

data/wd_countries.csv:
./scripts/wikidata.sh

data/iso3166.json:
python3 scripts/iso3166.py # Calls your custom iso3166 script
python3 scripts/csvtojson.py data/iso3166.csv data/iso3166-flat.json # Use your csvtojson function
Expand Down Expand Up @@ -54,7 +57,8 @@ data/country-codes.csv: data/country-codes.json data/geoname.csv data/cldr.csv d
python3 scripts/reorder_columns.py
python3 scripts/reorder_rows.py
cp data/country-codes-reordered-sorted.csv data/country-codes.csv
python3 scripts/cleanup.py # Ensure final column order
python3 scripts/wd_countries.py
python3 scripts/cleanup.py
cp data/country-codes.csv data/previous-country-codes.csv

clean:
Expand Down
2 changes: 1 addition & 1 deletion scripts/cleanup.py
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ def cleanup():
"Intermediate Region Name", "official_name_es", "UNTERM English Formal", "official_name_cn",
"official_name_en", "ISO4217-currency_country_name", "Least Developed Countries (LDC)", "Region Name",
"UNTERM Arabic Short", "Sub-region Name", "official_name_ru", "Global Name", "Capital",
"Continent", "TLD", "Languages", "Geoname ID", "CLDR display name", "EDGAR"
"Continent", "TLD", "Languages", "Geoname ID", "CLDR display name", "EDGAR","wikidata_id"
]

# Only reorder the columns that exist in both the dataframe and the desired order
Expand Down
20 changes: 20 additions & 0 deletions scripts/wd_countries.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
#!/usr/bin/env python
# -*- coding: utf-8 -*-

import pandas as pd

def run():
"""
Retrieving Wikidata IDs by ISO-code and updating the country-codes CSV file.
"""
wd_countries = pd.read_csv('/Users/gradedsystem/Desktop/country-codes/data/wd_countries.csv')
country_codes = pd.read_csv('/Users/gradedsystem/Desktop/country-codes/data/country-codes.csv')

merged_data = pd.merge(country_codes, wd_countries, left_on='ISO3166-1-Alpha-2', right_on='iso2_code', how='left')

merged_data['wikidata_id'] = 'https://www.wikidata.org/wiki/' + merged_data['wd_id'].fillna('')

merged_data.to_csv('/Users/gradedsystem/Desktop/country-codes/data/country-codes.csv', index=False)

if __name__ == '__main__':
run()
29 changes: 29 additions & 0 deletions scripts/wd_countries.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/bash

## retrieving Wikidata dataset by SparQL:
curl -o data/wd_countries.csv -G 'https://query.wikidata.org/sparql' \
--header "Accept: text/csv" \
--data-urlencode query='
SELECT DISTINCT (?simple_value AS ?iso2_code) ?wd_id
WHERE {
?item p:P297 ?statement .
?statement ps:P297 ?simple_value .
OPTIONAL { ?statement pq:P582 ?qualifier . }
FILTER ( !bound(?qualifier) )
BIND ( strafter(str(?item), str(wd:)) AS ?wd_id ).
} ORDER BY ?iso2_code
'

# Eliminate duplication (confusion with kingdoms and territories)
# in the future we can use "P31 Q417175" to eliminate doublets of kingdows, but "territory vs nation" need some check.
# so, filtering invalid doublets and saving with same name:
grep -v 'Q756617\|Q29999\|Q407199\|Q240592\|Q83286\|Q1246' data/wd_countries.csv | sponge data/wd_countries.csv

# Use awk to modify the second column, write to a temporary file
awk -F, 'BEGIN {OFS = FS} {if (NR > 1) $2="https://www.wikidata.org/wiki/" $2; print}' "data/wd_countries.csv" > "data/wd_countries.tmp.csv"

# Replace original file with the updated file
mv data/wd_countries.tmp.csv data/wd_countries.csv

# filtering also the last two, that are not in use at ISO: Q83286=old YU, Yugoslavia; Q1246=XK, Kosovo.
# filtering wrong duplicated Q240592 Macedonia.

0 comments on commit bb8fc49

Please sign in to comment.