-
Notifications
You must be signed in to change notification settings - Fork 3
Data sources
One of the first questions I had when I joined the project in January 2020 was: is there any real world data we can use in the map and directory? The answer was in the affirmative and I began by importing the ioo.coop data. There was a lot of manual manipulation to do: trim leading and following spaces, parse addresses into cities and countries, and a lot of other standardization as data entry at ioo.coop was largely freeform.
An unfinished task was to try and convert ioo.coop Activities into Directory Sectors. The Directory uses the same taxonomy as the Resource Library and some Activities lined up exactly with Sectors. But many did not. Luckily, ioo.coop Activities were all lower case so they were easy to query using SIMILAR TO '[a-z]%'
.
This query schlepped the leftover Activities into the Organization.notes field.
UPDATE mdi_organization o
SET notes = ss.notes
FROM (
SELECT o.id AS oid, concat('ioo.coop "Activities": ', string_agg(s.name,', ' ORDER BY s.name)) AS notes
FROM mdi_organization o
JOIN mdi_organization_sectors os on os.organization_id = o.id
JOIN mdi_sector s on s.id = os.sector_id
WHERE s.name SIMILAR TO '[a-z]%'
GROUP BY o.id) ss
WHERE ss.oid = o.id
;
All non-taxonomy Sectors were then purged from the database.
DELETE
FROM mdi_organization_sectors os
WHERE os.sector_id IN (
SELECT id
FROM mdi_sector
WHERE name SIMILAR TO '[a-z]%'
)
;
DELETE
FROM mdi_sector
WHERE name SIMILAR TO '[a-z]%'
;
Organizations imported from the ioo.coop without a Sector in the Resource Library taxonomy can be identified using this query:
SELECT o.name, o.source_id, os.id
FROM mdi_organization o
LEFT JOIN mdi_organization_sectors os ON os.organization_id = o.id
WHERE os.id IS NULL
ORDER BY o.name
;
Adding people to auth_user
. Finessing middle_name
manually.
INSERT INTO auth_user (date_joined, username, first_name, last_name, email, password, is_superuser, is_staff, is_active, middle_name, country, address, bio, city, notes, postal_code, state, updated_at, url, source_id, affiliation, field_of_study, projects, phone, has_profile)
SELECT a, d, b, c, d, MD5(random()::text), false, false, false, '', '', '', '', '', '', '', '', now(), '', 3, '', '', '', '', false
FROM surveys_ecosystem2020
WHERE d IN (
SELECT d
FROM surveys_ecosystem2020
WHERE d like '%@%' AND id > 30
EXCEPT
SELECT email
FROM auth_user)
ORDER BY a
ON CONFLICT (email) DO NOTHING
;
Adding organizations to mdi_organization
. This query does not populate many-to-many
tables, and assigns VA/Holy See
to Organizations lacking country
. This were corrected manually through the Django admin.
INSERT INTO mdi_organization (created_at, updated_at, admin_email, name, url, email, address, city, state, postal_code, country, founded_min_date, founded_max_date, media_url,
stage_id, geo_scope, geo_scope_city, geo_scope_region, geo_scope_country, num_workers,
impacted_exact_number, code_availability, notes, description, logo_url, source_id
)
SELECT a, a,
CASE WHEN d NOT LIKE '%@%' THEN '[email protected]'
ELSE d
END,
f, g, h, q, r, cb, cc,
CASE
WHEN s IN ('Turkey','Türkiye') THEN 'TR'
WHEN s IN ('United Kingdom') THEN 'GB'
WHEN s IN ('U.S.A.', 'US & Philippines', 'USA', 'United STates', 'United States', 'United States of Amercia', 'United States of America') THEN 'US'
ELSE 'VA'
END,
concat(t,'-01-01')::date, concat(t,'-12-31')::date, x,
CASE
WHEN ad LIKE 'Phase 1%' THEN 1
WHEN ad LIKE 'Phase 2%' THEN 2
WHEN ad LIKE 'Phase 3%' THEN 3
WHEN ad LIKE 'Phase 4%' THEN 4
WHEN ad LIKE 'Phase 5%' THEN 5
ELSE NULL
END,
CASE
WHEN ao = 'local' THEN 'Local'
WHEN ao = 'regional' THEN 'Regional'
WHEN ao = 'national' THEN 'National'
WHEN ao = 'local, national' THEN 'National'
WHEN ao = 'international' THEN 'International'
END,
aq, ar,
CASE
WHEN "as_field" = 'Spain' THEN 'ES'
WHEN "as_field" = 'United Kingdom' THEN 'GB'
ELSE "as_field"
END,
ay::int,
CASE
WHEN bz = '' THEN NULL
WHEN bz LIKE '%,%' OR bz LIKE '%+' THEN regexp_replace(bz, '[,+]','','g')::int4
ELSE bz::int
END,
bd, ae, '', '', 3
FROM surveys_ecosystem2020
WHERE a > '2020-03-14 16:43:01.000000'
ORDER BY f
;