Skip to content

Data sources

Eric Theise edited this page May 18, 2020 · 11 revisions

Data sources

ioo.coop

One of the first questions I had when I joined the project in January 2020 was: is there any real world data we can use in the map and directory? The answer was in the affirmative and I began by importing the ioo.coop data. There was a lot of manual manipulation to do: trim leading and following spaces, parse addresses into cities and countries, and a lot of other standardization as data entry at ioo.coop was largely freeform.

An unfinished task was to try and convert ioo.coop Activities into Directory Sectors. The Directory uses the same taxonomy as the Resource Library and some Activities lined up exactly with Sectors. But many did not. Luckily, ioo.coop Activities were all lower case so they were easy to query using SIMILAR TO '[a-z]%'.

This query schlepped the leftover Activities into the Organization.notes field.

UPDATE mdi_organization o
    SET notes = ss.notes
FROM (
    SELECT o.id AS oid, concat('ioo.coop "Activities": ', string_agg(s.name,', ' ORDER BY s.name)) AS notes
    FROM mdi_organization o
    JOIN mdi_organization_sectors os on os.organization_id = o.id
    JOIN mdi_sector s on s.id = os.sector_id
    WHERE s.name SIMILAR TO '[a-z]%'
    GROUP BY o.id) ss
WHERE ss.oid = o.id
;

All non-taxonomy Sectors were then purged from the database.

DELETE
FROM mdi_organization_sectors os
WHERE os.sector_id IN (
    SELECT id
    FROM mdi_sector
    WHERE name SIMILAR TO '[a-z]%'
    )
;

DELETE
FROM mdi_sector
WHERE name SIMILAR TO '[a-z]%'
;

Organizations imported from the ioo.coop without a Sector in the Resource Library taxonomy can be identified using this query:

SELECT o.name, o.source_id, os.id
FROM mdi_organization o
LEFT JOIN mdi_organization_sectors os ON os.organization_id = o.id
WHERE os.id IS NULL
ORDER BY o.name
;

Ecosystem 2020 Survey

Adding people to auth_user. Finessing middle_name manually.

INSERT INTO auth_user (date_joined, username, first_name, last_name, email, password, is_superuser, is_staff, is_active, middle_name, country, address, bio, city, notes, postal_code, state, updated_at, url, source_id, affiliation, field_of_study, projects, phone, has_profile)
SELECT a, d, b, c, d, MD5(random()::text), false, false, false, '', '', '', '', '', '', '', '', now(), '', 3, '', '', '', '', false
FROM surveys_ecosystem2020
WHERE d IN (
    SELECT d
    FROM surveys_ecosystem2020
    WHERE d like '%@%' AND id > 30
    EXCEPT
    SELECT email
    FROM auth_user)
ORDER BY a
ON CONFLICT (email) DO NOTHING
;

Adding organizations to mdi_organization. This query does not populate many-to-many tables, and assigns VA/Holy See to Organizations lacking country. This were corrected manually through the Django admin.

INSERT INTO mdi_organization (created_at, updated_at, admin_email, name, url, email, address, city, state, postal_code, country, founded_min_date, founded_max_date, media_url,
                              stage_id, geo_scope, geo_scope_city, geo_scope_region, geo_scope_country, num_workers,
                              impacted_exact_number, code_availability, notes, description, logo_url, source_id
                              )
SELECT a, a,
    CASE WHEN d NOT LIKE '%@%' THEN '[email protected]'
       ELSE d
    END,
    f, g, h, q, r, cb, cc,
    CASE
       WHEN s IN ('Turkey','Türkiye') THEN 'TR'
       WHEN s IN ('United Kingdom') THEN 'GB'
       WHEN s IN ('U.S.A.', 'US & Philippines', 'USA', 'United STates', 'United States', 'United States of Amercia', 'United States of America') THEN 'US'
       ELSE 'VA'
    END,
    concat(t,'-01-01')::date, concat(t,'-12-31')::date, x,
    CASE
       WHEN ad LIKE 'Phase 1%' THEN 1
       WHEN ad LIKE 'Phase 2%' THEN 2
       WHEN ad LIKE 'Phase 3%' THEN 3
       WHEN ad LIKE 'Phase 4%' THEN 4
       WHEN ad LIKE 'Phase 5%' THEN 5
       ELSE NULL
    END,
    CASE
       WHEN ao = 'local' THEN 'Local'
       WHEN ao = 'regional' THEN 'Regional'
       WHEN ao = 'national' THEN 'National'
       WHEN ao = 'local, national' THEN 'National'
       WHEN ao = 'international' THEN 'International'
    END,
    aq, ar,
    CASE
       WHEN "as_field" = 'Spain' THEN 'ES'
       WHEN "as_field" = 'United Kingdom' THEN 'GB'
       ELSE "as_field"
    END,
    ay::int,
    CASE
       WHEN bz = '' THEN NULL
       WHEN bz LIKE '%,%' OR bz LIKE '%+' THEN regexp_replace(bz, '[,+]','','g')::int4
       ELSE bz::int
    END,
    bd, ae, '', '', 3
FROM surveys_ecosystem2020
WHERE a > '2020-03-14 16:43:01.000000'
ORDER BY f
;
Clone this wiki locally