Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Some birthdays are inconsistent in legislators-historical.yaml #490

Open
muxspace opened this issue Jul 6, 2017 · 6 comments
Open

Some birthdays are inconsistent in legislators-historical.yaml #490

muxspace opened this issue Jul 6, 2017 · 6 comments

Comments

@muxspace
Copy link

muxspace commented Jul 6, 2017

I noticed that the birthdays for the following legislators are after their start dates in congress. This is in the legislators-historical.yaml. Not sure if it's a parse issue or source data issue.

         first      last   birthday      start        end
1432   William   Mayrant 1816-10-21 1815-12-04 1817-03-03
1935  Jeremiah    Cosden 1821-03-04 1821-12-03 1823-03-03
2618      Espy Van Horne 1825-03-04 1825-12-05 1827-03-03
2619      Espy Van Horne 1825-03-04 1827-12-03 1829-03-03
8074    Willis    Machen 1810-04-10 1797-05-15 1799-03-03
10838   Wilbur   Sanders 1834-05-02 1823-12-01 1825-03-03
11098   Edward     White 1845-11-03 1813-05-24 1815-03-03
17169  William  McMaster 1877-05-10 1817-12-01 1819-03-03
20272  William  Smathers 1891-01-07 1805-12-02 1807-03-03
20273  William  Smathers 1891-01-07 1807-10-26 1809-03-03
20274  William  Smathers 1891-01-07 1809-05-22 1811-03-03
21592    James   Huffman 1894-09-13 1809-05-22 1811-03-03
21593    James   Huffman 1894-09-13 1811-11-04 1813-03-03
21594    James   Huffman 1894-09-13 1813-05-24 1815-03-03
@JoshData
Copy link
Member

JoshData commented Jul 6, 2017

I just looked at one (Huffman) and it appears that the birthday is correct but the terms aren't - there are term records that are completely bogus (wrong century, wrong type, wrong state - they seem to be duplicates of terms for this other person). git blame shows that it goes back to the first commit in this repo when I imported GovTrack's scrape of bioguide. Either biogiude had bad data or the scrape got it wrong.

There could be a bigger problem here than just the terms identified in your table.

@muxspace
Copy link
Author

muxspace commented Jul 6, 2017

I see what you mean. For Huffman, the Senate data looks correct but not the HoR.

  terms:
  - type: rep
    start: '1809-05-22'
    end: '1811-03-03'
    state: NJ
    district: -1
    party: Federalist
  - type: rep
    start: '1811-11-04'
    end: '1813-03-03'
    state: NJ
    district: -1
    party: Federalist
  - type: rep
    start: '1813-05-24'
    end: '1815-03-03'
    state: NJ
    district: -1
    party: Federalist
  - type: sen
    start: '1945-01-03'
    end: '1947-01-03'
    state: OH
    class: 1
    party: Democrat

Another one, William Mayrant, appears to use the resignation date as the birthday: October 21, 1816. The page says "served until his resignation on October 21, 1816". Does the scraper parse the literal text on the page?

@JoshData
Copy link
Member

JoshData commented Jul 7, 2017 via email

@muxspace
Copy link
Author

muxspace commented Jul 8, 2017

I was going to suggest building a NER model to detect and distinguish between birthdays and terms, but it looks like it's possible to scrape from the table of search results the salient information. It would be easy to extend to all congresses. The only limitation is that the birth and death dates from this approach are limited to years instead of full dates.

Here's an R snippet:

> library(httr)
> library(rvest)

> url <- 'http://bioguide.congress.gov/biosearch/biosearch1.asp'
> rsp <- POST(url, body=list(congress=1), encode='form')
> x <- html_table(html_nodes(content(rsp),'table')[[2]],fill=TRUE)
No encoding supplied: defaulting to UTF-8.
> head(x)
          Member Name Birth-Death       Position               Party State
1        AMES, Fisher   1758-1808 Representative  Pro-Administration    MA
2 ASHE, John Baptista   1748-1802 Representative Anti-Administration    NC
3    BALDWIN, Abraham   1754-1807 Representative Anti-Administration    GA
4    BASSETT, Richard   1745-1815        Senator Anti-Administration    DE
5      BENSON, Egbert   1746-1833 Representative  Pro-Administration    NY
6   BLAND, Theodorick   1742-1790 Representative Anti-Administration    VA
  Congress(Year)
1   1(1789-1790)
2   1(1789-1790)
3   1(1789-1790)
4   1(1789-1790)
5   1(1789-1790)
6   1(1789-1790)

@JoshData
Copy link
Member

JoshData commented Jul 8, 2017

Running that to at least detect big errors (birthday years and term years) would be useful, even if we can't automatically fill in the complete date. (This script should be able to get complete birthdays, or it might be responsible for the error.)

@bycoffe
Copy link
Contributor

bycoffe commented Jul 10, 2017

Here's a script for getting birth dates from Wikidata (and the resulting data), in case we want to use that to help check for errors: https://gist.github.com/bycoffe/3f19b94a35785fd766a29b7454f38018

(This is the script I used to fix a similar issue with death dates/term end dates here in #475)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants