-
-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
914 spider il finance authority #995
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -22,6 +22,7 @@ parts/ | |
sdist/ | ||
var/ | ||
wheels/ | ||
city_scrapers/get-pip.py | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
|
Original file line number | Diff line number | Diff line change | ||
---|---|---|---|---|
@@ -0,0 +1,171 @@ | ||||
import re | ||||
from datetime import datetime | ||||
from io import BytesIO, StringIO | ||||
|
||||
import scrapy | ||||
from city_scrapers_core.constants import BOARD, COMMISSION, COMMITTEE | ||||
from city_scrapers_core.items import Meeting | ||||
from city_scrapers_core.spiders import CityScrapersSpider | ||||
from pdfminer.high_level import extract_text_to_fp | ||||
from pdfminer.layout import LAParams | ||||
from pdfminer.pdfparser import PDFSyntaxError | ||||
|
||||
|
||||
class IlFinanceAuthoritySpider(CityScrapersSpider): | ||||
name = "il_finance_authority" | ||||
agency = "Illinois Finance Authority" | ||||
timezone = "America/Chicago" | ||||
start_urls = ["https://www.il-fa.com/public-access/board-documents/"] | ||||
|
||||
def __init__(self, *args, **kwargs): | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a reason for leaving this in? Right now it has no effect, but not sure if it was used earlier |
||||
super().__init__(*args, **kwargs) | ||||
|
||||
def parse(self, response): | ||||
for item in response.css("tr:nth-child(1n+2)"): | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We don't have to use this much, but because there are so many meetings here it would be good to use our
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you explain more about the meaning behind this idea? I added it to my code and now it doesnt work. Also more about the settings object? Thank you There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure! We want to be careful about spamming sites with a ton of requests at once, and typically when we scrape a site we're only interested in the last few past meetings and next few upcoming. To reduce the amount of requests we make, as well as simplify the output for anyone using the feeds directly, we try to set ranges of time relative to the current date that we're interested like everything in the past year in that example. Scrapy's settings are a way for managing configuration across spiders like where the output is written or how quickly requests should be made. You can find more info on them in the scrapy documentation on settings. Could you explain more about what isn't working? It's hard for me to debug without an example, but in general all the |
||||
pdf_link = self._get_pdf_link(item) | ||||
if pdf_link is None or not pdf_link.endswith(".pdf"): | ||||
continue | ||||
title = self._parse_title(item) | ||||
date = self._parse_date(item) | ||||
|
||||
yield scrapy.Request( | ||||
response.urljoin(pdf_link), | ||||
callback=self._parse_schedule, | ||||
dont_filter=True, | ||||
meta={"title": title, "date": date}, | ||||
) | ||||
|
||||
def _parse_schedule(self, response): | ||||
"""Parse PDF and then yield to meeting items""" | ||||
pdf_text = self._parse_agenda_pdf(response) | ||||
location = self._parse_location(pdf_text) | ||||
time = self._parse_start(pdf_text) | ||||
meeting_dict = dict() | ||||
meeting_dict["title"] = response.meta["title"] | ||||
meeting_dict["date"] = response.meta["date"] | ||||
meeting_dict["location"] = location | ||||
meeting_dict["time"] = time | ||||
|
||||
yield scrapy.Request( | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is this last request necessary? It looks like we're trying to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The last request is necessary for the code to work. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could you explain more about what you mean? When you create a new request to |
||||
response.url, | ||||
callback=self._parse_meeting, | ||||
dont_filter=True, | ||||
meta={"meeting_dict": meeting_dict}, | ||||
) | ||||
|
||||
def _parse_agenda_pdf(self, response): | ||||
try: | ||||
lp = LAParams(line_margin=0.1) | ||||
out_str = StringIO() | ||||
extract_text_to_fp( | ||||
inf=BytesIO(response.body), | ||||
outfp=out_str, | ||||
maxpages=1, | ||||
laparams=lp, | ||||
codec="utf-8", | ||||
) | ||||
|
||||
pdf_content = out_str.getvalue().replace("\n", "") | ||||
# Remove duplicate spaces | ||||
clean_text = re.sub(r"\s+", " ", pdf_content) | ||||
# Remove underscores | ||||
clean_text = re.sub(r"_*", "", clean_text) | ||||
return clean_text | ||||
|
||||
except PDFSyntaxError as e: | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like this can return |
||||
print("~~Error: " + str(e)) | ||||
|
||||
def _parse_meeting(self, response): | ||||
meeting_dict = response.meta["meeting_dict"] | ||||
title = meeting_dict["title"] | ||||
date = meeting_dict["date"] | ||||
time = meeting_dict["time"] | ||||
location = meeting_dict["location"] | ||||
|
||||
meeting = Meeting( | ||||
title=title, | ||||
description="", | ||||
classification=self._parse_classification(title), | ||||
start=self._meeting_datetime(date, time), | ||||
end=None, | ||||
all_day=False, | ||||
time_notes="", | ||||
location=location, | ||||
links=self._parse_links(response.url, title), | ||||
source=self._parse_source(response), | ||||
) | ||||
meeting["status"] = self._get_status(meeting) | ||||
meeting["id"] = self._get_id(meeting) | ||||
yield meeting | ||||
|
||||
def _meeting_datetime(self, date, time): | ||||
meeting_start = date + " " + time | ||||
meeting_start = meeting_start.replace(", ", ",").strip() | ||||
return datetime.strptime(meeting_start, "%b %d,%Y %I:%M %p") | ||||
|
||||
def _get_pdf_link(self, item): | ||||
pdf_tag = item.css("td:nth-child(4) > a") | ||||
if not (pdf_tag): | ||||
return None | ||||
pdf_link = pdf_tag[0].attrib["href"] | ||||
return pdf_link | ||||
|
||||
def _parse_title(self, item): | ||||
"""Parse or generate meeting title.""" | ||||
try: | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Small thing, but if possible it would be good to replace "Comm" with "Committee" for committee meetings. To be safe we'd want to replace it only when it's at the end of the title, so something like this could work title = re.sub(r"Comm$", "Committee", title) |
||||
title = item.css("td:nth-child(3)::text").extract_first() | ||||
return title | ||||
except TypeError: | ||||
return "" | ||||
|
||||
def _parse_classification(self, title): | ||||
"""Parse or generate classification from allowed options.""" | ||||
if "Comm" in title: | ||||
return COMMITTEE | ||||
if "Board" in title: | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can default to |
||||
return BOARD | ||||
return COMMISSION | ||||
|
||||
def _parse_location(self, pdf_content): | ||||
try: | ||||
"""Parse or generate location.""" | ||||
address_match = re.search( | ||||
r"(?:in\s*the|at\s*the) .*(\. | \d{5})", pdf_content | ||||
) | ||||
address = address_match.group(0) | ||||
name = re.findall(r"(?:in\s*the|at\s*the).*?,", pdf_content)[0] | ||||
except Exception: | ||||
address = "Address Not Found" | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We can just return blank strings if that's the case, but it looks like the location is pretty consistent so it could be easier to use
|
||||
name = "Name Not Found" | ||||
return {"address": address, "name": name} | ||||
|
||||
def _parse_date(self, item): | ||||
"""Parse start datetime as a naive datetime object.""" | ||||
try: | ||||
date_str = item.css("td:nth-child(2)::text").extract_first() | ||||
return date_str | ||||
except TypeError: | ||||
return "" | ||||
|
||||
def _parse_start(self, pdf_content): | ||||
try: | ||||
time = re.findall(r"\d{1,2}:\d{2}\s?(?:A.M.|P.M.|PM|AM)", pdf_content)[0] | ||||
return time | ||||
except Exception: | ||||
return "12:00 AM" | ||||
|
||||
def _parse_end(self, item): | ||||
"""Parse end datetime as a naive datetime object. Added by pipeline if None""" | ||||
return None | ||||
|
||||
def _parse_all_day(self, item): | ||||
"""Parse or generate all-day status. Defaults to False.""" | ||||
return False | ||||
|
||||
def _parse_links(self, link, title): | ||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. It looks like there are multiple links here, so ideally we would want to pull the notice, minutes, and any other information as links even if we aren't parsing them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Are you talking about the the other pdf links on the website? Could you elaborate more? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sure, I'm seeing up to 4 PDF links in each row for the Agenda, Board Book, Minutes, and Voting Record. Ideally we'll want to include all of those in the |
||||
"""parse or generate links.""" | ||||
return [{"href": link, "title": title}] | ||||
|
||||
def _parse_source(self, response): | ||||
"""parse or generate source.""" | ||||
return response.url |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Usually we'll want to keep unrelated .gitignore changes out of PRs. If you want to ignore something locally you can add it to
.git/info/exclude
in your repo