Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP 0880 spider chi ssa 20 #989

Draft
wants to merge 25 commits into
base: main
Choose a base branch
from
105 changes: 105 additions & 0 deletions city_scrapers/spiders/chi_ssa_20.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
import re
from datetime import datetime

from city_scrapers_core.constants import NOT_CLASSIFIED
from city_scrapers_core.items import Meeting
from city_scrapers_core.spiders import CityScrapersSpider


class ChiSsa20Spider(CityScrapersSpider):
name = "chi_ssa_20"
agency = "Chicago Special Service Area #20 South Western Avenue"
timezone = "America/Chicago"
start_urls = ["https://www.mpbhba.org/business-resources/"]
location = {
"name": "Beverly Bank & Trust,",
guytet marked this conversation as resolved.
Show resolved Hide resolved
"address": "10258 s. Western ave.",
guytet marked this conversation as resolved.
Show resolved Hide resolved
}

def parse(self, response):

base = response.xpath(
"//*[self::p or self::strong or self::h3]/text()"
).getall()

base = [re.sub(r"\s+", " ", item).lower() for item in base]

for index, line in enumerate(base):
guytet marked this conversation as resolved.
Show resolved Hide resolved
if "ssa meetings" in line:
del base[:index]

for index, line in enumerate(base):
if "ssa 64" in line:
del base[index:]

for item in base:
if re.match(r"^\D*\d{4}\D*$", item):
year = re.match(r"^\d{4}", item)[0]

for item in base:

# don't pass empty lines to methods
if re.match(r"^\s*$", item):
continue

start = self._parse_start(item, year)
if not start:
continue

meeting = Meeting(
title=self._parse_title(item),
description=self._parse_description(item),
classification=self._parse_classification(item),
start=start,
end=self._parse_end(item),
all_day=self._parse_all_day(item),
time_notes=self._parse_time_notes(item),
location=self.location,
links=self._parse_links(item),
source=self._parse_source(response),
)

meeting["status"] = self._get_status(meeting)
meeting["id"] = self._get_id(meeting)

yield meeting

def _parse_title(self, item):
"""Parse or generate meeting title."""
return "SSA 20"
guytet marked this conversation as resolved.
Show resolved Hide resolved

def _parse_description(self, item):
"""Parse or generate meeting description."""
return ""

def _parse_classification(self, item):
"""Parse or generate classification from allowed options."""
return NOT_CLASSIFIED
guytet marked this conversation as resolved.
Show resolved Hide resolved

def _parse_start(self, item, year):

if not any(word in item for word in ["beverly", "ssa"]):
item = re.sub(r"([,\.])", "", item).strip()
ready_date = item + " " + year
date_object = datetime.strptime(ready_date, "%A %B %d %I %p %Y")
guytet marked this conversation as resolved.
Show resolved Hide resolved
return date_object

def _parse_end(self, item):
"""Parse end datetime as a naive datetime object. Added by pipeline if None"""
return None

def _parse_time_notes(self, item):
"""Parse any additional notes on the timing of the meeting"""
return ""

def _parse_all_day(self, item):
"""Parse or generate all-day status. Defaults to False."""
return False

def _parse_links(self, item):
"""Parse or generate links."""
return [{"href": "", "title": ""}]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try and parse these from the section on the right and attempt to match them to meetings. You can see an example of this in chi_il_medical_district

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to accomplish this. I'm afraid I may have chosen a less than optimal logic if including the links should be involved,
(We had a small conversation about this on Slack, irrc).

I will try and see how this option can be factored in.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another item that's worth mentioning: The links on the right do not have dates posted on them but rather "1st quarter 2018" , "1st quarter 2019" and so on.
The meetings posted for 2019 ("current" at the time, I assume) were in June and July - not necessarily corresponding to a quarterly schedule, that could be a challenge, too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As they've added 2021 meetings, things changed up a bit. Before the change, all the meetings took place in recent history, with some of them - having matching links on the right.

Now, there are only future meetings on the left (2021), and everything on the right is past, with no matching meetings on the left (future).
So it's a new challenge, so it seems.

Copy link
Author

@guytet guytet Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pjsier With the added 2021 meetings:

A meeting is either current/upcoming without meeting minutes, but with time data or, it's past with meeting minutes and without time data.
What should we do in this case ?
https://www.mpbhba.org/business-resources/

An example to a past meeting link (without time data)

 <div class="et_pb_text_inner"><p>SSA 20 1st Quarter 2020 meeting minutes</p></div>
</div> <!-- .et_pb_text -->
</div> <!-- .et_pb_column --><div class="et_pb_column et_pb_column_1_4 et_pb_column_inner et_pb_column_inner_6 et-last-child">
<div class="et_pb_module et_pb_image et_pb_image_5">


<a href="https://www.mpbhba.org/wp-content/uploads/SSA-64-Q1-2020-meeting-minutes.docx" target="_blank"><span class="et_pb_image_wrap 
"><img src="data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%200%200'%3E%3C/svg%3E" alt="" 
title="" height="auto" width="auto" class="wp-image-277" data-lazy-src="https://www.mpbhba.org/wp-content/uploads/adobe-pdf-icon.jpg" 
/><noscript><img src="https://www.mpbhba.org/wp-content/uploads/adobe-pdf-icon.jpg" alt="" title="" 
height="auto" width="auto" class="wp-image-277" /></noscript></span></a>
</div><div class="et_pb_module et_pb_text et_pb_text_9  et_pb_text_align_left et_pb_bg_layout_light">

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general if the time data isn't present on past meetings we can go with a reasonable default based on the current meetings. It looks like 8am should work?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8am works for me. What about the date :) ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah missed that, for some of them it looks like the date is in the URL itself, and for the others we might just need to ignore them for now


def _parse_source(self, response):
"""Parse or generate source."""
return response.url
Loading