Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP 0880 spider chi ssa 20 #989

Draft
wants to merge 25 commits into
base: main
Choose a base branch
from

Conversation

guytet
Copy link

@guytet guytet commented Nov 6, 2020

Summary

Issue: #ISSUE_NUMBER

Replace "ISSUE_NUMBER" with the number of your issue so that GitHub will link this pull request with the issue and make review easier.

Checklist

All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.

  • Tests are implemented
  • All tests are passing
  • Style checks run (see documentation for more details)
  • Style checks are passing
  • Code comments from template removed

Questions

An issue I'm encountering while trying to recurs over <h3>2019 SSA Meetings</h3> , as this tag and its siblings are where meeting information is presented.
I was able to move around the tree using: h2 = response.xpath("//h2[contains(text(), 'SPECIAL SERVICE AREAS')]/following-sibling::p/text()") - but, I understood contains() may not be the best route in this case (white space changes).

In current state, I'm able to parse all h3 in the page, get the one I'm interested in 2019 SSA Meetings, but from that point on - I cannot use xpath in order to traverse the tree, as the h3 is an iterated item with no xpath siblings.

I've tried to use relative xpath paths such as (./) within the iterated item, but it seems clear that when the regex matches - I'm holding a single xpath element, and possibly nothing more.

I'd be happy to receive some guidelines as to how I can keep on solving this issue.
Thank you.

Copy link
Collaborator

@pjsier pjsier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made some quick comments, let me know if there's anything in particular you want me to look at!

.gitignore Outdated Show resolved Hide resolved
city_scrapers/spiders/chi_ssa_20.py Outdated Show resolved Hide resolved
@guytet
Copy link
Author

guytet commented Nov 13, 2020

Changed the logic and pushed.
for white spaces and lower case, added

base = [ re.sub(r"\s+", " ", item).lower() for item in base ]

When the spider is run, the result is now:

2019 ssa meetings
ssa 20:

wednesday, june 5, 9 a.m.
beverly bank & trust, 10258 s. western ave.
wednesday, july 10, 9 a.m.
beverly bank & trust, 10258 s. western ave.

I hope that from this point onward, I can move to processing the start times using the parse_start method.
Does this look decent enough in order to move onto parse_start :) ?

@pjsier
Copy link
Collaborator

pjsier commented Nov 19, 2020

@guytet sorry for the delay, I think that's a good next step!

@guytet
Copy link
Author

guytet commented Nov 19, 2020

@guytet sorry for the delay, I think that's a good next step!

No problem @pjsier :) . Thank you for the feedback.

@guytet
Copy link
Author

guytet commented Nov 26, 2020

At current state, I believe(hope), the naive datetime object is being passed correctly by _parse_start(), if this seems satisfactory I can move forward, if not - please note what should I improve/change and i'll be happy to keep working on it.

@guytet
Copy link
Author

guytet commented Nov 29, 2020

At current state, I believe(hope), the naive datetime object is being passed correctly by _parse_start(), if this seems satisfactory I can move forward, if not - please note what should I improve/change and i'll be happy to keep working on it.

Hmm.. something is broken in the last commit, please ignore the above, I'm back to working on it :)

guy.tet added 2 commits November 28, 2020 21:22
@guytet
Copy link
Author

guytet commented Nov 29, 2020

Updates for the last commit:

  • Took care of liniting requirements per pipenv run flake8
  • The spider's run does seems to yield the expected result (from my inexperienced POV, obviously needs review)
  • pytest exists without errors

I've looked in other spiders under `city_scrapers/spiders. Please review and let me know which items require more work.
Thank you.

@pjsier
Copy link
Collaborator

pjsier commented Nov 30, 2020

@guytet Thanks for the updates! It looks like the test file isn't in version control though?

@guytet
Copy link
Author

guytet commented Dec 5, 2020

@guytet Thanks for the updates! It looks like the test file isn't in version control though?

@pjsier You're right. I will look into that. (Sorry for delayed response, for some reason I didn't see an email from github about your comment).

@guytet
Copy link
Author

guytet commented Dec 5, 2020

Anther comment I'd like to make - I'm working on another version of this spider, however it could take more time. The main challenge I see is - this page handles two SSA's (20 and 64), and so - not all meetings and dates are data we'd like to process or deliver.

I'm sure the next item is not unique, none the less it is challenging: The meeting info is posted using separate HTML tags which may not keep the same logic when newer meetings are added. (Trying to look into the future being only 2019 meetings are posted for now), and somehow make sure when future year meetings are posted, the spider is still valid. )

@guytet
Copy link
Author

guytet commented Dec 11, 2020

Added the test file. Please let me know if anything needs improving.

Copy link
Collaborator

@pjsier pjsier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your work on this! Let me know if any of the comments aren't clear

city_scrapers/spiders/chi_ssa_20.py Outdated Show resolved Hide resolved
city_scrapers/spiders/chi_ssa_20.py Outdated Show resolved Hide resolved
city_scrapers/spiders/chi_ssa_20.py Show resolved Hide resolved
city_scrapers/spiders/chi_ssa_20.py Outdated Show resolved Hide resolved
city_scrapers/spiders/chi_ssa_20.py Outdated Show resolved Hide resolved
city_scrapers/spiders/chi_ssa_20.py Outdated Show resolved Hide resolved

def _parse_links(self, item):
"""Parse or generate links."""
return [{"href": "", "title": ""}]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should try and parse these from the section on the right and attempt to match them to meetings. You can see an example of this in chi_il_medical_district

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will try to accomplish this. I'm afraid I may have chosen a less than optimal logic if including the links should be involved,
(We had a small conversation about this on Slack, irrc).

I will try and see how this option can be factored in.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another item that's worth mentioning: The links on the right do not have dates posted on them but rather "1st quarter 2018" , "1st quarter 2019" and so on.
The meetings posted for 2019 ("current" at the time, I assume) were in June and July - not necessarily corresponding to a quarterly schedule, that could be a challenge, too.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As they've added 2021 meetings, things changed up a bit. Before the change, all the meetings took place in recent history, with some of them - having matching links on the right.

Now, there are only future meetings on the left (2021), and everything on the right is past, with no matching meetings on the left (future).
So it's a new challenge, so it seems.

Copy link
Author

@guytet guytet Jan 27, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pjsier With the added 2021 meetings:

A meeting is either current/upcoming without meeting minutes, but with time data or, it's past with meeting minutes and without time data.
What should we do in this case ?
https://www.mpbhba.org/business-resources/

An example to a past meeting link (without time data)

 <div class="et_pb_text_inner"><p>SSA 20 1st Quarter 2020 meeting minutes</p></div>
</div> <!-- .et_pb_text -->
</div> <!-- .et_pb_column --><div class="et_pb_column et_pb_column_1_4 et_pb_column_inner et_pb_column_inner_6 et-last-child">
<div class="et_pb_module et_pb_image et_pb_image_5">


<a href="https://www.mpbhba.org/wp-content/uploads/SSA-64-Q1-2020-meeting-minutes.docx" target="_blank"><span class="et_pb_image_wrap 
"><img src="data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%200%200'%3E%3C/svg%3E" alt="" 
title="" height="auto" width="auto" class="wp-image-277" data-lazy-src="https://www.mpbhba.org/wp-content/uploads/adobe-pdf-icon.jpg" 
/><noscript><img src="https://www.mpbhba.org/wp-content/uploads/adobe-pdf-icon.jpg" alt="" title="" 
height="auto" width="auto" class="wp-image-277" /></noscript></span></a>
</div><div class="et_pb_module et_pb_text et_pb_text_9  et_pb_text_align_left et_pb_bg_layout_light">

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general if the time data isn't present on past meetings we can go with a reasonable default based on the current meetings. It looks like 8am should work?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8am works for me. What about the date :) ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah missed that, for some of them it looks like the date is in the URL itself, and for the others we might just need to ignore them for now

@guytet
Copy link
Author

guytet commented Dec 14, 2020

Thanks for your work on this! Let me know if any of the comments aren't clear

Thank you @pjsier , I will review your insightful input and adjust accordingly, will of course ask if anything's unclear.

guytet added 2 commits December 19, 2020 17:13
@guytet
Copy link
Author

guytet commented Jan 27, 2021

Well, this just in: SSA 20/64 has meetings scheduled for 2021 ! (and things have changed a bit, on the page) I guess it's back to the drawing table :)
https://www.mpbhba.org/business-resources/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants