-
-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WIP 0880 spider chi ssa 20 #989
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made some quick comments, let me know if there's anything in particular you want me to look at!
Changed the logic and pushed.
When the spider is run, the result is now:
I hope that from this point onward, I can move to processing the start times using the |
@guytet sorry for the delay, I think that's a good next step! |
At current state, I believe(hope), the naive datetime object is being passed correctly by |
Hmm.. something is broken in the last commit, please ignore the above, I'm back to working on it :) |
Updates for the last commit:
I've looked in other spiders under `city_scrapers/spiders. Please review and let me know which items require more work. |
@guytet Thanks for the updates! It looks like the test file isn't in version control though? |
Anther comment I'd like to make - I'm working on another version of this spider, however it could take more time. The main challenge I see is - this page handles two SSA's (20 and 64), and so - not all meetings and dates are data we'd like to process or deliver. I'm sure the next item is not unique, none the less it is challenging: The meeting info is posted using separate HTML tags which may not keep the same logic when newer meetings are added. (Trying to look into the future being only 2019 meetings are posted for now), and somehow make sure when future year meetings are posted, the spider is still valid. ) |
Added the test file. Please let me know if anything needs improving. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for your work on this! Let me know if any of the comments aren't clear
|
||
def _parse_links(self, item): | ||
"""Parse or generate links.""" | ||
return [{"href": "", "title": ""}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should try and parse these from the section on the right and attempt to match them to meetings. You can see an example of this in chi_il_medical_district
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will try to accomplish this. I'm afraid I may have chosen a less than optimal logic if including the links should be involved,
(We had a small conversation about this on Slack, irrc).
I will try and see how this option can be factored in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another item that's worth mentioning: The links on the right do not have dates posted on them but rather "1st quarter 2018" , "1st quarter 2019" and so on.
The meetings posted for 2019 ("current" at the time, I assume) were in June and July - not necessarily corresponding to a quarterly schedule, that could be a challenge, too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As they've added 2021 meetings, things changed up a bit. Before the change, all the meetings took place in recent history, with some of them - having matching links on the right.
Now, there are only future meetings on the left (2021), and everything on the right is past, with no matching meetings on the left (future).
So it's a new challenge, so it seems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@pjsier With the added 2021 meetings:
A meeting is either current/upcoming without meeting minutes, but with time data or, it's past with meeting minutes and without time data.
What should we do in this case ?
https://www.mpbhba.org/business-resources/
An example to a past meeting link (without time data)
<div class="et_pb_text_inner"><p>SSA 20 1st Quarter 2020 meeting minutes</p></div>
</div> <!-- .et_pb_text -->
</div> <!-- .et_pb_column --><div class="et_pb_column et_pb_column_1_4 et_pb_column_inner et_pb_column_inner_6 et-last-child">
<div class="et_pb_module et_pb_image et_pb_image_5">
<a href="https://www.mpbhba.org/wp-content/uploads/SSA-64-Q1-2020-meeting-minutes.docx" target="_blank"><span class="et_pb_image_wrap
"><img src="data:image/svg+xml,%3Csvg%20xmlns='http://www.w3.org/2000/svg'%20viewBox='0%200%200%200'%3E%3C/svg%3E" alt=""
title="" height="auto" width="auto" class="wp-image-277" data-lazy-src="https://www.mpbhba.org/wp-content/uploads/adobe-pdf-icon.jpg"
/><noscript><img src="https://www.mpbhba.org/wp-content/uploads/adobe-pdf-icon.jpg" alt="" title=""
height="auto" width="auto" class="wp-image-277" /></noscript></span></a>
</div><div class="et_pb_module et_pb_text et_pb_text_9 et_pb_text_align_left et_pb_bg_layout_light">
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general if the time data isn't present on past meetings we can go with a reasonable default based on the current meetings. It looks like 8am should work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8am works for me. What about the date :) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah missed that, for some of them it looks like the date is in the URL itself, and for the others we might just need to ignore them for now
Thank you @pjsier , I will review your insightful input and adjust accordingly, will of course ask if anything's unclear. |
Well, this just in: SSA 20/64 has meetings scheduled for 2021 ! (and things have changed a bit, on the page) I guess it's back to the drawing table :) |
Summary
Issue: #ISSUE_NUMBER
Replace "ISSUE_NUMBER" with the number of your issue so that GitHub will link this pull request with the issue and make review easier.
Checklist
All checks are run in GitHub Actions. You'll be able to see the results of the checks at the bottom of the pull request page after it's been opened, and you can click on any of the specific checks listed to see the output of each step and debug failures.
Questions
An issue I'm encountering while trying to recurs over
<h3>2019 SSA Meetings</h3>
, as this tag and its siblings are where meeting information is presented.I was able to move around the tree using:
h2 = response.xpath("//h2[contains(text(), 'SPECIAL SERVICE AREAS')]/following-sibling::p/text()")
- but, I understoodcontains()
may not be the best route in this case (white space changes).In current state, I'm able to parse all
h3
in the page, get the one I'm interested in2019 SSA Meetings
, but from that point on - I cannot usexpath
in order to traverse the tree, as theh3
is an iterated item with no xpath siblings.I've tried to use relative xpath paths such as
(./)
within the iterated item, but it seems clear that when the regex matches - I'm holding a single xpath element, and possibly nothing more.I'd be happy to receive some guidelines as to how I can keep on solving this issue.
Thank you.