-
-
Notifications
You must be signed in to change notification settings - Fork 311
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Spider: Illinois Health Facilities and Services Review Board #1001
Comments
I would like to work on this issu. |
@masoodqq sounds great! Assigning you now |
Hi, |
Hi, |
@palakshivlani-11 Hi. Thanks so much for checking out the project. Go for it. |
willing to help |
@yawar1101 hi. thanks so much for checking us out. go for it. |
Hi @haileyhoyat! I'd like to try my hand on this one. |
@godclause Hello! Go for it. Cheers. |
@haileyhoyat @palakshivlani-11 @yawar1101 @pjsier Hi! I hope I'm not overthinking on this question(s): The challenging issue with URL: https://www2.illinois.gov/sites/hfsrb/events/Pages/Board-Meetings.aspx is that "juicy" meeting details are only available as downloadable PDFs via hyperlinks on ASP.NET web pages. I have 'discovered' only a few Python libraries useful for scraping PDFs, but none seem to work for remote scraping,if that makes sense.
|
No need to store them locally. You're going to get a stream of bytes regardless, so you can just read the byte stream directly from the request. I don't know how package management is applied here (e.g. pypdf)
|
I have sustained a perspective that running python operations for city-scrapers is not wholly shell agnostic (e.g. zsh, bash, csh). With this in mind, I believe package management has proven to be problematic and ought to be considered when starting a project. The problem is that zsh is now the default shell on macs. The commands provided in our docs support bash. For the folks running zsh, should there be verbiage included in our docs to remind us to consider changing our shell if we're running Catalina and beyond? If not, should there be language updated in the docs that explains the following scenarios: For Macs
For Windows based machines
Is any of this at all necessary? |
yes, there should be verbiage guiding us and or reminding us to either switch to bash or use a virtual machine just so to save us the the agony that comes with continuous frustrations. I believe it is necessary. |
@godclause @appills @onyangojerry Hi All. I want to introduce you to Dan (@SimmonsRitchie ). Dan has officially taken over the role as project lead for the entire City Scrapers project. Dan, idk if this conversation is relevant for you, particularly as you fix a lot of infrastructure things. Cheers, all. |
Thank you for the introduction @hails. Hi Dan, nice to meet you here. |
Hi there, @godclause! Nice to meet you too! And thanks for the intro, @haileyhoyat. @godclause My apologies, I took over the city-scrapers project very recently and I'm juggling a lot of fixes and upgrades right now across the project's 15 repos. I overlooked this issue and conversation. Re: saving PDFs as files before parsing Re: shell/OS issues |
@SimmonsRitchie Hi! I have 'some' thoughts...
I'm hoping I'm within scope on these concerns. |
Python module/package dependencies should work regardless of platform, are
you having problems?
…On Thu, Feb 1, 2024 at 5:22 PM shinda ***@***.***> wrote:
Hi there, @godclause <https://github.com/godclause>! Nice to meet you
too! And thanks for the intro, @haileyhoyat
<https://github.com/haileyhoyat>.
@godclause <https://github.com/godclause> My apologies, I took over the
city-scrapers project very recently and I'm juggling a lot of fixes and
upgrades right now across the project's 15 repos. I overlooked this issue
and conversation.
*Re: saving PDFs as files before parsing* I think @appills
<https://github.com/appills> may have already answered your question, but
yes, you can just parse the in-memory byte sequence of the PDF rather than
writing it to a file and then parsing it. This is generally more efficient.
*Re: shell/OS issues* I am a Mac user but I have experienced my own
headaches with a number of the city-scraper repos. To my mind I think it
may make a lot of sense to dockerize all the projects. I hope this will
make them OS-agonistic and improve the dev experience overall (especially
for newcomers). I'd be very interested in any feedback on this subject
though. Let me know if you have thoughts!
@SimmonsRitchie <https://github.com/SimmonsRitchie> Hi!
I have 'some' thoughts...
1.
For my parsing issue, @appills <https://github.com/appills>'s answer
did result in a initial inquiry into a need for clarity about OS (mac)
updates and how those are affecting python depedencies.
2.
How does Docker compare / contrast in 'usability' versus Vagrant for
platform (OS) agnosticism? What's the expected long-term benefit attributed
to Docker for City Scrapers projects versus Vagrant, from a support
perspective? I mean, I'm all for improved performance, a better experience
for newcomers, etc., but what will implementing Docker instead of Vagrant
and vice versa cost City Scrapers' repos?
I'm hoping I'm within scope on these concerns.
—
Reply to this email directly, view it on GitHub
<#1001 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AE5B6LGMT2UPGWJKGKG7LYTYRQIR3AVCNFSM4WSC2WYKU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOJSGIZTMOBWGU3Q>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
@appills I have edited my comment above. Please excuse the error. Thank you in advance. To your question, I do not believe there to be personal problems associated to module / package dependencies. Dependencies 'should' work regardless of platform (OS), shell environment. The case I did encounter will suggest otherwise for zsh. Also, there is an evolving consensus that containerizing city-scrapers addresses that concern. |
Hello: It seems the expected behavior on this code snippet is to parse text from only a single file. How does our spider parse pdf's for all future / additional meetings, considering start URL as 'https://www2.illinois.gov/sites/hfsrb/events/Pages/Board-Meetings.aspx'? |
URL: https://www2.illinois.gov/sites/hfsrb/events/Pages/Board-Meetings.aspx
Spider Name: il_health_facilities
Agency Name: Illinois Health Facilities and Services Review Board
The text was updated successfully, but these errors were encountered: