Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support #1

Open
ghost opened this issue Mar 16, 2022 · 16 comments
Open

Support #1

ghost opened this issue Mar 16, 2022 · 16 comments

Comments

@ghost
Copy link

ghost commented Mar 16, 2022

Thanks for the tool. ❤❤

Can you add support to download files from the https://www.coursehero.com/tutors-problems/* endpoint. eg. the tool can download https://www.coursehero.com/file/61519475/Human-Services-Assignment-1-docx/ but not https://www.coursehero.com/tutors-problems/Social-Psychology/38685580-What-is-a-civic-professional-in-relation-to-the-Human-Service/

@daijro
Copy link
Owner

daijro commented Mar 16, 2022

Hello, thank you for bringing attention to this! Sadly, I don't believe it is possible to scrape information from these endpoints.

On https://www.coursehero.com/file/* links, the pages are hosted as blurred images that are split up into unblurred previews. This tool gathers unblurred parts of each page hosted on CourseHero servers and rebuilds the document behind the paywall.

In tutor-problems endpoints, the previews shown on CourseHero are actually randomly generated blurred text:

image

From what I could tell, there wasn't any way for me to gather any previews or split segments of the original answer. The only way to access the information behind the paywall would have to be using a CourseHero premium account token :(

Thank you so much for using my tool!!! I'm glad to see people using it. Currently working on a major update!!!

@ghost
Copy link
Author

ghost commented Mar 16, 2022

Thank you for your update. We shall wait for the update.

@ghost ghost closed this as completed Mar 16, 2022
@ghost ghost reopened this Mar 26, 2022
@ghost
Copy link
Author

ghost commented Mar 26, 2022

Screenshot 2022-03-26 194123

Using the same IP i was able to download one homework solution and the other raised an error.
what could be the issue. I even tried multiple NordVPN IPs but the same kept happening

daijro added a commit that referenced this issue Mar 26, 2022
- Add back the Accept-Encoding header (#1)
- Fix get_remaining time from starting elapsed time
@daijro
Copy link
Owner

daijro commented Mar 26, 2022

Just released a fix, hopefully it works now

@ghost
Copy link
Author

ghost commented Mar 27, 2022

still giving the same error..another thing to mention is that i am using python 3.10.4 on ubuntu

@daijro
Copy link
Owner

daijro commented Mar 28, 2022

Hello, sorry for the late response. There were a few questions I'd like to ask:

  1. Are you able to reach this endpoint in a browser? If you are, I'll need to fix my request headers.

  2. Does it only fail on this specific CourseHero link (https://www.coursehero.com/file/p20cefc/D-Re-direct-behavior-by-providing-choices-or-options-for-alternative-activities/) or all of them? Do any other CourseHero links fail?

  3. I was able to successfully run this using Python 3.8.9 on Windows shown below (I wasn't able to reproduce your error). Perhaps the version of Python you are running could be the issue? This script wasn't built with Python 3.10 compatibility in mind.

image

Thanks!

@ghost
Copy link
Author

ghost commented Mar 28, 2022

Thanks for the reply.

  1. Yes i am able to reach the endpoints on my browser.
  2. I am able to download https://www.coursehero.com/file/p20cefc/D-Re-direct-behavior-by-providing-choices-or-options-for-alternative-activities/ but unable to download https://www.coursehero.com/file/80230572/Corporal-Punishment-Law-Of-Children-in-t-1docx/
  3. I reverted my python to 3.8.9 as well.

@daijro
Copy link
Owner

daijro commented Mar 28, 2022

Can you run the command with --debug as a flag? Sorry I can't seem to find a way to reproduce that error

@ghost
Copy link
Author

ghost commented Mar 28, 2022

image

@daijro
Copy link
Owner

daijro commented Mar 28, 2022

Thanks! I just found what's causing the issue. Fixing it right now.

@daijro daijro closed this as completed in 41c98d6 Mar 28, 2022
@ghost
Copy link
Author

ghost commented Apr 1, 2022

Hi, the regex is better now.

However the error of changing the IP is not effective as Incapsula Firewall blocks requests basing on headers, cookies and so much more. Below is a curl request of the website i tried to connect manually.
image

do you think using Selenium and Scrapy would solve this?

@daijro
Copy link
Owner

daijro commented Apr 1, 2022

Hello, I had planned on using a QWebEngine, and passing the arguments to the requests session (similar to this), but I didn't think Incapsula Firewall would get in the way.

I'll be sure to add it as a fallback next time I have the chance to work on it!

@daijro daijro reopened this Apr 1, 2022
@ghost
Copy link
Author

ghost commented Apr 3, 2022

Looked around on how bypass the WAF and came across Imperva_gzip_WAF_Bypass and coursehero is vulnerable to it.

image

@daijro
Copy link
Owner

daijro commented Apr 4, 2022

Hi, I had no luck using this bypass. It seems to be falsely taking the 200 response code as a success:

image
I think the best way for me to bypass this would have to be through pyppeteer or some other javascript web engine to run captchas.
Thanks for showing me this! ❤️

@abdouhl
Copy link

abdouhl commented Apr 14, 2022

i have the same probleme
how can i fix it
i use python3.10 ubuntu too

@weakall0999
Copy link

got this error

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants