Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

500px now rips non-water marked images #492

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

cyian-1756
Copy link
Contributor

The 500px ripper now rips images without a water mark on them closing issue #491. There are still some issues with the ripper (It takes a long while to start ripping and doesn't save the image titles) but those can be fixed later

Test link http://500px.com/david-foto

@cyian-1756 cyian-1756 mentioned this pull request Mar 27, 2017
@metaprime metaprime self-assigned this Apr 25, 2017
@metaprime metaprime added this to the On-deck for 1.4.x milestone Apr 25, 2017
@metaprime metaprime removed their assignment Apr 25, 2017
Copy link
Collaborator

@metaprime metaprime left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Still managed to get an adult content placeholder image (although I got 67 actual photos).

Also missing out on obvious titles which could be taken from the image file names if available.

Also it looks like after one rip of the example link I exceeded the rate limit, so I can't test again.

As usual, I think we can definitely start ripping before all of the URLs are parsed out. It's a common pattern in various rippers anyway, so it seems worth making that change. But I supposed it could wait.

@metaprime
Copy link
Collaborator

My log btw: https://pastebin.com/ZpCgqdFC

@cyian-1756
Copy link
Contributor Author

@metaprime

Looks good overall. Still managed to get an adult content placeholder image

Also it looks like after one rip of the example link I exceeded the rate limit, so I can't test again.

It looks like theres been some changes to the site since I wrote the ripper, I'll get on fixing these

@Hrxn
Copy link

Hrxn commented Apr 25, 2017

Maybe it's best to avoid using images = doc.select("meta[property=og:image]"); completely, so we don't rely on <meta og:image... at all.

Then this check can be discarded: if (imageURL.contains("https://500px.com/graphics/nude/img_3"))

Because this placeholder URL could be different, or could change any time.

Instead, always extract the target URL(s) from here:

for (Element script : doc.select("head > script")) {
    if (script.html().contains("window.PxPreloadedData")) {

 ........

Because that script element with window.PxPreloadedData should always be present.

@metaprime
Copy link
Collaborator

@cyian-1756 any update on this one?

@cyian-1756
Copy link
Contributor Author

They implemented some insane rate limiting (I was still getting IP banned after waiting 10 secs between requests) so I haven't really be able to do much testing (As I get pretty much insta banned)

@metaprime
Copy link
Collaborator

Maybe we need to make the wait interval long and slightly randomized to get around bot-detection?

@rautamiekka
Copy link

rautamiekka commented Aug 12, 2017

^ 10 seconds and getting insta-banned is already a lot, so the base waiting time would have to be something like 15 or 20 seconds at minimum with 5-10 seconds range of randomization at minimum ... And those might not even be enough.

Tbh I'm very surprised how strict limiting they suddenly implemented.

@cyian-1756
Copy link
Contributor Author

Maybe we need to make the wait interval long and slightly randomized to get around bot-detection?

That might work, I'll look into it.

Tbh I'm very surprised how strict limiting they suddenly implemented.

I wouldn't be shocked if they did it to combat ripme considering it went into effect pretty much right after I fixed this ripper and added watermark free ripping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants