500px now rips non-water marked images #492

cyian-1756 · 2017-03-25T12:07:32Z

The 500px ripper now rips images without a water mark on them closing issue #491. There are still some issues with the ripper (It takes a long while to start ripping and doesn't save the image titles) but those can be fixed later

Test link http://500px.com/david-foto

metaprime

Looks good overall. Still managed to get an adult content placeholder image (although I got 67 actual photos).

Also missing out on obvious titles which could be taken from the image file names if available.

Also it looks like after one rip of the example link I exceeded the rate limit, so I can't test again.

As usual, I think we can definitely start ripping before all of the URLs are parsed out. It's a common pattern in various rippers anyway, so it seems worth making that change. But I supposed it could wait.

metaprime · 2017-04-25T10:29:16Z

My log btw: https://pastebin.com/ZpCgqdFC

cyian-1756 · 2017-04-25T10:50:12Z

@metaprime

Looks good overall. Still managed to get an adult content placeholder image

Also it looks like after one rip of the example link I exceeded the rate limit, so I can't test again.

It looks like theres been some changes to the site since I wrote the ripper, I'll get on fixing these

Hrxn · 2017-04-25T16:14:57Z

Maybe it's best to avoid using images = doc.select("meta[property=og:image]"); completely, so we don't rely on <meta og:image... at all.

Then this check can be discarded: if (imageURL.contains("https://500px.com/graphics/nude/img_3"))

Because this placeholder URL could be different, or could change any time.

Instead, always extract the target URL(s) from here:

for (Element script : doc.select("head > script")) {
    if (script.html().contains("window.PxPreloadedData")) {

 ........

Because that script element with window.PxPreloadedData should always be present.

metaprime · 2017-08-11T09:45:09Z

@cyian-1756 any update on this one?

cyian-1756 · 2017-08-11T11:30:02Z

They implemented some insane rate limiting (I was still getting IP banned after waiting 10 secs between requests) so I haven't really be able to do much testing (As I get pretty much insta banned)

metaprime · 2017-08-12T10:28:35Z

Maybe we need to make the wait interval long and slightly randomized to get around bot-detection?

rautamiekka · 2017-08-12T11:06:59Z

^ 10 seconds and getting insta-banned is already a lot, so the base waiting time would have to be something like 15 or 20 seconds at minimum with 5-10 seconds range of randomization at minimum ... And those might not even be enough.

Tbh I'm very surprised how strict limiting they suddenly implemented.

cyian-1756 · 2017-08-12T14:51:06Z

Maybe we need to make the wait interval long and slightly randomized to get around bot-detection?

That might work, I'll look into it.

Tbh I'm very surprised how strict limiting they suddenly implemented.

I wouldn't be shocked if they did it to combat ripme considering it went into effect pretty much right after I fixed this ripper and added watermark free ripping

Now rips non-water marked images

2aaaf81

cyian-1756 mentioned this pull request Mar 27, 2017

500px Integration #396

Closed

metaprime self-assigned this Apr 25, 2017

metaprime added this to the On-deck for 1.4.x milestone Apr 25, 2017

metaprime removed their assignment Apr 25, 2017

metaprime requested changes Apr 25, 2017

View reviewed changes

metaprime added the waiting-author label Apr 25, 2017

metaprime modified the milestones: On-deck for 1.4.x, On-deck for 1.5.x Jun 25, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

500px now rips non-water marked images #492

500px now rips non-water marked images #492

cyian-1756 commented Mar 25, 2017

metaprime left a comment •

edited

Loading

metaprime commented Apr 25, 2017

cyian-1756 commented Apr 25, 2017

Hrxn commented Apr 25, 2017

metaprime commented Aug 11, 2017

cyian-1756 commented Aug 11, 2017

metaprime commented Aug 12, 2017

rautamiekka commented Aug 12, 2017 •

edited

Loading

cyian-1756 commented Aug 12, 2017

500px now rips non-water marked images #492

Are you sure you want to change the base?

500px now rips non-water marked images #492

Conversation

cyian-1756 commented Mar 25, 2017

metaprime left a comment • edited Loading

Choose a reason for hiding this comment

metaprime commented Apr 25, 2017

cyian-1756 commented Apr 25, 2017

Hrxn commented Apr 25, 2017

metaprime commented Aug 11, 2017

cyian-1756 commented Aug 11, 2017

metaprime commented Aug 12, 2017

rautamiekka commented Aug 12, 2017 • edited Loading

cyian-1756 commented Aug 12, 2017

metaprime left a comment •

edited

Loading

rautamiekka commented Aug 12, 2017 •

edited

Loading