Following too many Redirects Is Problematic #1

BrendanIAB · 2017-07-11T16:34:58Z

Test case results in 4-5 redirects due to paywall. This misconfiguration should be handle elegantly, with a clear max on redirects, rather than failing to parse paywall text.

Consider update to adstxt_crawler.py line 126 to include "allow_redirects=False" in the request.get function. This fails to handle useful redirects (domain.com > www.domain.com / HTTP > HTTPS).
Consider update to adstxt_crawler.py line 129 to examine "r.history" for a maximum number of redirects (1-2). This requires additional code to understand r.history return.

mgriego · 2017-07-27T15:08:32Z

Per the spec, these changes make sense. In fact, based on my reading of the spec, only a single 301, 302, or 307 redirect should be allowed: a scheme change from http to https. Everything else must remain the same, otherwise you risk an invalid redirect. I would argue that allowing for redirecting domain.com to www.domain.com risks running afoul of section 5.5. If www.domain.com is indeed the canonical domain for the site being crawled, then www.domain.com will be the domain that is driving significant requests for advertising, so it should be the domain that the crawler is set to perform its request to.

iantri · 2017-07-29T01:14:25Z

In my data set, we see that "www." is routinely stripped (for the domain presented in the domain field), so this change would, while ensuring greater literal spec compliance, result in a poorer practical outcome. Perhaps a quick revision to the spec to also permit "domain.com" -> "www.domain.com" redirects is the cleanest solution.

mgriego · 2017-08-10T16:28:32Z

I suggest we only allow two styles of redirect (made this same comment in the slack channel):

Scheme modifications
Redirects to a different domain within the same domain tree (ie hostname.domain.tld -> domain.tld and vice versa, or even hostname.subdomain.domain.tld -> domain.tld or subdomain.tld). As long as the redirect doesn't redirect outside of the domain tree of the ORIGINAL DOMAIN, it should be allowed. Keeping it checked against the original domain ensures that we don't redirect up the domain tree then back down. This seems like a good compromise.

The number of redirects allowed should probably be limited as well.

ghost · 2017-09-27T12:40:35Z

We implemented the redirect limitation using requests.session() and it worked great.
Need to consider adding timeout value as well and validation of page type (i.e. HTML error page with response code 200) , I encountered several sites that return huge HTML files that cause an exception.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Following too many Redirects Is Problematic #1

Following too many Redirects Is Problematic #1

BrendanIAB commented Jul 11, 2017

mgriego commented Jul 27, 2017

iantri commented Jul 29, 2017

mgriego commented Aug 10, 2017

ghost commented Sep 27, 2017

Following too many Redirects Is Problematic #1

Following too many Redirects Is Problematic #1

Comments

BrendanIAB commented Jul 11, 2017

mgriego commented Jul 27, 2017

iantri commented Jul 29, 2017

mgriego commented Aug 10, 2017

ghost commented Sep 27, 2017