Update for NYT / User Agent change #64

iamvishnurajan · 2022-05-01T01:56:56Z

In the recent few weeks, the NYT scraper no longer works as it appears that they are now blocking requests with the default user agent of "python-requests"

This can be changed by editing ~/parsers/baseparser.py and adding code as follows to the grab_url section. This is setup here to randomly rotate among 10 different user agents. The section between opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) and retry = False is what has been added. Of course, you can add whatever user agents you want to here.

def grab_url(url, max_depth=5, opener=None):
    if opener is None:
        cj = cookielib.CookieJar()
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
        user_agent_list = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:83.0) Gecko/20100101 Firefox/83.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
        'Googlebot/2.1 (+http://www.google.com/bot.html)',
        'UCWEB/2.0 (compatible; Googlebot/2.1; +google.com/bot.html)',
        'facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Page Speed Insights) Chrome/41.0.2272.118 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
        'Mozilla/5.0 (X11; Linux x86_64)',
        ]
        for i in range(1,10):
            user_agent = random.choice(user_agent_list)
        opener.addheaders = [('User-Agent', user_agent)]
    retry = False
    try:
        text = opener.open(url, timeout=5).read()
        if '<title>NY Times Advertisement</title>' in text:
            retry = True
    except socket.timeout:
        retry = True
    if retry:
        if max_depth == 0:
            raise Exception('Too many attempts to download %s' % url)
        time.sleep(0.5)
        return grab_url(url, max_depth-1, opener)
    return text

I've updated this on my fork, but I have several other updates that are somewhat specific that others may or may not want, so I was hesitant to submit the pull request. I wanted to document it here though in case others were wondering why the system isn't catching new articles for NYT.

Thanks,
Vishnu

The text was updated successfully, but these errors were encountered:

MaxBittker · 2024-11-17T21:15:00Z

thank you @iamvishnurajan ! Incorporating this to fix nyt_first_said

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update for NYT / User Agent change #64

Update for NYT / User Agent change #64

iamvishnurajan commented May 1, 2022

MaxBittker commented Nov 17, 2024

Update for NYT / User Agent change #64

Update for NYT / User Agent change #64

Comments

iamvishnurajan commented May 1, 2022

MaxBittker commented Nov 17, 2024