Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update for NYT / User Agent change #64

Open
iamvishnurajan opened this issue May 1, 2022 · 1 comment
Open

Update for NYT / User Agent change #64

iamvishnurajan opened this issue May 1, 2022 · 1 comment

Comments

@iamvishnurajan
Copy link
Contributor

In the recent few weeks, the NYT scraper no longer works as it appears that they are now blocking requests with the default user agent of "python-requests"

This can be changed by editing ~/parsers/baseparser.py and adding code as follows to the grab_url section. This is setup here to randomly rotate among 10 different user agents. The section between opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) and retry = False is what has been added. Of course, you can add whatever user agents you want to here.

def grab_url(url, max_depth=5, opener=None):
    if opener is None:
        cj = cookielib.CookieJar()
        opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
        user_agent_list = [
        'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:83.0) Gecko/20100101 Firefox/83.0',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
        'Googlebot/2.1 (+http://www.google.com/bot.html)',
        'UCWEB/2.0 (compatible; Googlebot/2.1; +google.com/bot.html)',
        'facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)',
        'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Page Speed Insights) Chrome/41.0.2272.118 Safari/537.36',
        'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
        'Mozilla/5.0 (X11; Linux x86_64)',
        ]
        for i in range(1,10):
            user_agent = random.choice(user_agent_list)
        opener.addheaders = [('User-Agent', user_agent)]
    retry = False
    try:
        text = opener.open(url, timeout=5).read()
        if '<title>NY Times Advertisement</title>' in text:
            retry = True
    except socket.timeout:
        retry = True
    if retry:
        if max_depth == 0:
            raise Exception('Too many attempts to download %s' % url)
        time.sleep(0.5)
        return grab_url(url, max_depth-1, opener)
    return text

I've updated this on my fork, but I have several other updates that are somewhat specific that others may or may not want, so I was hesitant to submit the pull request. I wanted to document it here though in case others were wondering why the system isn't catching new articles for NYT.

Thanks,
Vishnu

@MaxBittker
Copy link

thank you @iamvishnurajan ! Incorporating this to fix nyt_first_said

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants