You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In the recent few weeks, the NYT scraper no longer works as it appears that they are now blocking requests with the default user agent of "python-requests"
This can be changed by editing ~/parsers/baseparser.py and adding code as follows to the grab_url section. This is setup here to randomly rotate among 10 different user agents. The section between opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj)) and retry = False is what has been added. Of course, you can add whatever user agents you want to here.
defgrab_url(url, max_depth=5, opener=None):
ifopenerisNone:
cj=cookielib.CookieJar()
opener=urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
user_agent_list= [
'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/80.0.3987.149 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_14_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36 OPR/72.0.3815.400',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.14; rv:83.0) Gecko/20100101 Firefox/83.0',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9',
'Googlebot/2.1 (+http://www.google.com/bot.html)',
'UCWEB/2.0 (compatible; Googlebot/2.1; +google.com/bot.html)',
'facebookexternalhit/1.1 (+http://www.facebook.com/externalhit_uatext.php)',
'Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko; Google Page Speed Insights) Chrome/41.0.2272.118 Safari/537.36',
'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:77.0) Gecko/20100101 Firefox/77.0',
'Mozilla/5.0 (X11; Linux x86_64)',
]
foriinrange(1,10):
user_agent=random.choice(user_agent_list)
opener.addheaders= [('User-Agent', user_agent)]
retry=Falsetry:
text=opener.open(url, timeout=5).read()
if'<title>NY Times Advertisement</title>'intext:
retry=Trueexceptsocket.timeout:
retry=Trueifretry:
ifmax_depth==0:
raiseException('Too many attempts to download %s'%url)
time.sleep(0.5)
returngrab_url(url, max_depth-1, opener)
returntext
I've updated this on my fork, but I have several other updates that are somewhat specific that others may or may not want, so I was hesitant to submit the pull request. I wanted to document it here though in case others were wondering why the system isn't catching new articles for NYT.
Thanks,
Vishnu
The text was updated successfully, but these errors were encountered:
In the recent few weeks, the NYT scraper no longer works as it appears that they are now blocking requests with the default user agent of "python-requests"
This can be changed by editing ~/parsers/baseparser.py and adding code as follows to the grab_url section. This is setup here to randomly rotate among 10 different user agents. The section between
opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cj))
andretry = False
is what has been added. Of course, you can add whatever user agents you want to here.I've updated this on my fork, but I have several other updates that are somewhat specific that others may or may not want, so I was hesitant to submit the pull request. I wanted to document it here though in case others were wondering why the system isn't catching new articles for NYT.
Thanks,
Vishnu
The text was updated successfully, but these errors were encountered: