Skip to content

Commit

Permalink
handle case when tags inside the anchor
Browse files Browse the repository at this point in the history
  • Loading branch information
facundoolano committed Jan 6, 2024
1 parent ae1457d commit 8418e2c
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions feedi/scraping.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,8 +83,10 @@ def all_meta(soup):

def extract_links(html):
soup = BeautifulSoup(html, 'lxml')
# text = True to skip images
links = soup.find_all('a', text=True, href=lambda url: url and url.startswith('http'))
# checks tag.text so it skips image links
# checks startswith http to exclude local links (not sure if it's the best assumption?)
links = soup.find_all(lambda tag: tag.name ==
'a' and tag.text and tag['href'].startswith('http'))
return [a['href'] for a in links]


Expand Down

0 comments on commit 8418e2c

Please sign in to comment.