Skip to content

Commit

Permalink
Merge pull request #1576 from vespa-engine/thomasht86/exclude-non-pdf
Browse files Browse the repository at this point in the history
(colpalidemo) do not download non pdf links
  • Loading branch information
thomasht86 authored Nov 12, 2024
2 parents 0b3cdfc + e434916 commit 71b5688
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions visual-retrieval-colpali/prepare_feed_deploy.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,8 +178,10 @@
for a_tag in year_div.select("a.button.button--download-secondary[href]"):
href = a_tag["href"]
full_url = urljoin(url, href)
links.append(full_url)
url_to_year[full_url] = year
# exclude non-pdf links
if full_url.endswith(".pdf"):
links.append(full_url)
url_to_year[full_url] = year
links, url_to_year
# -

Expand Down

0 comments on commit 71b5688

Please sign in to comment.