Use multithreaded YouTube downloader from zimscraperlib #301

dan-niles · 2024-08-30T18:58:21Z

Close #118

Use newer multithreaded YouTube downloader from zimscraperlib to download videos, thumbnails and subtitles.

Changes:

Remove "batch" downloading of videos and download_video_files_batch method.
Update download_video, download_thumbnail and download_subtitle methods to use new multithreaded downloader from scraperlib.
Update the above methods to download the files asynchronously.

codecov · 2024-08-30T18:59:36Z

Codecov Report

Attention: Patch coverage is 0% with 77 lines in your changes missing coverage. Please review.

Project coverage is 1.54%. Comparing base (eedd0b2) to head (69cb71c).
Report is 3 commits behind head on main.

Files with missing lines	Patch %	Lines
scraper/src/youtube2zim/scraper.py	0.00%	77 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff            @@
##            main    #301      +/-   ##
========================================
+ Coverage   1.52%   1.54%   +0.01%     
========================================
  Files         11      11              
  Lines       1112    1102      -10     
  Branches     166     158       -8     
========================================
  Hits          17      17              
+ Misses      1095    1085      -10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

rgaudin · 2024-08-31T14:26:04Z

Changed reviewer as @benoit74 is coming back

benoit74

This is not working exactly as intended, when a video fails to download, it is not counted as failed and final ZIM is hence created with no video at all (in my test where I not-so-intentionally had an issue with youtube-dlp).

download_video and download_thumbnail are now returning a future, so one has to check its status after completion (not an easy feat tbh).

dan-niles · 2024-09-07T08:03:00Z

@benoit74 I've updated the code to handle errors correctly. It now properly returns both the succeeded and failed video_ids. I've also mentioned the time taken to create a ZIM of the openZIM_testing channel with different values for concurrency (without using s3 cache) below.

Concurrency	Time (s)
1	40.16
2	25.29
4	21.11

However, I realized that most ZIMs in zimfarm are downloaded using the S3 cache. With this approach, only the videos that are not in the S3 cache and are downloaded directly using yt-dlp will benefit from the parallelization. This is because I'm using the ThreadPoolExecutor in YoutubeDownloader from python-scraperlib, rather than downloading in batches as suggested in #118 (comment). Therefore, this change will only provide a performance gain in cases where videos are missing from the S3 cache.

Given that most downloads rely on the S3 cache, would it be better to keep the current approach of downloading in batches and using the YoutubeDownloader from scraperlib without concurrent execution?

WDYT?

benoit74 · 2024-09-09T06:45:11Z

Given that most downloads rely on the S3 cache, would it be better to keep the current approach of downloading in batches and using the YoutubeDownloader from scraperlib without concurrent execution?

This is indeed a very important point! Current implementation allow concurrent download of both S3 cached videos and Youtube videos, and also reencoding.

I'm sorry to not have noticed this before. I feel a bit ashamed to have wasted your time on this. Thank you for speaking up, it would have been a bad move indeed!

I propose to close the issue and the PR as "non-relevant", WDYT? Or at least we should enhance the scraperlib, but this is not a small change to implement.

dan-niles · 2024-09-10T08:57:55Z

No worries at all! I should have considered this earlier before moving forward with the implementation. I'm fine with closing this issue as "non-relevant." However, I agree that using the YouTube downloader from scraperlib restricts us from achieving concurrency with S3 downloads and video reencoding.

dan-niles self-assigned this Aug 30, 2024

dan-niles marked this pull request as ready for review August 30, 2024 19:05

dan-niles requested a review from rgaudin August 30, 2024 19:07

rgaudin requested review from benoit74 and removed request for rgaudin August 31, 2024 14:25

benoit74 requested changes Sep 3, 2024

View reviewed changes

dan-niles force-pushed the use-multithreaded-yt-downloader-from-zimscraperlib branch 2 times, most recently from 9d3e4e4 to b1304aa Compare September 5, 2024 06:36

Use multithreaded YouTube downloader from zimscraperlib

962b6b3

dan-niles force-pushed the use-multithreaded-yt-downloader-from-zimscraperlib branch from b1304aa to 962b6b3 Compare September 6, 2024 12:24

Refactor multithreaded video download logic for better error handling

69cb71c

dan-niles marked this pull request as draft September 6, 2024 18:02

benoit74 closed this Sep 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use multithreaded YouTube downloader from zimscraperlib #301

Use multithreaded YouTube downloader from zimscraperlib #301

dan-niles commented Aug 30, 2024

codecov bot commented Aug 30, 2024 •

edited

Loading

rgaudin commented Aug 31, 2024

benoit74 left a comment

dan-niles commented Sep 7, 2024

benoit74 commented Sep 9, 2024

dan-niles commented Sep 10, 2024

Use multithreaded YouTube downloader from zimscraperlib #301

Use multithreaded YouTube downloader from zimscraperlib #301

Conversation

dan-niles commented Aug 30, 2024

codecov bot commented Aug 30, 2024 • edited Loading

Codecov Report

rgaudin commented Aug 31, 2024

benoit74 left a comment

Choose a reason for hiding this comment

dan-niles commented Sep 7, 2024

benoit74 commented Sep 9, 2024

dan-niles commented Sep 10, 2024

codecov bot commented Aug 30, 2024 •

edited

Loading