Skip to content

multi page marketing site scraping #2196

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
Apr 14, 2025
Merged

Conversation

shanbady
Copy link
Contributor

What are the relevant tickets?

Closes https://github.com/mitodl/hq/issues/7081

Description (What does it do?)

This PR allows us to scrape multiple pages for a given marketing site (for purposes of embedding).

How can this be tested?

  1. Checkout this branch
  2. rebuild your web and celery containers
  3. set settings.EMBEDDINGS_EXTERNAL_FETCH_USE_WEBDRIVER = True
  4. docker compose down/up celery
  5. run the task to fetch marketing page data
from learning_resources.tasks import scrape_marketing_pages
scrape_marketing_pages.run()
  1. inspect the content of the marketing pages:
from learning_resources.models import ContentFile
cfs = ContentFile.objects.filter(file_type="marketing_page")
print(cfs.first().content)
  1. for micromasters program pages - it should contain content from all the pages (tabs at the top):
Screenshot 2025-04-11 at 4 37 09 PM
from learning_resources.models import ContentFile
ContentFile.objects.filter(file_type="marketing_page", learning_resource__url__icontains='micromasters')

Copy link

OpenAPI Changes

Show/hide No detectable change.

@shanbady shanbady changed the title multi marketing page scraping multi page marketing site scraping Apr 11, 2025
@shanbady shanbady marked this pull request as ready for review April 11, 2025 20:40
@abeglova abeglova self-assigned this Apr 14, 2025
Copy link
Contributor

@abeglova abeglova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@shanbady shanbady merged commit 4f82990 into main Apr 14, 2025
12 checks passed
@shanbady shanbady deleted the shanbady/multi-page-scraping branch April 14, 2025 18:45
@odlbot odlbot mentioned this pull request Apr 29, 2025
19 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants