Skip to content

Commit

Permalink
fix: ketelhuis
Browse files Browse the repository at this point in the history
  • Loading branch information
ckuijjer committed Dec 22, 2023
1 parent 6d5384d commit 6a8ff03
Show file tree
Hide file tree
Showing 4 changed files with 30 additions and 6 deletions.
10 changes: 10 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,13 @@ aws dynamodb scan --table-name expatcinema-scrapers-analytics --profile casper >
### Favicon

Use https://favicongrabber.com/ to grab a favicon for the cinema.json file

## Troubleshooting

When running a puppeteer based scraper locally, e.g. `yarn tsx scrapers/ketelhuis.ts` and getting an error like

```
Error: Failed to launch the browser process! spawn /tmp/localChromium/chromium/mac_arm-1205129/chrome-mac/Chromium.app/Contents/MacOS/Chromium ENOENT
````
you need to install Chromium locally, run `yarn install-chromium` to do so and update `LOCAL_CHROMIUM_EXECUTABLE_PATH` in `browser.ts` to point to the Chromium executable
2 changes: 1 addition & 1 deletion cloud/browser.ts
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ import { Logger } from '@aws-lambda-powertools/logger'
// see https://github.com/Sparticuz/chromium#running-locally--headlessheadful-mode
// for how to install a locally running chromium
const LOCAL_CHROMIUM_EXECUTABLE_PATH =
'/tmp/localChromium/chromium/mac_arm-1205129/chrome-mac/Chromium.app/Contents/MacOS/Chromium'
'/tmp/localChromium/chromium/mac_arm-1240459/chrome-mac/Chromium.app/Contents/MacOS/Chromium'

const createBrowserSingleton = () => {
let instance: Browser
Expand Down
3 changes: 2 additions & 1 deletion cloud/package.json
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@
"tail-cloudwatch-logs": "lumigo-cli tail-cloudwatch-logs -n /aws/lambda/expatcinema-dev --profile casper --region eu-west-1",
"prettify-log": "awk -F'\\t' '{if ($5 && $5 ~ /^{/ && !system(\"echo \\x27\" $5 \"\\x27 | jq -e > /dev/null 2>&1\")) { print $1\"\\t\"$2\"\\t\"$3\"\\t\"$4 system(\"echo \\x27\" $5 \"\\x27 | jq\") } else { print $0 }}'",
"prettify-log:test": "cat tail-cloudwatch-logs.example.log | yarn prettify-log",
"log": "yarn tail-cloudwatch-logs|yarn prettify-log"
"log": "yarn tail-cloudwatch-logs|yarn prettify-log",
"install-local-chromium": "npx @puppeteer/browsers install chromium@latest --path /tmp/localChromium"
},
"author": "Casper Kuijjer",
"license": "ISC",
Expand Down
21 changes: 17 additions & 4 deletions cloud/scrapers/ketelhuis.ts
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,18 @@ const extractFromMainPage = async () => {
)
logger.info('scraped /deutsches-kino', { deutschesKinoResults })

const results = [...expatCinemaResults, ...deutschesKinoResults]
const italianCineclubResults = await xray(
'https://www.ketelhuis.nl/specials/italian-cineclub/',
'.c-default-page-content a[href^="https://www.ketelhuis.nl/films/"]',
selector,
)
logger.info('scraped /italian-cineclub', { italianCineclubResults })

const results = [
...expatCinemaResults,
...deutschesKinoResults,
...italianCineclubResults,
]
const uniqueResults = R.uniq(results)

logger.info('results', { results })
Expand All @@ -66,9 +77,11 @@ const extractFromMainPage = async () => {
return extracted
}

const hasEnglishSubtitles = ({ metadata, mainContent }) =>
metadata?.includes('English subtitles') ||
mainContent?.includes('Engels ondertiteld')
const hasEnglishSubtitles = ({ metadata, mainContent, title }) =>
title?.toLowerCase().includes('english subs') ||
metadata?.toLowerCase().includes('english subtitles') ||
metadata?.toLowerCase().includes('engels ondertiteld') ||
mainContent?.toLowerCase().includes('engels ondertiteld')

const splitFirstDate = (date: string) => {
if (date === 'Vandaag') {
Expand Down

0 comments on commit 6a8ff03

Please sign in to comment.