Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build scrapper passed #73

Closed
wants to merge 14 commits into from
Closed

Conversation

heytulsiprasad
Copy link
Contributor

Description

This PR is WIP (work in progress), focused to solve #68 . The scraper.js built within router directory contains the code to fetch data from zairza.blog.in. It fetches title, href, author, release-date, and cover-img-link of all the blog posts and stores them as objects.

Dependencies Added

  • request
  • cheerio

Work remaining todo:

  • Build: scraper to fetch contents from zairza.blog.in and are stored as objects
  • Fix: above mentioned issue
  • Make this code accessible through the app.js file
  • Create a template in blog.ejs for rendering the blog sections
  • Run a forEach method on blogs.json at the end to fetch scraped data to our website

All suggestions are appreciated. 👍

scraper.js scraps from blog.zairza.in info regarding all blog posts
build a scrapper with details to fetch into blogs.ejs
layout of blogs section is made responsive and date mentioned in blog cards is rendered using regex syntax
@heytulsiprasad
Copy link
Contributor Author

heytulsiprasad commented Dec 21, 2019

Feature Display

Annotation 2019-12-21 212013

Concerns

  • background image zoom out
  • blog title moved a little right (unaligned with author name)
  • date format yyyy/mm/dd which was originally dd/mm/yyyy

Help and suggestions on these above mentioned concerns is appreciated. 👍

date used in blog cards are now of the format DD/MM/YYYY
the latest post from medium is now also fetched and added to data.json file
Copy link
Collaborator

@ankitjena ankitjena left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tulsi-prasad Good work, try if what I mentioned above can be done

@@ -0,0 +1,105 @@
// The scraper for the blog.ejs section in the application.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do you run this file?

routes/data.json Outdated
"href": "https://blog.zairza.in/oauth-using-mevn-stack-4b4a383dae08?source=collection_home---6------0-----------------------",
"author": "Ramakrishna Pattnaik",
"release": "2019-08-25T12:13:49.122Z",
"cover": "https://cdn-images-1.medium.com/fit/t/1600/480/1*zqCh8ZNR-LjBzaacpiIyUA.png"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you are fetching the cover image, it's huge. Which is why it's zoomed out. We need the image which the first image inside the blog. WDYT? Can this be done?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think so. We need to scrape every hrefs of particular blog to get the right image. Working on it now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly we'll do this step for 4 of the recent posts for optimization.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, that should reduce unnecessary requests

routes/index.js Outdated Show resolved Hide resolved
routes/index.js Outdated Show resolved Hide resolved
add moment js as a dependency to work with date time objects
write datetime objects to be rendered using moment package
Scrapes the first image from each blog posts and forms a cover object.
added cover objects with img urls in cover.json file of first 4 blogs
@heytulsiprasad
Copy link
Contributor Author

heytulsiprasad commented Dec 25, 2019

Work to Do

Fetch the cover image urls from each individual blogs (first-image). For this purpose is taken care of in 567dc39. And also next commit adds them to cover.json file for convinience.

Problems now facing

The order of scraped image urls is not according to the blog posts order. This seems to appear out of nowhere, as the while loop iterates from count = 0 and which fetches from data array which is also ordered. After fixing this only, we can render them on ejs template.

Fix to Try

I am thinking of making a different coverScraper.js file which imports from data.json as it is and loops through its Top 4 hrefs and fetch the first image. I'll update on this by tonight.

EDIT 1

Refactored the code in the next commit, 2b837e5. ./json/cover.json stores cover image urls, which are still not in order. After every surver run, the order of json file changes. Reason still doubtful.

Any suggestions are appreciated. 👍

json folder stores the scraped data and scrapcover is fetches cover images from each blogs
bloglinks array contains four urls in order to be scraped for cover image
@ankitjena
Copy link
Collaborator

I am thinking of making a different coverScraper.js file which imports from data.json as it is and loops through its Top 4 hrefs and fetch the first image

You should keep the entire logic in one file. When you fetch the blog url for cover image, do another request using cheerio. async/await will help.

@heytulsiprasad
Copy link
Contributor Author

This PR is taken further in #74 to avoid any local conflicts.

@heytulsiprasad
Copy link
Contributor Author

Have worked on a new branch. Will update in a another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants