Crawly spider over non-static spider #215

nuno84 · 2022-09-12T13:29:01Z

nuno84
Sep 12, 2022

Hi,
I've been studying the code and playing around with crawly.
I would like to create a spider that computes both the item rules as well as the new_links over some data I may get from the DB.
For instance I could create a pool of HTTP spiders, each for different domain and those spiders would get the item properties and new links from an element described by a DB field. In this way, I could allow the user to personalyze the crawl by himself using a couple params.
But as I saw on crawly-ui, it uses fixed spider modules for the purpose (already compiled).
Is there any easy way to make it or should I refactory the module for my purpose?
I really like the way you set state machine working (middlewares, pipelines, backoff times, etc)
Thank you for your attention.

nuno84 · 2022-09-12T14:49:11Z

nuno84
Sep 12, 2022
Author

As a follow up, I would say it would be needed to for instance start 5 HttpSpiders each with its own params to be used inside the parse function.
Adding up to the issue I would say that crawly doesnt allow the same spider to run more than one instance.
Can someone confirm this?
I would need to refactor some stuff.
Would it be interested to merge that alternative on this project?

7 replies

oltarasenko Sep 14, 2022
Maintainer

Could you also share the project you're doing, so I can get a better understanding of the problem you're solving

nuno84 Sep 14, 2022
Author

Well... It is not working. It is by now a mess.
This is the core of the project and since I am learning Phoenix/Elixir I started by probably the most difficult part which is the Scraper.
I was trying to do it using simple HTTP requests but then I finally figured out that I needed something more structured. I also had the realization of the JS issues, then the robots, etc etc
Then I found Crawly which seems what I need. It seems really helpfull. Congrats for it by the way.
Now I am in the middle of a refactory so it is not in shape.
But let me try to summarize.
So I want to create a web app to show the cultural events nearby. The idea is that one can go and create a crawler to their website.
You would go to the app, add a Venue and say:

main url is http://a.com
name of event is element: "div#eventname span"
description of event is element: "div#description label"
link is element: "a#other-events" property: "href"

And then the Elixir will crawl from time to time.
So I cannot use a Spider for each website since the parsing is dynamic.

I think I have to create a spider for generic HTTP and keep adding the urls to parse.

However when an item is about to parse I need to know those user defined rules -> call it a %Source{}.
I am using an embedded schema so I want to pass the entire rules structure in each add request like this using that %Source{} struct.

def add(%Source{} = source) do Crawly.RequestsStorage.store(HttpSpider, %Crawly.Request{ url: source.start_url, headers: [], options: [], custom_data: source } ) end

Then that custom_data will be available on the parse_item callback as I did in the pull request and I could proceed with that processing.

Thank you for your time

oltarasenko Sep 15, 2022
Maintainer

It all sounds quite similar to what I was doing here: https://oltarasenko.medium.com/visual-scraping-with-elixir-and-crawly-or-how-to-get-data-without-programming-540222750135

Maybe you can try to re-use this project?

nuno84 Sep 15, 2022
Author

Hi Oleg,
After a while looking at the code, I believe you are creating Spiders modules on-the-fly using metaprogramming.
I find that approach quite interesting and will have a deep look at it and try to play with it to see if I can expand to what I want to do.
Thank you very much for you attention.

nuno84 Sep 29, 2022
Author

I've been busy lately, but I already managed to do as you suggested.
I will follow that path. Thank you so much for your work and support.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawly spider over non-static spider #215

{{title}}

Replies: 1 comment 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

Select a reply

Crawly spider over non-static spider #215

nuno84 Sep 12, 2022

Replies: 1 comment · 7 replies

nuno84 Sep 12, 2022 Author

oltarasenko Sep 14, 2022 Maintainer

nuno84 Sep 14, 2022 Author

oltarasenko Sep 15, 2022 Maintainer

nuno84 Sep 15, 2022 Author

nuno84 Sep 29, 2022 Author

nuno84
Sep 12, 2022

Replies: 1 comment 7 replies

nuno84
Sep 12, 2022
Author

oltarasenko Sep 14, 2022
Maintainer

nuno84 Sep 14, 2022
Author

oltarasenko Sep 15, 2022
Maintainer

nuno84 Sep 15, 2022
Author

nuno84 Sep 29, 2022
Author