Tutorial2 (#258)

* Describe how to configure FileLogging * Documentation updates 1. Explain how to run Crawly as a standalone application 2. Explain how to create spiders as YML file
elixir-crawly · Apr 11, 2023 · b8bd2a9 · b8bd2a9
1 parent f7ff7d8
commit b8bd2a9
Show file tree

Hide file tree

Showing 9 changed files with 167 additions and 48 deletions.
diff --git a/README.md b/README.md
@@ -133,53 +133,13 @@ historical archival.
    $ cat /tmp/BooksToScrape_<timestamp>.jl
    ```
 
-## Running Crawly as a standalone application
-
-It's possible to run Crawly as a standalone application for the cases when you just need the data and don't want to install Elixir and all other dependencies.
-
-Follow these steps in order to bootstrap it with the help of Docker:
-
- 1. Make a project folder on your filesystem: `mkdir standalone_quickstart`
- 2. Create a spider inside the folder created on the step 1. Ideally in a subfolder called spiders. For the example purposes we will re-use the: https://github.com/elixir-crawly/crawly/blob/8926f41df3ddb1a84099543293ec3345b01e2ba5/examples/quickstart/lib/quickstart/books_spider.ex
-
- 3. Create a configuration file (erlang configuration file format), for example:
-    ``` erlang
-      [{crawly, [
-          {closespider_itemcount, 500},
-          {closespider_timeout, 20},
-          {concurrent_requests_per_domain, 2},
-
-          {middlewares, [
-                  'Elixir.Crawly.Middlewares.DomainFilter',
-                  'Elixir.Crawly.Middlewares.UniqueRequest',
-                  'Elixir.Crawly.Middlewares.RobotsTxt',
-                  {'Elixir.Crawly.Middlewares.UserAgent', [
-                      {user_agents, [
-                          <<"Mozilla/5.0 (Macintosh; Intel Mac OS X x.y; rv:42.0) Gecko/20100101 Firefox/42.0">>,
-                          <<"Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36">>
-                          ]
-                      }]
-                  }
-              ]
-          },
-
-          {pipelines, [
-                  {'Elixir.Crawly.Pipelines.Validate', [{fields, [title, price, title, url]}]},
-                  {'Elixir.Crawly.Pipelines.DuplicatesFilter', [{item_id, title}]},
-                  'Elixir.Crawly.Pipelines.Experimental.Preview',
-                  {'Elixir.Crawly.Pipelines.JSONEncoder'}
-              ]
-          }]
-      }].
-    ```
-
-    ** TODO - it would be nice to switch it to human readable format, e.g. YML
-
-  4. Now it's time to the Docker container:
-    ``` bash
-        docker run -e "SPIDERS_DIR=/app/spiders" -it -p 4001:4001 -v $(pwd)/spiders:/app/spiders -v $(pwd)/crawly.config:/app/config/crawly.config crawly:latest
-    ```
-  5. Now you can open the management interface and manage your spiders from there: localhost:4001. [Management Interface](#management-ui)
+## Running Crawly without Elixir or Elixir projects
+
+It's possible to run Crawly in a standalone mode, when Crawly is running as a tiny docker container, and spiders are just YMLfiles or elixir modules that are mounted inside.
+
+Please read more about it here:
+- [Standalone Crawly](./documentation/standalone_crawly.md)
+- [Spiders as YML](./documentation/spiders_in_yml.md)
 
 ## Need more help?
 

diff --git a/documentation/assets/create_yml_spider.png b/documentation/assets/create_yml_spider.png
diff --git a/documentation/assets/management_ui.png b/documentation/assets/management_ui.png
diff --git a/documentation/assets/management_ui2.png b/documentation/assets/management_ui2.png
diff --git a/documentation/assets/preview_yml_spider.png b/documentation/assets/preview_yml_spider.png
diff --git a/documentation/configuration.md b/documentation/configuration.md
@@ -141,6 +141,19 @@ default: false
 
 Enables or disables file logging. If set to `true`, make sure to add `:logger_file_backend` https://github.com/onkel-dirtus/logger_file_backend#loggerfilebackend as a dependency to your project.
 
+Here is an example of file logger configuration where logs from different spiders are separated into multiple files:
+
+``` elixir
+config :logger,
+  backends: [{LoggerFileBackend, :info_log}]
+
+config :crawly,
+  log_dir: "/tmp/spider_logs",
+  log_to_file: true,
+  ......
+  other configurations
+```
+
 ### port :: pos_integer()
 
 default: 4001

diff --git a/documentation/spiders_in_yml.md b/documentation/spiders_in_yml.md
@@ -0,0 +1,42 @@
+# Defining spiders in YML
+
+Starting from version 0.15.0, Crawly has added the possibility to define spiders as YML files directly from Crawly’s management interface. The main idea is to reduce the amount of boilerplate when defining simple spiders.
+
+You should not write code to get titles and descriptions from Reddit (we hope :)
+
+## Quickstart
+1. Start the Crawly application (either using the classical, dependency-based approach or as a standalone application)
+2. Open Crawly Management interface (localhost:4001)
+![Create YML Spider](./assets/create_yml_spider.png)
+3. Define the spider using the following structure:
+``` yml
+name: BooksSpiderForTest
+base_url: "https://books.toscrape.com/"
+start_urls:
+    - "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"
+fields:
+    - name: title
+    selector: ".product_main"
+    - name: price
+    selector: ".product_main .price_color"
+links_to_follow:
+    - selector: "a"
+    attribute: "href"
+```
+
+4. Click Preview button, to see how extracted data will look like after the spider is created:
+![Preview YML Spider](./assets/preview_yml_spider.png)
+
+5. Now after saving the spider it will be possible to Schedule the spider using the Crawly Management interface.
+
+## YML Spider Structure
+
+ * "name" (required): A string representing the name of the scraper.
+ * "base_url" (required): A string representing the base URL of the website being scraped. The value must be a valid URI.
+ * "start_urls" (required): An array of strings representing the URLs to start scraping from. Each URL must be a valid URI.
+ * "links_to_follow" (required): An array of objects representing the links to follow when scraping a page. Each object must have the following properties:
+    * "selector": A string representing the CSS selector for the links to follow.
+    * "attribute": A string representing the attribute of the link element that contains the URL to follow.
+ * "fields" (required): An array of objects representing the fields to scrape from each page. Each object must have the following properties:
+    * "name": A string representing the name of the field.
+    * "selector": A string representing the CSS selector for the field to scrape.
diff --git a/documentation/standalone_crawly.md b/documentation/standalone_crawly.md
@@ -0,0 +1,103 @@
+# Running Crawly as a standalone application
+
+An approach that involves abstracting all scraping tasks into a separate entity (or service), thereby allowing you to extract your desired data without the need for Crawly in your mix file.
+
+In other words:
+```
+Run Crawly as a Docker container with spiders mounted from the outside.
+```
+
+# Getting started
+
+Here we will show how re-implement the example from Quickstart, to achieve the same results with a standalone version of Crawly.
+
+ 1. Make a folder for your project: `mkdir myproject`
+ 2. Make spiders folder inside the project folder: `mkdir ./myproject/spiders`
+ 3. Copy the spider code inside a file in a folder defined on the previous step `myproject/spiders/books_to_scrape.ex`:
+    ``` elixir
+    defmodule BooksToScrape do
+        use Crawly.Spider
+        @impl Crawly.Spider
+        def base_url(), do: "https://books.toscrape.com/"
+        @impl Crawly.Spider
+        def init() do
+            [start_urls: ["https://books.toscrape.com/"]]
+        end
+        @impl Crawly.Spider
+        def parse_item(response) do
+            # Parse response body to document
+            {:ok, document} = Floki.parse_document(response.body)
+            # Create item (for pages where items exists)
+            items =
+            document
+            |> Floki.find(".product_pod")
+            |> Enum.map(fn x ->
+                %{
+                title: Floki.find(x, "h3 a") |> Floki.attribute("title") |> Floki.text(),
+                price: Floki.find(x, ".product_price .price_color") |> Floki.text(),
+                url: response.request_url
+                }
+            end)
+            next_requests =
+            document
+            |> Floki.find(".next a")
+            |> Floki.attribute("href")
+            |> Enum.map(fn url ->
+                Crawly.Utils.build_absolute_url(url, response.request.url)
+                |> Crawly.Utils.request_from_url()
+            end)
+            %{items: items, requests: next_requests}
+        end
+    end
+    ```
+    3. Now create a configuration file using erlang.config format:
+    https://www.erlang.org/doc/man/config.html
+
+        For example: `myproject/crawly.config`
+        ``` erlang
+        [{crawly, [
+            {closespider_itemcount, 500},
+            {closespider_timeout, 20},
+            {concurrent_requests_per_domain, 2},
+
+            {middlewares, [
+                'Elixir.Crawly.Middlewares.DomainFilter',
+                'Elixir.Crawly.Middlewares.UniqueRequest',
+                'Elixir.Crawly.Middlewares.RobotsTxt',
+                {'Elixir.Crawly.Middlewares.UserAgent', [
+                    {user_agents, [<<"Crawly BOT">>]}
+                ]}
+            ]},
+
+            {pipelines, [
+                {'Elixir.Crawly.Pipelines.Validate', [{fields, [title, url]}]},
+                {'Elixir.Crawly.Pipelines.DuplicatesFilter', [{item_id, title}]},
+                {'Elixir.Crawly.Pipelines.JSONEncoder'},
+                {'Elixir.Crawly.Pipelines.WriteToFile', [{folder, <<"/tmp">>}, {extension, <<"jl">>}]}
+                ]
+            }]
+        }].
+        ```
+
+    4. Now lets start the Crawly (TODO: Insert link to crawly Docker repos):
+        ```
+        docker run --name crawlyApp1 -e "SPIDERS_DIR=/app/spiders" \
+         -it -p 4001:4001 -v $(pwd)/spiders:/app/spiders \
+         -v $(pwd)/crawly.config:/app/config/crawly.config \
+         crawly
+        ```
+
+        ** SPIDERS_DIR environment variable specifies a folder from which additional spiders are going to be fetched. `./spiders` is used by default
+
+    5. Open Crawly Web Management interface in your browser: https://localhost:4001/
+
+    Here it's possible to Schedule a spider with a use of Schedule button. The interface also allows you to access other useful information like:
+    1. History of your jobs
+    2. Items
+    3. Logs of the given spider
+
+    ![Crawly Management](./assets/management_ui.png)
+    ![Crawly Management](./assets/management_ui2.png)
+
+
+
diff --git a/mix.exs b/mix.exs
@@ -61,7 +61,8 @@ defmodule Crawly.Mixfile do
 
       # Add floki only for crawly standalone release
       {:floki, "~> 0.33.0", only: [:dev, :test, :standalone_crawly]},
-      {:logger_file_backend, "~> 0.0.11", only: [:test, :dev]}
+      {:logger_file_backend, "~> 0.0.11",
+       only: [:test, :dev, :standalone_crawly]}
     ]
   end