Zineb is a lightweight tool solution for simple and efficient web scrapping and crawling built around BeautifulSoup and Pandas. It's main purpose is to help quickly structure your data in order to be used as fast as possible in data science or machine learning projects.
Zineb gets your custom spider, creates a set of HTTPRequest
objects for each url, sends the requests and caches a BeautifulSoup object of the page within an HTMLResponse
class of that request.
Most of your interactions with the HTML page will be done through the HTMLResponse
class.
When the spider starts crawling the page, each response and request in past through the start function:
def start(self, response, **kwargs):
request = kwargs.get('request')
images = response.images
To create a project do python -m zineb startproject <project name>
which will create a directory which will have the following structure.
.myproject | |--media | |-- models |-- base.py | |-- init.py | |-- manage.py | |-- settings.py | |-- spiders.py
Once the project folder is created, all your interractions with Zineb will be made trough the management commands that are executed through python manage.py
from your project's directory.
The models directory allows you to place the elements that will help structure the data that you have scrapped from from the internet.
The manage.py
file will allow you to run all the required commands from your project.
Finally, the spiders module will contain all the spiders for your project.
On startup, Zineb implements a set of basic settings (zineb.settings.base
) that will get overrided by the values that you would have defined in your settings.py
located in your project.
You can read more about this in the settings section of this file.
Creating a spider is extremely easy and requires a set of starting urls that can be used to scrap one or many HTML pages.
class Celebrities(Zineb):
start_urls = ['http://example.com']
def start(self, response, request=None, soup=None, **kwargs):
# Do something here
Once the Celibrities class is called, each request is passed through the start
method. In other words the zineb.http.responses.HTMLResponse
, zineb.http.request.HTTPRequest
and the BeautifulSoup
HTML page object are sent through the function.
You are not required to use all these parameters at once. They're just for convinience.
In which case, you can also write the start method as so if you only need one of these.
def start(self, response, **kwargs):
# Do something here
Other objects can be passes through the function such as the models that you have created but also the settings of the application etc.
Meta options allows you to customize certain very specific behaviours [not found in the settings.py
file] related to the spider.
class Celerities(Zineb):
start_urls = ['http://example.com']
class Meta:
domains = []
This option limits a spider to a very specific set of domains.
This option writer as verbose_name
will specific a different name to your spider.
Triggers the execution of all the spiders present in the given the project.
Start a iPython shell on which you can test various elements on the HTML page.
When the shell is started, the zineb.http.HTTPRequest
, the zineb.response.HTMLResponse
, and the BeautifulSoup instance of the page are injected.
Extractors are passed using aliases:
links
: LinkExtractorimages
: ImageExtractormultilinks
: MultiLinkExtractortables
: TableExtractor
The extractors are also all passed within the shell in addition to the project settings.
In that regards, the shell becomes a interesting place where you can test various querying before using it in your project. For example, using the shell with http://example.com
.
We can get a simple url :
IPython 7.19.0
In [1]: response.find("a")
Out[1]: <a href="https://www.iana.org/domains/example">More information...</a>
We can find all urls on the page:
IPython 7.19.0
In [2]: extractor = links()
In [3]: extractor.resolve(response)
In [4]: str(extrator)
Out [4]: [Link(url=https://www.iana.org/domains/example, valid=True)]
In [5]: response.links
Out [5]: [Link(url=https://www.iana.org/domains/example, valid=True)]
Or simply get the page title:
IPython 7.19.0
In [6]: response.page_title
Out [6]: 'Example Domain'
Remember that in addition to the custom functions created for the class, all the rest called on zineb.response.HTMLResponse
are BeautifulSoup ones (find, find_all, find_next, next_sibling...)
Like said previously, the majority of your interactions with the HTML page will be done through the HTMLResponse
object or zineb.http.responses.HTMLResponse
class.
This class will implement some very basic general functionnalities that you can use through the course of your project. To illustrate this, let's create a basic Zineb HTTP response from a request:
from zineb.http.requests import HTTPRequest
request = HTTPRequest("http://example.com")
Requests, when created a not sent [or resolved] automatically if the _send
function is not called. In that case, they are marked as being unresolved ex. HTTPRequest("http://example.co", resolved=False)
.
Once the _send
method is called, by using the html_page
attribute or calling any BeautifulSoup function on the class, you can do all the classic querying on the page e.g. find, find_all...
request._send()
request.html_response
# -> Zineb HTMLResponse object
request.html_response.html_page
# -> BeautifulSoup object
request.find("a")
# -> BeautifulSoup Tag
If you do not know about BeautifulSoup please read the documentation here.
For instance, suppose you have a spider and want to get the first link present on http://example.com. That's what you would so:
from zineb.app import Zineb
class MySpider(Zineb):
start_urls = ["http://example.com"]
def start(self, response=None, request=None, soup=None, **kwargs):
link = response.find("a")
# Or, you can also use this tehnic through
# the request object
link = request.html_response.find("a")
# Or you can directly use the soup
# object as so
link = soup.find("a")
In order to understand what the Link
, Image
and Table
objects represents, please read the following section of this page.
Zineb HTTPRequest objects are better explained in the following section.
request.html_response.links
# -> [Link(url=http://example.com valid=True)]
request.html_response.images
# -> [Image(url=https://example.com/1.jpg")]
request.html_response.tables
# -> [Table(url=https://example.com/1")]
Finally you can retrieve all the text of the web page at once.
request.html_response.text
-> '\n\n\nExample Domain\n\n\n\n\n\n\n\nExample Domain\nThis domain is for use in illustrative examples in documents. You may use this\n domain in literature without prior coordination or asking for permission.\nMore information...\n\n\n\n'
There might be situations where you might have a set of HTML files in your project directory that you want to crawl. Zineb provides a Spider for such event.
NOTE: Ensure that the directory to use is within your project.
class Spider(FileCrawler):
start_files = ["media/folder/myfile.html"]
You might have thousands of files and certainly might not want to reference each file one by one. You can then also use a utility function collect_files
.
from zineb.utils.iterator import collect_files
class Spider(FileCrawler):
start_files = collect_files("media/folder")
Read more on collect_files
here.
Models are a simple way to structure your scrapped data before eventually saving them to a file (generally JSON or CSV). The Model class is an interface to an internal container called SmartDict
that actually does contain the data and fields which purpose is to clean and normalize the incoming values.
By using models, you are then assured to have clean usable data for data analysis.
In order to create a model, subclass the Model object from zineb.models.Model
and then add fields to it:
from zineb.models import fields
from zineb.models.datastructure import Model
class Player(Model):
name = fields.CharField()
date_of_birth = fields.DateField()
height = fields.IntegerField()
On its own however, a model does nothing. In order to make it work, you have to add values to it and then resolve the fields [or data]. There are multiple ways to do this.
Adding a new value generally requires two main parameters: the name of the field to use and the incoming data to be added.
Each model gets instantiated with a underlying container that does the heavy work of storing and aggregating the data. The default container is called SmartDict
.
The SmartDict
container ensures that each row is well balanced with the same amount of fields when values are added.
For instance, if your model has two fields name
and surname
, suppose you add name
but not surname
, the final result should be {"name": ['Kendall'], "surname": [None]}
which in return will be saved as [{"name": "Kendall", "surname": null}]
.
In the same manner, if you supply values for both fields your final result would be {"name": ['Kendall'], "surname": ["Jenner"]}
which in return will be saved as [{"name": "Kendall", "surname": "Jenner"}]
.
In other words, whichever fields are supplied, the final result will always be a well balanced list of dictionnaries with no missing fields.
deprecated This class does the following process:
- Before the data is added, it runs any field constraint present on the model
- It then adds the value to the existing container via the
update
function - Finally, once
execute_save
is called, it applies any sorting specified on the fields in theMeta
class of the model and returns the corresponding data
The first method consists of using add_value
.
player.add_value('name', 'Kendall Jenner')
Addind expression based values requires a BeautifulSoup HTML page object. You can add one value at a time.
player.add_using_expression('name', 'a', attrs={'class': 'title'})
When you want to add a value to the model based on certain conditions, use add_case
in combination wih a function class.
For instance, suppose you are scrapping a fashion website and for certain prices, let's say 25 you want to replace them by 25.5 you can do the following:
from zineb.models.expressions import When
my_model.add_case(25, When(25, 25.5))
If you wish to operate a calculation on a field before passing the data to your model, you can use math function classes in combination with the add_calculated_value
.
from zineb.models.expressions import Add
my_model.add_calculatd_value('price', 25, Add(5))
You can also run multiple arithmetic operations on on the field:
my_model.add_calculatd_value('price', 25, Add(5), Substract(1))
You can save the data within a model by calling the save
method. It takes the following arguments:
filename
commit
The save method does the following things in order:
- Call
full_clean
in order to apply general modifications to the final data full_clean
then calls theclean
method to apply any custom user modifications to be applied on the resulting data- Finally, save the data to a file if commit or return the elements as list
By adding a Meta to your model, you can pass custom behaviours.
- Ordering
- Template model
- Constraints
If a model's only purpose is to implement additional fields to a child model, use the template_model
option to indicate this state.
class TemplateModel(Model):
name = fields.CharField()
class Meta:
template_model = True
class MainModel(TemplateModel):
surname = fields.CharField()
This technique is useful when you need to implement common fields to multiple models at a time.
Order your data in a specific way based on certain fields before saving your model.
You an ensure that the data on your model is unique using the UniqueConstraint
class. These constraint check is done before the data is saved by skipping the saving process if a similar value was found.
class UserModel(Model):
name = fields.CharField()
email = fields.EmailField()
class Meta:
constraints = [
UniqueConstraint(fields=['name'], name='unique_name')
]
Multiple fields can be constrained creating a unique together directive. In the example below, both name and email have to be unique in order to be saved.
class UserModel(Model):
name = fields.CharField()
email = fields.EmailField()
class Meta:
constraints = [
UniqueConstraint(fields=['name', 'email'], name='unique_name')
]
You can also implement a constraint function on the fields:
class UserModel(Model):
name = fields.CharField()
email = fields.EmailField()
class Meta:
constraints = [
UniqueConstraint(fields=['name', 'email'], name='unique_name', condition=lambda x: x != 'Kendall')
]
Fields are the main entrypoint for passing a raw value from the internet to the underlying SmartDict
container of your model. They guarantee cleanliness and consistency.
Zineb comes with number of preset fields that you can use out of the box:
- CharField
- TextField
- NameField
- EmailField
- UrlField
- ImageField
- IntegerField
- DecimalField
- DateField
- AgeField
- CommaSeparatedField
- ListField
- BooleanField
- Value
- RelatedModelField
Each fields comes with a resolve
function whiche gets called by the model. The resulting data is then passed unto the model's data store.
The resolve function will then do the following things.
First, it will run all cleaning functions on the original value for example by stripping tags like "<" or ">" which normalizes the value before additional processing.
Second, a deep_clean
function is run on the result by removing any useless spaces, escape characters and finally reconstructing the value to ensure that any none-detected white space be eliminated.
Finally, all the registered validators (default and custom) are called.
You can access the data of a declared field directly on the model by calling the field's name.
class PlayerModel(Model):
name = fields.CharField()
surname = fields.CharField()
model = PlayerModel()
model.add_value('name': 'Shelly-Ann')
model.add_value('surname', 'Fraiser')
# -> model.name -> ["Shelly-Ann"]
# -> model.surname -> ["Fraiser"]
By calling model.name
you will receive an array containing all the values that were registered on in the data container e.g. ["Shelly-Ann"]
. Each field has a descriptor FieldDescriptor
The CharField represents the normal character element on an HTML page.
CharField(max_length=None, null=None, default=None, validators=[])
The text field is longer which allows you then to add paragraphs of text.
TextField(max_length=None, null=None, default=None, validators=[])
The name field allows you to implement capitalized text in your model. The title
method is called on the string in order to represent the value correctly e.g. Kendall Jenner.
NameField(max_length=None, null=None, default=None, validators=[])
The email field represents emails. The default validator, validators.validate_email
, is automatically called on the resolve function fo the class in order to ensure that that the value is indeed an email.
limit_to_domains
: Check if email corresponds to the list of specified domains
EmailField(limit_to_domains=[], max_length=None, null=None, default=None, validators=[])
The url field is specific for urls. Just like the email field, the default validator, validators.validate_url
is called in order to validate the url.
The image field holds the url of an image exactly like the UrlField with the sole difference that you can download the image directly when the field is evaluated.
download
: Download the image to your media folder while the scrapping is performedas_thumnail
: Download image as a thumbnaildownload_to
: Download image to a specific path
class MyModel(Model):
avatar = ImageField(download=True, download_to="/this/path")
This field allows you to pass an integer into your model.
default
: Default value if Nonemax_value
: Implements a maximum value constraintmin_value
: Implements a minimum value constraint
This field allows you to pass a float value into your model.
default
: Default value if Nonemax_value
: Implements a maximum value constraintmin_value
: Implements a minimum value constraint
The date field allows you to pass dates to your model. This field uses a preset of custom date formats to identify the structure of date incoming value. For instance %d-%m-%Y
will be able to resolve 1-1-2021
.
date_format
: Additional format that can be used to parse the incoming value
class MyModel(Model):
date = DateField("%d-%m-%Y")
Generally speaking, most date formats are covered so you wouldn't need to implement a generally used format.
The age field works likes the DateField but instead of returning the date, it will return the difference between the date and the current date which corresponds to the age.
date_format
: Indicates how to parse the incoming data valuedefault
: Default value if Nonetz_info
: Timezone information
An array field will store an array of values that are all evalutated to an output field that you would have specified.
N.B. Note that the value of an ArrayField is implemented as is in the final DataFrame. Make sure you are using this field correctly in order to avoid unwanted results.
Create a comma separated field in your model.
N.B. Note that the value of a CommaSeperatedField is implemented as is in the final DataFrame. Make sure you are using this field correctly in order to avoid unwanted results.
Parse an element within a given value using a regex expression before storing it in your model.
RegexField(r'(\d+)(?<=\€)')
Adds a boolean based value to your model. Uses classic boolean represenations such as on, off, 1, 0, True, true, False or false
to resolve the value.
This field allows you to create a direct relationship with any existing models of your project. Suppose you have the given models:
from zineb.models.datastructure import Model
from zineb.models import fields
class Tournament(Model):
location = fields.CharField()
class Player(Model):
full_name = fields.CharField()
You might be tempted when scrapping your data to instantiate both models in order to add values like this:
class MySpider(Spider):
def start(self, soup, **kwargs):
player = Player()
tournament = Tournament()
player.add_value('full_name', 'Kendall Jenner')
tournament.add_value('location', 'Paris')
There's lots of code and this is not necessarily the most efficient way for this task. The RelatedModelField
allows us then to create both a forward and backward relationship between two different models.
The above technique can then be simplified the code below:
from zineb.models.datastructure import Model
from zineb.models import fields
class Tournament(Model):
location = fields.CharField()
class Player(Model):
full_name = fields.CharField()
tournament = fields.RelatedModelField(Tournament)
Which would then allow us to do the following:
class MySpider(Spider):
def start(self, soup, **kwargs):
player = Player()
player.add_value('full_name', 'Kendall Jenner')
player.tournament.add_value('location', 'Paris')
player.save(commit=False)
# -> [{"full_name": "Kendall Jenner", "tournament": [{"location": "Paris"}]}]
It does not keep track of the individual relationship the main model and the related model. In other words, all data from the main model will receive the same data from the related model contrarily to a database foreign key.
This is ideal for creating nested data within your model.
You an also create a custom field by suclassing zineb.models.fields.Field
. When doing so, your custom field has to provide a resolve
function in order to determine how the value should be parsed and a _to_python_object
function in order to know under which python type the data should be represented (str, int...).
class MyCustomField(Field):
_dtype = str
def _to_python_object(self, clean_value):
# Code here
def resolve(self, value):
initial_result = super().resolve(value)
# Rest of your code here
Validators make sure that the value that was passed respects the constraints that were implemented as a keyword arguments on the field class. There are five basic validations that could possibly run if they are specified.
- Maximum length (
max_length
) - Nullity (
null
) - Defaultness (
default
) - Validity (
validators
)
The maximum or minimum length check ensures that the value does not exceed a certain length using validators.max_length_validator
or validators.min_length_validator
.
The nullity validation ensures that the value is not null and that if a default is provided, that null value be replaced by the latter. It uses validators.validate_is_not_null
.
The defaultness provides a default value for null or none existing ones.
For instance, suppose you want only values that do not exceed a certain length:
name = CharField(max_length=50)
Or suppose you want a default value for fields that are empty or blank:
name = CharField(default='Kylie Jenner')
Remember that validators will validate the value itself for example by making sure that an URL is indeed an url or that an email follows the expected pattern that you would expect from an email.
Suppose you want only values that would be Kendall Jenner
. Then you could create a custom validator that would do the following:
def check_name(value):
if value == "Kylie Jenner":
return None
return value
name = CharField(validators=[check_name])
You can also create validators that match a specific regex pattern using the zineb.models.validators.regex_compiler
decorator:
from zineb.models.datastructure import Model
from zineb.models.fields import CharField
from zineb.models.validators import regex_compiler
@regex_compiler(r'\d+')
def custom_validator(value):
if value > 10:
return value
return 0
class Player(Model):
age = IntegerField(validators=[custom_validator])
NOTE: The result of the regex compiler is reinjected into your custom validator on which you can then do your custom checks.
In order to get the complete structured data, you need to call resolve_fields
which will return the values as list stored into the SmartDict
container.
player.add_value("name", "Kendall Jenner")
player.resolve_values()
# -> List
Practically though, you'll be using the save
method which then calls the resolve_fields
under the hood:
player.save(commit=True, filename=None, **kwargs)
# -> List // New File
By calling the save method, you'll also be able to store the data directly to a JSON or CSV file.
Functions a built-in elements that can modify the incoming value in some kind of way before sending it to the SmartDict
container through your model.
Allows you to run an arithmetic operation on an incoming value.
from zineb.models.functions import Add, Substract, Divide, Multiply
player.add_calculated_value('height', 175, Add(5))
player.add_calculated_value('height', 175, Substract(5))
player.add_calculated_value('height', 175, Divide(1))
player.add_calculated_value('height', 175, Multiply(1))
# -> {'height': [180]}
# -> {'height': [170]}
# -> {'height': [175]}
# -> {'height': [175]}
From a string that contains a date, extract the year, the date or the day.
from zineb.models.functions import ExtractYear
player.add_value('competition_year', ExtractYear('11-1-2021'))
player.add_value('competition_month', ExtractMonth('11-1-2021'))
player.add_value('competition_day', ExtractDay('11-1-2021'))
# -> {'competition_year': [2021]}
# -> {'competition_month': [11]}
# -> {'competition_day': [1]}
Allows you to conditionally implement a value in the model if it respects a set of conditions.
from zineb.models.functions import When
player.add_value('age', When(21, 25, else_condition=21))
From a set of incoming data, pick the smallest or the greatest one. This requires that all the incoming values be of the same type.
from zineb.models.functions import Smallest, Greatest
player.add_value('name', Smallest('Kendall', 'Kylie', 'Hailey'))
player.add_value('revenue', Greatest(12000, 5000, 156000))
Zineb uses a special built-in HTTPRequest class which wraps the following for better cohesion:
- The
requests.Request
response class - The
bs4.BeautifulSoup
object
In general, you will not need to interact with this class because it's just an interface for implementing additional functionnalities the base Request class from the requests module.
follow
: create a new instance of the class whose response will be one created from the url tha was followedfollow_all
: create new instances of the class who responses will be ones created from the urls tha were followedurljoin
: join a domain to a given path
It wraps the BeautifulSoup object in order to implement some small additional functionalities:
page_title
: return the page's titlelinks
: return all the links of the pageimages
: return all the images of the pagetables
: return all the tables of the page
Most of times, when you retrieve links from a page, they are returned as relative paths. The urljoin
method reconciles the url of the visited page with that path.
# <a href="/kendall-jenner">Kendall Jenner</a>
# Now we want to reconcile the relative path from this link to
# the main url that we are visiting e.g. https://example.com
request.urljoin("/kendall-jenner")
# -> https://example.com/kendall-jenner
Collect files within a specific directory using collect_files
. Collect files also takes an additional function that can be used to filter or alter the final results.
This section will talk about all the available settings for your project and how they should be used.
Represents the current path for your project. This setting is not to be changed.
In order for your spiders to be executed, they should be registered here. The name of the spider class serves as the name of the spider to be run.
SPIDERS = [
"MySpider"
]
You can restrict your project to use only to a specific set of domains by ensuring that no request is sent if it matches one of the domains within this list.
DOMAINS = [
"example.com"
]
Enforce that every link in your project is a secured HTTPS link. This setting is set to False by default.
A user agent is a characteristic string that lets servers and network peers identify the application, operating system, vendor, and/or version of the requesting MDN Web Docs.
Implement additional sets of user agents to your projects in addition to those that were already created.
Specifies whether to use one user agent for every request or to randomize user agents on every request. This setting is set to to False by default.
Specify additional default headers to use for each requests.
The default initial headers are:
Accept-Language
- enAccept
- text/html,application/json,application/xhtml+xml,application/xml;q=0.9,/;q=0.8Referrer
- None
Allows every request to be sent via a proxy. A random proxy is selected and implemented within each request.
PROXIES
accepts a list of tuples implemeting a loc e.g. http, https and the IP address to bee used.
PROXIES = [
("http", "127.0.0.1"),
("https", "127.0.0.1")
]
Specifies the retry policy. This is set to False by default. In other words, the request silently fails and never retries.
Specificies the amount of times the the request is sent before eventually failing.
Indicates which status codes should trigger a retry. By default, the following codes: 500, 502, 503, 504, 522, 524, 408 and 429 will trigger it.
Indicates which timezone to use when manipulating dates and times in the application. The default is America/Chicago
.