Skip to content

Latest commit

 

History

History
650 lines (521 loc) · 19.3 KB

README.md

File metadata and controls

650 lines (521 loc) · 19.3 KB

gurkha

Data extraction module for Yakuza

Description

gurkha is a simple data extraction tool designed to standarize html parsing. While it was thought with web scraping in mind, it can be used as a standalone tool. Let's say you have the following html table:

Product Code Price
Apple 2001 $0.40
Orange 2002 $0.44
Banana 2003 $0.50

Let's also say you wish to generate an array of 'product objects' like so:

[{'name': 'Apple', 'code': '2001', 'price': '0.40'},
 {'name': 'Orange', 'code': '2002', 'price': '0.44'},
 {'name': 'Banana', 'code': '2003', 'price': '0.50'}]

With gurkha, you just have to specify the object structure (called schema object) you desire, and a set of rules to indicate where in the html you can find the desired data for each object member. You can also specify post-processing functions to sanitize the data after it is retrieved (for example, if the price component of each object has to be stripped of the dollar signs). After that, you're just a function call away from your precious object array!

Usage

Considering the previous example, this is what the table html would look like:

<html>
  <table id='fruit'>
    <thead>
      <tr>
        <td>Product</td>
        <td>Code</td>
        <td>Price</td>
      </tr>
    </thead>
    <tbody>
      <tr>
        <td>Apple</td>
        <td>2001</td>
        <td>$0.40</td>
      </tr>
      <tr>
        <td>Orange</td>
        <td>2002</td>
        <td>$0.44</td>
      </tr>
      <tr>
        <td>Banana</td>
        <td>2003</td>
        <td>$0.50</td>
      </tr>
    </tbody>
  </table>
</html>

And this is how you would parse it using gurkha:

var Gurkha = require('gurkha');
var gk = new Gurkha({
  '$rule': 'table#fruit > tbody > tr',
  'name': 'td:nth-child(1)',
  'code': 'td:nth-child(2)',
  'price': {
    '$rule': 'td:nth-child(3)',
    '$sanitizer': function ($elem) {
      return $elem.text().replace(/\$/, '');
    }
  }
});

var products = gk.parse(someHtml);

console.log(products[0].name);
// 'Apple'

gurkha automatically detects whether the selection returns multiple elements and in that case, the value of that object member will be an array.

You can also pass an extra object to the Gurkha constructor if you need to modify any of the default parsing options and/or pass extra parameters to be used in the schema's functions. We will look into parameters in depth in another section. The syntax for the options object is the following:

{
  'options': {
    // parsing options
  },
  'params': {
    // parameters
  }
}

These parsing options are taken directly from htmlparser2, therefore any options that can be used in htmlparser2 are valid in cheerio as well. The default options are:

{
    normalizeWhitespace: false,
    xmlMode: false,
    decodeEntities: true
}

The result of a .parse() call will always be an array.

Smart Parsing

gurkha is smart. You can instruct it to parse html into any kind of object structure. This includes nested objects and arrays, single values or an array of any mix of valid object structures. The only current limitation is that you cannot have dynamic object member names (they have to be predefined).

Nested Objects

Following the previous example, let's say you wish to create a different structure, like so:

[{
  'fruit1:' {
    'name': 'Apple',
    'code': '2001',
    'price': '$0.40'
  },
  'fruit2': {
    'name': 'Orange',
    'code': '2002',
    'price': '$0.44'
  },
  'fruit3': {
    'name': 'Banana',
    'code': '2003',
    'price': '$0.50'
  }
}]

Your schema object would then look something like this:

var gk = new Gurkha({
  '$rule': 'table#fruit > tbody',
  'fruit1': {
    '$rule': 'tr:nth-child(1)',
    'name': 'td:nth-child(1)',
    'code': 'td:nth-child(2)',
    'price': 'td:nth-child(3)'
  },
  'fruit2': {
    '$rule': 'tr:nth-child(2)',
    'name': 'td:nth-child(1)',
    'code': 'td:nth-child(2)',
    'price': 'td:nth-child(3)'
  },
  'fruit3': {
    '$rule': 'tr:nth-child(3)',
    'name': 'td:nth-child(1)',
    'code': 'td:nth-child(2)',
    'price': 'td:nth-child(3)'
  }
});

This particular example is kind of redundant, but it shows how you can build infinitely nested objects with gurkha.

Nested Arrays

Let's say you now just want your fruit data in array format, like so:

[
  {
    'fruit': ['Apple', '2001', '$0.40']
  },
  {
    'fruit': ['Orange', '2002', '$0.44']
  },
  {
    'fruit': ['Banana', '2003', '$0.50']
  }
]

Your schema object would then look something like this:

gk = new Gurkha({
  '$rule': 'table#fruit > tbody > tr',
  'fruit': [
    'td:nth-child(1)',
    'td:nth-child(2)',
    'td:nth-child(3)'
  ]
});

Single Array

If you just wished to get an array of the fruit names:

['Apple', 'Orange', 'Banana']

Your schema object would then look something like this:

gk = new Gurkha({
  '$rule': 'table#fruit > tbody > tr > td:nth-child(1)'
});

String

You can pass a single selector to gurkha and it will return an array of the text attributes of all elements selected.

So this schema object

gk = new Gurkha('table#fruit > tbody > tr > td:nth-child(1)');

would return the same as the previous example, that is

['Apple', 'Orange', 'Banana']

Mixed Array

You can give gurkha an array of valid schema objects and it will yield an array of the results of parsing each of those objects separately! You can even nest these arrays and gurkha will follow suit.

Just play around with the schema object and the selectors and you should be able to create just about any kind of structure you desire!

Options

gurkha has a bunch of reserved keywords in the schema object to help you build your structure. This section will explain them in detail.

Rule

The '$rule' object member specifies the CSS selector that will retrieve the element in question. If absent, the selection will begin at the top level of the DOM.

When a member of the schema object has a string for a value, the string value is implicitly bound to '$rule'.

{
  'name': 'table#fruit > tbody > tr > td:nth-child(1)'
}

Is equivalent to

{
  'name': {
    '$rule': 'table#fruit > tbody > tr > td:nth-child(1)'
  }
}

It must be noted that nested '$rule' values are equivalent to a jquery .find() method.

For example, in this schema object

{
  '$rule': 'table#fruit > tbody > tr',
  'name': {
    '$rule': 'td:nth-child(1)'
  }
}

The 'name' object member will be extracted using

$('table#fruit > tbody > tr').find('td:nth-child(1)')

Which is equivalent to concatenating the selectors.

This holds true to implicit rules, such as

{
  '$rule': 'table#fruit > tbody > tr',
  'name': 'td:nth-child(1)'
}

Use of '$rule' is optional, but not using it at the top level of the schema object will result on a performance hit.

For example, this is a perfectly valid schema object:

{
  'name': 'table#fruit > tbody > tr > td:nth-child(1)'
}

Sanitizing function

The '$sanitizer' object member specifies a sanitizing function to be applied to the data retrieved by the selector specified in '$rule'. It always receives a cheerio object representing the selected element as a parameter and must return the sanitized value. This is useful if you wish to perform any operations on the data, like removing special characters, trimming or even to continue traversing the DOM and retrieve other values to perform mixed operations.

For instance, a good use of '$sanitizer' would be to remove the dollar signs on the price of the fruit, like in the first example.

'$sanitizer' will apply to the element selected by '$rule' only if the schema object of which it is a member has no unreserved object members. If there are unreserved members, '$sanitizer' will apply only to those which do not have unreserved members themselves. In other words, it will only apply to single-rule schema objects within the schema object where it is defined, so if you have this schema object

gk = new Gurkha({
  '$rule': 'table#fruit > tbody > tr',
  '$sanitizer': function ($elem) {
    return $elem.text() + ' gurkha rocks!';
  },
  'name': 'td:nth-child(1)',
  'name2': {
    '$rule': 'td:nth-child(1)'
  },
  'name-price': [
    'td:nth-child(1)',
    'td:nth-child(3)'
  ],
  'price-code': {
    'code': 'td:nth-child(2)',
    'price': 'td:nth-child(3)'
  }
});

the sanitizing function will not propagate to the inner members of 'price-code', but will apply to 'name', 'name2' and 'name-price', yielding the following result:

[ { name: 'Apple gurkha rocks!',
    name2: 'Apple gurkha rocks!',
    'name-price': [ 'Apple gurkha rocks!', '$0.40 gurkha rocks!' ],
    'price-code': { code: '2001', price: '$0.40' } },
  { name: 'Orange gurkha rocks!',
    name2: 'Orange gurkha rocks!',
    'name-price': [ 'Orange gurkha rocks!', '$0.44 gurkha rocks!' ],
    'price-code': { code: '2002', price: '$0.44' } },
  { name: 'Banana gurkha rocks!',
    name2: 'Banana gurkha rocks!',
    'name-price': [ 'Banana gurkha rocks!', '$0.50 gurkha rocks!' ],
    'price-code': { code: '2003', price: '$0.50' } } ]

Moreover, inner sanitizing functions override outer ones in their scope.

Top level selection

A schema object that has the '$topLevel' object member set to true will override '$rule' concatenation, meaning that it will select from the top level of the DOM. This is useful if the data you wish to extract is not all in one place.

Let's continue with the example but add an anchor tab outside the table:

<html>
  <a href="https://www.npmjs.com/package/gurkha">gurkha</a>
  <table>
    ...
  </table>
</html>

Let's say that for some reason, you wish your fruit objects to also have the link as a property. Your object array would look like this:

[{'name': 'Apple', 'code': '2001', 'price': '0.40', link: 'https://www.npmjs.com/package/gurkha'},
 {'name': 'Orange', 'code': '2002', 'price': '0.44', link: 'https://www.npmjs.com/package/gurkha'},
 {'name': 'Banana', 'code': '2003', 'price': '0.50', link: 'https://www.npmjs.com/package/gurkha'}]

With '$rule' concatenation, any selector you added to extract the 'link' attribute would start selecting from the rows of the table, yielding no results. Here is where you can use '$topLevel' to specify that the selection should start from the top level of the DOM.

Your schema object would look like this

gk = new Gurkha({
  '$rule': 'table#fruit > tbody > tr',
  'name': 'td:nth-child(1)',
  'code': 'td:nth-child(2)',
  'price': {
    '$rule': 'td:nth-child(3)',
    '$sanitizer': function ($elem) {
      return $elem.text().replace(/\$/, '');
    }
  },
  'link': {
    '$rule': 'a',
    '$topLevel': true,
    '$sanitizer': function ($elem) {
      return $elem.attr('href');
    }
  }
});

Post-processing

Sometimes you may want to perform some post-processing operations on your object before gurkha returns it. You may even want to perform them on one of the inner objects in your master object.

Let's take our first example and say that you wish to add the word 'Premium' at the beginning of the name of any fruit with a price of more than $0.42. Normally, you'd have to wait until gurkha returns your object and then pass it to a function that does this operation. With '$post', you may simply declare a function and gurkha will pass the object to that function before it returns it.

Your schema object would look something like this:

var gk = new Gurkha({
  '$rule': 'table#fruit > tbody > tr',
  '$post': function (obj) {
    if (obj.price > 0.42) {
      obj.name = 'Premium ' + obj.name;
    }

    return obj;
  },
  'name': 'td:nth-child(1)',
  'code': 'td:nth-child(2)',
  'price': {
    '$rule': 'td:nth-child(3)',
    '$sanitizer': function ($elem) {
      return $elem.text().replace(/\$/, '');
    }
  }
});
``

And the result would be

```javascript
[{'name': 'Apple', 'code': '2001', 'price': '0.40'},
 {'name': 'Premium Orange', 'code': '2002', 'price': '0.44'},
 {'name': 'Premium Banana', 'code': '2003', 'price': '0.50'}]

It must be noted that post-processing functions must receive the object and return a value, so even if you just wish to perform operations on the object members, you must return the object at the end.

As you might expect, you can define post-processing functions for inner objects in your schema. For example, let's say you have this object structure

{
  'names': [
    'Apple',
    'Orange',
    'Banana'
  ],
  'price-codes': [
    {
      'price': '$0.40',
      'code': '2001'
    },
    {
      'price': '$0.44',
      'code': '2002'
    },
    {
      'price': '$0.50',
      'code': '2003'
    }
  ]
}

but you wish to compress both price and code into a single value (i.e. '0.40-2001'). You can accomplish this with '$post' in the following way:

gk = new Gurkha({
  'names': 'table#fruit > tbody > tr > td:nth-child(1)',
  'price-codes': {
    '$rule': 'table#fruit > tbody > tr',
    '$post': function (obj) {
      return obj.price + '-' + obj.code;
    },
    'price': 'td:nth-child(3)',
    'code': 'td:nth-child(2)'
  }
});

and the result would be

{
  'names': [
    'Apple',
    'Orange',
    'Banana'
  ],
  'price-codes': [
    '$0.40-2001',
    '$0.44-2002',
    '$0.50-2003'
  ]
}

Be mindful that post-processing functions apply over each element of the object array, so you cannot post-process the entire array.

Filters

Let's say you want gurkha to ignore any rows with the word 'Apple' in them. You could use the :nth-child pseudo selector, but there is a better way. Gurkha now supports filtering functions. A filtering function is a simple function which receives a cheerio object as an argument and returns true if that element should be filtered out. For instance, for this particular example, your schema object would look like this:

var gk = new Gurkha({
  '$rule': 'table#fruit > tbody > tr',
  'name': 'td:nth-child(1)',
  'code': 'td:nth-child(2)',
  'price': {
    '$rule': 'td:nth-child(3)',
    '$sanitizer': function ($elem) {
      return $elem.text().replace(/\$/, '');
    }
  },
  '$ignore: function ($elem) {
    return $elem.text().indexOf('Apple') > -1;
  }'
});

and the result would of course be:

[{'name': 'Orange', 'code': '2002', 'price': '0.44'},
 {'name': 'Banana', 'code': '2003', 'price': '0.50'}]

Constants

Sometimes not all of your desired object members are obtained via parsing. Sometimes you want your parsed object to contain some constant values. Let's say you wish to add a setter method for the ever fluctuating prices of fruit. You could always simply add it to the object with a post-processing function, or a sanitizer function that returns said function, but there is an easier and more readable way. Enter the 'constant' option:

var gk = new Gurkha({
  '$rule': 'table#fruit > tbody > tr',
  'name': 'td:nth-child(1)',
  'code': 'td:nth-child(2)',
  'price': {
    '$rule': 'td:nth-child(3)',
    '$sanitizer': function ($elem) {
      return $elem.text().replace(/\$/, '');
    }
  },
  'setPrice': {
    '$constant': function (val) {
      this.price = val;
    }
  }
});

By using the 'constant' option, you are specifying that any preceding rules are to be ignored and that the result of parsing with that schema object will be whichever value its '$constant' member has. Be mindful that if '$constant' is set, any other options on that schema object are ignored, including post-processing functions.

Parameters

As it was mentioned in the introduction, you can pass parameters to the gurkha constructor along with the parsing options. These parameters are meant to be used by the sanitizer, filter and post-processing functions. Here are a couple of use cases that will illustrate how you can use parameters to your benefit.

Let's take the first example and say you wish to convert the price to another currency, but the specific currency conversion rate you wish to use is dynamic. You could of course define the schema within the scope of your currency conversion variable, but there is a better, more readable way.

First, you set up your schema like so:

var mySchema = {
  '$rule': 'table#fruit > tbody > tr',
  'name': 'td:nth-child(1)',
  'code': 'td:nth-child(2)',
  'price': {
    '$rule': 'td:nth-child(3)',
    '$sanitizer': function ($elem, params) {
      return $elem.text().replace(/\$/, '') * params.conversionRate;
    }
  }
};

And then you simply pass the gurkha constructor an options object which contains your conversion rate as a parameter, like this:

var gk = new Gurkha(mySchema,
{
  'params': {
    'conversionRate': someRate // defined elsewhere
  }
});

The params object will be passed to every sanitizing function in the schema object, no matter how nested it is.

In another example, let's say you now wish to keep the original price but add a function to your object that logs out the converted price to the console. You could achieve this with parameters and the post-processing function.

This is how the constructor call would look like:

var gk = new Gurkha({
  '$rule': 'table#fruit > tbody > tr',
  'name': 'td:nth-child(1)',
  'code': 'td:nth-child(2)',
  'price': {
    '$rule': 'td:nth-child(3)',
    '$sanitizer': function ($elem) {
      return $elem.text().replace(/\$/, '');
    }
  },
  '$post': function (obj, params) {
    obj.printConvertedPrice = function () {
      console.log(this.price * params.conversionRate);
    };

    return obj;
  }
},
{
  'params': {
    'conversionRate': someRate // defined elsewhere
  }
});

As with the sanitizers, the params object is also passed to every post-processing function in the schema object.

Finally, let's say you wish to filter out all rows that contain a certain string of text, but this text is also dynamic.

This is how the constructor call would look like:

var gk = new Gurkha({
  '$rule': 'table#fruit > tbody > tr',
  'name': 'td:nth-child(1)',
  'code': 'td:nth-child(2)',
  'price': {
    '$rule': 'td:nth-child(3)',
    '$sanitizer': function ($elem) {
      return $elem.text().replace(/\$/, '');
    }
  },
  '$ignore': function ($elem, params) {
    return $elem.text().indexOf(params.forbiddenText) > -1;
  }
},
{
  'params': {
    'forbiddenText': someText // defined elsewhere
  }
});

As before, the params object is passed to every filtering function in the schema object.

Since you can only pass a single params object to the gurkha constructor, it is shared among all functions in the schema. This allows you to mix-and-match.

As a final note, parameters passed to the constructor can be overriden by passing a new params object to a .parse() call.

Contributing

If you wish to contribute to this module, feel free to branch out from the development branch, I'll be glad to go over your contributions and add them if they're reasonable. Please test your code before sending in a pull request. Feel free to write any tests you need to check new functionality you wish to add. All tests are written using jasmine-node. Any requests for features or bug fixes can be made by adding a new issue.