Skip to content

Commit

Permalink
feat: Add switch schema type
Browse files Browse the repository at this point in the history
  • Loading branch information
nzakas committed Sep 17, 2021
1 parent 1fd5e14 commit 094caeb
Show file tree
Hide file tree
Showing 6 changed files with 129 additions and 6 deletions.
48 changes: 45 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ If you want a more specific conversion, you should use `"string"` and specify a
}
```

### `"array"`
### `"array"` Type

The `"array"` type lets you specify a collection of elements whose text should be extracted and the results put into an array. Here, the `selector` property is expected to return more than one element, and there is an additional `items` property that contains another schema that will be used for each item in the array. For example:

Expand Down Expand Up @@ -164,7 +164,7 @@ The elements of the array are always an object, but if you'd like them to be a p
}
```

### `"object"`
### `"object"` Type

The `"object"` type lets you specify a collection of properties whose text should be extracted and the results put into an object. There is an additional `properties` property that contains another schema. For example:

Expand Down Expand Up @@ -205,7 +205,7 @@ Here, an object is created with three properties. As with `"array"`, the selecto

You can also use a `convert` function.

### `"custom"`
### `"custom"` Type

The `"custom"` type lets you control exactly how data is extracted from the page by specifying an `extract` function. The `extract` receives the element indicated by `selector` and is executed in the context of the Puppeteer page, meaning it does not act as a closure. The `element` passed in is an `HTMLElement` instance that you can interrogate to find the data you want. Then, return a JSON-serializable value from `extract`. For example:

Expand All @@ -232,6 +232,48 @@ The `"custom"` type lets you control exactly how data is extracted from the page

You can also use a `convert` function with `"custom"`, and that function does not execute inside of the Puppeteer page, so you can make further customizations to the returned data.

### `"switch"` Type

The `"switch"` type lets you specify multiple possible values for the key, and the first one that matches will be the value. You do so by providing a `cases` array, each of which has a pattern to match (`if`) and a value to use (`then`). For example:

```js
{
references: {
type: "switch",
cases: [
{
if: "ol.references",
then: {
type: "array",
selector: "ol.references > li",
items: {
name: {
type: "string"
}
}
}
},
{
if: "#references",
then: {
type: "array",
selector: "#references + ol > li",
items: {
name: {
type: "string"
}
}
}
}
]
}
}
```

In this example, the key `references` has two possible options. The first is to create an array based on the selector `ol.references > li` and the second is based on the selector `#references + ol > li`. If the first selector matches, then that case is executed and the second is not; if the first selector doesn't match, then the second selector is checked.

Note: If no cases match then an error is thrown.

## Developer Setup

1. Fork the repository
Expand Down
2 changes: 1 addition & 1 deletion src/converters.js
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ export function identity(value) {
}

export function stringToNumber(value) {
return Number(value.replace(/[^\d\.\-]/g, ""));
return Number(value.replace(/[^\d.-]/g, ""));
}

export function stringToBoolean(value) {
Expand Down
44 changes: 44 additions & 0 deletions src/schema-types.js
Original file line number Diff line number Diff line change
Expand Up @@ -19,14 +19,21 @@ import { stringToBoolean, stringToNumber, identity } from "./converters.js";
* @typedef {import("puppeteer").ElementHandle} ElementHandle
*
* @typedef {Object<string,JSONValue>|Array<JSONValue>|string|number|boolean|null} JSONValue
* @typedef {SchemaDef|ArraySchemaDef|ObjectSchemaDef|TableSchemaDef|SwitchSchemaDef|CustomSchemaDef} AnySchemaDef
*
* @typedef {Object} CaseIf
* @property {string} if The CSS selector to locate.
* @property {AnySchemaDef} then The schema definition to apply if `if` is found.
*
* @typedef {Object} SchemaDef
* @property {string} type The type of schema.
* @property {string} selector The CSS selector to locate the element.
* @property {boolean} [optional=false] Indicates if the selector may not exist.
* @property {Function?} convert A conversion function that will initially
* receive the extracted data before placing it in the data structure
*
* @typedef {Object} ArraySchemaDef
* @property {string} type The type of schema.
* @property {string} selector The CSS selector to locate the element.
* @property {boolean} [optional=false] Indicates if the selector may not exist.
* @property {Function?} convert A conversion function that will initially
Expand All @@ -35,6 +42,7 @@ import { stringToBoolean, stringToNumber, identity } from "./converters.js";
* in the array.
*
* @typedef {Object} CustomSchemaDef
* @property {string} type The type of schema.
* @property {string} selector The CSS selector to locate the element.
* @property {HTMLElement => JSONValue} extract A function that receives an HTML element as
* an argument and must return a serializable value. This function runs
Expand All @@ -43,14 +51,20 @@ import { stringToBoolean, stringToNumber, identity } from "./converters.js";
* receive the extracted data before placing it in the data structure
*
* @typedef {Object} ObjectSchemaDef
* @property {string} type The type of schema.
* @property {string} selector The CSS selector to locate the element.
* @property {boolean} [optional=false] Indicates if the selector may not exist.
* @property {Function?} convert A conversion function that will initially
* receive the extracted data before placing it in the data structure
* @property {Object<string,SchemaDef>} properties The schema for each
* property in the object.
*
* @typedef {Object} SwitchSchemaDef
* @property {string} type The type of schema.
* @property {Array<CaseIf>} cases The cases to check.
*
* @typedef {Object} TableSchemaDef
* @property {string} type The type of schema.
* @property {string} selector The CSS selector to locate the element.
* @property {boolean} [optional=false] Indicates if the selector may not exist.
* @property {Function?} convert A conversion function that will initially
Expand Down Expand Up @@ -125,8 +139,14 @@ export const schemaTypes = {
* @param {Page|ElementHandle} root The page or element handle to query from.
* @param {ArraySchemaDef} def The schema definition for the array.
* @returns {Array} An array of data matching the definition.
* @throws {TypeError} If required information is missing.
*/
async array(root, { selector, optional, items, convert = identity }) {

if (typeof items === "undefined") {
throw new TypeError(`Array definition for "${selector}" is missing "items" property.`);
}

const itemHandles = await root.$$(selector);

if (itemHandles.length === 0) {
Expand Down Expand Up @@ -249,6 +269,30 @@ export const schemaTypes = {
});
},

/**
* Chooses the value from the first case that matches.
* @param {Page|ElementHandle} root The page or element handle to query from.
* @param {SwitchSchemaDef} def The schema definition for the switch.
* @returns {*} The value returned from the first matching case.
* @throws {TypeError} If required information is missing.
*/
async switch(root, { cases }) {

if (!Array.isArray(cases)) {
throw new TypeError("Switch definition is missing 'cases' array.");
}

for (const caseDef of cases) {
const handle = await root.$(caseDef.if);
if (handle) {
return this[caseDef.then.type](root, caseDef.then);
}
}

throw new Error("No cases matched.");

},

/**
* Creates an object containing information from an HTML table.
* @param {Page|ElementHandle} root The page or element handle to query from.
Expand Down
29 changes: 29 additions & 0 deletions tests/data-extractor.test.js
Original file line number Diff line number Diff line change
Expand Up @@ -165,6 +165,35 @@ const salaryPost = {
selector: "meta[name='og:type']"
}
}
},
references: {
type: "switch",
cases: [
{
if: "#foo",
then: {
type: "array",
selector: "#foo > ol > li",
items: {
name: {
type: "string"
}
}
}
},
{
if: "#references",
then: {
type: "array",
selector: "#references + ol > li",
items: {
name: {
type: "string"
}
}
}
}
]
}
};

Expand Down
1 change: 0 additions & 1 deletion tests/fixtures/blog-somewhat-complete-salary-history.html
Original file line number Diff line number Diff line change
Expand Up @@ -424,7 +424,6 @@ <h2 class="no-margin-top">Join the Mailing List</h2>

<div id="sidebar" class="sidebar-width sidebar-background gutters hide-on-small-screens">
<h1 class="hide-offscreen">Additional Information</h1>
<script async type="text/javascript" src="//cdn.carbonads.com/carbon.js?serve=CKYIEK3Y&placement=humanwhocodescom" id="_carbonads_js"></script>



Expand Down
11 changes: 10 additions & 1 deletion tests/fixtures/blog-somewhat-complete-salary-history.json
Original file line number Diff line number Diff line change
Expand Up @@ -169,5 +169,14 @@
"title": "My (somewhat) complete salary history as a software engineer",
"description": "",
"type": "article"
}
},
"references": [
{
"name": "By the Numbers: What pay inequality looks like for women in tech (forbes.com)"
},
{
"name": "Women Know When Negotiating Isn’t Worth It (theatlantic.com)"
}

]
}

0 comments on commit 094caeb

Please sign in to comment.