Skip to content

Commit

Permalink
Merge pull request #43 from Narzerus/development
Browse files Browse the repository at this point in the history
Development
  • Loading branch information
Rafael Vidaurre authored and Rafael Vidaurre committed Apr 16, 2015
2 parents 9608b50 + 5db4d0e commit 78e1a7b
Show file tree
Hide file tree
Showing 13 changed files with 164 additions and 382 deletions.
99 changes: 43 additions & 56 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,23 +103,19 @@ Remember the `plan` property we mentioned before? Now is a good time to use that
This plan runs `login`and `getArticlesList` sequentially:

```javascript
Yakuza.agent('articles', 'techCrunch').setup(function (config) {
config.plan = [
'login',
'getArticlesList'
];
});
Yakuza.agent('articles', 'techCrunch').plan([
'login',
'getArticlesList'
]);
```

This one runs `login` before the other tasks, but runs `getArticlesList` and `getUsersList` in parallel as they are in the same sub-array:

```javascript
Yakuza.agent('articles', 'techCrunch').setup(function (config) {
config.plan = [
'login',
['getArticlesList', 'getUsersList']
];
});
Yakuza.agent('articles', 'techCrunch').plan([
'login',
['getArticlesList', 'getUsersList']
]);
```

Agents can also define something called `routines` which in turn define a set of tasks to be run. For example you could want to define three routines:
Expand Down Expand Up @@ -336,15 +332,13 @@ Hooks are run in specific moments of an instanced `task`'s life (before emitting
To specify a `task`'s hooks use its `setup` method.

```javascript
Yakuza.task('scraper', 'agent', 'someTask').setup(function (config) {
config.hooks = {
'onFail': function (task) {
// ... do stuff
},
'onSuccess': function (task) {
// ... do stuff
}
};
Yakuza.task('scraper', 'agent', 'someTask').hooks({
'onFail': function (task) {
// ... do stuff
},
'onSuccess': function (task) {
// ... do stuff
}
});
```

Expand All @@ -369,18 +363,16 @@ The `task` object passed to the `onSuccess` hook has the following properties:
Here's an example on when this could be useful:

```javascript
Yakuza.task('scraper', 'agent', 'login').setup(function (config) {
config.hooks = {
'onSuccess': function (task) {
// We stop the job if the loginStatus returns `wrongPassword`
// remember: in many cases wrongPassword might NOT be an error, identifying what's the login status
// can be part of a successful scraping process as well.

if (task.data.loginStatus === 'wrongPassword') {
task.stopJob();
}
Yakuza.task('scraper', 'agent', 'login').hooks({
'onSuccess': function (task) {
// We stop the job if the loginStatus returns `wrongPassword`
// remember: in many cases wrongPassword might NOT be an error, identifying what's the login status
// can be part of a successful scraping process as well.

if (task.data.loginStatus === 'wrongPassword') {
task.stopJob();
}
};
}
}).main(function (task, http, params) {
var opts;

Expand All @@ -393,8 +385,8 @@ Here's an example on when this could be useful:
};

http.post(opts)
.then(function (res, body) {
if (body === 'wrong password') {
.then(function (result) {
if (result.body === 'wrong password') {
task.success({loginStatus: 'wrongPassword});
} else {
task.success({loginStatus: 'authorized});
Expand Down Expand Up @@ -537,22 +529,21 @@ Running task instances sequentially
Sometimes because of server limitations, we might want several instances of the same task to run sequentially. Take our previous example about articles, where we instanced `getArticleData` multiple times. Let's say the server doesn't allow us to view multiple articles in parallel because god knows why. We would need to change the default behavior of task instances and run them one after the other.

This can be achieved in the agent plan by changing the `selfSync` property:

```javascript
Yakuza.agent('articles', 'fooBlog').setup(function (config) {
config.plan = [
'getArticlesList',
{taskId: 'getArticleData', selfSync: true}
];
});
Yakuza.agent('articles', 'fooBlog').plan([
'getArticlesList',
{taskId: 'getArticleData', selfSync: true}
]);
```

Saving cookies
--------------
A lot of times we need to preserve cookies so that they exist for other tasks. This can be achieved by a method called `saveCookies()`.

Example:
```javascript
Yakuza.task('scraper', 'agent', 'login').main(function (task, http, params) {
```javascript
// .. Send a login form
task.saveCookies();
// .. Do more stuff
Expand All @@ -569,15 +560,13 @@ In many cases the websites we scrape are sloppy, implemented in very wrong ways
When a task is rerun, it restarts to the point in which it was instanced. Except (for some properties like `startTime` which marks the moment when the task was first run)

```javascript
Yakuza.task('scraper', 'agent', 'login').setup(function (config) {
config.hooks = {
onFail: function (task) {
if (task.runs <== 5) {
// Will retry the task a maximum amount of 5 times
task.rerun();
}
Yakuza.task('scraper', 'agent', 'login').hooks({
onFail: function (task) {
if (task.runs <== 5) {
// Will retry the task a maximum amount of 5 times
task.rerun();
}
};
}
});
```

Expand All @@ -590,13 +579,11 @@ Execution Block
---------------
An execution block is a set of tasks that run in parallel. For example, take the following plan:
```javascript
Yakuza.agent('scraper', 'agent').setup(function (config) {
config.plan = [
'task1', // Execution block 1
['task2', 'task3'], // Execution block 2
'task4' // Execution block 3
];
});
Yakuza.agent('scraper', 'agent').plan([
'task1', // Execution block 1
['task2', 'task3'], // Execution block 2
'task4' // Execution block 3
])
```

Execution blocks run sequentially, meaning one execution block will only run when the previous block was run or **skipped**.
Expand Down
45 changes: 8 additions & 37 deletions agent.js
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,6 @@ function Agent (id) {
*/
this.__applied = false;

/**
* List of functions which modify the Agent's configuration (provided by setup())
* @private
*/
this.__configCallbacks = [];

/**
* Agent's configuration object (set by running all configCallback functions)
* @private
Expand Down Expand Up @@ -58,17 +52,6 @@ function Agent (id) {
this.id = id;
}

/**
* Run functions passed via config(), thus applying their config logic
* @private
*/
Agent.prototype.__applyConfigCallbacks = function () {
var _this = this;
_.each(_this.__configCallbacks, function (configCallback) {
configCallback(_this.__config);
});
};

/**
* Turns every element in the execution plan into an array for type consistency
* @private
Expand Down Expand Up @@ -108,41 +91,29 @@ Agent.prototype.__formatPlan = function () {
this._plan = formattedPlan;
};

/**
* Applies all task definitions
* @private
*/
Agent.prototype.__applyTaskDefinitions = function () {
_.each(this._taskDefinitions, function (taskDefinition) {
taskDefinition._applySetup();
});
};

/**
* Applies all necessary processes regarding the setup stage of the agent
*/
Agent.prototype._applySetup = function () {
if (this.__applied) {
return;
}
this.__applyConfigCallbacks();
this.__applyTaskDefinitions();

this.__formatPlan();
this.__applied = true;
};

/**
* Saves a configuration function into the config callbacks array
* @param {function} cbConfig method which modifies the agent's config object (passed as argument)
* Sets the task's execution plan
* @param {Array} executionPlan array representing the execution plan for this agent
*/
Agent.prototype.setup = function (cbConfig) {
if (!_.isFunction(cbConfig)) {
throw new Error('Setup argument must be a function');
Agent.prototype.plan = function (executionPlan) {
// TODO: Validate execution plan format right away
if (!_.isArray(executionPlan)) {
throw new Error('Agent plan must be an array of task ids');
}

this.__configCallbacks.push(cbConfig);

return this;
this.__config.plan = executionPlan;
};

/**
Expand Down
1 change: 0 additions & 1 deletion job.js
Original file line number Diff line number Diff line change
Expand Up @@ -631,7 +631,6 @@ Job.prototype.__applyComponents = function () {
return;
}

this._scraper._applySetup();
this.__agent._applySetup();

this.__componentsApplied = true;
Expand Down
2 changes: 1 addition & 1 deletion package.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"name": "yakuza",
"version": "0.2.1",
"version": "1.0.0",
"description": "",
"main": "yakuza.js",
"repository": {
Expand Down
55 changes: 0 additions & 55 deletions scraper.js
Original file line number Diff line number Diff line change
Expand Up @@ -16,24 +16,6 @@ Agent = require('./agent');
* @class
*/
function Scraper () {
/**
* Determines if the setup processes have been applied
* @private
*/
this.__applied = false;

/**
* Array of callbacks provided via config() which set the Scraper's configuration variables
* @private
*/
this.__configCallbacks = [];

/**
* Config object, contains configuration data and is exposed via the setup() method
* @private
*/
this.__config = {};

/**
* Object which contains scraper-wide routine definitions, routines are set via the routine()
* method
Expand Down Expand Up @@ -70,43 +52,6 @@ Scraper.prototype.__createAgent = function (agentId) {
return this._agents[agentId];
};

/**
* Run functions passed via config(), thus applying their config logic
* @private
*/
Scraper.prototype.__applyConfigCallbacks = function () {
var _this = this;
_.each(_this.__configCallbacks, function (configCallback) {
configCallback(_this.__config);
});
};

/**
* Applies all necessary processes regarding the setup stage of the scraper
*/
Scraper.prototype._applySetup = function () {
if (this.__applied) {
return;
}
this.__applyConfigCallbacks();
this.__applied = true;
};

/**
* Used to configure the scraper, it enqueues each configuration function meaning it
* allows a scraper to be configured in multiple different places
* @param {function} cbConfig function which will modify config parameters
*/
Scraper.prototype.setup = function (cbConfig) {
if (!_.isFunction(cbConfig)) {
throw new Error('Config argument must be a function');
}

this.__configCallbacks.push(cbConfig);

return Scraper;
};

/**
* Creates or gets an agent based on the id passed
* @param {string} agentId Id of the agent to retrieve/create
Expand Down
Loading

0 comments on commit 78e1a7b

Please sign in to comment.