Skip to content

Commit

Permalink
Add candidatesFilters option
Browse files Browse the repository at this point in the history
  • Loading branch information
hankliu62 committed Jun 21, 2018
1 parent 3f5ee1e commit 08a65ac
Show file tree
Hide file tree
Showing 5 changed files with 568 additions and 4 deletions.
17 changes: 17 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,23 @@ read(url, {
});
```

- `candidatesFilters` which allow set your own filters for candidate tags.

options.candidatesFilters = [callback(obj, index)]
```javascript
read(url, {
candidatesFilters: [
function (obj) {
if (obj.tagName === 'ARTICLE' && elem.getAttribute('type') === 'video') {
return false;
}
return true;
}
]}, function(err, article, response) {
//...
});
```

- `preprocess` which should be a function to check or modify downloaded source before passing it to readability.

options.preprocess = callback(source, response, contentType, callback);
Expand Down
13 changes: 12 additions & 1 deletion src/helpers.js
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,16 @@ exports.debug = function(debug) {
};

var cleanRules = [];
var candidatesFilters = [];

module.exports.setCleanRules = function(rules) {
cleanRules = rules;
};

module.exports.setCandidatesFilters = function(filters) {
candidatesFilters = filters;
};

/**
* Prepare the HTML document for readability to scrape it.
* This includes things like stripping javascript, CSS, and handling terrible markup.
Expand Down Expand Up @@ -65,7 +70,7 @@ var prepDocument = module.exports.prepDocument = function(document) {
}
}
}

// Strip out all <script> tags, as they *should* be useless
var scripts = document.getElementsByTagName('script');
[].forEach.call(scripts, function (node) {
Expand Down Expand Up @@ -182,6 +187,12 @@ var grabArticle = module.exports.grabArticle = function(document, preserveUnlike
grandParentNode.readability.contentScore += contentScore / 2;
}

if (candidatesFilters.length) {
candidatesFilters.forEach(function(filterBy) {
candidates = candidates.filter(filterBy);
});
}


/**
* After we've calculated scores, loop through all of the possible candidate nodes we found
Expand Down
1 change: 1 addition & 0 deletions src/readability.js
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ function Readability(window, options) {
this.bodyCache = null;
this._articleContent = '';
helpers.setCleanRules(options.cleanRulers || []);
helpers.setCandidatesFilters(options.candidatesFilters || []);

this.cache = {};

Expand Down
43 changes: 40 additions & 3 deletions test/article-tests.js
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ describe('Regression Tests', function() {
],
notInclude: [
'Donate to Wikipedia'
]
],
},
{
fixture: 'mediashift',
Expand Down Expand Up @@ -80,11 +80,48 @@ describe('Regression Tests', function() {
'最赞回应',
'最新话题',
'北京豆网科技有限公司',
]
],
},
{
fixture: 'ifeng',
title: '熊玲:什么样的婚姻才是鸡肋婚姻?',
include: [
'沃尔沃“憋”不住了,最高狂降8万,性能不输BBA,白菜价愣没人',
'打开APP',
],
notInclude: [
'它是“迷恋婚姻又排拒婚姻”的一种复杂婚姻情感心理状态。它意味着即便你有千条理由走出婚姻,背后却有万种吸引力把你留在围城里。',
'在婚姻十字路口的人,你若要想你们的关系和好如初,就必须有重修婚姻的姿态,即必须有妥协的态度。',
'重修婚姻的办法很多很多,但最简单也是最核心的办法只有一个,那就是接受。',
],
},
{
fixture: 'ifeng',
title: '熊玲:什么样的婚姻才是鸡肋婚姻?',
include: [
'它是“迷恋婚姻又排拒婚姻”的一种复杂婚姻情感心理状态。它意味着即便你有千条理由走出婚姻,背后却有万种吸引力把你留在围城里。',
'在婚姻十字路口的人,你若要想你们的关系和好如初,就必须有重修婚姻的姿态,即必须有妥协的态度。',
'重修婚姻的办法很多很多,但最简单也是最核心的办法只有一个,那就是接受。',
],
notInclude: [
'沃尔沃“憋”不住了,最高狂降8万,性能不输BBA,白菜价愣没人',
'打开APP',
],
options: {
candidatesFilters: [
function (elem) {
if (elem.tagName === 'ARTICLE' && elem.getAttribute('type') === 'video') {
return false;
}

return true;
}
],
},
}].forEach(function(testCase) {
it('can extract ' + testCase.fixture + ' articles', function(done) {
var html = fs.readFileSync(articleFixtures + '/' + testCase.fixture + '.html').toString();
read(html, function(error, article) {
read(html, testCase.options || {}, function(error, article) {
if(error) {
done(error)
} else {
Expand Down
Loading

0 comments on commit 08a65ac

Please sign in to comment.