Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add candidateFilters option #107

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,24 @@ read(url, {
});
```

- `candidateFilters` which allows you to set your own filters for candidate tags.

options.candidateFilters = [callback(candidateNode, index)]
```javascript
read(url, {
candidateFilters: [
// Filter any article tags with a type of "video"
function (candidateNode) {
if (candidateNode.tagName === 'ARTICLE' && candidateNode.getAttribute('type') === 'video') {
return false;
}
return true;
}
]}, function(err, article, response) {
//...
});
```

- `preprocess` which should be a function to check or modify downloaded source before passing it to readability.

options.preprocess = callback(source, response, contentType, callback);
Expand Down
13 changes: 12 additions & 1 deletion src/helpers.js
Original file line number Diff line number Diff line change
Expand Up @@ -22,11 +22,16 @@ exports.debug = function(debug) {
};

var cleanRules = [];
var candidateFilters = [];

module.exports.setCleanRules = function(rules) {
cleanRules = rules;
};

module.exports.setCandidateFilters = function(filters) {
candidateFilters = filters;
};

/**
* Prepare the HTML document for readability to scrape it.
* This includes things like stripping javascript, CSS, and handling terrible markup.
Expand Down Expand Up @@ -65,7 +70,7 @@ var prepDocument = module.exports.prepDocument = function(document) {
}
}
}

// Strip out all <script> tags, as they *should* be useless
var scripts = document.getElementsByTagName('script');
[].forEach.call(scripts, function (node) {
Expand Down Expand Up @@ -182,6 +187,12 @@ var grabArticle = module.exports.grabArticle = function(document, preserveUnlike
grandParentNode.readability.contentScore += contentScore / 2;
}

if (candidateFilters.length) {
candidateFilters.forEach(function(filterBy) {
candidates = candidates.filter(filterBy);
});
}


/**
* After we've calculated scores, loop through all of the possible candidate nodes we found
Expand Down
1 change: 1 addition & 0 deletions src/readability.js
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@ function Readability(window, options) {
this.bodyCache = null;
this._articleContent = '';
helpers.setCleanRules(options.cleanRulers || []);
helpers.setCandidateFilters(options.candidateFilters || []);

this.cache = {};

Expand Down
43 changes: 40 additions & 3 deletions test/article-tests.js
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ describe('Regression Tests', function() {
],
notInclude: [
'Donate to Wikipedia'
]
],
},
{
fixture: 'mediashift',
Expand Down Expand Up @@ -80,11 +80,48 @@ describe('Regression Tests', function() {
'最赞回应',
'最新话题',
'北京豆网科技有限公司',
]
],
},
{
fixture: 'ifeng',
title: '熊玲:什么样的婚姻才是鸡肋婚姻?',
include: [
'沃尔沃“憋”不住了,最高狂降8万,性能不输BBA,白菜价愣没人',
'打开APP',
],
notInclude: [
'它是“迷恋婚姻又排拒婚姻”的一种复杂婚姻情感心理状态。它意味着即便你有千条理由走出婚姻,背后却有万种吸引力把你留在围城里。',
'在婚姻十字路口的人,你若要想你们的关系和好如初,就必须有重修婚姻的姿态,即必须有妥协的态度。',
'重修婚姻的办法很多很多,但最简单也是最核心的办法只有一个,那就是接受。',
],
},
{
fixture: 'ifeng',
title: '熊玲:什么样的婚姻才是鸡肋婚姻?',
include: [
'它是“迷恋婚姻又排拒婚姻”的一种复杂婚姻情感心理状态。它意味着即便你有千条理由走出婚姻,背后却有万种吸引力把你留在围城里。',
'在婚姻十字路口的人,你若要想你们的关系和好如初,就必须有重修婚姻的姿态,即必须有妥协的态度。',
'重修婚姻的办法很多很多,但最简单也是最核心的办法只有一个,那就是接受。',
],
notInclude: [
'沃尔沃“憋”不住了,最高狂降8万,性能不输BBA,白菜价愣没人',
'打开APP',
],
options: {
candidateFilters: [
function (candidateNode) {
if (candidateNode.tagName === 'ARTICLE' && candidateNode.getAttribute('type') === 'video') {
return false;
}

return true;
}
],
},
}].forEach(function(testCase) {
it('can extract ' + testCase.fixture + ' articles', function(done) {
var html = fs.readFileSync(articleFixtures + '/' + testCase.fixture + '.html').toString();
read(html, function(error, article) {
read(html, testCase.options || {}, function(error, article) {
if(error) {
done(error)
} else {
Expand Down
Loading