-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ignore empty OpenGraph props #120
base: master
Are you sure you want to change the base?
Conversation
{ | ||
"@value": "Buy tickets for an upcoming Elysian Fields concert near you. List of all Elysian Fields tickets and tour dates for 2017." | ||
} | ||
], | ||
"http://ogp.me/ns#image": [ | ||
{ | ||
"@value": "http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg" | ||
"@value": "http://images.sk-static.com/SECONDARY_IMAGE.jpg" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changing this order will create a bug in function test_rdfa_not_preserving_order
. Why do you have to change the order here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the reason is that it is because you are comparing with the sorting values... could you sort the values in the test function instead? That way test_rdfa_not_preserving_order
would keep working as xfail case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm trying to sort the list in this failing function to fix it. I'm almost there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@croqaz the function test_rdfa_not_preserving_order
should be preserved as it is. It is an xfail
function, which means it is a test that should be passing, but it is not. So don't sort on this function, sort it in test_all
.
Probably you understood well, but just in case :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Well, to fix the test_rdfa_not_preserving_order() all we need to do is:
for rdf in data['rdfa']:
for key, pairs in rdf.items():
if ':' in key and isinstance(pairs, list):
rdf[key] = sorted(pairs, key=lambda e: e["@value"])
self.assertEqual(jsonize_dict(data)['rdfa'], expected['rdfa'])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test_rdfa_not_preserving_order()
fails randomly as it is exactly now, with my changes or not, because the order of the duplicate values is random.
So it doesn't matter if the duplicate values are sorted correctly or not in the JSON file.
The function is still a good way to replicate the bug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Anyway, still, the order of the values for this case in elysianfields.json
should be reversed. The first image should be http://images.sk-static.com/images/media/img/col4/20100330-103600-169450.jpg
and the second image should be http://images.sk-static.com/SECONDARY_IMAGE.jpg
as it was before, because this is the appearance order in the page.
This change will require some more code in the function test_all
so that test passes, but this is the correct way. Otherwise test_rdfa_not_preserving_order
would be incorrect, so wouldn't represent properly a case that should pass but is failing. (with current PR code test_rdfa_not_preserving_order
represents a case that is failing and would also fail even in case #116 is fixed)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. I'm working on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, this is done, I forgot to mention.
All the meta tags are converted to JSON objects and the sorting of JSON in the tests matches the actual HTML.
843b2e6
to
fdb5a4a
Compare
Codecov Report
@@ Coverage Diff @@
## master #120 +/- ##
==========================================
+ Coverage 87.63% 87.68% +0.05%
==========================================
Files 11 11
Lines 469 471 +2
Branches 101 102 +1
==========================================
+ Hits 411 413 +2
Misses 52 52
Partials 6 6
Continue to review full report at Codecov.
|
@@ -28,6 +28,7 @@ | |||
<meta property="og:type" content="songkick-concerts:artist"> | |||
<meta property="og:title" content="Elysian Fields"> | |||
<meta property="og:description" content="Buy tickets for an upcoming Elysian Fields concert near you. List of all Elysian Fields tickets and tour dates for 2017."> | |||
<meta property="og:description" content="" /> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe you could have the tests also check that whitespace characters, like spaces, are now ignored as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you! That's a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Gallaecio Do you want to take a look or I can just go ahead and merge this?
This is my first contribution to Extruct ;)
I believe I adressed all the comments. The Travis build seems to constantly fail with Python 3.4 at |
@croqaz I have noticed that the same is happening to master, see: https://travis-ci.org/scrapinghub/extruct/builds/558857852?utm_source=github_status&utm_medium=notification Digging deeper, the failure in the logs is:
I googled and this could be relevant: tox-dev/tox-travis#76. Could you have a look? |
Updated the tests
86cc2cb
to
d4645ef
Compare
Squashed all commits to have a clean history on merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank @croqaz for implementing that! I see everything right after the changes.
In #117 idea was to only drop empty values when deciding which value to choose, to be on a safe side. But maybe we can drop them always, as in this PR. @croqaz could you plese check that properties with empty values are always ignored, e.g. using https://developers.facebook.com/tools/debug/sharing/? |
@kmike Just to make sure I understand this correctly, because I think I might have misunderstood how to solve the problem. This is the problem: <meta property="og:description" content="" />
<blah-blah>... </blah-blah> ...
<meta property="og:description" content="My cool website" /> Duplicate OpenGraph with the same name, but one of them has an empty value. Solution:
In this PR I implemented the first option. |
Oh, Facebook's Debug tool only shows one of the OpenGraph meta tags. |
The question is: if there is only a single property with an empty value, is it returned or dropped? |
Ok, I understand now. I wonder if we can quickly craft a page like that and feed that to Facebook and see how it works? Actually I might craft a page like that relatively easy. |
For:
Facebook Sharing Debugger shows:
|
I also tested with: <meta property="og:image" content="" />
<meta property="og:image" content="/images/logo.png" /> And Facebook Sharing Debugger shows: "Provided og:image URL, was not a valid URL" So if the image is first and the empty is second, OR if the image is second and the empty is first, the Sharing Debugger still finds the correct image. |
Hey guys! Is there something which prevents us from merging that fix? |
My cent: Looking at http://ogp.me/ and https://developers.facebook.com/docs/sharing/webmasters/#markup I don't see any definition of a property that could contain an empty value and still have some meaning. It seems that properties cannot be used to label a page somehow. So I would include the removal of every single property without value from the output. What do you think the rest of reviewers? |
@ivanprado You mean in this case of: |
@croqaz well, I have some doubts regarding the fix. By one side, I would say that it should be the user using But by the other side, I have not seen anything in the protocol definition pointing that is possible to create properties with empty values, so removing all of them could be reasonable (option 2). But the question is, it is the same a document with The solution of cleaning empty values but only if there is at least one non-empty value for the duplicated property seems like a strange mix (option 3). But maybe could do the job:
But is weird as the user still will have to deal with empty value for the particular case. It is a little bit strange. So in conclusión, I have not a clear view about what we should do, the three options make some sense. The prefered one by my side would be the option 1, but I'm also ok with option 2 and 3. @croqaz |
I have the same concerns as Ivan. For OpenGraph we return a single value for an attribute, and empty values affect prioritization in case of multiple meta tags, so doing a correct prioritization (non-empty over empty) is fixing a bug. Removing all empty values is a different thing; it solves the above mention prioritization problem, but it is a larger change, with unclear outcome - it can be a right thing to do (if having an empty value is the same as not having tag at all), or it can be a data loss (if a presence of a tag with empty value is meaningful information). To decide we need to answer this question, and I'm not sure how to do it :) My suggestion was a tool by Facebook (as standard says nothing), and it seems to confirm that empty value is meaningful - if I'm reading @croqaz's results properly, it does show an empty description if a description is present, but empty - and shows nothing if description is absent. |
Ignore empty OpenGraph properties or content.
Fixes #117
Upgraded six>=1.11 to make Python 3.4 tests pass.
This was also done in #119