-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
two ideas for taming robots #6093
Conversation
{ | ||
String p = e.nextElement(); | ||
if (p.contains(".") && !LAST_FILTER_PARAM.equals(p)) | ||
getPageConfig().setNoIndex(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe getPageConfig().setNoFollow() as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case here? An HTML wiki that uses QueryWebPart
via a <script>
tag? Something that uses the server-side webpart include syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exactly. This certainly isn't the only place this could be relevant (portal pages are probably more important). But letting the crawler go nuts on the pages with multiple data regions seems to exacerbate the combinatorics.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open to other ideas, just trying to keep up with the bots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like it's worth trying. It shouldn't hurt and might help/.
else | ||
out.write(" href=\"#\""); | ||
if (null != dataHref) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW I don't actually believe this is sufficient to hide the URL, but I think it could be with a very little more effort.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not following
{ | ||
try | ||
{ | ||
var context = HttpView.currentContext(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like this might help the Container Filter menu items? Looks like they're href
and not JS handlers. They already have rel="nofollow"
but that's not enough?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my reading. nofollow doesn't actually mean "pretend this target link or page doesn't exist". It just means "don't give this link any weight in your magic SEO algorithm, I don't vouch for it". Google is going to crawl every link it can find.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
Why the split between the href
and the data-query
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is an important question. The interwebs seems to think google is very good at finding links in attributes and javascript. I am not sure that this change is sufficient, but it seemed worth trying to separate the parts so that the varying part does not look like a URL.
{ | ||
String p = e.nextElement(); | ||
if (p.contains(".") && !LAST_FILTER_PARAM.equals(p)) | ||
getPageConfig().setNoIndex(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the use case here? An HTML wiki that uses QueryWebPart
via a <script>
tag? Something that uses the server-side webpart include syntax?
@@ -228,21 +230,9 @@ public void doFilter(ServletRequest request, ServletResponse response, FilterCha | |||
|
|||
QueryService.get().setEnvironment(QueryService.Environment.USER, user); | |||
|
|||
if (AppProps.getInstance().isOptionalFeatureEnabled("experimental-unsafe-inline")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you mean to delete this? Maybe since I don't see it being registered as an optional feature anywhere?
return false; | ||
var count = robotLimiter.getCount(); | ||
var delay = robotLimiter.getDelay(); | ||
if (count >= 10 && delay > 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This overall approach seems great to me, but I'll admit that I can't figure out what the effective limits will be. I'm confused on how 10
and 0
connect with the arguments to Rate
above.
Regardless, to move this forward, we might want config to control this. Either a simple on/off switch or a way to control the allowed rate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes. Needs startup properties/configuration before being released into the wild. I'll add comments.
Rationale
what do you think?
Related Pull Requests
Changes