Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

two ideas for taming robots #6093

Closed
wants to merge 5 commits into from

Conversation

labkey-matthewb
Copy link
Contributor

Rationale

what do you think?

Related Pull Requests

Changes

{
String p = e.nextElement();
if (p.contains(".") && !LAST_FILTER_PARAM.equals(p))
getPageConfig().setNoIndex();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe getPageConfig().setNoFollow() as well?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case here? An HTML wiki that uses QueryWebPart via a <script> tag? Something that uses the server-side webpart include syntax?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. This certainly isn't the only place this could be relevant (portal pages are probably more important). But letting the crawler go nuts on the pages with multiple data regions seems to exacerbate the combinatorics.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open to other ideas, just trying to keep up with the bots.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like it's worth trying. It shouldn't hurt and might help/.

else
out.write(" href=\"#\"");
if (null != dataHref)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW I don't actually believe this is sufficient to hide the URL, but I think it could be with a very little more effort.

Copy link
Contributor

@labkey-jeckels labkey-jeckels left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not following

{
try
{
var context = HttpView.currentContext();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this might help the Container Filter menu items? Looks like they're href and not JS handlers. They already have rel="nofollow" but that's not enough?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my reading. nofollow doesn't actually mean "pretend this target link or page doesn't exist". It just means "don't give this link any weight in your magic SEO algorithm, I don't vouch for it". Google is going to crawl every link it can find.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

Why the split between the href and the data-query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is an important question. The interwebs seems to think google is very good at finding links in attributes and javascript. I am not sure that this change is sufficient, but it seemed worth trying to separate the parts so that the varying part does not look like a URL.

{
String p = e.nextElement();
if (p.contains(".") && !LAST_FILTER_PARAM.equals(p))
getPageConfig().setNoIndex();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the use case here? An HTML wiki that uses QueryWebPart via a <script> tag? Something that uses the server-side webpart include syntax?

@@ -228,21 +230,9 @@ public void doFilter(ServletRequest request, ServletResponse response, FilterCha

QueryService.get().setEnvironment(QueryService.Environment.USER, user);

if (AppProps.getInstance().isOptionalFeatureEnabled("experimental-unsafe-inline"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you mean to delete this? Maybe since I don't see it being registered as an optional feature anywhere?

return false;
var count = robotLimiter.getCount();
var delay = robotLimiter.getDelay();
if (count >= 10 && delay > 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This overall approach seems great to me, but I'll admit that I can't figure out what the effective limits will be. I'm confused on how 10 and 0 connect with the arguments to Rate above.

Regardless, to move this forward, we might want config to control this. Either a simple on/off switch or a way to control the allowed rate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. Needs startup properties/configuration before being released into the wild. I'll add comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants