Robots.txt default message is confusing #345

fe-lix- · 2023-07-05T15:16:37Z

Is your feature request related to a problem? Please describe.

Description

The current default robots.txt can create confusion for the users. It does not help understanding why on a production website it would be returned instead of the robots.txt configured in the site repository.

This is the current default robots.txt:

# Helix robots.txt FAQ
#
# Q: This looks like a default robots.txt, how can I provide my own?
# A: Put a file named robots.txt into the root of your GitHub 
# repo, Franklin will serve it from there.
#
# Q: Why am I'm seeing this robots.txt instead of the one I 
# configured?
# A: You are visiting from *.hlx.page or *.hlx.live - in order 
# to prevent these sites from showing up in search engines and 
# giving you a duplicate content penalty on your real site we 
# exclude all robots 
# 
# Q: What do you mean with "real site"?
# A: If you add a custom domain to this site (e.g. 
# example.com), then Franklin detects that you are ready for 
# production and serves your own robots.txt - but only on 
# example.com
#
# Q: This does not answer my questions at all. What can I do?
# A: head over to #franklin-chat on Slack or 
# github.com/adobe/helix-home/issues and ask your question 
# there.
User-agent: *
Disallow: /

Phrasing issue in the default robots.txt

The problem is the part defining the "real site". The message states that:

Problem 1 - `Franklin detects that you are ready for production`

This is actually not the case, the behavior of returning the default robots.txt or not is defined by the presence of the x-forwarded-host header in the BYOCDN configuration. So a client would be trying to find out where to configure this example.com domain in helix.

There is no mention of the domain anywhere in the helix documentation except on the Push invalidation configuration. And in the BYOCDN configuration, there is no mention of the importance of x-forwarded-host as the definition of the "real site". Only a screenshot with the header configured.

Problem 2 - `but only on example.com`

This behavior is not factual, once the CDN is correctly configured any domain hooked on that CDN will show the robots.txt from the repository. I believe rephrasing this passage might help users understand the issue.

By example, if you are using Cloudfront the repository robots.txt would be returned from the "real site" domain (ie: example.com) and your CloudFront distribution (randomid123.cloudfront.net)

Behaviour in which the problem appears

The current problematic behavior is the following:

Create a new website
Configure the BYOCDN but omit the x-forwarded-host(by mistake let's say)
See the default robots.txt
Reading the message you commit a robots.txt to your repository
Everything works as expected except the default robots.txt is still returned

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Robots.txt default message is confusing #345

Robots.txt default message is confusing #345

fe-lix- commented Jul 5, 2023

Robots.txt default message is confusing #345

Robots.txt default message is confusing #345

Comments

fe-lix- commented Jul 5, 2023

Description

Phrasing issue in the default robots.txt

Problem 1 - Franklin detects that you are ready for production

Problem 2 - but only on example.com

Behaviour in which the problem appears

Suggested solution

Problem 1 - `Franklin detects that you are ready for production`

Problem 2 - `but only on example.com`