Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for alternate links with hreflang #55

Merged
merged 6 commits into from
Jan 20, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions docs/changelog.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,13 @@
Changelog
=========

Upcoming
--------

**New Features**

* Added support for :ref:`alternate localised pages <sitemap-extra-localisation>` with ``hreflang``.

v1.0.0 (2025-01-13)
-------------------

Expand Down
23 changes: 23 additions & 0 deletions docs/reference/formats.rst
Original file line number Diff line number Diff line change
Expand Up @@ -132,6 +132,8 @@ The Google News extension provides additional information to describe the news s

If the page contains Google News data, it is stored as a :class:`~usp.objects.page.SitemapNewsStory` object in :attr:`SitemapPage.news_story <usp.objects.page.SitemapPage.news_story>`.

.. _google-image-ext:

Google Image
""""""""""""

Expand All @@ -150,6 +152,27 @@ If the page contains Google Image data, it is stored as a list of :class:`~usp.o

.. _xml date:

Additional Features
^^^^^^^^^^^^^^^^^^^

Beyond the Sitemap specification, USP also supports some non-standard features used by large sitemap consumers (e.g. Google).

.. _sitemap-extra-localisation:

Alternate Localised Pages
"""""""""""""""""""""""""

- `Google documentation <https://developers.google.com/search/docs/specialty/international/localized-versions#sitemap>`__

.. dropdown:: Example
:class-container: flush

.. literalinclude:: formats_examples/hreflang.xml
:emphasize-lines: 3,7-10,15-18
:language: xml

Alternate localised pages specified with the ``<link>`` tag will be stored as a list in :attr:`SitemapPage.alternates <usp.objects.page.SitemapPage.alternates>`. Language codes are not validated.

Date Time Parsing
^^^^^^^^^^^^^^^^^

Expand Down
20 changes: 20 additions & 0 deletions docs/reference/formats_examples/hreflang.xml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9/"
xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>https://example.org/en/page</loc>
<lastmod>2024-01-01</lastmod>
<xhtml:link
rel="alternate"
hreflang="fr-FR"
href="https://example.org/fr/page"/>
</url>
<url>
<loc>https://example.org/fr/page</loc>
<lastmod>2024-01-02</lastmod>
<xhtml:link
rel="alternate"
hreflang="en-GB"
href="https://example.org/en/page"/>
</url>
</urlset>
118 changes: 118 additions & 0 deletions tests/tree/test_xml_exts.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,3 +105,121 @@ def test_xml_image(self, requests_mock):
print(tree)

assert tree == expected_sitemap_tree


class TestXMLHrefLang(TreeTestBase):
def test_hreflang(self, requests_mock):
requests_mock.add_matcher(TreeTestBase.fallback_to_404_not_found_matcher)

requests_mock.get(
self.TEST_BASE_URL + "/robots.txt",
headers={"Content-Type": "text/plain"},
text=textwrap.dedent(
f"""
User-agent: *
Disallow: /whatever

Sitemap: {self.TEST_BASE_URL}/sitemap.xml
"""
).strip(),
)

requests_mock.get(
self.TEST_BASE_URL + "/sitemap.xml",
headers={"Content-Type": "text/xml"},
text=textwrap.dedent(
f"""
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>{self.TEST_BASE_URL}/en/page</loc>
<lastmod>{self.TEST_DATE_STR_ISO8601}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
<xhtml:link rel="alternate" hreflang="fr-FR" href="{self.TEST_BASE_URL}/fr/page"/>
</url>
<url>
<loc>{self.TEST_BASE_URL}/fr/page</loc>
<lastmod>{self.TEST_DATE_STR_ISO8601}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
<xhtml:link rel="alternate" hreflang="en-GB" href="{self.TEST_BASE_URL}/en/page"/>
</url>
</urlset>
"""
).strip(),
)

tree = sitemap_tree_for_homepage(self.TEST_BASE_URL)

pages = list(tree.all_pages())
assert pages[0].alternates == [
("fr-FR", f"{self.TEST_BASE_URL}/fr/page"),
]
assert pages[1].alternates == [
("en-GB", f"{self.TEST_BASE_URL}/en/page"),
]

def test_missing_attrs(self, requests_mock):
requests_mock.add_matcher(TreeTestBase.fallback_to_404_not_found_matcher)

requests_mock.get(
self.TEST_BASE_URL + "/robots.txt",
headers={"Content-Type": "text/plain"},
text=textwrap.dedent(
f"""
User-agent: *
Disallow: /whatever

Sitemap: {self.TEST_BASE_URL}/sitemap.xml
"""
).strip(),
)

requests_mock.get(
self.TEST_BASE_URL + "/sitemap.xml",
headers={"Content-Type": "text/xml"},
text=textwrap.dedent(
f"""
<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:xhtml="http://www.w3.org/1999/xhtml">
<url>
<loc>{self.TEST_BASE_URL}/en/page</loc>
<lastmod>{self.TEST_DATE_STR_ISO8601}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
<xhtml:link rel="alternate" href="{self.TEST_BASE_URL}/fr/page"/>
</url>
<url>
<loc>{self.TEST_BASE_URL}/en/page2</loc>
<lastmod>{self.TEST_DATE_STR_ISO8601}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
<xhtml:link hreflang="fr-FR" href="{self.TEST_BASE_URL}/fr/page2"/>
</url>
<url>
<loc>{self.TEST_BASE_URL}/fr/page</loc>
<lastmod>{self.TEST_DATE_STR_ISO8601}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
<xhtml:link rel="alternate" hreflang="en-GB"/>
</url>
<url>
<loc>{self.TEST_BASE_URL}/fr/page2</loc>
<lastmod>{self.TEST_DATE_STR_ISO8601}</lastmod>
<changefreq>monthly</changefreq>
<priority>0.8</priority>
<xhtml:link hreflang="en-GB" href="{self.TEST_BASE_URL}/en/page2"/>
</url>
</urlset>
"""
).strip(),
)

tree = sitemap_tree_for_homepage(self.TEST_BASE_URL)

pages = list(tree.all_pages())
assert pages[0].alternates is None
assert pages[1].alternates is None
assert pages[2].alternates is None
assert pages[3].alternates is None
20 changes: 20 additions & 0 deletions usp/fetch_parse.py
Original file line number Diff line number Diff line change
Expand Up @@ -643,6 +643,7 @@ class Page:
"news_keywords",
"news_stock_tickers",
"images",
"alternates",
]

def __init__(self):
Expand All @@ -659,6 +660,7 @@ def __init__(self):
self.news_keywords = None
self.news_stock_tickers = None
self.images = []
self.alternates = []

def __hash__(self):
return hash(
Expand Down Expand Up @@ -763,13 +765,18 @@ def page(self) -> Optional[SitemapPage]:
for image in self.images
]

alternates = None
if len(self.alternates) > 0:
alternates = self.alternates

return SitemapPage(
url=url,
last_modified=last_modified,
change_frequency=change_frequency,
priority=priority,
news_story=sitemap_news_story,
images=sitemap_images,
alternates=alternates,
)

__slots__ = ["_current_page", "_pages", "_page_urls", "_current_image"]
Expand Down Expand Up @@ -801,6 +808,19 @@ def xml_element_start(self, name: str, attrs: Dict[str, str]) -> None:
"Page is expected to be set before <image:image>."
)
self._current_image = self.Image()
elif name == "link":
if not self._current_page:
raise SitemapXMLParsingException(
"Page is expected to be set before <link>."
)
if "rel" not in attrs or attrs["rel"] != "alternate":
log.warning(f"<link> element is missing rel attribute: {attrs}.")
elif "hreflang" not in attrs or "href" not in attrs:
log.warning(
f"<link> element is missing hreflang or href attributes: {attrs}."
)
else:
self._current_page.alternates.append((attrs["hreflang"], attrs["href"]))

def __require_last_char_data_to_be_set(self, name: str) -> None:
if not self._last_char_data:
Expand Down
32 changes: 29 additions & 3 deletions usp/objects/page.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
import datetime
from decimal import Decimal
from enum import Enum, unique
from typing import List, Optional
from typing import List, Optional, Tuple

SITEMAP_PAGE_DEFAULT_PRIORITY = Decimal("0.5")
"""Default sitemap page priority, as per the spec."""
Expand Down Expand Up @@ -331,6 +331,7 @@ class SitemapPage:
"__change_frequency",
"__news_story",
"__images",
"__alternates",
]

def __init__(
Expand All @@ -341,6 +342,7 @@ def __init__(
change_frequency: Optional[SitemapPageChangeFrequency] = None,
news_story: Optional[SitemapNewsStory] = None,
images: Optional[List[SitemapImage]] = None,
alternates: Optional[List[Tuple[str, str]]] = None,
):
"""
Initialize a new sitemap-derived page.
Expand All @@ -357,6 +359,7 @@ def __init__(
self.__change_frequency = change_frequency
self.__news_story = news_story
self.__images = images
self.__alternates = alternates

def __eq__(self, other) -> bool:
if not isinstance(other, SitemapPage):
Expand All @@ -380,6 +383,9 @@ def __eq__(self, other) -> bool:
if self.images != other.images:
return False

if self.alternates != other.alternates:
return False

return True

def __hash__(self):
Expand Down Expand Up @@ -442,10 +448,30 @@ def change_frequency(self) -> Optional[SitemapPageChangeFrequency]:

@property
def news_story(self) -> Optional[SitemapNewsStory]:
"""Get the Google News story attached to the URL."""
"""Get the Google News story attached to the URL.

See :ref:`google-news-ext` reference
"""
return self.__news_story

@property
def images(self) -> Optional[List[SitemapImage]]:
"""Get the images attached to the URL."""
"""Get the images attached to the URL.

See :ref:`google-image-ext` reference
"""
return self.__images

@property
def alternates(self) -> Optional[List[Tuple[str, str]]]:
"""Get the alternate URLs for the URL.

A tuple of (language code, URL) for each ``<xhtml:link>`` element with ``rel="alternate"`` attribute.

See :ref:`sitemap-extra-localisation` reference

Example::

[('fr', 'https://www.example.com/fr/page'), ('de', 'https://www.example.com/de/page')]
"""
return self.__alternates
Loading