Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: meta-tags audit #382

Open
wants to merge 32 commits into
base: main
Choose a base branch
from
Open

feat: meta-tags audit #382

wants to merge 32 commits into from

Conversation

dipratap
Copy link

@dipratap dipratap commented Aug 27, 2024

This PR adds an audit meta-tags that audits top-pages tags specifically the title, description and H1 tag. The tags are scraped by content-scraper and stored in S3, the audit worker fetches tags from S3 and performs checks to detect if the specified 3 tags are missing or length wise inoptimal.

JIRA: https://jira.corp.adobe.com/browse/SITES-23343

Related Docs

https://wiki.corp.adobe.com/display/AEMSites/SEO+%7C+Title%2C+Description+and+H1+Tags+Optimisation

Related Issues

adobe/spacecat-api-service#470
https://github.com/adobe/spacecat-content-scraper/pull/137

Thanks for contributing!

@dipratap dipratap mentioned this pull request Aug 28, 2024
2 tasks
Copy link

This PR will trigger a minor release when merged.

@dipratap dipratap requested a review from a team September 4, 2024 20:34
@dipratap dipratap changed the title feat: seo tags audit feat: meta-tags audit Sep 4, 2024
import { getObjectFromKey, getObjectKeysUsingPrefix } from '../utils/s3-utils.js';
import SeoChecks from './seo-checks.js';

async function fetchAndProcessPageObject(s3Client, bucketName, key, prefix, log) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please add functional comment to this above the definition?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure


async function fetchAndProcessPageObject(s3Client, bucketName, key, prefix, log) {
const object = await getObjectFromKey(s3Client, bucketName, key, log);
if (!object?.scrapeResult?.tags || typeof object.scrapeResult.tags !== 'object') {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ?. optional chaining is great for preventing errors, but after checking for object?.scrapeResult?.tags, the typeof object.scrapeResult.tags !== 'object' could be unnecessary.

if (!object?.scrapeResult?.tags) 

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the tags object is coming from an external source, validating if it is an object safeguards from type errors.

Copy link
Member

@solaris007 solaris007 Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, we have isObject or isNonEmptyObject in spacecat-shared-utils

[pageUrl]: {
title: object.scrapeResult.tags.title,
description: object.scrapeResult.tags.description,
h1: object.scrapeResult.tags.h1 || [],

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use optional chaining to avoid potential errors if h1 doesn’t exist in tags.

h1: object.scrapeResult.tags?.h1 || [],

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we have already checked if tags object exists or not in line 22, we can safely use object.scrapeResult.tags object here

Comment on lines 46 to 52
if (!site) {
return notFound('Site not found');
}
if (!site.isLive()) {
log.info(`Site ${siteId} is not live`);
return ok();
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we combine this to a single return?

if (!site || !site.isLive()) {
  log.info(`Site ${siteId} not found or not live`);
  return notFound('Site not found or not live');
}

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we won't be able to identify what exact error did the audit encountered - if site is not live or it is not found. We need this info in response so that the user knows what exact error we encountered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

more explicit guard statements are preferable

Comment on lines +63 to +69
for (const key of scrapedObjectKeys) {
// eslint-disable-next-line no-await-in-loop
const pageMetadata = await fetchAndProcessPageObject(s3Client, bucketName, key, prefix, log);
if (pageMetadata) {
Object.assign(extractedTags, pageMetadata);
}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using await inside a loop can lead to performance issues as it blocks the iteration. Consider refactoring to Promise.all() for parallel execution, improving efficiency.

const pageMetadataPromises = scrapedObjectKeys.map(key => fetchAndProcessPageObject(s3Client, bucketName, key, prefix, log));
const pageMetadataResults = await Promise.all(pageMetadataPromises);
pageMetadataResults.forEach(pageMetadata => {
  if (pageMetadata) {
    Object.assign(extractedTags, pageMetadata);
  }
});

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I'm doing sequential invocations instead of parallel is that in parallel execution, all S3 objects would be fetched into memory simultaneously, which could lead to exceeding the allocated memory. With sequential calls, the Nodejs garbage collector has more opportunities to clean up memory after each fetchAndProcessPageObject invocation finishes, reducing the risk of high memory usage. The audits complete in less than 8 seconds, so the execution time remains within acceptable limits.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe Promise.all is worth a test, as i doubt we'll quickly exceed the allocated 4GB for top 200 pages. this would give a DOM-size per page of 20 kB... additional consideration would be rate limiting towards the S3 API though.

src/metatags/handler.js Show resolved Hide resolved
Comment on lines +16 to +26
try {
const params = {
Bucket: bucketName,
Prefix: prefix,
MaxKeys: 1000,
};
const data = await s3Client.send(new ListObjectsV2Command(params));
data?.Contents?.forEach((obj) => {
objectKeys.push(obj.Key);
});
log.info(`Fetched ${objectKeys.length} keys from S3 for bucket ${bucketName} and prefix ${prefix}`);

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If more than 1000 objects are expected, you should handle pagination by repeatedly calling ListObjectsV2 with ContinuationToken


let continuationToken = null;
do {
  const params = {
    Bucket: bucketName,
    Prefix: prefix,
    MaxKeys: 1000,
    ContinuationToken: continuationToken,
  };
  const data = await s3Client.send(new ListObjectsV2Command(params));
  data?.Contents?.forEach((obj) => {
    objectKeys.push(obj.Key);
  });
  continuationToken = data.IsTruncated ? data.NextContinuationToken : null;
} while (continuationToken);

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ListObject command doesn't returns the object contents but just the objects metadata like key etc so memory wise it is not much. The listObjectsV2 API has a limit of 1,000 objects per request, and since we are scraping top-pages only which are expected to be < 200, we should be good here in my opinion. Let me know your thoughts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants