Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor sync script #175

Merged
merged 21 commits into from
Nov 13, 2024
Merged

Refactor sync script #175

merged 21 commits into from
Nov 13, 2024

Conversation

dgorodnichy
Copy link
Collaborator

@dgorodnichy dgorodnichy commented Oct 19, 2024

Closes #172

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced modules for article synchronization, cleanup, and downloading images from the Dev.to API.
    • Added a new script for manual synchronization with the Dev.to platform.
    • Added a structured interface for interacting with the Dev.to API.
    • New JSON files added for testing article synchronization.
  • Improvements

    • Enhanced workflow for GitHub Actions to support manual execution and improved synchronization logic.
    • Improved image downloading functionality with better error handling.
    • Added a retry mechanism for operations that may fail.
  • Bug Fixes

    • Improved error handling and retry mechanisms for downloading articles and images.
  • Removals

    • Deprecated an older script for processing articles from Dev.to, consolidating functionality into new modules.

Copy link
Contributor

coderabbitai bot commented Oct 19, 2024

Important

Review skipped

More than 25% of the files skipped due to max files limit. The review is being skipped to prevent a low-quality review.

141 files out of 225 files are above the max files limit of 75. Please upgrade to Pro plan to get higher limits.

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Walkthrough

The pull request introduces several modules and classes for managing article synchronization with the Dev.to platform. Key additions include the ArticleCleaner, ArticleSyncChecker, ArticleUpdater, and DevToAdapter modules, which encapsulate functionality for cleaning up articles, synchronizing with an API, and downloading new articles. A new entry script sync_with_devto is provided to initiate synchronization, replacing an older script. Additionally, a new ImagesDownloader class handles image downloads, while the workflow configuration is updated to support manual execution with a force option.

Changes

File Change Summary
bin/sync/article_cleaner.rb Added module ArticleCleaner, constants SYNC_STATUS_FILE, ARTICLE_FILE, and methods cleanup_renamed_articles, load_slugs_from_yaml.
bin/sync/article_sync_checker.rb Added module ArticleSyncChecker, constants USERNAME, SYNC_STATUS_FILE, USELESS_WORDS, and methods for synchronizing articles.
bin/sync/article_updater.rb Added module ArticleUpdater, constants for API and status, and methods for downloading new articles and updating status.
bin/sync/dev_to_adapter.rb Added class DevToAdapter with methods for interacting with the Dev.to API.
bin/sync/images_downloader.rb Added class ImagesDownloader for downloading images from markdown files.
bin/sync/sync.rb Added class Sync that orchestrates article synchronization, including a perform method.
bin/sync_with_devto New script to initiate synchronization with Dev.to, replacing bin/from_devto.
.github/workflows/sync-and-publish.yml Updated workflow to include manual trigger and modified synchronization command to use sync_with_devto.
bin/from_devto Removed class SaveArticleToMarkdown and related methods for fetching and saving articles.
bin/upload_assets_to_github Removed classes ImageDownloader and BlogProcessor and their associated methods.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant Sync
    participant ArticleSyncChecker
    participant ArticleUpdater
    participant ArticleCleaner
    participant DevToAdapter
    participant ImagesDownloader

    User->>Sync: perform(force)
    Sync->>ArticleSyncChecker: update_sync_status()
    Sync->>ArticleUpdater: download_new_articles(force)
    Sync->>ArticleCleaner: cleanup_renamed_articles()
    ArticleUpdater->>DevToAdapter: fetch_articles()
    ArticleUpdater->>ImagesDownloader: download_images()
Loading

Assessment against linked issues

Objective Addressed Explanation
Synchronization in one iteration (#172)

Possibly related PRs

  • Refactor workflows #148: The changes in the main PR involve the management and cleanup of article directories, while PR Refactor workflows #148 focuses on refactoring workflows related to article synchronization and deployment. The connection lies in the shared context of managing articles, although they address different aspects of the workflow.

Suggested reviewers

  • pftg

Poem

🐰 In the land of code where articles roam,
A cleaner was born to tidy the home.
With sync and download, we gather with glee,
Images and articles, all set free!
So hop along, dear friends, let's sync with delight,
For the blog shines brighter, all ready for flight! 🌟


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 14

🧹 Outside diff range and nitpick comments (10)
bin/sync_with_devto (3)

1-5: LGTM! Consider using absolute paths for requires.

The shebang is correct, allowing the script to be executed directly. However, using relative paths for requires might cause issues if the script is moved or executed from a different directory.

Consider using absolute paths for requires to make the script more robust:

require_relative File.expand_path('../sync/article_sync_checker', __FILE__)
require_relative File.expand_path('../sync/article_updater', __FILE__)
require_relative File.expand_path('../sync/article_cleaner', __FILE__)

7-9: LGTM! Consider adding error handling and logging.

The synchronization process is implemented concisely and follows the Single Responsibility Principle. The sequential execution aligns with the PR objective of performing synchronization in a single iteration.

Consider adding error handling and logging to improve debuggability and provide feedback on the synchronization process:

def run_step(step_name, step_class)
  puts "Starting #{step_name}..."
  step_class.new.call
  puts "#{step_name} completed successfully."
rescue StandardError => e
  puts "Error in #{step_name}: #{e.message}"
  exit 1
end

run_step("Article Sync Check", ArticleSyncChecker)
run_step("Article Update", ArticleUpdater)
run_step("Article Cleanup", ArticleCleaner)

puts "Synchronization completed successfully."

1-9: Overall, good implementation. Consider additional improvements for production readiness.

The script successfully achieves the PR objective of performing synchronization in a single iteration. It's concise and follows good design principles. To make it more production-ready, consider the following additional improvements:

  1. Add command-line options for flexibility (e.g., dry run mode, verbose logging).
  2. Implement proper exit codes for success/failure scenarios.
  3. Add a configuration file for customizable settings.
  4. Implement retry logic for network-related operations.
  5. Add progress indicators for long-running operations.

These enhancements would make the script more robust, flexible, and user-friendly.

bin/sync/article_cleaner.rb (4)

4-7: LGTM: Well-defined class and constants.

The ArticleCleaner class is appropriately named, and the use of constants for file and directory paths is a good practice. The freeze method is correctly used to ensure immutability of the string constants.

Consider moving these constants to a separate configuration file or environment variables for easier management and potential reuse across different classes.


19-22: LGTM: Efficient YAML parsing and slug extraction.

The load_slugs_from_yaml method effectively loads the YAML file and extracts the slugs. Its private status is appropriate for an internal helper method.

Consider adding error handling for cases where the YAML file might be missing or malformed. For example:

def load_slugs_from_yaml
  yaml_data = YAML.load_file(YAML_FILE)
  yaml_data.values.map { |article| article[:slug] }
rescue Errno::ENOENT, Psych::SyntaxError => e
  puts "Error loading YAML file: #{e.message}"
  []
end

This will return an empty array if the file is missing or invalid, preventing potential runtime errors.


24-34: LGTM: Effective cleanup of unused folders.

The clean_up_unused_folders method efficiently iterates through the blog directories and removes those that are not present in the slugs list. The use of FileUtils.rm_rf for deletion and the logging of deleted folders are good practices.

Consider the following improvements:

  1. Add a dry-run option to preview what would be deleted without actually removing files.
  2. Implement logging using a proper logging framework instead of puts for better control and flexibility.
  3. Consider adding a safety check to prevent accidental deletion of all folders if @Slugs is empty.

Here's a sample implementation incorporating these suggestions:

def clean_up_unused_folders(dry_run: false)
  if @slugs.empty?
    logger.warn("No slugs found. Aborting to prevent accidental deletion of all folders.")
    return
  end

  Dir.glob("#{BLOG_DIR}/*").each do |folder_path|
    next unless File.directory?(folder_path) && File.exist?("#{folder_path}/#{ARTICLE_FILE}")

    folder_name = File.basename(folder_path)
    unless @slugs.include?(folder_name)
      if dry_run
        logger.info("Would delete folder: #{folder_name}")
      else
        FileUtils.rm_rf(folder_path)
        logger.info("Deleted folder: #{folder_name}")
      end
    end
  end
end

This refactored version adds a dry-run option, uses a logger (which you'd need to set up), and includes a safety check for empty @Slugs.


1-35: Overall: Well-structured and effective ArticleCleaner class.

The ArticleCleaner class is well-designed, adhering to the Single Responsibility Principle and providing a clear, focused implementation for cleaning up unused blog folders. The code is clean, easy to understand, and uses appropriate Ruby conventions.

Consider the following general improvements to enhance the robustness and flexibility of the class:

  1. Implement dependency injection for the file paths (BLOG_DIR, YAML_FILE, ARTICLE_FILE) to make the class more flexible and easier to test.
  2. Add comprehensive error handling and logging throughout the class to make it more robust and easier to debug.
  3. Consider breaking down the clean_up_unused_folders method into smaller, more focused methods to improve readability and maintainability.
  4. Add unit tests to ensure the correct functionality of the class, especially for edge cases like empty directories or missing files.

Example of dependency injection:

class ArticleCleaner
  def initialize(blog_dir:, yaml_file:, article_file:)
    @blog_dir = blog_dir
    @yaml_file = yaml_file
    @article_file = article_file
    @slugs = load_slugs_from_yaml
  end

  # ... rest of the class ...
end

This approach would make the class more flexible and easier to use in different contexts or test environments.

bin/sync/images_processor.rb (2)

Line range hint 9-112: Extract magic numbers into constants

Magic numbers like max_attempts = 5 are hard-coded throughout the class. Extracting these into constants at the top enhances readability and maintainability.

Define constants at the beginning of the file:

+MAX_ATTEMPTS = 5

 class ImagesProcessor

And update the usages:

 def download_image(url, dest)
-    max_attempts = 5
+    max_attempts = MAX_ATTEMPTS

Line range hint 9-112: Replace puts statements with a logger

Using puts for logging can clutter the console output and isn't suitable for production environments. Utilize Ruby's Logger class for better log management.

At the top of the file, require the logger and initialize it:

 require 'fileutils'
 require 'net/http'
 require 'uri'
+require 'logger'

 IMG_REGEX = %r{!\[(?<alt>(?:[^\[\]]|\[(?:[^\[\]]|\[[^\[\]]*\])*\])*)\]\((?<url>https?:\/\/[^\s\)]+)\)}
 BLOG_DIR = 'content/blog'
 REPO_URL = 'https://raw.githubusercontent.com/jetthoughts/jetthoughts.github.io/master'
+LOGGER = Logger.new($stdout)

Update puts statements:

-puts "Failed to download #{url}: #{response.code} #{response.message}"
+LOGGER.warn("Failed to download #{url}: #{response.code} #{response.message}")

Repeat for other puts instances.

bin/sync/article_updater.rb (1)

120-120: Remove unnecessary directory existence check before mkdir_p.

FileUtils.mkdir_p already handles existing directories gracefully. The check unless Dir.exist?(dir_path) is redundant and can be removed.

Simplify the code:

-FileUtils.mkdir_p(dir_path) unless Dir.exist?(dir_path)
+FileUtils.mkdir_p(dir_path)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 30849f3 and a7b0264.

📒 Files selected for processing (6)
  • bin/sync/article_cleaner.rb (1 hunks)
  • bin/sync/article_sync_checker.rb (1 hunks)
  • bin/sync/article_updater.rb (1 hunks)
  • bin/sync/images_processor.rb (3 hunks)
  • bin/sync_with_devto (1 hunks)
  • sync_status.yml (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • sync_status.yml
🧰 Additional context used
🪛 rubocop
bin/sync/article_sync_checker.rb

[warning] 42-42: Literal interpolation detected.

(Lint/LiteralInInterpolation)

bin/sync/article_updater.rb

[convention] 54-87: Assignment Branch Condition size for update_canonical_url_on_dev_to is too high. [<10, 20, 6> 23.15/23]

(Metrics/AbcSize)

🔇 Additional comments (3)
bin/sync/article_cleaner.rb (3)

1-2: LGTM: Necessary imports are present.

The required libraries 'yaml' and 'fileutils' are correctly imported for handling YAML files and file system operations respectively.


9-11: LGTM: Clear initialization.

The initialize method is concise and loads the slugs at object creation, which is efficient if the slugs are used multiple times during the object's lifecycle.


13-15: LGTM: Clear and conventional primary action method.

The call method follows the Ruby convention for a primary action method and has a clear, single responsibility of initiating the cleanup process.

bin/sync/article_sync_checker.rb Outdated Show resolved Hide resolved
bin/sync/article_sync_checker.rb Outdated Show resolved Hide resolved
bin/sync/article_sync_checker.rb Outdated Show resolved Hide resolved
bin/sync/article_sync_checker.rb Outdated Show resolved Hide resolved
bin/sync/images_processor.rb Outdated Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 9

🧹 Outside diff range and nitpick comments (5)
bin/sync/article_sync_checker.rb (1)

1-74: Summary of ArticleSyncChecker review

The ArticleSyncChecker class is well-structured and effectively implements the synchronization logic for articles from the Dev.to API. It aligns well with the PR objective of refactoring the sync script and ensuring synchronization occurs in a single iteration.

Key strengths:

  1. Good use of dependency injection in the initializer.
  2. Clear separation of concerns across methods.
  3. Effective management of sync status for articles.

Areas for improvement (addressed in previous comments):

  1. More flexible constant definitions.
  2. Robust handling of API requests and caching prevention.
  3. Better handling of potential nil values and edge cases in methods like calculate_hash and slug.

By implementing the suggested improvements, this class will be more robust, maintainable, and aligned with the goal of efficient synchronization as outlined in issue #172.

Consider adding logging throughout the class to aid in debugging and monitoring the synchronization process. This could help track the progress of synchronization and identify any issues that may arise during the process.

🧰 Tools
🪛 rubocop

[warning] 42-42: Literal interpolation detected.

(Lint/LiteralInInterpolation)

bin/sync/images_processor.rb (3)

Line range hint 22-31: Enhance error handling for file operations

The process_blog method correctly handles the flow of processing a blog post. However, it lacks error handling for potential I/O operations. Consider wrapping the file operations in a begin-rescue block to handle exceptions gracefully.

Here's a suggested improvement:

def process_blog(file_path)
  begin
    file_name = File.basename(File.dirname(file_path))
    content = File.read(file_path)

    content = process_cover_image(file_name, content, file_path)
    content = process_images(file_name, content)

    File.write(file_path, content)
  rescue Errno::ENOENT => e
    puts "Error: File not found - #{e.message}"
  rescue IOError => e
    puts "Error: IO operation failed - #{e.message}"
  end
end

This change will make the method more robust by handling potential file-related errors.


Line range hint 33-50: Unresolved issue: Handle failed cover image downloads

The method correctly processes the cover image, but it doesn't address a previously raised concern about handling failed downloads. When a download fails, the original cover_image URL remains unchanged in the content, which might be invalid or point to an external resource.

Please implement the suggestion from the previous review to remove the cover_image attribute when the download fails. Here's the suggested change:

      if download_image(cover_image_url, cover_path)
        updated_content = content.sub(cover_image_match[:url], "#{@repo_url}/#{cover_path}")
        File.write(file_path, updated_content)
        return updated_content
      else
        FileUtils.rm_f(cover_path)
+       updated_content = content.sub(/cover_image: ".*"\n/, '')
+       File.write(file_path, updated_content)
+       return updated_content
      end

This change ensures that the cover_image attribute is removed from the content if the download fails, preventing broken links.


Line range hint 1-111: Summary: Improvements needed in error handling and previous review implementations

The ImagesProcessor class is well-structured and effectively combines the functionalities of image downloading and blog processing. However, there are several areas that require attention:

  1. Error handling: Consider adding more robust error handling, especially for file operations in methods like call and process_blog.
  2. Unresolved issues: Several suggestions from previous reviews have not been fully implemented, particularly in process_cover_image, process_images, and download_image methods. These should be addressed to improve the robustness and reliability of the code.
  3. Edge case management: Pay attention to edge cases, such as file name collisions in process_images and consistent retry logic in download_image.

Addressing these points will significantly enhance the overall quality and reliability of the ImagesProcessor class.

sync_status.yml (1)

1-817: LGTM! Consider adding a file header comment.

The structure and consistency of the sync_status.yml file look good. Each entry follows a consistent format with a unique numeric key and three attributes (:hash, :slug, and :synced). This structure appears suitable for tracking the synchronization status of multiple articles.

Consider adding a comment at the beginning of the file to explain its purpose and structure. This will help other developers understand the file's role in the system. For example:

# sync_status.yml
# This file tracks the synchronization status of articles.
# Each entry represents an article with the following structure:
# <unique_id>:
#   :hash: SHA256 hash of the article content
#   :slug: URL-friendly identifier for the article
#   :synced: Boolean indicating whether the article has been synchronized

Consider implementing version control for this file to track changes over time. This could be helpful for auditing and troubleshooting synchronization issues.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between a7b0264 and 4d8401e.

📒 Files selected for processing (6)
  • bin/sync/article_cleaner.rb (1 hunks)
  • bin/sync/article_sync_checker.rb (1 hunks)
  • bin/sync/article_updater.rb (1 hunks)
  • bin/sync/images_processor.rb (3 hunks)
  • bin/sync_with_devto (1 hunks)
  • sync_status.yml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • bin/sync/article_cleaner.rb
  • bin/sync_with_devto
🧰 Additional context used
🪛 rubocop
bin/sync/article_sync_checker.rb

[warning] 42-42: Literal interpolation detected.

(Lint/LiteralInInterpolation)

bin/sync/article_updater.rb

[convention] 54-87: Assignment Branch Condition size for update_canonical_url_on_dev_to is too high. [<10, 20, 6> 23.15/23]

(Metrics/AbcSize)

🔇 Additional comments (10)
bin/sync/article_sync_checker.rb (6)

9-15: Well-structured initialize method with good dependency injection.

The initialize method is well-designed:

  • It uses dependency injection for http_client, file_manager, and yaml_parser, which enhances testability and flexibility.
  • It ensures the sync status file exists and loads the existing data.

This approach allows for easy mocking in tests and potential customization of dependencies. Good job!


17-20: Concise and well-structured call method.

The call method effectively orchestrates the main functionality of the class:

  1. It fetches articles and updates the sync status.
  2. It saves the updated sync status.

This method adheres to the Single Responsibility Principle by delegating specific tasks to other methods. Its simplicity makes the class's primary function clear and easy to understand.


61-73: Well-implemented update_sync_status method

The update_sync_status method effectively manages the synchronization status of articles:

  1. It iterates through each article, calculating its hash.
  2. For new articles, it initializes the sync status.
  3. For existing articles, it compares the new hash with the stored hash and updates the sync status if there are changes.

This implementation ensures that only articles with changes will be marked for synchronization, which is efficient and prevents unnecessary updates.


41-45: ⚠️ Potential issue

Improve API request handling and caching prevention

The current implementation of fetch_articles has a couple of issues:

  1. The use of &per_page=#{999..9999} is not a reliable way to prevent caching and may cause issues with the API.
  2. The HTTP request is not properly handled for the Net::HTTP client.

Let's address these issues:

  1. Replace the caching prevention hack with a proper cache buster:
uri = URI("#{API_URL}&cache_buster=#{Time.now.to_i}")
  1. Properly handle the HTTP request:
def fetch_articles
  uri = URI("#{API_URL}&cache_buster=#{Time.now.to_i}")
  response = @http_client.get_response(uri)
  JSON.parse(response.body)
end

These changes will improve the reliability of the API requests and prevent caching issues more effectively.

🧰 Tools
🪛 rubocop

[warning] 42-42: Literal interpolation detected.

(Lint/LiteralInInterpolation)


47-49: ⚠️ Potential issue

Handle potential nil values in hash calculation

The current implementation of calculate_hash may raise a TypeError if article["edited_at"] is nil. Additionally, we could make the hash calculation more robust by including more article data.

Here's an improved version that addresses these issues:

def calculate_hash(article)
  hash_data = [
    article["id"],
    article["created_at"],
    article["edited_at"],
    article["title"],
    article["description"]
  ].compact.join('|')
  Digest::SHA256.hexdigest(hash_data)
end

This implementation:

  1. Handles potential nil values by using compact.
  2. Includes more article data for a more comprehensive hash.
  3. Uses a separator (|) to prevent unintended hash collisions.

52-59: ⚠️ Potential issue

Improve slug generation to handle edge cases

The current slug method has a few potential issues:

  1. It doesn't handle nil values for article['tags'].
  2. It assumes there are at least two tags, which may not always be true.
  3. The logic could be simplified for better readability and maintainability.

Here's an improved version that addresses these issues:

def slug(article)
  slug_parts = article['slug'].to_s.split('-')[0..-2]
  tags = article['tags'].to_s.split(', ').first(2)
  
  (slug_parts + tags)
    .flatten
    .uniq
    .reject { |segment| USELESS_WORDS.include?(segment.downcase) }
    .compact
    .join('-')
end

This implementation:

  1. Safely handles potential nil values for both slug and tags.
  2. Uses first(2) to get up to two tags without raising an error if there are fewer.
  3. Simplifies the logic by combining slug parts and tags before processing.
  4. Uses downcase when checking against USELESS_WORDS for case-insensitive comparison.

These changes make the method more robust and less prone to errors in edge cases.

bin/sync/images_processor.rb (2)

10-14: LGTM: Well-structured initialization method

The initialize method is well-implemented, providing flexibility with default values for blog_dir and repo_url while allowing custom values to be passed. The use of instance variables is appropriate for maintaining the object's state.


72-83: LGTM: Robust URL decoding implementation

The remove_cdn method is well-implemented. It uses a while loop with a maximum number of attempts to decode the URL, which effectively handles cases where the URL might be encoded multiple times. The use of a maximum number of attempts prevents potential infinite loops. This approach is robust and should handle various CDN-encoded URLs effectively.

sync_status.yml (2)

1-817: Overall, the sync_status.yml file is well-structured and fit for purpose.

The sync_status.yml file effectively serves its purpose of tracking synchronization status for articles. Its structure is consistent and straightforward, making it easy to read and maintain in its current state.

Key points:

  1. The file structure is appropriate for the current number of entries.
  2. All entries are consistently formatted with the required attributes.
  3. The use of SHA256 hashes provides a reliable way to track content changes.

While the current implementation is solid, consider the suggested improvements for better scalability and maintainability as the number of articles grows. Additionally, ensure that the synchronization process correctly updates the :synced status and that the hash values don't expose any sensitive information.


1-817: Ensure hash values don't expose sensitive information.

The file contains SHA256 hash values for each article. While these hashes likely represent the content of public articles, it's important to verify that they don't inadvertently expose any sensitive information.

Please confirm that:

  1. The hash values only represent public content and don't include any private or sensitive information.
  2. The hashing process doesn't incorporate any secret keys or sensitive metadata.

To help verify this, you can run the following script to check the consistency of the hash length and format:

#!/bin/bash
# Description: Verify consistency of hash values in sync_status.yml

echo "Checking hash consistency..."
grep ":hash:" sync_status.yml | cut -d' ' -f4 | while read -r hash; do
  if ! [[ $hash =~ ^[a-f0-9]{64}$ ]]; then
    echo "Inconsistent hash found: $hash"
    exit 1
  fi
done

if [ $? -eq 0 ]; then
  echo "All hash values are consistent with SHA256 format"
else
  echo "Inconsistent hash values found"
fi

This script checks if all hash values are 64 characters long and only contain hexadecimal digits, which is consistent with SHA256 hashes. Any deviations might indicate potential issues with the hashing process or the presence of sensitive information.

bin/sync/article_sync_checker.rb Outdated Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
@dgorodnichy dgorodnichy force-pushed the refactor-sync-script branch 2 times, most recently from 0f0e8a6 to 19c9185 Compare October 21, 2024 14:06
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (1)
sync_status.yml (1)

1-817: Inconsistent quoting of hash values.

Some hash values are enclosed in single quotes, while others are not. For consistency and to avoid potential parsing issues, it's recommended to use a uniform approach for all hash values.

Consider removing the single quotes from hash values where they are present, for example:

-  :hash: '0923ca2669c7ee41af195404e968d148afbc4e3d0d2ae7cce5e7a74a86180d16'
+  :hash: 0923ca2669c7ee41af195404e968d148afbc4e3d0d2ae7cce5e7a74a86180d16
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 4d8401e and 19c9185.

📒 Files selected for processing (6)
  • bin/sync/article_cleaner.rb (1 hunks)
  • bin/sync/article_sync_checker.rb (1 hunks)
  • bin/sync/article_updater.rb (1 hunks)
  • bin/sync/images_processor.rb (3 hunks)
  • bin/sync_with_devto (1 hunks)
  • sync_status.yml (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • bin/sync/article_cleaner.rb
  • bin/sync/article_sync_checker.rb
  • bin/sync_with_devto
🧰 Additional context used
🪛 rubocop
bin/sync/article_updater.rb

[convention] 54-87: Assignment Branch Condition size for update_canonical_url_on_dev_to is too high. [<10, 20, 6> 23.15/23]

(Metrics/AbcSize)

🔇 Additional comments (7)
bin/sync/images_processor.rb (4)

9-13: LGTM: Class definition and initialization

The ImagesProcessor class is well-defined with a clear initialization method. The use of default values for blog_dir and repo_url provides flexibility while maintaining sensible defaults.


Line range hint 22-31: LGTM: Blog processing logic

The process_blog method follows a clear and logical sequence of operations. It correctly handles reading the file content, processing the cover image and other images, and writing the updated content back to the file.


Line range hint 33-50: LGTM: Cover image processing with improved error handling

The process_cover_image method now correctly handles failed cover image downloads as suggested in the previous review. If the download fails, the cover_image attribute is removed from the content, preventing broken links. This improvement enhances the robustness of the cover image processing.


72-83: LGTM: Robust URL decoding

The remove_cdn method effectively handles URLs that may be encoded multiple times. The use of a loop with a maximum number of attempts ensures that the decoding process doesn't get stuck in an infinite loop. This implementation is robust and should handle various URL encoding scenarios well.

sync_status.yml (3)

1-817: Well-structured and consistent YAML file.

The file maintains a consistent structure for each entry, with proper indentation and clear organization. This enhances readability and maintainability.


1-817: Verify accuracy of synchronization statuses.

All 136 entries in the file have :synced: true. While this could mean that all articles have been successfully synchronized, it's worth verifying that this accurately reflects the current state of the system.

Please confirm that:

  1. All these articles have indeed been successfully synchronized.
  2. The synchronization process correctly updates the :synced status to false when synchronization fails for any article.

To help verify this, you can run the following script to check for any entries with :synced: false:

#!/bin/bash
# Description: Check for any entries with :synced: false in sync_status.yml

grep -n ":synced: false" sync_status.yml

if [ $? -eq 1 ]; then
  echo "All entries have :synced: true"
else
  echo "Found entries with :synced: false"
fi

83-84: Verify hash format for flagged entries.

In a previous review, some hash values were flagged as potentially invalid. Let's verify if these have been corrected.

Also applies to: 123-124, 443-444, 671-672

✅ Verification successful

All flagged hash entries have valid hash formats.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the format of previously flagged hash values

# Function to check if a string is a valid 64-character hexadecimal hash
is_valid_hash() {
    [[ $1 =~ ^[a-f0-9]{64}$ ]]
}

# Array of previously flagged hashes
declare -a hashes=(
    "'0923ca2669c7ee41af195404e968d148afbc4e3d0d2ae7cce5e7a74a86180d16'"
    "'059e12737c6f8933224c9960a50ef9f91124ff4d5f12bb99aa107e1d7537229c'"
    "'089c69ac85e5c7c009151aa563bcbec7533badb4ccc96652035d6be0680a39ce'"
    "'082b3f44a1b938f68c02b51dfa9578fbbed9174863f537509ac7ef2fcbdaba22'"
    "'02180b6bf9def0d1809317b5e02ca67e392229419f7924d0f21497ec33038db0'"
)

# Check each hash
for hash in "${hashes[@]}"; do
    # Remove single quotes if present
    clean_hash=$(echo "$hash" | tr -d "'")
    if is_valid_hash "$clean_hash"; then
        echo "Valid hash: $hash"
    else
        echo "Invalid hash: $hash"
    fi
done

Length of output: 288


Script:

#!/bin/bash
# Description: Verify the format of previously flagged hash values

# Function to check if a string is a valid 64-character hexadecimal hash
is_valid_hash() {
    [[ $1 =~ ^[a-f0-9]{64}$ ]]
}

# Array of previously flagged hashes
declare -a hashes=(
    "0923ca2669c7ee41af195404e968d148afbc4e3d0d2ae7cce5e7a74a86180d16"
    "059e12737c6f8933224c9960a50ef9f91124ff4d5f12bb99aa107e1d7537229c"
    "089c69ac85e5c7c009151aa563bcbec7533badb4ccc96652035d6be0680a39ce"
    "082b3f44a1b938f68c02b51dfa9578fbbed9174863f537509ac7ef2fcbdaba22"
    "02180b6bf9def0d1809317b5e02ca67e392229419f7924d0f21497ec33038db0"
)

# Check each hash
for hash in "${hashes[@]}"; do
    if is_valid_hash "$hash"; then
        echo "Valid hash: $hash"
    else
        echo "Invalid hash: $hash"
    fi
done

Length of output: 2185


Script:

#!/bin/bash
# Description: Verify the format of previously flagged hash values

# Function to check if a string is a valid 64-character hexadecimal hash
is_valid_hash() {
    [[ $1 =~ ^[a-f0-9]{64}$ ]]
}

# Array of previously flagged hashes
hashes=(
    "0923ca2669c7ee41af195404e968d148afbc4e3d0d2ae7cce5e7a74a86180d16"
    "059e12737c6f8933224c9960a50ef9f91124ff4d5f12bb99aa107e1d7537229c"
    "089c69ac85e5c7c009151aa563bcbec7533badb4ccc96652035d6be0680a39ce"
    "082b3f44a1b938f68c02b51dfa9578fbbed9174863f537509ac7ef2fcbdaba22"
    "02180b6bf9def0d1809317b5e02ca67e392229419f7924d0f21497ec33038db0"
)

# Verify each hash
for hash in "${hashes[@]}"; do
    if is_valid_hash "$hash"; then
        echo "Valid hash: $hash"
    else
        echo "Invalid hash: $hash"
    fi
done

Length of output: 2165

bin/sync/images_processor.rb Outdated Show resolved Hide resolved
bin/sync/images_processor.rb Outdated Show resolved Hide resolved
bin/sync/images_processor.rb Outdated Show resolved Hide resolved
sync_status.yml Outdated Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (2)
bin/sync/images_processor.rb (1)

Line range hint 22-31: Consider adding error handling for file operations

The process_blog method performs several file operations without error handling. Consider wrapping these operations in begin-rescue blocks to handle potential exceptions, such as file not found or permission issues. For example:

def process_blog(file_path)
  begin
    file_name = File.basename(File.dirname(file_path))
    content = File.read(file_path)

    content = process_cover_image(file_name, content, file_path)
    content = process_images(file_name, content)

    File.write(file_path, content)
  rescue Errno::ENOENT
    puts "Error: File not found - #{file_path}"
  rescue Errno::EACCES
    puts "Error: Permission denied - #{file_path}"
  rescue StandardError => e
    puts "Error processing blog: #{e.message}"
  end
end

This change will improve the robustness of the method by gracefully handling potential file operation errors.

sync_status.yml (1)

1-817: LGTM! Well-structured and consistent YAML file.

The sync_status.yml file is well-organized and maintains a consistent structure throughout its 136 entries. Each entry correctly includes the required :hash, :slug, and :synced attributes.

Consider structuring the file for better scalability as the number of articles grows. Some options to consider:

  1. Group entries by date or category:

        :hash: e0bc897eb52c216a372d256185bed8406048657e268105bb69c8300afc2beb37
        :slug: develop-minimalistic-process-for-development-teams-innovate-startup-productivity
        :synced: true
      # ... other entries for this month
      # ... entries for September 2023
  2. Use a more structured format, possibly separating concerns:

    articles:
      - id: 2040573
        hash: e0bc897eb52c216a372d256185bed8406048657e268105bb69c8300afc2beb37
        slug: develop-minimalistic-process-for-development-teams-innovate-startup-productivity
    
    sync_status:
      # ... other sync statuses

These structures could make the file easier to manage and allow for more efficient updates and queries, especially as the number of articles grows.

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 19c9185 and 9c308a3.

📒 Files selected for processing (7)
  • bin/from_devto (0 hunks)
  • bin/sync/article_cleaner.rb (1 hunks)
  • bin/sync/article_sync_checker.rb (1 hunks)
  • bin/sync/article_updater.rb (1 hunks)
  • bin/sync/images_processor.rb (3 hunks)
  • bin/sync_with_devto (1 hunks)
  • sync_status.yml (1 hunks)
💤 Files with no reviewable changes (1)
  • bin/from_devto
🚧 Files skipped from review as they are similar to previous changes (2)
  • bin/sync/article_cleaner.rb
  • bin/sync_with_devto
🧰 Additional context used
🪛 rubocop
bin/sync/article_updater.rb

[convention] 56-89: Assignment Branch Condition size for update_canonical_url_on_dev_to is too high. [<10, 20, 6> 23.15/23]

(Metrics/AbcSize)

🔇 Additional comments (11)
bin/sync/article_sync_checker.rb (5)

9-15: LGTM: Well-structured initialize method with dependency injection.

The initialize method is well-implemented, using dependency injection for http_client, file_manager, and yaml_parser. This approach enhances testability and flexibility. The method also follows the Single Responsibility Principle by delegating tasks to separate methods.


17-20: LGTM: Concise and well-structured call method.

The call method effectively orchestrates the main functionality of the class. It follows the Single Responsibility Principle by delegating specific tasks to other methods (fetch_articles, update_sync_status, save_sync_status).


24-39: LGTM: Well-implemented file management methods.

The file management methods (ensure_sync_status_file_exists, load_sync_status, and save_sync_status) are well-implemented. They handle file operations safely and use the injected dependencies (file_manager and yaml_parser), which enhances testability.


47-62: LGTM: Well-implemented calculate_hash and slug methods.

Both calculate_hash and slug methods are well-implemented:

  1. calculate_hash safely handles potential nil values in timestamps.
  2. slug efficiently processes the article's slug and tags, filtering out useless words.

These implementations address concerns raised in previous reviews and provide robust solutions for their respective tasks.


64-76: LGTM: Efficient and correct implementation of update_sync_status.

The update_sync_status method is well-implemented:

  1. It efficiently iterates through articles, updating their sync status.
  2. It correctly identifies changes in articles by comparing newly calculated hashes with stored ones.
  3. The method properly marks articles as unsynced when changes are detected.

This implementation aligns well with the overall purpose of the ArticleSyncChecker class.

bin/sync/images_processor.rb (6)

10-14: LGTM: Well-structured initialization method

The initialize method is well-implemented, allowing for flexible instantiation with default values for blog_dir and repo_url. This approach enhances the reusability of the class.


Line range hint 33-50: Handle failed cover image downloads gracefully

As pointed out in the previous review, the method still doesn't handle the case where the cover image download fails. When the download fails, the original cover_image URL remains unchanged, which might be invalid or point to an external resource. Consider implementing the suggestion from the previous review:

         else
             FileUtils.rm_f(cover_path)
+            updated_content = content.sub(/cover_image: ".*"\n/, '')
+            File.write(file_path, updated_content)
+            return updated_content
         end

This modification removes the cover_image attribute from the content if the download fails, preventing broken links and improving the robustness of the blog post.


72-83: LGTM: Robust URL decoding implementation

The remove_cdn method is well-implemented with a loop for decoding the URL multiple times if necessary. This approach enhances the robustness of the URL processing, ensuring that even deeply encoded URLs are properly handled.


16-18: ⚠️ Potential issue

Add error handling for file existence

As suggested in the previous review, the call method should include error handling for cases where the file might not exist. Consider implementing the following change:

def call
  file_path = "#{@blog_dir}/#{@slug}/index.md"
  if File.exist?(file_path)
    process_blog(file_path)
  else
    puts "Error: File not found - #{file_path}"
    # or raise an appropriate exception
  end
end

This modification will make the method more robust by handling cases where the file might not exist, improving error reporting and debugging.


52-70: ⚠️ Potential issue

Prevent potential filename collisions

As highlighted in the previous review, the process_images method still doesn't fully address the concern about potential filename collisions. To ensure unique filenames regardless of download success, move the index increment outside the conditional block:

    content.gsub(IMG_REGEX) do |match|
      alt_text = $~[:alt]
      img_url = remove_cdn($~[:url])
      ext = File.extname(URI(img_url).path)
      new_file = "file_#{index}#{ext}"
      new_path = "#{@blog_dir}/#{file_name}/#{new_file}"

      FileUtils.mkdir_p(File.dirname(new_path))
      if download_image(img_url, new_path)
-       index += 1
        "![#{alt_text}](#{new_file})"
      else
        match
      end
+     index += 1
    end

This change ensures that each image has a unique filename regardless of the download outcome, preventing potential filename collisions.


85-110: 🛠️ Refactor suggestion

Improve retry logic for non-exception failures

While the download_image method has been improved, it still doesn't fully address the concern about consistent retry behavior for non-exception failures. Consider implementing the following changes to ensure consistent retry logic:

def download_image(url, dest)
  max_attempts = 5
  attempts = 0

  while attempts < max_attempts
    attempts += 1
    begin
      uri = URI(url)
      Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == 'https', open_timeout: 20, read_timeout: 60) do |http|
        request = Net::HTTP::Get.new(uri)
        response = http.request(request)
        if response.is_a?(Net::HTTPSuccess)
          File.open(dest, 'wb') { |io| io.write(response.body) }
          return true
        else
          puts "Attempt #{attempts} of #{max_attempts} failed: #{response.code} #{response.message}"
        end
      end
    rescue StandardError => e
      puts "Attempt #{attempts} of #{max_attempts} failed with message: #{e.message}"
    end
  end

  false
end

This revision ensures that attempts is incremented with each attempt, and retries occur whether an exception is raised or not. It also simplifies the file writing process and provides more consistent error reporting.

bin/sync/article_sync_checker.rb Outdated Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
.github/workflows/sync-and-publish.yml (1)

36-38: Approve script update with a minor suggestion.

The update from 'bin/from_devto' to 'bin/sync_with_devto' aligns with the PR objectives of refactoring the sync script. The addition of the '-f' flag for force synchronization is a good feature.

Consider using a variable for the script name to improve readability and maintainability:

+SYNC_SCRIPT="bin/sync_with_devto"
 if [ "${{ github.event.inputs.force }}" = "true" ]
 then
-  bin/sync_with_devto -f
+  $SYNC_SCRIPT -f
 else
-  bin/sync_with_devto
+  $SYNC_SCRIPT
 fi
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 9c308a3 and 8b8c04c.

📒 Files selected for processing (1)
  • .github/workflows/sync-and-publish.yml (1 hunks)
🧰 Additional context used

@dgorodnichy dgorodnichy force-pushed the refactor-sync-script branch 2 times, most recently from 2f069b3 to 1929ba1 Compare October 22, 2024 08:24
bin/sync/article_cleaner.rb Outdated Show resolved Hide resolved
bin/sync/article_cleaner.rb Outdated Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
@pftg
Copy link
Member

pftg commented Oct 22, 2024

Let's split it into more manageable smaller PRs https://www.loom.com/share/99f0d49cf1bd4601bfa4ec3d3f79b874

Copy link
Member

@pftg pftg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 23

🧹 Outside diff range and nitpick comments (6)
bin/sync/dev_to_adapter.rb (1)

1-23: Consider adding rate limiting protection.

The Dev.to API has rate limits, and the adapter should handle them gracefully to prevent service disruption.

Consider implementing:

  1. Rate limit detection from response headers
  2. Exponential backoff for retries
  3. Rate limit tracking across requests

Would you like me to provide an example implementation of rate limiting handling?

bin/sync/article_cleaner.rb (2)

3-5: Add module documentation.

Consider adding a module documentation block to describe its purpose, responsibilities, and usage examples. This will help other developers understand how to properly include and use this mixin.

 require 'fileutils'

+# Provides functionality for cleaning up renamed articles in a blog directory.
+# This module is intended to be mixed into a class that provides a `working_dir` method
+# which should return the path to the blog's root directory.
+#
+# @example
+#   class BlogManager
+#     include ArticleCleaner
+#     
+#     def working_dir
+#       '/path/to/blog'
+#     end
+#   end
 module ArticleCleaner

1-27: Consider restructuring the sync functionality as a proper gem.

The current location in bin/sync suggests this is part of executable scripts. However, this business logic would be better suited in a proper gem structure under lib/, with the executable scripts in bin/ only handling CLI interaction.

This would:

  1. Improve testability
  2. Enable better dependency management
  3. Make the code more reusable
  4. Follow Ruby community best practices

Consider this directory structure:

lib/
  devto_sync/
    article_cleaner.rb
    article_sync_checker.rb
    article_updater.rb
    dev_to_adapter.rb
bin/
  sync_with_devto
bin/sync/article_sync_checker.rb (1)

1-62: Consider architectural improvements for efficient synchronization.

To achieve the PR objective of completing sync in one iteration efficiently:

  1. Implement batch processing for better performance with large article sets
  2. Add proper pagination handling to ensure no articles are missed
  3. Consider adding a progress indicator for long-running operations

Would you like me to provide a detailed implementation for these improvements?

bin/sync/images_downloader.rb (2)

100-104: Rescue specific exceptions instead of StandardError

Catching broad exceptions can mask other issues. Specify the exceptions you expect.

For example:

rescue Net::OpenTimeout, Net::ReadTimeout, SocketError => e

92-92: Ensure the destination directory exists before writing the file

Although directories are created elsewhere, ensuring the directory exists here enhances code safety.

Apply this change:

FileUtils.mkdir_p(File.dirname(dest))
File.open(dest, 'wb') { |file| file.write(response.body) }
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 8b8c04c and 4f156a1.

📒 Files selected for processing (10)
  • .github/workflows/sync-and-publish.yml (1 hunks)
  • bin/from_devto (0 hunks)
  • bin/sync/article_cleaner.rb (1 hunks)
  • bin/sync/article_sync_checker.rb (1 hunks)
  • bin/sync/article_updater.rb (1 hunks)
  • bin/sync/dev_to_adapter.rb (1 hunks)
  • bin/sync/images_downloader.rb (1 hunks)
  • bin/sync/sync.rb (1 hunks)
  • bin/sync_with_devto (1 hunks)
  • bin/upload_assets_to_github (0 hunks)
💤 Files with no reviewable changes (2)
  • bin/from_devto
  • bin/upload_assets_to_github
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/sync-and-publish.yml
  • bin/sync_with_devto
🧰 Additional context used
🪛 rubocop
bin/sync/article_updater.rb

[convention] 4-149: Module has too many lines. [115/100]

(Metrics/ModuleLength)

🔇 Additional comments (5)
bin/sync/dev_to_adapter.rb (1)

8-22: ⚠️ Potential issue

Add error handling, input validation, and security improvements.

The API methods need several improvements for robustness and security:

  1. Error handling for API responses
  2. Input validation
  3. URL parameter encoding
  4. Consistent timeout handling
  5. Return type documentation

Here's a suggested implementation:

   def get_articles(username, page)
-    @http_client.get("#{BASE_URL}/articles?username=#{username}&page=#{page}")
+    raise ArgumentError, 'username cannot be blank' if username.to_s.strip.empty?
+    raise ArgumentError, 'page must be positive' unless page.to_i.positive?
+    
+    response = @http_client.get(
+      "#{BASE_URL}/articles",
+      query: { username: username, page: page },
+      timeout: 60
+    )
+    handle_response(response)
   end

   def get_article(id)
-    @http_client.get(id)
+    raise ArgumentError, 'id cannot be blank' if id.to_s.strip.empty?
+    
+    response = @http_client.get(
+      "#{BASE_URL}/articles/#{CGI.escape(id.to_s)}",
+      timeout: 60
+    )
+    handle_response(response)
   end

   def download(url)
-    @http_client.get(url, timeout: 60)
+    raise ArgumentError, 'url cannot be blank' if url.to_s.strip.empty?
+    
+    response = @http_client.get(url, timeout: 60)
+    handle_response(response)
   end

   def update_article(url, headers:, body:)
-    @http_client.put(url, headers: headers, body: body)
+    raise ArgumentError, 'url cannot be blank' if url.to_s.strip.empty?
+    raise ArgumentError, 'headers must be a hash' unless headers.is_a?(Hash)
+    
+    response = @http_client.put(
+      url,
+      headers: headers,
+      body: body,
+      timeout: 60
+    )
+    handle_response(response)
   end

+  private
+
+  def handle_response(response)
+    case response.code
+    when 200..299
+      response
+    when 404
+      raise NotFoundError, "Resource not found: #{response.message}"
+    when 401
+      raise UnauthorizedError, "Unauthorized: #{response.message}"
+    else
+      raise ApiError, "API error (#{response.code}): #{response.message}"
+    end
+  end

Let's verify the API endpoint usage across the codebase:

bin/sync/sync.rb (4)

1-8: LGTM! Well-organized dependencies.

The imports are properly structured with standard libraries first, followed by local modules. The separation of concerns into distinct modules promotes maintainability.


9-13: LGTM! Well-structured class with clear separation of concerns.

The class follows SOLID principles by delegating specific responsibilities to dedicated modules. This modular approach will make the synchronization process more maintainable and testable.


29-31: LGTM! Proper encapsulation of internal state.

The private attribute readers follow Ruby best practices for encapsulation.


23-27: Verify single-iteration synchronization requirement.

The implementation looks clean and appears to fulfill the PR objective of completing synchronization in a single iteration. However, we should verify that the included modules (ArticleSyncChecker, ArticleUpdater, ArticleCleaner) don't introduce any loops or multiple passes.

✅ Verification successful

Let me try to locate these files first since the previous pattern didn't work.


Now that we've found the correct files, let's check their implementations for any potential loops or multiple iterations.


Single-iteration synchronization requirement is verified

After reviewing the implementations, I can confirm that the synchronization process completes in a single pass. Here's the verification breakdown:

  1. ArticleSyncChecker: Uses a single iteration to update the sync status of articles
  2. ArticleUpdater: Processes articles in a single pass, with each article being processed exactly once
  3. ArticleCleaner: Performs a single directory scan to remove renamed articles

The iterations found (each, map, select) are all single-pass operations over collections, not multiple passes over the same data. The only loops present are for:

  • Processing individual articles once
  • Retrying failed API calls (with a maximum retry limit)
  • Processing images for each article
  • Cleaning up renamed articles

There are no recursive calls, nested iterations, or multiple passes that would violate the single-iteration requirement.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for potential loops or multiple iterations in the sync-related modules
echo "Checking for potential loops in sync modules..."
rg -A 5 "while|until|loop|each" --type ruby "article_(sync_checker|updater|cleaner)\.rb$"

echo "Checking for recursive calls..."
rg -A 5 "def.*perform|call" --type ruby "article_(sync_checker|updater|cleaner)\.rb$"

Length of output: 525


Script:

#!/bin/bash
# First, let's find the actual files
echo "Locating sync-related files..."
fd -e rb -E vendor . | grep -i "article.*\(sync\|update\|clean\)"

echo "Checking the directory structure around bin/sync..."
ls -R bin/sync/

echo "Looking for requires/includes in sync.rb to find the correct paths..."
cat bin/sync/sync.rb

Length of output: 1449


Script:

#!/bin/bash
# Check the implementation of the three main modules
echo "=== ArticleSyncChecker implementation ==="
cat bin/sync/article_sync_checker.rb

echo -e "\n=== ArticleUpdater implementation ==="
cat bin/sync/article_updater.rb

echo -e "\n=== ArticleCleaner implementation ==="
cat bin/sync/article_cleaner.rb

# Also look for any iteration patterns that might indicate multiple passes
echo -e "\n=== Checking for iteration patterns in all sync files ==="
rg -A 5 "while|until|loop|each|map|select" bin/sync/article_*.rb

Length of output: 9880

bin/sync/dev_to_adapter.rb Outdated Show resolved Hide resolved
bin/sync/dev_to_adapter.rb Show resolved Hide resolved
bin/sync/article_cleaner.rb Outdated Show resolved Hide resolved
bin/sync/article_cleaner.rb Outdated Show resolved Hide resolved
bin/sync/sync.rb Outdated Show resolved Hide resolved
bin/sync/images_downloader.rb Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 21

🧹 Outside diff range and nitpick comments (7)
test/support/fake_http_client.rb (2)

1-4: Remove unused HTTParty requirement

The httparty gem is required but not used in this fake client implementation.

 require "json"
 require 'ostruct'
-require 'httparty'

5-5: Add class documentation and tests

Consider adding:

  1. Class documentation explaining its purpose and usage
  2. Tests for the fake client itself to ensure it behaves consistently
+# FakeHttpClient simulates HTTP responses for testing the Dev.to API integration.
+# It provides mock responses for article listing, retrieval, updates, and image downloads.
+#
+# @example
+#   client = FakeHttpClient.new
+#   response = client.get_articles('/articles', 1)
+#   articles = JSON.parse(response.body)
 class FakeHttpClient
bin/sync/article_sync_checker.rb (2)

6-11: Add method documentation

Add YARD-style documentation to explain the method's purpose, usage, and return value. This is particularly important for public methods.

+  # Updates the sync status for all articles fetched from Dev.to
+  # @return [void]
+  # @raise [StandardError] if file operations or API calls fail
   def update_sync_status
     ensure_sync_status_file_exists
     @sync_status = load_sync_status

48-60: Consider batch processing for large article sets

Processing articles one by one might be inefficient for large sets. Consider implementing batch processing.

   def update_status(articles)
+    return if articles.empty?
+    
+    # Process articles in batches of 100
+    articles.each_slice(100) do |batch|
+      batch_updates = {}
+      
+      batch.each do |article|
+        id = article['id']
+        edited_at = article["edited_at"] || article["created_at"]
+        
+        batch_updates[id] = {
+          edited_at: edited_at,
+          slug: slug(article),
+          synced: false
+        }
+      end
+      
+      # Bulk update sync status
+      @sync_status.merge!(batch_updates) do |_, old_val, new_val|
+        if old_val[:edited_at] != new_val[:edited_at]
+          new_val
+        else
+          old_val
+        end
+      end
+    end
   end
bin/sync/article_updater.rb (1)

4-149: Consider splitting the module into smaller, focused classes.

The module is handling multiple responsibilities and has grown too large. Consider extracting these responsibilities into separate classes:

  • ArticleFetcher - Handle API interactions and retries
  • ArticleFormatter - Handle markdown and YAML generation
  • ArticleSynchronizer - Orchestrate the synchronization process

This separation would improve maintainability and testability of the codebase. Would you like me to help create a detailed plan for this refactoring?

🧰 Tools
🪛 rubocop

[convention] 4-149: Module has too many lines. [115/100]

(Metrics/ModuleLength)

test/fixtures/responses/1879395.json (2)

2-14: Consider using placeholder URLs in test fixtures.

Using production URLs (dev.to domain) in test fixtures could make tests environment-dependent. Consider using placeholder domains (e.g., example.com) to ensure tests remain isolated from external dependencies.

Example modification:

-  "url": "https://dev.to/jetthoughts/recent-searches-sorting-hashes-how-they-are-connected-4965",
+  "url": "https://example.com/articles/recent-searches-sorting-hashes-how-they-are-connected-4965",

15-16: Simplify image URLs in test fixtures.

The image URLs contain complex dynamic image processing parameters. For test fixtures, consider using simpler URLs to improve readability and reduce complexity.

Example modification:

-  "cover_image": "https://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fjetthoughts%2Fjetthoughts.github.io%2Fmaster%2Fstatic%2Fassets%2Fimg%2Fblog%2Frecent-searches-sorting-hashes-how-they-are-connected-ruby-rails%2Ffile_0.jpeg",
+  "cover_image": "https://example.com/images/cover.jpeg",
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 4f156a1 and 3924291.

⛔ Files ignored due to path filters (1)
  • test/fixtures/images/sample.jpeg is excluded by !**/*.jpeg
📒 Files selected for processing (12)
  • bin/sync/article_cleaner.rb (1 hunks)
  • bin/sync/article_sync_checker.rb (1 hunks)
  • bin/sync/article_updater.rb (1 hunks)
  • bin/sync/dev_to_adapter.rb (1 hunks)
  • bin/sync/images_downloader.rb (1 hunks)
  • bin/sync/sync.rb (1 hunks)
  • bin/sync_with_devto (1 hunks)
  • test/fixtures/responses/1877720.json (1 hunks)
  • test/fixtures/responses/1879395.json (1 hunks)
  • test/fixtures/responses/index.json (1 hunks)
  • test/integration/sync_with_devto_test.rb (1 hunks)
  • test/support/fake_http_client.rb (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • test/fixtures/responses/index.json
🚧 Files skipped from review as they are similar to previous changes (5)
  • bin/sync/article_cleaner.rb
  • bin/sync/dev_to_adapter.rb
  • bin/sync/images_downloader.rb
  • bin/sync/sync.rb
  • bin/sync_with_devto
🧰 Additional context used
🪛 rubocop
bin/sync/article_updater.rb

[convention] 4-149: Module has too many lines. [115/100]

(Metrics/ModuleLength)

test/integration/sync_with_devto_test.rb

[convention] 66-66: Prefer using YAML.safe_load over YAML.load.

(Security/YAMLLoad)

🔇 Additional comments (8)
bin/sync/article_updater.rb (2)

1-8: LGTM! Dependencies and constants are well-defined.

The module's dependencies and constants are appropriate for its responsibilities.


24-26: ⚠️ Potential issue

Inject working_dir dependency.

The working_dir variable is undefined. It should be injected as a dependency.

Apply this fix:

-def download_images(slug, http_client)
+def download_images(slug, http_client:, working_dir:)
-  ImagesDownloader.new(slug, http_client, working_dir).call
+  ImagesDownloader.new(slug, http_client: http_client, working_dir: working_dir).call
end

Likely invalid or redundant comment.

test/integration/sync_with_devto_test.rb (3)

1-13: LGTM! Well-structured test setup.

The test class setup follows Ruby best practices with proper requires and frozen constants.


15-25: LGTM! Good test hygiene practices.

The lifecycle methods ensure proper test isolation through setup and cleanup.


59-175: LGTM! Comprehensive test coverage.

The test suite provides good coverage of the sync functionality, including:

  • Status file creation and updates
  • Article content synchronization
  • Image downloads and path updates
  • Custom slug handling
🧰 Tools
🪛 rubocop

[convention] 66-66: Prefer using YAML.safe_load over YAML.load.

(Security/YAMLLoad)

test/fixtures/responses/1877720.json (1)

15-16: 🛠️ Refactor suggestion

Consider making image dimensions configurable.

The cover and social image URLs have hardcoded dimensions (width=1000). Consider making these dimensions configurable through environment variables or configuration files to allow for flexibility in different environments or use cases.

-  "cover_image": "https://media2.dev.to/dynamic/image/width=1000,height=420,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fjetthoughts%2Fjetthoughts.github.io%2Fmaster%2Fstatic%2Fassets%2Fimg%2Fblog%2Fmyth-or-reality-can-test-driven-development-in-agile-replace-qa-programming%2Ffile_0.png",
-  "social_image": "https://media2.dev.to/dynamic/image/width=1000,height=500,fit=cover,gravity=auto,format=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fjetthoughts%2Fjetthoughts.github.io%2Fmaster%2Fstatic%2Fassets%2Fimg%2Fblog%2Fmyth-or-reality-can-test-driven-development-in-agile-replace-qa-programming%2Ffile_0.png",
+  "cover_image": "https://media2.dev.to/dynamic/image/width=${COVER_IMAGE_WIDTH},height=${COVER_IMAGE_HEIGHT},fit=cover,gravity=auto,format=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fjetthoughts%2Fjetthoughts.github.io%2Fmaster%2Fstatic%2Fassets%2Fimg%2Fblog%2Fmyth-or-reality-can-test-driven-development-in-agile-replace-qa-programming%2Ffile_0.png",
+  "social_image": "https://media2.dev.to/dynamic/image/width=${SOCIAL_IMAGE_WIDTH},height=${SOCIAL_IMAGE_HEIGHT},fit=cover,gravity=auto,format=auto/https%3A%2F%2Fraw.githubusercontent.com%2Fjetthoughts%2Fjetthoughts.github.io%2Fmaster%2Fstatic%2Fassets%2Fimg%2Fblog%2Fmyth-or-reality-can-test-driven-development-in-agile-replace-qa-programming%2Ffile_0.png",

Likely invalid or redundant comment.

test/fixtures/responses/1879395.json (2)

1-49: LGTM: Well-structured JSON fixture.

The JSON structure follows the Dev.to API response format and includes all required fields.


32-48: LGTM: Well-structured user and organization data.

The user and organization sections are well-structured with appropriate null values where applicable, which is good for testing edge cases.

test/support/fake_http_client.rb Outdated Show resolved Hide resolved
test/support/fake_http_client.rb Outdated Show resolved Hide resolved
test/support/fake_http_client.rb Outdated Show resolved Hide resolved
test/support/fake_http_client.rb Outdated Show resolved Hide resolved
bin/sync/article_sync_checker.rb Outdated Show resolved Hide resolved
test/integration/sync_with_devto_test.rb Outdated Show resolved Hide resolved
test/fixtures/responses/1877720.json Show resolved Hide resolved
test/fixtures/responses/1877720.json Show resolved Hide resolved
test/fixtures/responses/1879395.json Show resolved Hide resolved
test/fixtures/responses/1879395.json Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 16

🧹 Outside diff range and nitpick comments (2)
bin/sync/images_downloader.rb (1)

1-108: Consider security implications of downloading external content.

When downloading external content:

  1. Consider implementing file size limits to prevent DOS attacks
  2. Add virus scanning for downloaded content if possible
  3. Implement rate limiting to prevent overwhelming external servers
  4. Consider caching successfully downloaded images

These measures will help protect both your system and the external services you're interacting with.

bin/sync/article_updater.rb (1)

4-158: Consider splitting the module into focused classes.

The module handles multiple responsibilities and exceeds the recommended length. Consider extracting these responsibilities into separate classes:

  1. ArticleFetcher - API interactions
  2. ArticleFormatter - Markdown/YAML generation
  3. ArticlePersister - File operations
  4. CanonicalUrlManager - URL updates

This would improve:

  • Code organization
  • Testability
  • Maintainability
🧰 Tools
🪛 rubocop

[convention] 4-158: Module has too many lines. [126/100]

(Metrics/ModuleLength)


[convention] 13-36: Assignment Branch Condition size for download_new_articles is too high. [<6, 22, 7> 23.85/23]

(Metrics/AbcSize)


[convention] 13-36: Cyclomatic complexity for download_new_articles is too high. [8/7]

(Metrics/CyclomaticComplexity)

📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 56f4d54 and 01b97f3.

📒 Files selected for processing (8)
  • bin/sync/article_cleaner.rb (1 hunks)
  • bin/sync/article_sync_checker.rb (1 hunks)
  • bin/sync/article_updater.rb (1 hunks)
  • bin/sync/dev_to_adapter.rb (1 hunks)
  • bin/sync/images_downloader.rb (1 hunks)
  • bin/sync/sync.rb (1 hunks)
  • test/integration/sync_with_devto_test.rb (1 hunks)
  • test/support/fake_http_client.rb (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • bin/sync/article_cleaner.rb
  • bin/sync/dev_to_adapter.rb
  • test/integration/sync_with_devto_test.rb
🧰 Additional context used
🪛 rubocop
bin/sync/article_updater.rb

[convention] 4-158: Module has too many lines. [126/100]

(Metrics/ModuleLength)


[convention] 13-36: Assignment Branch Condition size for download_new_articles is too high. [<6, 22, 7> 23.85/23]

(Metrics/AbcSize)


[convention] 13-36: Cyclomatic complexity for download_new_articles is too high. [8/7]

(Metrics/CyclomaticComplexity)

🔇 Additional comments (7)
bin/sync/article_sync_checker.rb (1)

8-14: ⚠️ Potential issue

Fix undefined method reference

The method references sync_status which is undefined. This should use load_sync_status instead.

Add the following method and update the reference:

-    @sync_status = sync_status
+    @sync_status = load_sync_status

+  private
+
+  def load_sync_status
+    YAML.load_file(File.join(working_dir, SYNC_STATUS_FILE)) || {}
+  end

Likely invalid or redundant comment.

bin/sync/images_downloader.rb (1)

7-12: LGTM! Well-structured class with proper dependency injection.

The class follows good practices by accepting dependencies through the constructor, making it easier to test and maintain.

bin/sync/article_updater.rb (1)

1-12: LGTM! Well-structured constants and configuration.

The constants are properly defined and frozen, and the retry configuration is well-structured with reasonable default values.

bin/sync/sync.rb (4)

13-13: Ensure consistent naming of synchronization status file constant

There is inconsistency in naming the constant for the synchronization status file between YAML_STATUS_FILE and SYNC_STATUS_FILE in other modules. Please use consistent naming across modules to avoid confusion.


20-22: Consider reducing parameter duplication with the constructor

The default values for http_client and working_dir are duplicated from the constructor. Consider using class constants or delegating to the constructor's defaults to reduce redundancy.


27-30: Use logger instead of puts for warnings and errors

Using puts for warnings and error messages is not recommended in production code. Consider using a logging framework for better control over log levels and output destinations.


34-38: Consider adding progress reporting and return value

The perform method could benefit from adding progress reporting to track operations and a return value indicating success or failure of the sync operation.

test/support/fake_http_client.rb Show resolved Hide resolved
test/support/fake_http_client.rb Outdated Show resolved Hide resolved
test/support/fake_http_client.rb Show resolved Hide resolved
test/support/fake_http_client.rb Outdated Show resolved Hide resolved
bin/sync/article_sync_checker.rb Show resolved Hide resolved
bin/sync/images_downloader.rb Show resolved Hide resolved
bin/sync/images_downloader.rb Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
bin/sync/article_updater.rb Outdated Show resolved Hide resolved
bin/sync/sync.rb Show resolved Hide resolved
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (2)
bin/sync/images_downloader.rb (2)

5-5: Consider making REPO_URL configurable

Hardcoding the repository URL makes the code less flexible and harder to reuse. Consider making this configurable through environment variables or configuration files.


7-7: Add documentation and type checking

The class lacks documentation and type checking for its parameters. Consider adding YARD documentation and parameter validation.

+# Downloads and processes images from markdown blog posts
+# @param slug [String] The blog post slug
+# @param http_client [#download] An HTTP client object that responds to download
+# @param working_dir [String] The working directory path
 class ImagesDownloader
   def initialize(slug, http_client, working_dir)
+    raise ArgumentError, "slug must be a String" unless slug.is_a?(String)
+    raise ArgumentError, "http_client must respond to download" unless http_client.respond_to?(:download)
+    raise ArgumentError, "working_dir must be a String" unless working_dir.is_a?(String)
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 01b97f3 and bfdf395.

📒 Files selected for processing (1)
  • bin/sync/images_downloader.rb (1 hunks)

bin/sync/images_downloader.rb Outdated Show resolved Hide resolved
bin/sync/images_downloader.rb Show resolved Hide resolved
bin/sync/images_downloader.rb Show resolved Hide resolved
bin/sync/images_downloader.rb Show resolved Hide resolved
bin/sync/images_downloader.rb Show resolved Hide resolved
dgorodnichy and others added 4 commits October 25, 2024 12:08
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (4)
bin/sync/retryable.rb (1)

1-23: Consider additional reliability patterns.

Since this module is critical for network operations, consider these architectural improvements:

  1. Circuit Breaker pattern to prevent overwhelming failing services
  2. Metrics collection for monitoring retry patterns
  3. Integration with a proper observability system

Example implementation of a circuit breaker integration:

module Retryable
  class CircuitBreaker
    def self.around_operation
      return unless block_given?
      
      if healthy?
        yield
        record_success
      else
        raise CircuitOpenError
      end
    rescue => e
      record_failure
      raise
    end
  end
end

Consider integrating with monitoring systems to track:

  • Retry attempt distribution
  • Failure rates by operation
  • Recovery patterns
  • Circuit breaker state changes
bin/sync/article_updater.rb (2)

5-11: Add module documentation and initialize instance variables.

The module lacks documentation explaining its purpose and usage. Additionally, it relies on instance variables that are not explicitly initialized.

Add documentation and initialize instance variables:

+# Handles synchronization of articles between Dev.to and local markdown files.
+# Requires http_client and working_dir to be initialized before use.
 module ArticleUpdater
   JT_BLOG_HOST = 'https://jetthoughts.com/blog/'.freeze
   DEV_TO_API_URL = 'https://dev.to/api/articles'.freeze
   YAML_STATUS_FILE = 'sync_status.yml'.freeze

   include Retryable
+
+  def initialize(http_client:, working_dir:)
+    @http_client = http_client
+    @working_dir = working_dir
+  end
🧰 Tools
🪛 rubocop

[convention] 5-140: Module has too many lines. [108/100]

(Metrics/ModuleLength)


5-140: Consider splitting the module into smaller focused classes.

The module is slightly over the recommended length (108 lines vs 100) and handles multiple responsibilities. Consider extracting these responsibilities into separate classes:

  • ArticleFetcher - Handling Dev.to API interactions
  • ArticleFormatter - Managing markdown generation
  • ArticlePersister - Handling file operations
  • SyncStatusManager - Managing synchronization status

This would improve maintainability and adhere better to the Single Responsibility Principle.

🧰 Tools
🪛 rubocop

[convention] 5-140: Module has too many lines. [108/100]

(Metrics/ModuleLength)


[convention] 12-35: Assignment Branch Condition size for download_new_articles is too high. [<6, 22, 7> 23.85/23]

(Metrics/AbcSize)


[convention] 12-35: Cyclomatic complexity for download_new_articles is too high. [8/7]

(Metrics/CyclomaticComplexity)

test/integration/sync_with_devto_test.rb (1)

1-14: LGTM! Consider documenting test data constants.

The test file structure and setup look good. However, consider adding documentation comments for the test data constants to explain their purpose and structure.

# Article with basic metadata and content
FAKE_API_ARTICLE_1 = {
  slug: "recent-searches-sorting-hashes-how-they-are-connected-ruby-rails"
}.freeze

# Article used for testing custom slug functionality
FAKE_API_ARTICLE_2 = {
  slug: "myth-or-reality-can-test-driven-development-in-agile-replace-qa-programming"
}.freeze
📜 Review details

Configuration used: .coderabbit.yaml
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between aa6b268 and 73007ad.

📒 Files selected for processing (4)
  • bin/sync/article_updater.rb (1 hunks)
  • bin/sync/images_downloader.rb (1 hunks)
  • bin/sync/retryable.rb (1 hunks)
  • test/integration/sync_with_devto_test.rb (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • bin/sync/images_downloader.rb
🧰 Additional context used
🪛 rubocop
bin/sync/article_updater.rb

[convention] 5-140: Module has too many lines. [108/100]

(Metrics/ModuleLength)


[convention] 12-35: Assignment Branch Condition size for download_new_articles is too high. [<6, 22, 7> 23.85/23]

(Metrics/AbcSize)


[convention] 12-35: Cyclomatic complexity for download_new_articles is too high. [8/7]

(Metrics/CyclomaticComplexity)

🔇 Additional comments (9)
bin/sync/retryable.rb (1)

7-23: ⚠️ Potential issue

Improve error handling and logging robustness.

The current implementation has several areas for improvement:

  1. Catching StandardError is too broad and could mask critical errors
  2. Missing cleanup in case of interruption
  3. Limited error logging
  4. No handling of interrupt signals

Consider this enhanced implementation:

   def with_retries(operation:)
     attempts = 0
     begin
       attempts += 1
       yield
-    rescue StandardError => e
+    rescue Net::HTTP::Persistent::Error, 
+           Errno::ETIMEDOUT,
+           Errno::ECONNRESET => e
       if attempts < RETRY_CONFIG[:max_attempts]
-        delay = RETRY_CONFIG[:base_delay] * attempts
-        puts "#{operation} failed, attempt #{attempts}/#{RETRY_CONFIG[:max_attempts]}. Retrying in #{delay}s..."
+        delay = calculate_delay(attempts)
+        logger.warn "#{operation} failed (#{e.class}: #{e.message}), " \
+                   "attempt #{attempts}/#{RETRY_CONFIG[:max_attempts]}. " \
+                   "Retrying in #{delay}s..."
         sleep(delay)

         retry
       end
-      puts "#{operation} failed after #{RETRY_CONFIG[:max_attempts]} attempts: #{e.message}"
+      logger.error "#{operation} failed after #{RETRY_CONFIG[:max_attempts]} " \
+                   "attempts: #{e.class} - #{e.message}\n#{e.backtrace.join("\n")}"
+      raise
+    rescue Interrupt => e
+      logger.info "#{operation} interrupted, cleaning up..."
+      raise
+    ensure
+      # Add any necessary cleanup here
     end
   end

Additionally, consider:

  1. Using a proper logger instead of puts
  2. Adding metrics/monitoring for retry attempts
  3. Implementing circuit breaker pattern for persistent failures

Let's verify the error types being handled in the codebase:

bin/sync/article_updater.rb (2)

43-54: ⚠️ Potential issue

Use atomic file operations for YAML updates.

The current implementation of writing to the YAML file is not atomic and could lead to file corruption if the process is interrupted.

 def update_article_edited_at(article_id, updated_article)
   data = sync_status

   if data[article_id]
     data[article_id][:edited_at] = updated_article["edited_at"]
     data[article_id][:synced] = true
-    File.open(File.join(working_dir, YAML_STATUS_FILE), 'w') { |f| f.write(data.to_yaml) }
+    temp_file = File.join(working_dir, "#{YAML_STATUS_FILE}.tmp")
+    File.write(temp_file, data.to_yaml)
+    FileUtils.mv(temp_file, File.join(working_dir, YAML_STATUS_FILE))
     puts "Article ID: #{article_id} updated successfully."
   else
     puts "Article ID: #{article_id} not found."
   end
 end

Likely invalid or redundant comment.


113-139: 🛠️ Refactor suggestion

Improve YAML front matter generation.

The current implementation of generating markdown content could be improved to handle YAML front matter more robustly.

 def generate_markdown(article_data, slug)
   cover_image = article_data['cover_image']
   metatags_image = {}

   unless cover_image.to_s.empty?
     metatags_image = {
       'metatags' => {
         'image' => "cover#{File.extname(cover_image)}"
       }
     }
   end

   article_hash = {
     'dev_to_id' => article_data['id'],
     'title' => article_data['title'],
     'description' => article_data['description'],
     'created_at' => article_data['created_at'],
     'edited_at' => article_data['edited_at'],
     'draft' => false,
     'tags' => article_data['tags'],
     'canonical_url' => article_data['canonical_url'],
     'cover_image' => cover_image,
     'slug' => slug
   }.merge(metatags_image)

-  "#{article_hash.to_yaml(line_width: -1)}---\n#{article_data['body_markdown']}"
+  [
+    "---\n",
+    article_hash.to_yaml.gsub(/^---\n/, ''),
+    "---\n",
+    article_data['body_markdown']
+  ].join
 end

Likely invalid or redundant comment.

test/integration/sync_with_devto_test.rb (6)

15-26: LGTM! Setup and teardown methods are well-implemented.

The test environment setup and cleanup are properly handled, following testing best practices.


28-58: Refer to the previous review comment about DRYing up test data setup.


60-68: LGTM! Proper implementation of YAML file handling.

Good use of YAML.safe_load with permitted classes for secure YAML parsing.


70-117: Refer to the previous review comment about breaking down complex tests.


119-142: Refer to the previous review comment about using test fixtures.


164-177: LGTM! Thorough testing of custom slug functionality.

The test properly verifies the directory structure changes when applying custom slugs.

Comment on lines +1 to +6
module Retryable
RETRY_CONFIG = {
max_attempts: 5,
base_delay: 2
}.freeze

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider enhancing configuration flexibility and documentation.

The configuration is well-structured but could benefit from these improvements:

  1. Allow configuration override through environment variables or initialization options
  2. Add jitter to the base delay to prevent synchronized retries across multiple processes
  3. Document the configuration parameters

Consider this implementation:

 module Retryable
+  # Configuration for retry behavior:
+  # max_attempts: Maximum number of retry attempts before giving up
+  # base_delay: Base delay in seconds between retries (actual delay will include jitter)
   RETRY_CONFIG = {
-    max_attempts: 5,
-    base_delay: 2
+    max_attempts: ENV.fetch('RETRY_MAX_ATTEMPTS', 5).to_i,
+    base_delay: ENV.fetch('RETRY_BASE_DELAY', 2).to_i
   }.freeze

+  private
+
+  def calculate_delay(attempt)
+    jitter = rand(0.1..0.3)
+    (RETRY_CONFIG[:base_delay] * attempt) + jitter
+  end

Committable suggestion was skipped due to low confidence.

Comment on lines 12 to 35
def download_new_articles(force = false)
raise ArgumentError, "http_client is required" if http_client.nil?
raise ArgumentError, "working_dir is required" if working_dir.nil?

articles = force ? all_articles : unsynced_articles

begin
articles.each do |article_id, attributes|
article_data = fetch_article(article_id)
next unless article_data

save_article_as_markdown(article_data, attributes[:slug])
download_images(attributes[:slug], http_client, working_dir)
updated_article = update_canonical_url_on_dev_to(article_id, attributes[:slug])
next unless updated_article

update_article_edited_at(article_id, updated_article)
end

rescue StandardError => e
puts "Error processing articles: #{e.message}"
raise
end
end
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Refactor method to reduce complexity and improve clarity.

The method has high cyclomatic complexity and uses instance variables directly. Consider extracting the article processing logic and using keyword arguments.

-def download_new_articles(force = false)
+def download_new_articles(force: false)
   raise ArgumentError, "http_client is required" if http_client.nil?
   raise ArgumentError, "working_dir is required" if working_dir.nil?

   articles = force ? all_articles : unsynced_articles

   begin
-    articles.each do |article_id, attributes|
-      article_data = fetch_article(article_id)
-      next unless article_data
-
-      save_article_as_markdown(article_data, attributes[:slug])
-      download_images(attributes[:slug], http_client, working_dir)
-      updated_article = update_canonical_url_on_dev_to(article_id, attributes[:slug])
-      next unless updated_article
-
-      update_article_edited_at(article_id, updated_article)
-    end
+    articles.each { |article_id, attributes| process_article(article_id, attributes) }
   rescue StandardError => e
     puts "Error processing articles: #{e.message}"
     raise
   end
 end

+private
+
+def process_article(article_id, attributes)
+  article_data = fetch_article(article_id)
+  return unless article_data
+
+  save_article_as_markdown(article_data, attributes[:slug])
+  download_images(attributes[:slug], http_client, working_dir)
+  updated_article = update_canonical_url_on_dev_to(article_id, attributes[:slug])
+  return unless updated_article
+
+  update_article_edited_at(article_id, updated_article)
+end

Committable suggestion was skipped due to low confidence.

🧰 Tools
🪛 rubocop

[convention] 12-35: Assignment Branch Condition size for download_new_articles is too high. [<6, 22, 7> 23.85/23]

(Metrics/AbcSize)


[convention] 12-35: Cyclomatic complexity for download_new_articles is too high. [8/7]

(Metrics/CyclomaticComplexity)

test/integration/sync_with_devto_test.rb Outdated Show resolved Hide resolved
@@ -2,5 +2,7 @@ early_access: true
reviews:
path_filters:
- "!content/blog/!*.md"
- "!content/blog/sync_status.yml"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's avoid to store data files from content, it should be located in data folder as I remember, please check hugo docs about this

@pftg
Copy link
Member

pftg commented Oct 28, 2024

@dgorodnichy, could you add a screenshot of what files would be changed after running this script with the force (but make sure not to update dev.to (by the way, create some flag for this)? And what will be changed in the selected post

Copy link
Member

@pftg pftg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One post has the issue:

jetthoughts github io – Commit: index md 2024-11-12 13-49-08

synced: true
}
end

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dgorodnichy still has time changes without content updates. Please add a test that there is no dev.to API calls if there are no changes

jetthoughts github io – Commit: index md 2024-11-12 13-40-16

@pftg
Copy link
Member

pftg commented Nov 12, 2024

@dgorodnichy please rebase

@pftg
Copy link
Member

pftg commented Nov 13, 2024

@dgorodnichy we can fix the problem with that post in separate pr. Please rebase and merge

@pftg pftg merged commit d182950 into master Nov 13, 2024
@pftg pftg deleted the refactor-sync-script branch November 13, 2024 09:26
@dgorodnichy
Copy link
Collaborator Author

@pftg , Can't to reproduce. At this moment I see this article on jt blog.

@dgorodnichy
Copy link
Collaborator Author

Sorry, Reproduced.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor Sync script
2 participants