Feat: transitFeedSyncProcessing implementation #819

AlfredNwolisa · 2024-11-12T21:35:15Z

Summary:

The pull request addresses feed sync processing, ensuring proper handling and consistency of feed data using Pub/Sub messages. It includes necessary configuration files, comprehensive tests, and documentation.

Key implementations include:

Created FeedProcessor class with comprehensive database interaction capabilities
Implemented idempotent feed processing logic that handles both new and existing feeds
Added support for feed URL change detection and deprecation workflow
Integrated with Google Cloud Pub/Sub for dataset batch processing
Implemented stable ID generation for feed tracking across updates
Added comprehensive logging and error handling at all processing stages
Implemented database transaction management with rollback support

Added support for:

Authentication type handling for feeds

External ID mapping and management

Feed redirection tracking

URL duplication checking

Feed status management (active/deprecated)

This pull request addresses the functionality described in issue https://github.com/MobilityData/product-tasks/issues/102, which is part of the https://github.com/MobilityData/product-tasks/issues/95 epic.

Expected behavior:

Feed Processing Flow:

The Cloud Function receives a Pub/Sub event containing feed information:

Decodes base64 encoded message

Validates and parses the payload into a FeedPayload object

For each feed processing request:

Checks if feed exists using external ID and source

Validates feed URL for duplicates across the system

New Feed Processing:

If feed doesn't exist:

Generates new UUID and stable ID

Creates feed record with active status

Creates external ID mapping

Publishes to dataset batch topic if not authenticated

Feed Update Processing:

If feed exists with different URL:

Creates new feed record with updated URL

Deprecates old feed record

Updates external ID mapping

Creates redirect mapping between old and new feed IDs

Publishes update to dataset batch topic if not authenticated

Database Transaction Handling:

Commits successful operations
Rolls back on any errors
Maintains data consistency across all operations

Error Handling:

Provides detailed logging at each step

Testing tips:

Provide tips, procedures and sample files on how to test the feature.
Testers are invited to follow the tips AND to try anything they deem relevant outside the bounds of the testing tips.

Please make sure these boxes are checked before submitting your pull request - thanks!

Run the unit tests with ./scripts/api-tests.sh to make sure you didn't break anything
Add or update any needed documentation to the repo
Format the title like "feat: [new feature short description]". Title must follow the Conventional Commit Specification(https://www.conventionalcommits.org/en/v1.0.0/).
Linked all relevant issues
Include screenshot(s) showing how this pull request works and fixes the issue(s)

This commit: - Implements feed sync processing for Pub/Sub messages - Ensures database consistency during sync operations - Adds configuration files for feed sync settings - Includes comprehensive test coverage - Documents sync process and configuration options

functions-python/feed_sync_process_transitland/src/main.py

functions-python/feed_sync_process_transitland/requirements.txt

Replaced raw SQL queries with SQLAlchemy ORM models for handling database operations in feed processing. Enhanced test coverage and updated mock configurations to align with the new ORM-based approach.

cka-y · 2024-11-18T20:18:27Z

functions-python/feed_sync_process_transitland/.env.rename_me

+FEEDS_DATABASE_URL=postgresql://postgres:postgres@localhost:54320/MobilityDatabase
+PROJECT_ID=my-project-id
+PUBSUB_TOPIC_NAME=my-topic
+TRANSITLAND_API_KEY=your-api-key


Is this variable used?

cka-y · 2024-11-18T20:24:22Z

functions-python/feed_sync_process_transitland/src/main.py

+logger = logging.getLogger("feed_processor")
+if not logger.handlers:
+    handler = logging.StreamHandler()
+    handler.setFormatter(
+        logging.Formatter("%(asctime)s - %(name)s " "- %(levelname)s - %(message)s")
+    )
+    logger.addHandler(handler)
+    logger.setLevel(logging.INFO)


Please init the logger using the helpers.Logger.init_logger()
Example usage: https://github.com/MobilityData/mobility-feed-api/blob/main/functions-python/preprocessed_analytics/src/main.py#L36
This is important to make sure the logs are printed correctly when running in a GCP instance

cka-y · 2024-11-18T20:26:43Z

functions-python/feed_sync_process_transitland/src/main.py

+class FeedPayload:
+    """Data class for feed processing payload"""
+
+    external_id: str
+    feed_id: str
+    feed_url: str
+    execution_id: Optional[str]
+    spec: str
+    auth_info_url: Optional[str]
+    auth_param_name: Optional[str]
+    type: Optional[str]
+    operator_name: Optional[str]
+    country: Optional[str]
+    state_province: Optional[str]
+    city_name: Optional[str]
+    source: str
+    payload_type: str


This seems to be duplicated from feed_sync_dispatcher_transitland. To avoid code duplication we should move it to a common location and keep the naming consistent (it's renamed from TransitFeedSyncPayload).

cka-y · 2024-11-19T01:05:59Z

functions-python/feed_sync_process_transitland/src/main.py

+            logger.error(error_msg)
+            if "payload" in locals():
+                self.session.rollback()
+                logger.debug("Database transaction rolled back due to error")


We should change this to an error level logging as it is critical

cka-y · 2024-11-19T01:29:04Z

functions-python/feed_sync_process_transitland/src/main.py

+        result = (
+            self.session.query(Feed.id, Feed.producer_url)
+            .join(Externalid)
+            .filter(
+                Externalid.associated_id == external_id,
+                Externalid.source == source,
+                Feed.status == "active",
+            )
+            .first()
+        )
+        if result:
+            logger.debug(
+                f"Retrieved current feed " f"info for external_id: {external_id}"
+            )
+            return result[0], result[1]
+
+        logger.debug(f"No existing feed found for external_id: {external_id}")
+        return None, None


[suggestion to avoid table join]

Suggested change

result = (

self.session.query(Feed.id, Feed.producer_url)

.join(Externalid)

.filter(

Externalid.associated_id == external_id,

Externalid.source == source,

Feed.status == "active",

)

.first()

)

if result:

logger.debug(

f"Retrieved current feed " f"info for external_id: {external_id}"

)

return result[0], result[1]

logger.debug(f"No existing feed found for external_id: {external_id}")

return None, None

result = (self.session

.query(Feed)

.filter(

Feed.externalids.any(

associated_id=external_id,

source=source

),

Feed.status == "active")

.first()

)

if result is not None:

logger.info(

f"Retrieved feed {result.stable_id} info for external_id: {external_id}"

)

return result.id, result.producer_url

logging.info(f"No existing feed found for external_id: {external_id}")

return None, None

cka-y · 2024-11-19T01:52:04Z

functions-python/feed_sync_process_transitland/src/main.py

+            # Update old feed status to deprecated
+            old_feed = self.session.get(Feed, old_feed_id)
+            if old_feed:
+                old_feed.status = "deprecated"


[question] just to confirm @emmambd in the case of the update of an existing feed during scraping. What should the status of the old feed be changed to?

cka-y · 2024-11-19T01:56:45Z

functions-python/feed_sync_process_transitland/src/main.py

+            existing_external_id = (
+                self.session.query(Externalid)
+                .filter(
+                    Externalid.associated_id == payload.external_id,
+                    Externalid.source == payload.source,
+                )
+                .first()
+            )
+
+            if existing_external_id:
+                existing_external_id.feed_id = new_feed_id
+                logger.debug(
+                    f"Updated external ID mapping to new feed_id: {new_feed_id}"
+                )


We should actually create a new entity of Externalid linked to the updated (new) feed.

cka-y · 2024-11-19T02:00:42Z

functions-python/feed_sync_process_transitland/src/main.py

+                logger.debug("Database transaction rolled back due to error")
+            raise
+
+    def process_new_feed(self, payload: FeedPayload) -> None:


[question] what about location information?

cka-y · 2024-11-19T02:03:50Z

functions-python/feed_sync_process_transitland/src/main.py

+        try:
+            # Create new feed with updated URL
+            new_feed_id = str(uuid.uuid4())
+            stable_id = f"{payload.source}-{payload.external_id}"


If I'm not mistaking, following the code logic, the updated feed would have the same stable_id as the old version. Is this behaviour ok as we use the stable_id as a key in a lot of processes including the feed page url in the web ui? @emmambd @davidgamez

cka-y · 2024-11-19T02:06:57Z

functions-python/feed_sync_process_transitland/src/main.py

+            {"feed_id": payload.feed_id, "execution_id": payload.execution_id}
+        ).encode("utf-8")


This is not the message format that the dataset processing topic expects. It should be:

{ "message": { "data": { "execution_id": "execution_id", "producer_url": "producer_url", "feed_stable_id": "feed_stable_id", "feed_id": "feed_id", "dataset_id": "dataset_id", "dataset_hash": "dataset_hash", "authentication_type": "authentication_type", "authentication_info_url": "authentication_info_url", "api_key_parameter_name": "api_key_parameter_name" } } }

Refer to functions-python/batch_process_dataset for more information.

cka-y · 2024-11-19T02:13:48Z

New feeds and feed updates should have status="wip" so they can be manually validated before becoming public.

AlfredNwolisa requested a review from davidgamez November 12, 2024 21:43

AlfredNwolisa self-assigned this Nov 12, 2024

AlfredNwolisa changed the title ~~feat: Add Transitland feed sync processor~~ Feat: transitFeedSyncProcessing implementation Nov 12, 2024

davidgamez reviewed Nov 13, 2024

View reviewed changes

functions-python/feed_sync_process_transitland/src/main.py Outdated Show resolved Hide resolved

functions-python/feed_sync_process_transitland/requirements.txt Outdated Show resolved Hide resolved

AlfredNwolisa added 2 commits November 13, 2024 14:01

lint fix

8f2740f

Refactor to use SQLAlchemy models for database operations

af81f14

Replaced raw SQL queries with SQLAlchemy ORM models for handling database operations in feed processing. Enhanced test coverage and updated mock configurations to align with the new ORM-based approach.

AlfredNwolisa marked this pull request as ready for review November 18, 2024 20:06

AlfredNwolisa requested a review from davidgamez November 18, 2024 20:06

davidgamez requested a review from cka-y November 18, 2024 20:15

cka-y reviewed Nov 19, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feat: transitFeedSyncProcessing implementation #819

Feat: transitFeedSyncProcessing implementation #819

AlfredNwolisa commented Nov 12, 2024 •

edited

Loading

cka-y Nov 18, 2024

cka-y Nov 18, 2024

cka-y Nov 18, 2024

cka-y Nov 19, 2024 •

edited

Loading

cka-y Nov 19, 2024

cka-y Nov 19, 2024

cka-y Nov 19, 2024 •

edited

Loading

cka-y Nov 19, 2024

cka-y Nov 19, 2024

cka-y Nov 19, 2024

cka-y commented Nov 19, 2024

		{"feed_id": payload.feed_id, "execution_id": payload.execution_id}
		).encode("utf-8")

Feat: transitFeedSyncProcessing implementation #819

Are you sure you want to change the base?

Feat: transitFeedSyncProcessing implementation #819

Conversation

AlfredNwolisa commented Nov 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cka-y Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cka-y Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cka-y commented Nov 19, 2024

AlfredNwolisa commented Nov 12, 2024 •

edited

Loading

cka-y Nov 19, 2024 •

edited

Loading

cka-y Nov 19, 2024 •

edited

Loading