Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New lists: Big ass change #287

Merged
merged 88 commits into from
Nov 12, 2024
Merged

New lists: Big ass change #287

merged 88 commits into from
Nov 12, 2024

Conversation

perfectly-preserved-pie
Copy link
Owner

  • Add support for new list formats (just Lease for now)
  • Reverse engineered API for The Agency
  • Implement strict dtypes for Pandas dataframe
  • Normalizing a bunch of column names and dtypes
  • Better way of installing uv in the container

…ve error handling; enhance listing expiration checks for BHHS and The Agency
Add support for new list format, new dataset from 10/21, other fixes
@perfectly-preserved-pie perfectly-preserved-pie added enhancement New feature or request fix Fixing an issue or problem labels Nov 12, 2024
@perfectly-preserved-pie perfectly-preserved-pie linked an issue Nov 12, 2024 that may be closed by this pull request
@perfectly-preserved-pie perfectly-preserved-pie merged commit 13c02c0 into master Nov 12, 2024
3 checks passed
mls_number = str(getattr(row, 'mls_number', ''))

# Check if the listing is expired on BHHS
if 'bhhscalifornia.com' in listing_url:

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
bhhscalifornia.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix AI about 2 months ago

To fix the problem, we should parse the URL and check the hostname instead of using a substring check. This ensures that the check is performed on the actual host part of the URL, preventing bypasses through embedding the allowed host in an unexpected location.

The best way to fix this is to use the urlparse function from the urllib.parse module to extract the hostname from the URL and then check if it matches the allowed host. This change should be made in the remove_inactive_listings function.

Suggested changeset 1
functions/dataframe_utils.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/functions/dataframe_utils.py b/functions/dataframe_utils.py
--- a/functions/dataframe_utils.py
+++ b/functions/dataframe_utils.py
@@ -3,2 +3,3 @@
 from loguru import logger
+from urllib.parse import urlparse
 import asyncio
@@ -29,3 +30,4 @@
         # Check if the listing is expired on BHHS
-        if 'bhhscalifornia.com' in listing_url:
+        parsed_url = urlparse(listing_url)
+        if parsed_url.hostname == 'bhhscalifornia.com':
             is_expired = check_expired_listing_bhhs(listing_url, mls_number)
@@ -35,3 +37,3 @@
         # Check if the listing is expired on The Agency
-        elif 'theagencyre.com' in listing_url:
+        elif parsed_url.hostname == 'theagencyre.com':
             is_sold = check_expired_listing_theagency(listing_url, mls_number)
EOF
@@ -3,2 +3,3 @@
from loguru import logger
from urllib.parse import urlparse
import asyncio
@@ -29,3 +30,4 @@
# Check if the listing is expired on BHHS
if 'bhhscalifornia.com' in listing_url:
parsed_url = urlparse(listing_url)
if parsed_url.hostname == 'bhhscalifornia.com':
is_expired = check_expired_listing_bhhs(listing_url, mls_number)
@@ -35,3 +37,3 @@
# Check if the listing is expired on The Agency
elif 'theagencyre.com' in listing_url:
elif parsed_url.hostname == 'theagencyre.com':
is_sold = check_expired_listing_theagency(listing_url, mls_number)
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
indexes_to_drop.append(row.Index)
logger.success(f"Removed MLS {mls_number} (Index: {row.Index}) from the DataFrame because the listing has expired on BHHS.")
# Check if the listing is expired on The Agency
elif 'theagencyre.com' in listing_url:

Check failure

Code scanning / CodeQL

Incomplete URL substring sanitization High

The string
theagencyre.com
may be at an arbitrary position in the sanitized URL.

Copilot Autofix AI about 2 months ago

To fix the problem, we need to parse the URL and check the hostname to ensure it matches the expected domain. This can be done using the urlparse function from the urllib.parse module. Specifically, we will:

  1. Parse the listing_url to extract the hostname.
  2. Check if the hostname matches the expected domain (bhhscalifornia.com or theagencyre.com).

This approach ensures that the check is performed on the actual hostname, preventing bypasses through substring manipulation.

Suggested changeset 1
functions/dataframe_utils.py

Autofix patch

Autofix patch
Run the following command in your local git repository to apply this patch
cat << 'EOF' | git apply
diff --git a/functions/dataframe_utils.py b/functions/dataframe_utils.py
--- a/functions/dataframe_utils.py
+++ b/functions/dataframe_utils.py
@@ -29,3 +29,4 @@
         # Check if the listing is expired on BHHS
-        if 'bhhscalifornia.com' in listing_url:
+        parsed_url = urlparse(listing_url)
+        if parsed_url.hostname == 'bhhscalifornia.com':
             is_expired = check_expired_listing_bhhs(listing_url, mls_number)
@@ -35,3 +36,3 @@
         # Check if the listing is expired on The Agency
-        elif 'theagencyre.com' in listing_url:
+        elif parsed_url.hostname == 'theagencyre.com':
             is_sold = check_expired_listing_theagency(listing_url, mls_number)
EOF
@@ -29,3 +29,4 @@
# Check if the listing is expired on BHHS
if 'bhhscalifornia.com' in listing_url:
parsed_url = urlparse(listing_url)
if parsed_url.hostname == 'bhhscalifornia.com':
is_expired = check_expired_listing_bhhs(listing_url, mls_number)
@@ -35,3 +36,3 @@
# Check if the listing is expired on The Agency
elif 'theagencyre.com' in listing_url:
elif parsed_url.hostname == 'theagencyre.com':
is_sold = check_expired_listing_theagency(listing_url, mls_number)
Copilot is powered by AI and may make mistakes. Always verify output.
Unable to commit as this autofix suggestion is now outdated
Positive Feedback
Negative Feedback

Provide additional feedback

Please help us improve GitHub Copilot by sharing more details about this comment.

Please select one or more of the options
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request fix Fixing an issue or problem
Projects
None yet
Development

Successfully merging this pull request may close these issues.

New list format - Lease
1 participant