Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for new list format, new dataset from 10/21, other fixes #285

Merged
merged 85 commits into from
Nov 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
85 commits
Select commit Hold shift + click to select a range
2269821
Bump deps
perfectly-preserved-pie Oct 14, 2024
961219a
Initial commit
perfectly-preserved-pie Oct 28, 2024
a27f3c8
Add a function to webscrape TLA
perfectly-preserved-pie Oct 28, 2024
e7e98a6
Add a function to use The Agency API to get property details
perfectly-preserved-pie Oct 30, 2024
3847346
Fall back to fetching property data from The Agency if it doesn't exi…
perfectly-preserved-pie Oct 30, 2024
72f260f
Refactor fetch_the_agency_data to include row_index and total_rows pa…
perfectly-preserved-pie Oct 30, 2024
fa8b0cd
Fetch the first property listing image from TA
perfectly-preserved-pie Oct 30, 2024
ed34e4e
fetch and transform the first property listing image from The Agency API
perfectly-preserved-pie Oct 30, 2024
0e839e4
Return an additional None
perfectly-preserved-pie Oct 30, 2024
6193a17
Move webscraping logic into its own function
perfectly-preserved-pie Oct 30, 2024
ffbcf59
Add logging
perfectly-preserved-pie Oct 30, 2024
1e16c6b
Rename parking key
perfectly-preserved-pie Oct 30, 2024
987ccbe
Clean up lease dataset. Rename columns to match the new lists'. Drop …
perfectly-preserved-pie Oct 30, 2024
1ea0421
Change column dtypes to the optimal dtype
perfectly-preserved-pie Oct 31, 2024
f47d304
Change pet_policy and laundry and subtype to category dtype
perfectly-preserved-pie Oct 31, 2024
718947b
Cast dtypes of new list to (hopefully) match those of the old dataset
perfectly-preserved-pie Oct 31, 2024
dae6a9b
Rename "Garage Spaces" to "Parking Spaces" in popup.js
perfectly-preserved-pie Nov 1, 2024
c873f51
Convert senior_community into boolean
perfectly-preserved-pie Nov 1, 2024
fc9ba9a
Normalize 'Sqft' column name to 'sqft' in LeaseFilters and BuyFilters
perfectly-preserved-pie Nov 1, 2024
c8dece3
Rename 'garage_spaces' to 'parking_spaces' in LeaseFilters for consis…
perfectly-preserved-pie Nov 1, 2024
4c48ccf
Rename 'PetsAllowed' to 'pet_policy' in LeaseFilters for consistency
perfectly-preserved-pie Nov 1, 2024
754d66e
Normalize 'Furnished' column name to 'furnished' in LeaseFilters for …
perfectly-preserved-pie Nov 1, 2024
b73dc94
Normalize 'DepositSecurity' column name to 'security_deposit' in Leas…
perfectly-preserved-pie Nov 1, 2024
e99ef62
Normalize 'DepositPets' column name to 'pets_deposit' in LeaseFilters…
perfectly-preserved-pie Nov 1, 2024
f0aa22d
Normalize 'DepositKey' column name to 'key_deposit' in LeaseFilters f…
perfectly-preserved-pie Nov 1, 2024
d8dd2f2
Normalize 'DepositOther' column name to 'other_deposit' in LeaseFilte…
perfectly-preserved-pie Nov 1, 2024
e3e18f9
Normalize 'Terms' column name to 'terms' in LeaseFilters for consistency
perfectly-preserved-pie Nov 1, 2024
39f42b7
Normalize 'LaundryCategory' column name to 'laundry' in LeaseFilters …
perfectly-preserved-pie Nov 1, 2024
cba3864
Change latitude and longitude colum nmnames
perfectly-preserved-pie Nov 4, 2024
caa5151
Add .venv/ to .gitignore to exclude virtual environment files
perfectly-preserved-pie Nov 4, 2024
b62a828
Change slider parameter types in sqft_radio_button method from float …
perfectly-preserved-pie Nov 4, 2024
71f253a
Normalize 'YrBuilt' column name to 'year_built' in LeaseFilters for c…
perfectly-preserved-pie Nov 4, 2024
d658b0c
Change slider parameter types in ppsqft_radio_button method from floa…
perfectly-preserved-pie Nov 4, 2024
9b62e8d
Refactor map update logic to use new column names
perfectly-preserved-pie Nov 4, 2024
8d50813
Refactor the rest of the lease filters
perfectly-preserved-pie Nov 4, 2024
4393e57
Normalize column names and update references in LeaseComponents for c…
perfectly-preserved-pie Nov 4, 2024
c66186c
Revert buy components to how they were before
perfectly-preserved-pie Nov 5, 2024
7c7a516
Update column names in LeaseComponents and lease_page for consistency
perfectly-preserved-pie Nov 5, 2024
a59d704
Update property details in popup.js to use MLS number, MLS photo, and…
perfectly-preserved-pie Nov 5, 2024
1936ecb
Update popup.js to display total bathrooms instead of individual bath…
perfectly-preserved-pie Nov 5, 2024
bfd76a6
Capitalize "Bedrooms" in the bedrooms slider header
perfectly-preserved-pie Nov 5, 2024
a35e49d
Remove check for 'ppsqft' column in lease_dataframe.py
perfectly-preserved-pie Nov 5, 2024
decac21
Change full street address column to use StringDtype in lease_datafra…
perfectly-preserved-pie Nov 6, 2024
3992052
Refactor zip code retrieval in geocoding_utils.py and update zipcode …
perfectly-preserved-pie Nov 6, 2024
b28a36a
Refactor data type handling in lease_dataframe.py: clean numeric colu…
perfectly-preserved-pie Nov 7, 2024
af4f3b2
Refactor agency data handling in update_dataframe_with_listing_data: …
perfectly-preserved-pie Nov 7, 2024
a73fc34
Improve error handling and logging for JSON parsing in fetch_the_agen…
perfectly-preserved-pie Nov 7, 2024
44b73ee
Enhance JSON parsing and error handling in fetch_the_agency_data: add…
perfectly-preserved-pie Nov 7, 2024
d40c936
Enhance MLS number matching in fetch_the_agency_data: implement fuzzy…
perfectly-preserved-pie Nov 7, 2024
c31955a
Sort imports alphabetically
perfectly-preserved-pie Nov 7, 2024
1dcabb6
Change log level from debug to info for MLS number matching in fetch_…
perfectly-preserved-pie Nov 7, 2024
a2fd701
Fix not being able to find the property image src
perfectly-preserved-pie Nov 7, 2024
eb86ce7
Use regex to find the correct property based on street name
perfectly-preserved-pie Nov 8, 2024
422619c
Refactor fetch_the_agency_data function:
perfectly-preserved-pie Nov 8, 2024
55fab2d
DOCSTRINGS BABY!!!!!
perfectly-preserved-pie Nov 8, 2024
9e20dc6
Minor edit in the docstring
perfectly-preserved-pie Nov 8, 2024
455adaa
Minor edit again dammit
perfectly-preserved-pie Nov 8, 2024
7f86b38
Refactor logging in fetch_the_agency_data and update_dataframe_with_l…
perfectly-preserved-pie Nov 8, 2024
d013b8a
Update docstring for fetch agency data
perfectly-preserved-pie Nov 8, 2024
21741da
Remove dead comments
perfectly-preserved-pie Nov 8, 2024
17b86e1
Preliminary function to remove expired listings on The Agency
perfectly-preserved-pie Nov 9, 2024
d83863a
Sort imports alphabetically
perfectly-preserved-pie Nov 9, 2024
0d72883
Consolidate expired listings check into one function
perfectly-preserved-pie Nov 9, 2024
b55a18e
Remove unneeded imports
perfectly-preserved-pie Nov 9, 2024
88231d5
Move The Agency removal check to webscraping_utils and change the log…
perfectly-preserved-pie Nov 9, 2024
d53892e
Remove unneeded imports
perfectly-preserved-pie Nov 9, 2024
ab2c1c6
Ensure listing_url and mls_number are strings in remove_inactive_list…
perfectly-preserved-pie Nov 12, 2024
0c1861b
Refactor check_expired_listing_bhhs to use synchronous requests and i…
perfectly-preserved-pie Nov 12, 2024
42411cc
Fix wrong expired listing check message on BHHS
perfectly-preserved-pie Nov 12, 2024
74ce223
Use specific domains for removing inactive listings
perfectly-preserved-pie Nov 12, 2024
75f1f6a
Refactor web scraping functions to use synchronous requests and impro…
perfectly-preserved-pie Nov 12, 2024
935910d
Fix categorize_laundry_features to handle NaN values using pd.isna
perfectly-preserved-pie Nov 12, 2024
3805db8
New lease dataset for 10/21
perfectly-preserved-pie Nov 12, 2024
5a2e118
Enhance rental terms checklist to handle 'Unknown' category and impro…
perfectly-preserved-pie Nov 12, 2024
2949f19
Drop old/redundant columns
perfectly-preserved-pie Nov 12, 2024
48abd93
Remove redundant bedrooms_bathrooms field from lease map data
perfectly-preserved-pie Nov 12, 2024
c14ae1d
Changing column dtypes
perfectly-preserved-pie Nov 12, 2024
ded1f13
Drop a row with wild ass bedrooms/bathrooms. Fuck this i'm not dealin…
perfectly-preserved-pie Nov 12, 2024
b30e34e
Drop another row with fucked up sqft
perfectly-preserved-pie Nov 12, 2024
6422e7c
Remove trailing .0 in zipcode
perfectly-preserved-pie Nov 12, 2024
2ea2313
Remove trailing .0 in full_street_address
perfectly-preserved-pie Nov 12, 2024
4dea27a
Fix missing city and zipcode
perfectly-preserved-pie Nov 12, 2024
339618c
fix city
perfectly-preserved-pie Nov 12, 2024
faebf07
Cast 'sqft' to UInt32 and update numeric columns to use UInt16Dtype
perfectly-preserved-pie Nov 12, 2024
64a86a8
Copy missing values from their old column counterpart, set dtypes, ma…
perfectly-preserved-pie Nov 12, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@ __pycache__/larentals.cpython-310.pyc
*.csv
*.pyc
*.xlsx
.venv/
env
hdf
larentals-checkpoint.py
Expand Down
Binary file modified assets/datasets/lease.parquet
Binary file not shown.
Binary file not shown.
14 changes: 7 additions & 7 deletions assets/javascript/popup.js
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ window.dash_props = Object.assign({}, window.dash_props, {
return `
<tr>
<th style="text-align:left;padding:8px;border-bottom:1px solid #ddd;">Listing ID (MLS#)</th>
<td style="padding:8px;border-bottom:1px solid #ddd;">Not Available</td>
<td style="padding:8px;border-bottom:1px solid #ddd;">${data.mls_number}</td>
</tr>
`;
}
Expand All @@ -47,9 +47,9 @@ window.dash_props = Object.assign({}, window.dash_props, {
const listingUrlBlock = getListingUrlBlock(data);

// Conditionally include the property image row if the image URL is available
const imageRow = data.image_url ? `
const imageRow = data.mls_photo ? `
<a href="${data.listing_url}" target="_blank" referrerPolicy="noreferrer">
<img src="${data.image_url}" alt="Property Image" style="width:100%;height:auto;">
<img src="${data.mls_photo}" alt="Property Image" style="width:100%;height:auto;">
</a>
` : '';

Expand All @@ -64,7 +64,7 @@ window.dash_props = Object.assign({}, window.dash_props, {
<div>
${imageRow}
<div style="text-align: center;">
<h5>${data.address}</h5>
<h5>${data.full_street_address}</h5>
</div>
<table style="width:100%;border-collapse:collapse;">
<tr>
Expand Down Expand Up @@ -106,11 +106,11 @@ window.dash_props = Object.assign({}, window.dash_props, {
</tr>
<tr>
<th style="text-align:left;padding:8px;border-bottom:1px solid #ddd;">Bedrooms/Bathrooms</th>
<td style="padding:8px;border-bottom:1px solid #ddd;">${data.bedrooms}/${data.bathrooms}</td>
<td style="padding:8px;border-bottom:1px solid #ddd;">${data.bedrooms}/${data.total_bathrooms}</td>
</tr>
<tr>
<th style="text-align:left;padding:8px;border-bottom:1px solid #ddd;">Garage Spaces</th>
<td style="padding:8px;border-bottom:1px solid #ddd;">${data.garage_spaces || "Unknown"}</td>
<th style="text-align:left;padding:8px;border-bottom:1px solid #ddd;">Parking Spaces</th>
<td style="padding:8px;border-bottom:1px solid #ddd;">${data.parking_spaces || "Unknown"}</td>
</tr>
<tr>
<th style="text-align:left;padding:8px;border-bottom:1px solid #ddd;">Pets Allowed?</th>
Expand Down
123 changes: 91 additions & 32 deletions functions/dataframe_utils.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
from aiolimiter import AsyncLimiter
from functions.webscraping_utils import check_expired_listing
from functions.mls_image_processing_utils import imagekit_transform
from functions.webscraping_utils import check_expired_listing_bhhs, check_expired_listing_theagency, webscrape_bhhs, fetch_the_agency_data
from loguru import logger
import asyncio
import pandas as pd
Expand All @@ -8,40 +8,99 @@
# Initialize logging
logger.add(sys.stderr, format="{time} {level} {message}", filter="my_module", level="INFO")

async def remove_expired_listings(df: pd.DataFrame, limiter: AsyncLimiter) -> pd.DataFrame:
def remove_inactive_listings(df: pd.DataFrame) -> pd.DataFrame:
"""
Asynchronously checks each listing URL in the DataFrame to determine if it has expired,
and removes rows with expired listings, applying rate limiting. Also counts the number of expired listings removed.
Checks each listing to determine if it has expired or been sold, and removes inactive listings.
If 'bhhs' is in the 'listing_url', it checks for expired listings.
If 'idcrealestate' is in the 'listing_url', it checks for sold listings.

Parameters:
df (pd.DataFrame): The DataFrame containing listing URLs and MLS numbers.
limiter (AsyncLimiter): The rate limiter to control request frequency.

Returns:
pd.DataFrame: The DataFrame with expired listings removed.
pd.DataFrame: The DataFrame with inactive listings removed.
"""
async def check_and_mark_expired(row):
async with limiter:
expired = await check_expired_listing(row.listing_url, row.mls_number)
return (row.Index, expired)

# Gather tasks for all rows that need to be checked
tasks = [check_and_mark_expired(row) for row in df[df.listing_url.notnull()].itertuples()]
results = await asyncio.gather(*tasks)

# Determine indexes of rows to drop (where listing has expired)
indexes_to_drop = [index for index, expired in results if expired]

# Counter for expired listings
expired_count = len(indexes_to_drop)

# Log success messages for dropped listings and the count of expired listings
for index in indexes_to_drop:
mls_number = df.loc[index, 'mls_number']
logger.success(f"Removed {mls_number} (Index: {index}) from the dataframe because the listing has expired.")

logger.info(f"Total expired listings removed: {expired_count}")

# Drop the rows from the DataFrame and return the modified DataFrame
df_dropped_expired = df.drop(indexes_to_drop)
return df_dropped_expired
indexes_to_drop = []

for row in df.itertuples():
listing_url = str(getattr(row, 'listing_url', ''))
mls_number = str(getattr(row, 'mls_number', ''))

# Check if the listing is expired on BHHS
if 'bhhscalifornia.com' in listing_url:
is_expired = check_expired_listing_bhhs(listing_url, mls_number)
if is_expired:
indexes_to_drop.append(row.Index)
logger.success(f"Removed MLS {mls_number} (Index: {row.Index}) from the DataFrame because the listing has expired on BHHS.")
# Check if the listing is expired on The Agency
elif 'theagencyre.com' in listing_url:
is_sold = check_expired_listing_theagency(listing_url, mls_number)
if is_sold:
indexes_to_drop.append(row.Index)
logger.success(f"Removed MLS {mls_number} (Index: {row.Index}) from the DataFrame because the listing has expired on The Agency.")

inactive_count = len(indexes_to_drop)
logger.info(f"Total inactive listings removed: {inactive_count}")

df_active = df.drop(indexes_to_drop)
return df_active.reset_index(drop=True)

def update_dataframe_with_listing_data(
df: pd.DataFrame, imagekit_instance
) -> pd.DataFrame:
"""
Updates the DataFrame with listing date, MLS photo, and listing URL by scraping BHHS and using The Agency's API.

Parameters:
df (pd.DataFrame): The DataFrame to update.
imagekit_instance: The ImageKit instance for image transformations.

Returns:
pd.DataFrame: The updated DataFrame.
"""
for row in df.itertuples():
mls_number = row.mls_number
try:
webscrape = webscrape_bhhs(
url=f"https://www.bhhscalifornia.com/for-lease/{mls_number}-t_q;/",
row_index=row.Index,
mls_number=mls_number,
total_rows=len(df)
)

if not all(webscrape):
logger.warning(f"BHHS did not return complete data for MLS {mls_number}. Trying The Agency.")
agency_data = fetch_the_agency_data(
mls_number,
row_index=row.Index,
total_rows=len(df),
full_street_address=row.full_street_address
)

if agency_data and any(agency_data):
listed_date, listing_url, mls_photo = agency_data
if listed_date:
df.at[row.Index, 'listed_date'] = listed_date
if listing_url:
df.at[row.Index, 'listing_url'] = listing_url
if mls_photo:
df.at[row.Index, 'mls_photo'] = imagekit_transform(
mls_photo,
mls_number,
imagekit_instance=imagekit_instance
)
else:
logger.warning(f"No photo URL found for MLS {mls_number} from The Agency.")
else:
pass
else:
df.at[row.Index, 'listed_date'] = webscrape[0]
df.at[row.Index, 'mls_photo'] = imagekit_transform(
webscrape[1],
mls_number,
imagekit_instance=imagekit_instance
)
df.at[row.Index, 'listing_url'] = webscrape[2]
except Exception as e:
logger.error(f"Error processing MLS {mls_number} at index {row.Index}: {e}")
return df
50 changes: 25 additions & 25 deletions functions/geocoding_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -66,39 +66,39 @@ def fetch_missing_city(address: str, geolocator: GoogleV3) -> Optional[str]:

return city

def return_postalcode(address: str, geolocator: GoogleV3) -> Optional[Union[int, type(pd.NA)]]:
def return_zip_code(address: str, geolocator: GoogleV3) -> Optional[str]:
"""
Fetches the postal code for a given short address using forward and reverse geocoding.
Fetches the postal code for a given address using geocoding.

Parameters:
address (str): The short address.
geolocator (GoogleV3): An instance of a GoogleV3 geocoding class.
address (str): The full street address.
geolocator (GoogleV3): An instance of the GoogleV3 geocoding class.

Returns:
Optional[Union[int, type(pd.NA)]]: The postal code as an integer, or pd.NA if unsuccessful.
Optional[str]: The postal code as a string, or None if unsuccessful.
"""
# Initialize postalcode variable
postalcode = None

try:
geocode_info = geolocator.geocode(address, components={'administrative_area': 'CA', 'country': 'US'})
components = geolocator.geocode(f"{geocode_info.latitude}, {geocode_info.longitude}").raw['address_components']

# Create a dataframe from the list of dictionaries
components_df = pd.DataFrame(components)

# Iterate through rows to find the postal code
for row in components_df.itertuples():
if row.types == ['postal_code']:
postalcode = int(row.long_name)

logger.info(f"Fetched postal code {postalcode} for {address}.")
except AttributeError:
logger.warning(f"Geocoding returned no results for {address}.")
return pd.NA
geocode_info = geolocator.geocode(
address, components={'administrative_area': 'CA', 'country': 'US'}
)
if geocode_info:
raw = geocode_info.raw['address_components']
# Find the 'postal_code'
postalcode = next(
(addr['long_name'] for addr in raw if 'postal_code' in addr['types']),
None
)
if postalcode:
logger.info(f"Fetched zip code ({postalcode}) for {address}.")
else:
logger.warning(f"No postal code found in geocoding results for {address}.")
else:
logger.warning(f"Geocoding returned no results for {address}.")
except Exception as e:
logger.warning(f"Couldn't fetch postal code for {address} because {e}.")
return pd.NA
logger.warning(f"Couldn't fetch zip code for {address} because of {e}.")
postalcode = None

return postalcode

Loading
Loading