Skip to content

Commit

Permalink
Write out megastock table with sample size tag (#31)
Browse files Browse the repository at this point in the history
Co-authored-by: mikivee <mikivee>
  • Loading branch information
mikivee authored Feb 11, 2025
1 parent 58cc254 commit 0763726
Show file tree
Hide file tree
Showing 3 changed files with 15 additions and 19 deletions.
8 changes: 3 additions & 5 deletions scripts/megastock/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ A. Generate resstock building samples using the resstock repo.
- See the [resstock github repo](https://github.com/NREL/resstock/tree/develop?tab=readme-ov-file), and the [relevant documentation](https://resstock.readthedocs.io/en/latest/basic_tutorial/architecture.html#sampling).
- Follow their installation instructions -- you'll have to install OpenStudio and the appropriate version of ruby to match what is defined in the resstock repo. They use [rbenv](https://github.com/rbenv/rbenv#readme) to manage ruby versions.
- generate building metadata csv files using their sampling script
- Sampled files using v3.3.0 are currently on GCS at `the-cube/data/processed/sampling_resstock/resstock_v3.3.0`. There are files corresponding to multiple sample sizes including N=10k, 1M, 2M, and 5M.
- Sampled files using v3.3.0 are currently on GCS at `the-cube/data/processed/sampling_resstock/resstock_v3.3.0`. There are files corresponding to multiple sample sizes including N=10k, 1M, 2M, 5M, 10M, 15M, 20M.

B. Run the [MegaStock Job](https://4617764665359845.5.gcp.databricks.com/jobs/724743198057405?o=4617764665359845) with the job parameter `n_sample_tag` set to the sample size suffix of the CSV from step 1. (e.g, '5M'). This will perform the following:

Expand All @@ -18,10 +18,8 @@ B. Run the [MegaStock Job](https://4617764665359845.5.gcp.databricks.com/jobs/72
2. Run `feature_extract_02`, referencing appropriate file names based on the job parameter. There are functions/code which:
- transform building features and add upgrades and weather city
- write out building metadata and upgrades to the feature store
3. Run `write_databricks_to_bigquery_03`, , referencing appropriate file names based on the job parameter. There code will write out two tables to BQ, *which will overwrite the current tables based on whatever the chosen sample size is*.
- `cube-machine-learning.ds_api_datasets.megastock_metadata`
- `cube-machine-learning.ds_api_datasets.megastock_features`

3. Run `write_databricks_to_bigquery_03`, , referencing appropriate file names based on the job parameter. There code will write the following table to BQ:
- `cube-machine-learning.ds_api_datasets.megastock_combined_baseline_{n_sample_tag}`

## Useful info
- [Reference figma diagram](https://www.figma.com/board/HbgKjS4P6tHGDLmz84fxTK/SuMo%2FDoyho?node-id=9-429&node-type=section&t=UCFHhbgvIyBZKoQM-0)
18 changes: 8 additions & 10 deletions scripts/megastock/feature_extract_02.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,13 +84,11 @@
# DBTITLE 1,Write out building metadata feature store
table_name = f"ml.megastock.building_features_{N_SAMPLE_TAG}"
df = building_metadata_upgrades
if spark.catalog.tableExists(table_name):
fe.write_table(name=table_name, df=df, mode="merge")
else:
fe.create_table(
name=table_name,
primary_keys=["building_id", "upgrade_id", "weather_file_city"],
df=df,
schema=df.schema,
description="megastock building metadata features",
)
spark.sql(f"DROP TABLE IF EXISTS {table_name}")
fe.create_table(
name=table_name,
primary_keys=["building_id", "upgrade_id", "weather_file_city"],
df=df,
schema=df.schema,
description="megastock building metadata features",
)
8 changes: 4 additions & 4 deletions scripts/megastock/write_databricks_to_bigquery_03.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@
# MAGIC - `ml.megastock.building_features_{n_sample_tag}`
# MAGIC
# MAGIC ## Outputs: tables on BigQuery
# MAGIC - `cube-machine-learning.ds_api_datasets.megastock_combined_baseline`
# MAGIC - `cube-machine-learning.ds_api_datasets.megastock_combined_baseline_{n_sample_tag}`
# MAGIC

# COMMAND ----------
Expand Down Expand Up @@ -62,7 +62,7 @@
# set up paths to write to
bq_project = "cube-machine-learning"
bq_dataset = "ds_api_datasets"
bq_megastock_table = 'megastock_combined_baseline'
bq_megastock_table = f'megastock_combined_baseline_{N_SAMPLE_TAG}'
bq_write_path = f"{bq_project}.{bq_dataset}.{bq_megastock_table}"

# COMMAND ----------
Expand Down Expand Up @@ -110,7 +110,7 @@
# optimize the table by partitioning and clustering
query = f"""
CREATE TABLE `{bq_write_path}_optimized`
PARTITION BY RANGE_BUCKET(climate_zone_int__m, GENERATE_ARRAY(1, {len(climate_zone_mapping)+1}, 1))
PARTITION BY RANGE_BUCKET(climate_zone_int__m, GENERATE_ARRAY(1, {len(CLIMATE_ZONE_TO_INDEX)+1}, 1))
CLUSTER BY heating_fuel__m, geometry_building_type_acs__m, geometry_floor_area__m, vintage__m AS
SELECT *,
FROM `{bq_write_path}`
Expand Down Expand Up @@ -144,7 +144,7 @@
rows = query_job.result() # Waits for query to finish

for row in rows:
print(row) #3,234,218
print(row)

# COMMAND ----------

Expand Down

0 comments on commit 0763726

Please sign in to comment.