Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider more lightweight DB system #268

Open
lyricnz opened this issue Sep 21, 2023 · 4 comments
Open

Consider more lightweight DB system #268

lyricnz opened this issue Sep 21, 2023 · 4 comments
Labels
enhancement New feature or request

Comments

@lyricnz
Copy link
Contributor

lyricnz commented Sep 21, 2023

Following on from the work in #255

Ideas for reducing the size of the DB. Perhaps it's time to consider:

  • using something more lightweight than postgresql: maybe sqllite, compressed and readonly
  • maybe stored in filesystem rather than docker image
  • preprocessing the DB into a format more specific to the application (there's only 1 main query)
@lyricnz
Copy link
Contributor Author

lyricnz commented Sep 21, 2023

Notes (just experimenting):

  • gnaf-202308.dmp 2.1G
  • address_principals.csv.gz 306M
  • address_principals.csv 1.1GB
  • current DB image 3.01GB
  • test.db DB per below (no index) 1.1GB
  • test.db DB per below (with index) 1.4GB
  • gzip test.db with index 507M
sqlite3 test.db

CREATE TABLE address_principals
(
  gnaf_pid text NOT NULL,
  address text NOT NULL,
  locality_name text NOT NULL,
  postcode text NULL,
  state text NOT NULL,
  latitude numeric(10,8) NOT NULL,
  longitude numeric(11,8) NOT NULL
);

sqlite> .mode csv
sqlite> .import address_principals.csv address_principals

CREATE INDEX address_name_state ON address_principals(locality_name, state);

SELECT gnaf_pid, address, postcode, latitude, longitude FROM address_principals WHERE locality_name = "SOMERVILLE" AND state = "VIC"

@lyricnz
Copy link
Contributor Author

lyricnz commented Sep 22, 2023

Related idea - host the latest version of the DB as a static+queryable source.
Per https://phiresky.github.io/blog/2021/hosting-sqlite-databases-on-github-pages/

@lyricnz
Copy link
Contributor Author

lyricnz commented Sep 25, 2023

Possibly a Pandas dataframe saved in parquet file? This is a compressed and queryable format

csv_file = "..../extra/db/address_principals.csv"
df = pd.read_csv(csv_file)
df.to_parquet("rows.parquet")  # 484MB
columns = ['gnaf_pid', 'address', 'postcode', 'latitude', 'longitude']
filters = [
    ('locality_name', '==', 'BLI BLI'),
    ('state', '==', 'QLD'),
]
df2 = pd.read_parquet("rows.parquet", columns=columns, filters=filters)
print(df2.head())
print(len(df2))

emits

2023-09-25 11:37:43,566 INFO     start
         gnaf_pid                       address  ...   latitude   longitude
0  GAQLD720868370              19 WILLOW STREET  ... -26.616855  153.031108
1  GAQLD720861910            21 HILLGROVE COURT  ... -26.618337  153.030099
2  GAQLD720864676               8 WILLOW STREET  ... -26.617568  153.030009
3  GAQLD719559927  UNIT 1, 134 KINGFISHER DRIVE  ... -26.601149  153.015579
4  GAQLD720857697             4 HILLGROVE COURT  ... -26.617756  153.030490

[5 rows x 5 columns]
4844
2023-09-25 11:37:45,287 INFO     elapsed: 0 days 00:00:01.718875

@lyricnz
Copy link
Contributor Author

lyricnz commented Sep 25, 2023

We can't use git-lfs on github, because it only provides 1GB of storage, and 1GB/mo of transfer. It's $5/mo for 50+50, but if GHA counts, that'd be used inside a day or two.

https://docs.github.com/en/repositories/working-with-files/managing-large-files/about-storage-and-bandwidth-usage

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants