-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve BRAC model performance #82
Comments
Current performance running DBT with the models defined here:
As you can see in the logs above, the insertions to two of the three incremental tables take the longest time. The query plans for the two queries are below. couchdb
data_record
|
For additional context below are the logs for the dbt run immediately before and after the one posted above.
After:
|
|
I honestly don't know. I have tried rewriting the incremental query with no luck
|
its because of the block range index for some reason. but if I try with a btree index, it uses that and takes <1 ms
|
I'm going to test this locally to try to figure out why the indexes are not working. Which branch of cht-sync are you using? do you create indexes also for |
Also, on couhdb model did you tried the explain plan using:
?
the |
I tested this locally.
And using the following .env: .env
Connected to the Postgres database and run the following:
The output:
For data_record:
The output:
If I understand correctly, this is using the indexes. |
The time in my local with approximately 4 million docs in db:
|
@witash changing the type of index does indeed help. The query plan for data_record below now makes use of the index
|
|
The incremental logic can potentially be improved by further filtering by the
Creating partitioned tables in DBT is not straight forward and there is an open pr to add partitioning support. There seems to be a workaround by creating a function in DBT and have that create the partition tables but I have not yet gotten that to work. Right now I have removed the couchb model and created a new contact model to try and reduce the number of duplicates we store in the base tables. |
After changing the root table to have an autoincrement primary key below is the current query plan
|
@njuguna-n Can you share the full query? Are you querying based on the primary key or the timestamp? |
@garethbowen the query:
|
There shouldn't be anything fundamentally different between using an auto_id and a timestamp. The pattern that @lorerod shows here uses a dbt macro to get the static value to compare against the column on the where clause. The postgres optimiser is able to make effectively use and index than when using a subquery. Unless things have changed over the last 2 or so years. |
@njuguna-n, what error were you getting? |
@lorerod the error
|
Update: The Even an explain query like I'll temporarily remove |
Is the long run-time only affecting the first load or does it also affect the incremental loading? |
|
It also affects incremental loading with runs taking longer with more data. |
Three incremental tables and two materialized views now take about 9 minutes to refresh. Tomorrow I will test if using the mat views is faster than the macros for the incremental queries.
|
Mat views ended up causing a cyclical dependency based on how I used them. I tried using mat views for
|
Latest dbt run with about 10m docs synced results below. The good news is that it seems to be holding up better than previous tries and the lambda view makes sure the data is up to date even if the dbt run is delayed. The not so good news is that the time is slowly inching upwards as the DB size grows. The sync rate currently is about 700 docs per second. I'll let it run overnight and check back in the morning.
|
Using the lambda table and not running DBT in a continuous loop has so far yielded the best results (10 minute interval in this instance). The base table has ~44m docs now and the longest DBT run I see from a cursory look at the logs is 59 minutes and even then the latest data is still available from the lambda. I will cleanup the PR and have it form the basis for other models and hopefully it holds up with more models added. CC: @derickl
|
PR merged |
Summary of the testing tried out in this ticket: A typical DBT run with 44 million documents already synced takes about 14 minutes. This is when there is no additional data being synced or when the rate of data sync is low. Previous iterations were taking a few hours and many of them did not complete even the initial sync. All tests were conducted on an EKS instance with the Postgres pod having 6GB in memory and 6 vCPUs resource limits. The DBT pod has 4GB of memory and 4 vCPUs. The test data is from a clone of the production instance of BRAC Uganda. CouchDB records were being copied to Postgres at a rate of approximately 500 documents per second. The idea of the branch was to restart the modelling process with minimal models and only have a few models that would be required for a specific BRAC dashboard widget to establish a basic pattern for modelling that would work. This is important to note because any production deployment will likely have more models which might impact the performance. The basic structure recommended right now is to have base models defined as incremental tables. The first draft of recommended base tables can be found here. We also added a lambda view for the main base table to try and ensure that downstream models always have the latest data even when DBT takes long to run. An important note is that a lambda view has the most benefit if it is queried directly by the dashboard or another model that is a view. If a model that creates a table reads from a lambda view then the data is only added to the table once The biggest gains in terms of reducing the time it takes for DBT to complete to run have been caused by:
The next questions we need to answer are:
|
We noticed that
dbt run
performance becomes increasingly worse as the document count increases. This issue will hold the discussion on what we are trying or have tried to improve the performance.The text was updated successfully, but these errors were encountered: