Improves distribution of dummy data #2326

DRMacIver · 2024-12-19T10:38:14Z

This improves the distribution of dummy data through the strategy of oversampling: We produce a larger population of patients than requested, and then sample down to a subset of it according to some weighted distribution of values.

This weighting scheme is calculated in two ways:

We try to force the distribution to be "more uniform" than whatever the basic dummy data generation produces.
We allow the user to provide an arbitrary weighting function to deviate from that.

It's also a good place to insert heuristics we might want to add (e.g. we could add default age and sex distributions, we could choose what proportion of nullable columns should be null, etc). This doesn't do any of that.

It does come with the cost of making dummy data slower. I think this is an OK tradeoff, and will be improved by anything that improves dummy data generation (in particular future work on making constraints satisfied more often)

cloudflare-workers-and-pages · 2024-12-19T10:43:33Z

Deploying databuilder-docs with Cloudflare Pages

Latest commit:	`dc68b87`
Status:	✅ Deploy successful!
Preview URL:	https://a36ced6f.databuilder.pages.dev
Branch Preview URL:	https://drmaciver-field-weightings.databuilder.pages.dev

View logs

rebkwok · 2025-01-14T15:15:49Z

ehrql/query_language.py

+        Dummy data generation will generate a larger population and then sample from it
+        to improve the distribution of patients. This parameter controls how much larger.
+        Lower values will be faster to generate, while larger values will get closer to
+        the target distribution.


I think it'd be helpful to users to give the default and an idea of what a small/large value is. i.e. 1 means no oversampling, 2 means oversampling by up to 2x the specified population size

rebkwok · 2025-01-14T15:16:42Z

ehrql/query_language.py

+        Defines a "weight" expression that lets you control the distribution of patients.
+        Ideally a patient row will be generated with probability proportionate to its weight,
+        although this ideal will be imperfectly realised in practice. The higher the value of
+        ``oversample`` the closer this ideal will to being realised.


An example as we have for additional_population_constraints would be helpful. Will this always be a case()?

DRMacIver added 3 commits December 19, 2024 10:30

Implement a basic weighting scheme

4b1cee9

Renormalise weights of records by frequency of each value

2d723a7

Build docs

c7c53a8

github-actions bot deployed to databuilder-docs (Preview) December 19, 2024 10:43 View deployment

Results needs to return an iterator

c8e26f4

github-actions bot deployed to databuilder-docs (Preview) December 19, 2024 11:58 View deployment

DRMacIver added 2 commits December 19, 2024 13:19

Fix dummy data serialization test

fa46a98

Move resampling logic into get_data

055fe95

github-actions bot deployed to databuilder-docs (Preview) December 19, 2024 14:59 View deployment

DRMacIver added 3 commits December 19, 2024 15:18

Import tree sampler tests

65f3bf5

Fix error for patient weighting with legacy

e7906ed

Coverage edge case

8616777

github-actions bot deployed to databuilder-docs (Preview) December 19, 2024 16:54 View deployment

Fix coverage

dc68b87

github-actions bot deployed to databuilder-docs (Preview) December 19, 2024 17:13 View deployment

rebkwok reviewed Jan 14, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improves distribution of dummy data #2326

Improves distribution of dummy data #2326

DRMacIver commented Dec 19, 2024

cloudflare-workers-and-pages bot commented Dec 19, 2024 •

edited

Loading

rebkwok Jan 14, 2025

rebkwok Jan 14, 2025

Improves distribution of dummy data #2326

Are you sure you want to change the base?

Improves distribution of dummy data #2326

Conversation

DRMacIver commented Dec 19, 2024

cloudflare-workers-and-pages bot commented Dec 19, 2024 • edited Loading

Deploying databuilder-docs with Cloudflare Pages

rebkwok Jan 14, 2025

Choose a reason for hiding this comment

rebkwok Jan 14, 2025

Choose a reason for hiding this comment

cloudflare-workers-and-pages bot commented Dec 19, 2024 •

edited

Loading