You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I think we can create some very generic fake extractors and define a schema for them, for example
- df()->read(from_flow_orders(limit:1_000))
- flow_orders_schema() : Schema
- do the same for products
- do the same for customers
- do the same for inventory
We should make sure all of those datasets keep a consistent schema and that they are using all possible entry types. Those virtual datasets would need to follow a very strict backward compatibility policy and proper schema evolution.
This would make it so much easier and more realistic to test not only the entire pipeline but also stand-alone scalar functions as we can also create helpers that would just give us one row (and use them inside fake extractors)
I would put those into src/core/etl/tests/Flow/ETL/Tests/Double/Fake/Dataset
The text was updated successfully, but these errors were encountered:
The important part here is that those datasets can't be total random, they need to be fully predictable.
For example, Orders can't start from a random point in time, it should also be possible to configure on the extractor level how many orders per day, time period, % of cancelled orders etc.
Do we want the datasets to be mainly a static file that we manipulate or maybe we could utilize libraries such as https://fakerphp.org/ so we can add some controllable randomness into the play?
For example, we could provide a "schema" to faker and then the faker will fill the data for us.
Great question!
IMO data should be 100% generated by faker but we should put some options as I explained above to make those datasets more predictable.
Tests using those datasets should not rely on the values but more on the shape and size of the data.
I think we can create some very generic fake extractors and define a schema for them, for example
We should make sure all of those datasets keep a consistent schema and that they are using all possible entry types. Those virtual datasets would need to follow a very strict backward compatibility policy and proper schema evolution.
This would make it so much easier and more realistic to test not only the entire pipeline but also stand-alone scalar functions as we can also create helpers that would just give us one row (and use them inside fake extractors)
I would put those into
src/core/etl/tests/Flow/ETL/Tests/Double/Fake/Dataset
The text was updated successfully, but these errors were encountered: