Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable basic stats for non-delta tables #2

Open
GeekSheikh opened this issue Jul 21, 2021 · 2 comments
Open

Enable basic stats for non-delta tables #2

GeekSheikh opened this issue Jul 21, 2021 · 2 comments
Labels
enhancement New feature or request

Comments

@GeekSheikh
Copy link
Contributor

Curious to get your thoughts on collecting stats in more of the old-fashioned way to collect and store stats for non-delta tables. Obviously, this would be less performant but I think some customers would be willing to pay for it. There are a lot of extended options for Analyze Table that could be used to collect proper metrics outside of delta. :) thoughts?

@GeekSheikh GeekSheikh added the enhancement New feature or request label Jul 21, 2021
@ronanstokes-db
Copy link

I created a table analyzer notebook some time ago to give to customers in advance of tuning exercises in situations where we don't have hands on access to their workspace. Its very useful in helping drive the conversations around the potential benefit they would get by moving to delta (i.e showing them small file issues, data size disparity etc)

This notebook performs analysis and to break down tables / datasets by:

  • average / min / max size of files
  • numbers of files and partitions
  • average / min / max size of files

i'll add a link to it

@ronanstokes-db
Copy link

Another thought is to create views that instrument access to tables to help in optimization decisions.

How would you do this ?

You could have a dataframe pipeline that creates a temporary view - but as part of the pipeline, we write stats to a side table on logical primary keys accessed etc. Would have to experiment with this to see how to do this efficiently.

My general thinking behind this is based on experimenting with code coverage - the question is how would you do this for data use ? Haven't thought through this fully

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants