Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit of NHANES 2011 data #1

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

kshedden
Copy link

Not sure what you have in mind in terms of data format, meta data, etc. Let me know and I will revise the PR as needed.

@josef-pkt
Copy link
Member

@vincentarelbundock
I think when we build this out we will use something like
https://github.com/vincentarelbundock/Rdatasets
which would put nhanes one level lower into a csv folder

in general:
There were discussion on the nipy mailing list about making installable python dataset packages, which makes sense if users will want to use most of the data available or they don't get too large, but not so much if we want to use just a few datasets as in rdatasets.
I didn't pay a lot of attention to the details of dataset packages and meta information. For now the rdataset pattern plus our datasets inside statsmodels seems to be enough.
It's possible to rethink this in future if someone is interested. I saw that there are also related datset packages for Julia (one of them a translation of Vincent's rdatasets) which will have similar installation and license/copyright questions as we do.

@josef-pkt
Copy link
Member

On specific question:
Is the Hanes .gz file an archive with a single csv file or does it have a collection of csv files?
What's the advantage of using an archive instead of a plain csv file?

I'm fine either way, but AFAIK, we would have to write the py2/py3 compatible helper functions to get the data from an archive file. (The statespace notebooks are doing that, and it was what triggered me into looking at creating smdatasets)

@kshedden
Copy link
Author

There's just a single file in there, in csv format. It's only compressed to save space/bandwidth.

I don't feel strongly about this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants