Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Example/Tutorial of importing data to Xarray (Merge/conact/etc) #1391

Open
NicWayand opened this issue May 1, 2017 · 11 comments · May be fixed by #3131
Open

Adding Example/Tutorial of importing data to Xarray (Merge/conact/etc) #1391

NicWayand opened this issue May 1, 2017 · 11 comments · May be fixed by #3131

Comments

@NicWayand
Copy link

NicWayand commented May 1, 2017

I love xarray for analysis but getting my data into xarray often takes a lot more time than I think it should. I am a hydrologist and very often hydro data is poorly stored/formatted, which means I need to do multiple merge/conact/combine_first operations etc. to get to a nice xarray dataset format. I think having more examples for importing different types of data would be helpful (for me and possibly others), instead of my current approach, which often entails trial and error.

I can start off by providing an example of importing funky hydrology data that hopefully would be general enough for others to use. Maybe we can compile other examples as well. With the end goal of adding to the readthedocs.

@klapo @jhamman

@shoyer
Copy link
Member

shoyer commented May 2, 2017

I'm certainly always happy to see more tutorials and examples for the docs, especially when they hit on common workflows.

@klapo
Copy link

klapo commented May 10, 2017

I have an example that I just struggled through that might be relevant to this idea. I'm running a point model using some arbitrary number of experiments (for the below example there are 28 experiments). Each experiment is opened and then stored in a dictionary resultsDict. The below excerpt extracts all of my scalar variables, concatenates them along an experiment dimension, and finally combines all scalar variables into a DataSet. I often find myself struggling to combine data (for instance meteorological stations) into a DataSet and I can never remember how to use merge and/or concat.

resultsDataSet = xr.Dataset()
for k in scalar_data_vars:
    if not 'scalar' in k:
        continue
        
    # Assign scalar value to a dataArray
    darray = xr.concat([resultsDict[scen][scalar_data_vars[0]] for scen in resultsDict], dim='expID')
    # Remove hru dimension, as it is unused
    darray = darray.squeeze('hru')
        
    resultsDataSet[k] = darray
print(resultsDataSet)

which yields

<xarray.Dataset>
Dimensions:                (expID: 28, time: 8041)
Coordinates:
  * time                   (time) datetime64[ns] 2008-10-01 ...
    hru                    int32 1
Dimensions without coordinates: expID
Data variables:
    scalarRainPlusMelt     (expID, time) float64 -9.999e+03 -9.999e+03 ...
    scalarSWE              (expID, time) float64 -9.999e+03 -9.999e+03 ...
    scalarSnowSublimation  (expID, time) float64 -9.999e+03 -9.999e+03 ...
    scalarInfiltration     (expID, time) float64 -9.999e+03 -9.999e+03 ...
    scalarSurfaceRunoff    (expID, time) float64 -9.999e+03 -9.999e+03 ...
    scalarSurfaceTemp      (expID, time) float64 -9.999e+03 -9.999e+03 ...
    scalarSenHeatTotal     (expID, time) float64 -9.999e+03 -9.999e+03 ...
    scalarLatHeatTotal     (expID, time) float64 -9.999e+03 -9.999e+03 ...
    scalarSnowDepth        (expID, time) float64 -9.999e+03 -9.999e+03 ...

And here is a helper function that can do this more generally, which I wrote a while back.

def combinevars(ds_in, dat_vars, new_dim_name='new_dim', combinevarname='new_var'):
    ds_out = xr.Dataset()
    ds_out = xr.concat([ds_in[dv] for dv in dat_vars], dim='new_dim')
    ds_out = ds_out.rename({'new_dim': new_dim_name})
    ds_out.coords[new_dim_name] = dat_vars
    ds_out.name = combinevarname

    return ds_out

@klapo
Copy link

klapo commented May 10, 2017

Also, just a small thing in the docs for concat

The example includes this snippet
xr.concat([arr[0], arr[1]], pd.Index([-90, -100], name='new_dim'))
but as far as I can tell, name is not an argument accepted by concat

@shoyer
Copy link
Member

shoyer commented May 10, 2017

Also, just a small thing in the docs for concat

This is a certainly confusing, but actually correct. name is a parameter to pd.Index().

@darothen
Copy link

@klapo! Great to see you here!

Happy to iterate with you on documenting this functionality. For reference, I wrote a package for my dissertation work to help automate the task of constructing multi-dimensional Datasets which include dimensions corresponding to experimental/ensemble factors. One of my on-going projects is to actually fully abstract this (I have a not-uploaded branch of the project which tries to build the notion of an "EnsembleDataset", which has the same relationship to a Dataset that an pandas Panel used to have to a DataFrame).

@klapo
Copy link

klapo commented May 10, 2017

@darothen That sounds great!

I think we should be clearer. The issue that @NicWayand and I are highlighting is the coercing observational data, which often comes with some fairly heinous formatting issues, into an xarray format. The stacking of these data along a new dimension is usually the last step in this process, and one that can be frustrating. An example of this in practice can be found in this notebook (please be forgiving, it is one of the first things I ever wrote in python).

https://github.com/klapo/CalRad/blob/master/CR.SurfObs.DataIngest.xray.ipynb

The data flow looks like this:

  • read the csv summarizing each station
  • read data from one set of stations using pandas
  • clean the data
  • assign the data in a pandas DataFrame to a dictionary of DataFrames
  • rinse and repeat for the other set of data
  • concat the dictionary of DataFrames into a single DataFrame
  • convert to an xarray DataSet

This example is a little ludicrous because I didn't know what I was doing, but I think that's the point. There is a lot of ambiguity on which tools to use at what point. Concatenating a dictionary of DataFrames into a single DataFrame and then converting to a DataSet was the only solution I could get to work, after a lot of trial and error, for putting these data in an xarray DataSet.

@stale
Copy link

stale bot commented Apr 10, 2019

In order to maintain a list of currently relevant issues, we mark issues as stale after a period of inactivity

If this issue remains relevant, please comment here or remove the stale label; otherwise it will be marked as closed automatically

@stale stale bot added the stale label Apr 10, 2019
@shoyer
Copy link
Member

shoyer commented Apr 10, 2019

This is still relevant

@stale stale bot removed the stale label Apr 10, 2019
@rabernat rabernat self-assigned this Jul 12, 2019
@rabernat
Copy link
Contributor

I feel like I am a ninja on this issue and I think I could write a good tutorial for this.

@dcherian
Copy link
Contributor

+20 @rabernat

@dcherian
Copy link
Contributor

I think you are a good candidate for expanding the tutorial on multidimensional coordinates too. This is such a common use case for model output...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants