You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is either a feature request or a request for help with current functionality. I am doing some work with unbalanced panel data work that involves using patsy to forecast some series. Here's a basic example:
importio, pandas, patsy#raw panel data indexed on ID, YEAR. Y is the forecast variable of interest. There are no gaps in the data for an individual entity but the panel is potentially unbalanced (meaning different start/end dates).data='''ID,YEAR,Y,B,C,D1,1999,0,2,3,41,2000,.,2,3,41,2001,.,2,3,41,2002,.,2,3,42,1996,1,2,3,42,1997,.,2,3,43,1998,3,2,3,43,1999,3,2,3,43,2000,.,2,3,43,2001,3,2,3,4'''data=io.StringIO(data)
df=pandas.read_csv(data, index_col=['ID','YEAR'], na_values=['.'])
print(df)
deflag(series, n=1):
returnseries.groupby(level=0).shift(n)
formula='1+lag(Y)+B+C+D'#This is the forecast equation for Yx=patsy.dmatrix(formula,df, return_type='dataframe')
params=pandas.Series([1,2,3,4,5], index=x.columns) #these are the coefficients on the forecast vars#Now forecast year by yearforyrinrange(1997,2010):
ind=df.index.get_level_values('YEAR')==yrx=patsy.dmatrix(formula,df, return_type='dataframe').reindex(df.index)
x=x.loc[ind]
df.loc[ind, 'Y'] =df.loc[ind, 'Y'].fillna(x@params)
print('================')
print(yr)
print(df)
Note that to produce the entire forecast we need to call dmatrix over and over. The problem that I'm having is that it is quite inefficient to have to call dmatrix on the entire DataFrame repeatedly, but because the forecast formula can contain arbitrary numbers of lags I can't just pass in a df filtered to the current year (or a set number of lags from the current year). What would be ideal is if I could replace
with version of dmatrix that takes a boolean rows and only evaluates and returns the rows that are needed
ind=df.index.get_level_values('YEAR')==yrx=patsy.dmatrix(formula,df, return_type='dataframe', rows=ind) #evaluates only on rows there ind==True and returns a dataframe with only those rows
I thought incr_dbuilder might be able to handle this, but it seems that it expects each chunk returned is completely separate from previous chunks. That won't work in the time series/panel context.
The text was updated successfully, but these errors were encountered:
@spillz Did you find a solution to your problem? I've run into a similar issue. In my case, I am calling dmatrix repeatedly (e.g., tens of thousands of times), passing a different DataFrame each time. The DataFrame is small (e.g., 4 rows), but the repeated calls are quite slow. See the attached call graph from profiling.
This is either a feature request or a request for help with current functionality. I am doing some work with unbalanced panel data work that involves using patsy to forecast some series. Here's a basic example:
Note that to produce the entire forecast we need to call dmatrix over and over. The problem that I'm having is that it is quite inefficient to have to call dmatrix on the entire DataFrame repeatedly, but because the forecast formula can contain arbitrary numbers of lags I can't just pass in a df filtered to the current year (or a set number of lags from the current year). What would be ideal is if I could replace
with version of dmatrix that takes a boolean
rows
and only evaluates and returns the rows that are neededI thought incr_dbuilder might be able to handle this, but it seems that it expects each chunk returned is completely separate from previous chunks. That won't work in the time series/panel context.
The text was updated successfully, but these errors were encountered: