You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have an experiment design that does not include all combinations of its categorical variables, and ran into some difficulties getting a full-rank design matrix for statsmodels. I included a simplified version below.
importnumpyasnpimportnumpy.linalgaslaimportpandasaspdimportpatsyindex_vals=tuple("abc")
level_names=list("ABD")
n_samples=2defdescribe_design_matrix(design_matrix):
print("Shape:", design_matrix.shape)
print("Rank: ", la.matrix_rank(design_matrix))
print(
"Approximate condition number: {0:.2g}".format(
np.divide(*la.svd(design_matrix)[1][[0, -1]])
)
)
ds_simple=pd.DataFrame(
index=pd.MultiIndex.from_product(
[index_vals] *len(level_names) + [range(n_samples)],
names=level_names+ ["sample"],
),
columns=["y"],
data=np.random.randn(len(index_vals) **len(level_names) *n_samples),
).reset_index()
print("All sampled")
simple_X=patsy.dmatrices("y ~ (A + B + D) ** 3", ds_simple)[1]
describe_design_matrix(simple_X)
print("Only some sampled")
simple_X=patsy.dmatrices(
"y ~ (A + B + D) ** 3", ds_simple.query("A != 'a' or B == 'a'")
)[1]
describe_design_matrix(simple_X)
print("Reduced X")
simple_X=patsy.dmatrices(
"y ~ (A + B + D) ** 3",
ds_simple.query("A != 'a' or B == 'a'"),
return_type="dataframe",
)[1]
reduced_X=simple_X.loc[
:, [colforcolinsimple_X.columnsifnotcol.startswith("A[T.b]:B")]
]
describe_design_matrix(reduced_X)
print("Only some sampled: alternate method")
simple_X=patsy.dmatrices(
"y ~ (C(A, Treatment('b')) + B + D) ** 3", ds_simple.query("A != 'a' or B == 'a'")
)[1]
describe_design_matrix(simple_X)
print("Number of nonzero elements:", (simple_X!=0).sum(axis=0))
print("Number of all-zero columns:", np.count_nonzero((simple_X!=0).sum(axis=0) ==0))
print("Reduced X: alternate method")
simple_X=patsy.dmatrices(
"y ~ (C(A, Treatment('b')) + B + D) ** 3",
ds_simple.query("A != 'a' or B == 'a'"),
return_type="dataframe",
)[1]
reduced_X=simple_X.loc[
:,
[
colforcolinsimple_X.columnsifnotcol.startswith("C(A, Treatment('b'))[T.a]:B")
],
]
describe_design_matrix(reduced_X)
produces as output
All sampled
Shape: (54, 27)
Rank: 27
Approximate condition number: 52
Only some sampled
Shape: (42, 27)
Rank: 21
Approximate condition number: 3.8e+16
Reduced X
Shape: (42, 21)
Rank: 21
Approximate condition number: 37
Only some sampled: alternate method
Shape: (42, 27)
Rank: 21
Approximate condition number: 3.4e+16
Number of nonzero elements: [42 6 18 12 12 14 14 0 6 0 6 2 6 2 6 4 4 4 4 0 2 0 2 0
2 0 2]
Number of all-zero columns: 6
Reduced X: alternate method
Shape: (42, 21)
Rank: 21
Approximate condition number: 39
I don't mind spending the time to find the representation that produces all-zero columns, but there doesn't seem to be a way within patsy to say "I know some of these columns are going to be all zeros" or "These columns will be linear dependent on others". Since some statsmodels functions require the formula information from patsy.DesignInfo objects, I wanted to see what could be done within patsy.
matthewwardrop/formulaic#19 is a related issue, with some discussion of how to generalize the "Reduced X" method in the script.
The text was updated successfully, but these errors were encountered:
Is there a way to get a stateful transform to operate on a categorical interaction term (e.g. drop_unused_interactions(A:B), drop_unused_interactions((A + B) ** 2), or drop_all_zero_columns((C(A, Treatment('b')) + B) ** 2))? That might be the easiest way to accomplish this if such a thing is possible.
I have an experiment design that does not include all combinations of its categorical variables, and ran into some difficulties getting a full-rank design matrix for
statsmodels
. I included a simplified version below.produces as output
I don't mind spending the time to find the representation that produces all-zero columns, but there doesn't seem to be a way within patsy to say "I know some of these columns are going to be all zeros" or "These columns will be linear dependent on others". Since some
statsmodels
functions require the formula information frompatsy.DesignInfo
objects, I wanted to see what could be done withinpatsy
.matthewwardrop/formulaic#19 is a related issue, with some discussion of how to generalize the "Reduced X" method in the script.
The text was updated successfully, but these errors were encountered: