Is there a way to get dmatrix to drop all-zero columns? #161

DWesl · 2020-08-30T20:59:41Z

I have an experiment design that does not include all combinations of its categorical variables, and ran into some difficulties getting a full-rank design matrix for statsmodels. I included a simplified version below.

import numpy as np
import numpy.linalg as la
import pandas as pd
import patsy

index_vals = tuple("abc")
level_names = list("ABD")
n_samples = 2


def describe_design_matrix(design_matrix):
    print("Shape:", design_matrix.shape)
    print("Rank: ", la.matrix_rank(design_matrix))
    print(
        "Approximate condition number: {0:.2g}".format(
            np.divide(*la.svd(design_matrix)[1][[0, -1]])
        )
    )


ds_simple = pd.DataFrame(
    index=pd.MultiIndex.from_product(
        [index_vals] * len(level_names) + [range(n_samples)],
        names=level_names + ["sample"],
    ),
    columns=["y"],
    data=np.random.randn(len(index_vals) ** len(level_names) * n_samples),
).reset_index()

print("All sampled")
simple_X = patsy.dmatrices("y ~ (A + B + D) ** 3", ds_simple)[1]
describe_design_matrix(simple_X)

print("Only some sampled")
simple_X = patsy.dmatrices(
    "y ~ (A + B + D) ** 3", ds_simple.query("A != 'a' or B == 'a'")
)[1]
describe_design_matrix(simple_X)

print("Reduced X")
simple_X = patsy.dmatrices(
    "y ~ (A + B + D) ** 3",
    ds_simple.query("A != 'a' or B == 'a'"),
    return_type="dataframe",
)[1]
reduced_X = simple_X.loc[
    :, [col for col in simple_X.columns if not col.startswith("A[T.b]:B")]
]
describe_design_matrix(reduced_X)

print("Only some sampled: alternate method")
simple_X = patsy.dmatrices(
    "y ~ (C(A, Treatment('b')) + B + D) ** 3", ds_simple.query("A != 'a' or B == 'a'")
)[1]
describe_design_matrix(simple_X)
print("Number of nonzero elements:", (simple_X != 0).sum(axis=0))
print("Number of all-zero columns:", np.count_nonzero((simple_X != 0).sum(axis=0) == 0))

print("Reduced X: alternate method")
simple_X = patsy.dmatrices(
    "y ~ (C(A, Treatment('b')) + B + D) ** 3",
    ds_simple.query("A != 'a' or B == 'a'"),
    return_type="dataframe",
)[1]
reduced_X = simple_X.loc[
    :,
    [
        col
        for col in simple_X.columns
        if not col.startswith("C(A, Treatment('b'))[T.a]:B")
    ],
]
describe_design_matrix(reduced_X)

produces as output

All sampled
Shape: (54, 27)
Rank:  27
Approximate condition number: 52
Only some sampled
Shape: (42, 27)
Rank:  21
Approximate condition number: 3.8e+16
Reduced X
Shape: (42, 21)
Rank:  21
Approximate condition number: 37
Only some sampled: alternate method
Shape: (42, 27)
Rank:  21
Approximate condition number: 3.4e+16
Number of nonzero elements: [42  6 18 12 12 14 14  0  6  0  6  2  6  2  6  4  4  4  4  0  2  0  2  0
  2  0  2]
Number of all-zero columns: 6
Reduced X: alternate method
Shape: (42, 21)
Rank:  21
Approximate condition number: 39

I don't mind spending the time to find the representation that produces all-zero columns, but there doesn't seem to be a way within patsy to say "I know some of these columns are going to be all zeros" or "These columns will be linear dependent on others". Since some statsmodels functions require the formula information from patsy.DesignInfo objects, I wanted to see what could be done within patsy.

matthewwardrop/formulaic#19 is a related issue, with some discussion of how to generalize the "Reduced X" method in the script.

The text was updated successfully, but these errors were encountered:

DWesl · 2020-08-30T21:26:41Z

Is there a way to get a stateful transform to operate on a categorical interaction term (e.g. drop_unused_interactions(A:B), drop_unused_interactions((A + B) ** 2), or drop_all_zero_columns((C(A, Treatment('b')) + B) ** 2))? That might be the easiest way to accomplish this if such a thing is possible.

DWesl mentioned this issue Aug 30, 2020

Is there some way to specify that certain interactions never occur? matthewwardrop/formulaic#19

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is there a way to get dmatrix to drop all-zero columns? #161

Is there a way to get dmatrix to drop all-zero columns? #161

DWesl commented Aug 30, 2020

DWesl commented Aug 30, 2020

Is there a way to get dmatrix to drop all-zero columns? #161

Is there a way to get dmatrix to drop all-zero columns? #161

Comments

DWesl commented Aug 30, 2020

DWesl commented Aug 30, 2020