Adding interaction terms to the design matrix #181

khalilouardini · 2023-10-17T12:51:35Z

This PR has several purposes:

it leverages formulaic to allow using interaction terms of the form a:b:...:zfor design_factors and formulas based on a combination of single design factors and of such interaction terms a + b + a:b:...:z (no support for other more complex structures or alternative syntax enabled by formulaic's grammar yet such as i.e. ~ C(X, contr.treatment("x")), a * b, contr.poly, ...)
allows the user to pass a design_matrixto a dds dataset
fixes pyproject.toml syntax (linters complained)

Completion milestones:

adds @anaischossegros test on edge-case from deseq2 vignette
test to check that passing a design_matrix populates all necessary attributes such as design_factors (kind of as end2end integration tests are passing)
continuous factors well-handled
deseq2 examples matched (at least at the level of the design matrix)
arbitrary number of interactions well handled
deseq2 examples matched end2end

pydeseq2/dds.py

pydeseq2/interaction_utils.py

jeandut · 2023-10-19T17:13:57Z

Youpi tests pass ! @khalilouardini and @BorisMuzellec you are on !

BorisMuzellec · 2023-10-20T08:09:02Z

Hi @jeandut and @khalilouardini, thanks for this PR!

It's nice that you went all the way to even support recursive interactions like "a: b:c".

I experimented a bit with your code, and there seem to be a few remaining issues though.
Running

from pydeseq2.utils import build_design_matrix, load_example_data

counts = load_example_data()
metadata = load_example_data(modality = "metadata")

design = build_design_matrix(metadata=clinical, 
                   design_factors=["condition", "condition:group"])

I get the following design:

          intercept  condition_B_vs_A  condition:group_AY_vs_A_vs_AX  \
sample1            1                 0                              0   
sample2            1                 0                              1   
sample3            1                 0                              0   
sample4            1                 0                              1   
sample5            1                 0                              0   
...              ...               ...                            ...   
sample96           1                 1                              0   
sample97           1                 1                              0   
sample98           1                 1                              0   
sample99           1                 1                              0   
sample100          1                 1                              0   

           condition:group_BX_vs_A_vs_AX  condition:group_BY_vs_A_vs_AX  
sample1                                0                              0  
sample2                                0                              0  
sample3                                0                              0  
sample4                                0                              0  
sample5                                0                              0  
...                                  ...                            ...  
sample96                               0                              1  
sample97                               1                              0  
sample98                               0                              1  
sample99                               1                              0  
sample100                              0                              1  

[100 rows x 5 columns]

The columns seem to contain the intended values, but:

There's an issue with variable names. We would expect something like condition:group_AY_vs_AX instead of condition:group_AY_vs_A_vs_AX.
The design is not full rank, e.g. ((design.iloc[:,-1] + design.iloc[:,-2] - design.iloc[:,1])**2).sum() returns 0. This means that one of the last two columns is redundant. Not sure how to determine this automatically though, perhaps looking at the way DESeq2 handles this could help.

Let me know if you need help with this :)

EDIT: I think that in the example above issue 2 is due to the fact the design is of the form "~a + a:b", which creates a redundancy that probably wouldn't be there if it was just "~a:b".

BorisMuzellec

See my comment above

jeandut · 2023-10-20T14:56:04Z

@BorisMuzellec see the modifications I made. LGTM but it's hard to be sure wo test data available.
In order for us to converge faster and minimize iterations of reviews if the PR is still not doing exactly what you want I would appreciate if you could give me couples of:
(design_factors, metadata, continuous_factors) -> expected_design_matrix
For a sufficiently representative set of design_factors and continuous_factors.

jeandut · 2023-10-20T16:07:52Z

As a start is this matching DeSeq2 ? and if not what should be changed ?

jeandut · 2023-10-20T16:15:20Z

And introducing continuous factors, same question:

BorisMuzellec · 2023-10-23T12:40:39Z

Thanks @jeandut for the code updates. The variable names look fine now :).

Matrix rank

We still have the rank issue with designs of the form `"~ factor1 + factor1:factor2" though. E.g., in the same example as above,

from pydeseq2.utils import build_design_matrix
counts = load_example_data()
metadata = load_example_data(modality = "metadata")
design = build_design_matrix(metadata=metadata, 
                   design_factors=["condition", "condition:group"])

we have the following design:

    intercept  condition_B_vs_A  condition:group_AY_vs_AX  \
sample1            1                 0                         0   
sample2            1                 0                         1   
sample3            1                 0                         0   
sample4            1                 0                         1   
sample5            1                 0                         0   
...              ...               ...                       ...   
sample96           1                 1                         0   
sample97           1                 1                         0   
sample98           1                 1                         0   
sample99           1                 1                         0   
sample100          1                 1                         0   

           condition:group_BX_vs_AX  condition:group_BY_vs_AX  
sample1                           0                         0  
sample2                           0                         0  
sample3                           0                         0  
sample4                           0                         0  
sample5                           0                         0  
...                             ...                       ...  
sample96                          0                         1  
sample97                          1                         0  
sample98                          0                         1  
sample99                          1                         0  
sample100                         0                         1

which does not have full column rank, because condition_B_vs_A = condition:group_BX_vs_AX condition:group_BY_vs_AX (since group has only two values X and Y, knowing BX and BY is enough to know B).

In comparison, the design matrix output by DESeq2 only has the following columns ["intercept", "condition_B_vs_A", "condition:group_AY_vs_AX", "condition:group_BY_vs_AX"].

I think that when there are interaction terms in the design, we need to check whether those variables are also present on their own, and if so remove an additional column.
I added a test to check that the design has full rank in this case, which is why the CI now fails.

In-place modification

On a side note, the present code modifies the metadata that is being passed (it adds colums). E.G:

from pydeseq2.utils import build_design_matrix
counts = load_example_data()
metadata = load_example_data(modality = "metadata")
print(metadata)

 condition group
sample1           A     X
sample2           A     Y
sample3           A     X
sample4           A     Y
sample5           A     X
...             ...   ...
sample96          B     Y
sample97          B     X
sample98          B     Y
sample99          B     X
sample100         B     Y

_ = build_design_matrix(metadata=metadata, 
                   design_factors=["condition", "condition:group"])
print(metadata)

      condition group condition:group
sample1           A     X              AX
sample2           A     Y              AY
sample3           A     X              AX
sample4           A     Y              AY
sample5           A     X              AX
...             ...   ...             ...
sample96          B     Y              BY
sample97          B     X              BX
sample98          B     Y              BY
sample99          B     X              BX
sample100         B     Y              BY

It would be better to avoid this, e.g. by adding an inplace argument to the interaction term utilities, or deleting added columns after the code is done running.

I'm not a huge fan of adding to many dependencies, but I'm starting to wonder if we could save us some pain by relying on formulaic, as suggested in #125...

pydeseq2/dds.py

pydeseq2/interaction_utils.py

thondeboer · 2024-01-05T22:15:21Z

Has this attempt to introduce interaction terms been abandoned? I was hoping to not have to resort to R to get interaction designs to work, since that is a very important part of DESeq2 in R and was quite surprised this was not part of the original pyDESeq2...Is this specifically hard to implement for some reason in Python, just curious...

jeandut · 2024-01-05T22:35:59Z

Has this attempt to introduce interaction terms been abandoned? I was hoping to not have to resort to R to get interaction designs to work, since that is a very important part of DESeq2 in R and was quite surprised this was not part of the original pyDESeq2...Is this specifically hard to implement for some reason in Python, just curious...

This attempt has not been abandoned. Currently it has been because of lack of bandwidth that I could not make more progress, I don't want to make hard commitments but I hope this gets done in Q1.
However indeed this turned out to be more complicated than I expected mainly because of the versatility of the formula and its interaction with the rank-reduction step. For this reason I will rely on formulaic
In the meantime all those manipulations can be done in Python on a case by case basis outside of pydeseq2

examples/plot_pandas_io_example.py

tests/test_build_design_matrix.py

Marwansha · 2024-07-18T11:13:40Z

Hi,

I was wondering if this will be implemented soon? I have done all my analysis in Python but was asked for some interaction terms and i was wondering if i Should switch totally to R, or will this be implemented soon?

Thanks

jeandut · 2024-08-19T15:13:49Z

Hi,

I was wondering if this will be implemented soon? I have done all my analysis in Python but was asked for some interaction terms and i was wondering if i Should switch totally to R, or will this be implemented soon?

Thanks

Hi @Marwansha normally this PR is pretty much finished but, as the changes are substantial, we wanted to spend some extra time to review it before releasing it (we even think of doing a pre-release). Crossing fingers this will be merged soonish.
In the meantime you can checkout this branch and install it from source to test it against your usual workflow. We would be super happy to get your feedbacks !

abearab · 2024-09-19T05:48:22Z

Hi,
I was wondering if this will be implemented soon? I have done all my analysis in Python but was asked for some interaction terms and i was wondering if i Should switch totally to R, or will this be implemented soon?
Thanks

Hi @Marwansha normally this PR is pretty much finished but, as the changes are substantial, we wanted to spend some extra time to review it before releasing it (we even think of doing a pre-release). Crossing fingers this will be merged soonish. In the meantime you can checkout this branch and install it from source to test it against your usual workflow. We would be super happy to get your feedbacks !

I'll be happy to do some analysis using this branch. Is there any specific concern you guys have in mind? :)

abearab · 2024-09-19T06:52:46Z

Here is a quick try on my data:

build_design_matrix(
    metadata=rnaseq_data_wt.obs,
    design_factors = '`Treatment`+`Time`+`Treatment:Time`',
    ref_level=[('Treatment','DMSO'),('Time','8hr')]
)

I'm trying to have "Treatment" and "Time" as co-variables but I ran into an error while setting up the ref_level:

FormulaSyntaxError: Missing operator between `C(` and `Treatment`.

⧛`C(`Treatment⧚`, contr.treatment(base='DMSO'))`+`C(`Time`, contr.treatment(base='8hr'))`+`C(`Treatment`, contr.treatment(base='DMSO')):C(`Time`, contr.treatment(base='8hr'))`

Full Error:

---------------------------------------------------------------------------
FormulaSyntaxError                        Traceback (most recent call last)
Cell In[145], line 1
----> 1 build_design_matrix(
      2     metadata=rnaseq_data_wt.obs,
      3     design_factors = '`Treatment`+`Time`+`Treatment:Time`',
      4     ref_level=[('Treatment','DMSO'),('Time','8hr')]
      5 )

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/pydeseq2/utils.py:235](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/pydeseq2/utils.py#line=234), in build_design_matrix(metadata, design_factors, ref_level)
    232         all_metadata_ref_levels[col] = sorted(metadata[col].unique())[0]
    234 try:
--> 235     design_matrix = model_matrix(design_factors, metadata)
    236 except formulaic.errors.FactorEvaluationError:
    237     # It is a design choice due to the fact that forumalaic doesn't handle
    238     # well expressions with hyphens
    239     warnings.warn(
    240         "It seems one of the factor of the formula could not be"
    241         "well parsed by formulaic trying to fix it",
    242         UserWarning,
    243         stacklevel=2,
    244     )

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/sugar.py:51](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/sugar.py#line=50), in model_matrix(spec, data, context, **spec_overrides)
     16 """
     17 Generate a model matrix directly from a formula or model spec.
     18 
   (...)
     48     nominated structure.
     49 """
     50 _context = capture_context(context + 1) if isinstance(context, int) else context
---> 51 return ModelSpec.from_spec(spec, **spec_overrides).get_model_matrix(
     52     data, context=_context
     53 )

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/model_spec.py:107](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/model_spec.py#line=106), in ModelSpec.from_spec(cls, spec, **attrs)
    104     return ModelSpec(formula=formula, **attrs)
    106 if isinstance(spec, Formula) or not isinstance(spec, Structured):
--> 107     return prepare_model_spec(spec)
    108 return cast(ModelSpecs, spec._map(prepare_model_spec, as_type=ModelSpecs))

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/model_spec.py:99](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/model_spec.py#line=98), in ModelSpec.from_spec.<locals>.prepare_model_spec(obj)
     97 if isinstance(obj, ModelSpec):
     98     return obj.update(**attrs)
---> 99 formula = Formula.from_spec(obj)
    100 if not formula._has_root or formula._has_structure:
    101     return cast(
    102         ModelSpec, formula._map(prepare_model_spec, as_type=ModelSpecs)
    103     )

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/formula.py:117](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/formula.py#line=116), in Formula.from_spec(cls, spec, parser, nested_parser, ordering)
    115 if isinstance(spec, Formula):
    116     return spec
--> 117 return Formula(
    118     spec, _parser=parser, _nested_parser=nested_parser, _ordering=ordering
    119 )

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/formula.py:132](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/formula.py#line=131), in Formula.__init__(self, _parser, _nested_parser, _ordering, *args, **kwargs)
    130 self._nested_parser = _nested_parser or _parser or self.DEFAULT_NESTED_PARSER
    131 self._ordering = OrderingMethod(_ordering)
--> 132 super().__init__(*args, **kwargs)
    133 self._simplify(unwrap=False, inplace=True)

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/types/structured.py:101](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/types/structured.py#line=100), in Structured.__init__(self, root, _metadata, **structure)
     96     raise ValueError(
     97         "Substructure keys cannot start with an underscore. "
     98         f"The invalid keys are: {set(key for key in structure if key.startswith('_'))}."
     99     )
    100 if root is not _MISSING:
--> 101     structure["root"] = self.__prepare_item("root", root)
    102 self._metadata = _metadata
    104 self._structure = {
    105     key: self.__prepare_item(key, item) for key, item in structure.items()
    106 }

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/types/structured.py:115](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/types/structured.py#line=114), in Structured.__prepare_item(self, key, item)
    113 if isinstance(item, tuple):
    114     return tuple(self.__prepare_item(key, v) for v in item)
--> 115 return self._prepare_item(key, item)

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/formula.py:152](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/formula.py#line=151), in Formula._prepare_item(self, key, item)
    136 """
    137 Convert incoming formula items into either a list of Terms or a nested
    138 `Formula` instance.
   (...)
    145     item: The specification to convert.
    146 """
    148 if isinstance(item, str):
    149     item = cast(
    150         FormulaSpec,
    151         (self._parser if key == "root" else self._nested_parser)
--> 152         .get_terms(item)
    153         ._simplify(),
    154     )
    156 if isinstance(item, Structured):
    157     formula_or_terms = Formula(
    158         _parser=self._nested_parser, **item._structure
    159     )._simplify()

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/parser.py:132](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/parser.py#line=131), in DefaultFormulaParser.get_terms(self, formula)
    119 def get_terms(self, formula: str) -> Structured[List[Term]]:
    120     """
    121     Assemble the `Term` instances for a formula string. Depending on the
    122     operators involved, this may be an iterable of `Term` instances, or
   (...)
    130         formula: The formula for which an AST should be generated.
    131     """
--> 132     terms = super().get_terms(formula)
    134     def check_terms(terms: Iterable[Term]) -> None:
    135         seen_terms = set()

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/types/formula_parser.py:72](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/types/formula_parser.py#line=71), in FormulaParser.get_terms(self, formula)
     63 def get_terms(self, formula: str) -> Structured[List[Term]]:
     64     """
     65     Assemble the `Term` instances for a formula string. Depending on the
     66     operators involved, this may be an iterable of `Term` instances, or
   (...)
     70         formula: The formula for which an AST should be generated.
     71     """
---> 72     ast = self.get_ast(formula)
     73     if ast is None:
     74         return Structured([])

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/types/formula_parser.py:58](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/types/formula_parser.py#line=57), in FormulaParser.get_ast(self, formula)
     50 """
     51 Assemble an abstract syntax tree for the nominated `formula` string.
     52 
     53 Args:
     54     formula: The formula for which an AST should be generated.
     55 """
     56 from ..algos.tokens_to_ast import tokens_to_ast
---> 58 return tokens_to_ast(
     59     self.get_tokens(formula),
     60     operator_resolver=self.operator_resolver,
     61 )

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/algos/tokens_to_ast.py:135](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/algos/tokens_to_ast.py#line=134), in tokens_to_ast(tokens, operator_resolver)
    133 if output_queue:
    134     if len(output_queue) > 1:
--> 135         raise exc_for_missing_operator(output_queue[0], output_queue[1])
    136     return output_queue[0]
    138 return None

FormulaSyntaxError: Missing operator between `C(` and `Treatment`.

⧛`C(`Treatment⧚`, contr.treatment(base='DMSO'))`+`C(`Time`, contr.treatment(base='8hr'))`+`C(`Treatment`, contr.treatment(base='DMSO')):C(`Time`, contr.treatment(base='8hr'))`

jeandut · 2024-09-19T15:15:43Z

Hi,
I was wondering if this will be implemented soon? I have done all my analysis in Python but was asked for some interaction terms and i was wondering if i Should switch totally to R, or will this be implemented soon?
Thanks

Hi @Marwansha normally this PR is pretty much finished but, as the changes are substantial, we wanted to spend some extra time to review it before releasing it (we even think of doing a pre-release). Crossing fingers this will be merged soonish. In the meantime you can checkout this branch and install it from source to test it against your usual workflow. We would be super happy to get your feedbacks !

I'll be happy to do some analysis using this branch. Is there any specific concern you guys have in mind? :)

Thank you for the feedback ! Two things I would be specifically looking for is:

differences of default settings with interaction terms (what happens when you do not specify reference levels)
limited support for the full syntax offered by Wilkinson formulas (maybe what happened to you in your example)

jeandut · 2024-09-19T15:17:05Z

Here is a quick try on my data:

build_design_matrix(
    metadata=rnaseq_data_wt.obs,
    design_factors = '`Treatment`+`Time`+`Treatment:Time`',
    ref_level=[('Treatment','DMSO'),('Time','8hr')]
)

I'm trying to have "Treatment" and "Time" as co-variables but I ran into an error while setting up the ref_level:

FormulaSyntaxError: Missing operator between `C(` and `Treatment`.

⧛`C(`Treatment⧚`, contr.treatment(base='DMSO'))`+`C(`Time`, contr.treatment(base='8hr'))`+`C(`Treatment`, contr.treatment(base='DMSO')):C(`Time`, contr.treatment(base='8hr'))`

Full Error:

---------------------------------------------------------------------------
FormulaSyntaxError                        Traceback (most recent call last)
Cell In[145], line 1
----> 1 build_design_matrix(
      2     metadata=rnaseq_data_wt.obs,
      3     design_factors = '`Treatment`+`Time`+`Treatment:Time`',
      4     ref_level=[('Treatment','DMSO'),('Time','8hr')]
      5 )

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/pydeseq2/utils.py:235](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/pydeseq2/utils.py#line=234), in build_design_matrix(metadata, design_factors, ref_level)
    232         all_metadata_ref_levels[col] = sorted(metadata[col].unique())[0]
    234 try:
--> 235     design_matrix = model_matrix(design_factors, metadata)
    236 except formulaic.errors.FactorEvaluationError:
    237     # It is a design choice due to the fact that forumalaic doesn't handle
    238     # well expressions with hyphens
    239     warnings.warn(
    240         "It seems one of the factor of the formula could not be"
    241         "well parsed by formulaic trying to fix it",
    242         UserWarning,
    243         stacklevel=2,
    244     )

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/sugar.py:51](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/sugar.py#line=50), in model_matrix(spec, data, context, **spec_overrides)
     16 """
     17 Generate a model matrix directly from a formula or model spec.
     18 
   (...)
     48     nominated structure.
     49 """
     50 _context = capture_context(context + 1) if isinstance(context, int) else context
---> 51 return ModelSpec.from_spec(spec, **spec_overrides).get_model_matrix(
     52     data, context=_context
     53 )

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/model_spec.py:107](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/model_spec.py#line=106), in ModelSpec.from_spec(cls, spec, **attrs)
    104     return ModelSpec(formula=formula, **attrs)
    106 if isinstance(spec, Formula) or not isinstance(spec, Structured):
--> 107     return prepare_model_spec(spec)
    108 return cast(ModelSpecs, spec._map(prepare_model_spec, as_type=ModelSpecs))

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/model_spec.py:99](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/model_spec.py#line=98), in ModelSpec.from_spec.<locals>.prepare_model_spec(obj)
     97 if isinstance(obj, ModelSpec):
     98     return obj.update(**attrs)
---> 99 formula = Formula.from_spec(obj)
    100 if not formula._has_root or formula._has_structure:
    101     return cast(
    102         ModelSpec, formula._map(prepare_model_spec, as_type=ModelSpecs)
    103     )

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/formula.py:117](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/formula.py#line=116), in Formula.from_spec(cls, spec, parser, nested_parser, ordering)
    115 if isinstance(spec, Formula):
    116     return spec
--> 117 return Formula(
    118     spec, _parser=parser, _nested_parser=nested_parser, _ordering=ordering
    119 )

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/formula.py:132](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/formula.py#line=131), in Formula.__init__(self, _parser, _nested_parser, _ordering, *args, **kwargs)
    130 self._nested_parser = _nested_parser or _parser or self.DEFAULT_NESTED_PARSER
    131 self._ordering = OrderingMethod(_ordering)
--> 132 super().__init__(*args, **kwargs)
    133 self._simplify(unwrap=False, inplace=True)

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/types/structured.py:101](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/types/structured.py#line=100), in Structured.__init__(self, root, _metadata, **structure)
     96     raise ValueError(
     97         "Substructure keys cannot start with an underscore. "
     98         f"The invalid keys are: {set(key for key in structure if key.startswith('_'))}."
     99     )
    100 if root is not _MISSING:
--> 101     structure["root"] = self.__prepare_item("root", root)
    102 self._metadata = _metadata
    104 self._structure = {
    105     key: self.__prepare_item(key, item) for key, item in structure.items()
    106 }

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/types/structured.py:115](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/types/structured.py#line=114), in Structured.__prepare_item(self, key, item)
    113 if isinstance(item, tuple):
    114     return tuple(self.__prepare_item(key, v) for v in item)
--> 115 return self._prepare_item(key, item)

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/formula.py:152](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/formula.py#line=151), in Formula._prepare_item(self, key, item)
    136 """
    137 Convert incoming formula items into either a list of Terms or a nested
    138 `Formula` instance.
   (...)
    145     item: The specification to convert.
    146 """
    148 if isinstance(item, str):
    149     item = cast(
    150         FormulaSpec,
    151         (self._parser if key == "root" else self._nested_parser)
--> 152         .get_terms(item)
    153         ._simplify(),
    154     )
    156 if isinstance(item, Structured):
    157     formula_or_terms = Formula(
    158         _parser=self._nested_parser, **item._structure
    159     )._simplify()

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/parser.py:132](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/parser.py#line=131), in DefaultFormulaParser.get_terms(self, formula)
    119 def get_terms(self, formula: str) -> Structured[List[Term]]:
    120     """
    121     Assemble the `Term` instances for a formula string. Depending on the
    122     operators involved, this may be an iterable of `Term` instances, or
   (...)
    130         formula: The formula for which an AST should be generated.
    131     """
--> 132     terms = super().get_terms(formula)
    134     def check_terms(terms: Iterable[Term]) -> None:
    135         seen_terms = set()

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/types/formula_parser.py:72](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/types/formula_parser.py#line=71), in FormulaParser.get_terms(self, formula)
     63 def get_terms(self, formula: str) -> Structured[List[Term]]:
     64     """
     65     Assemble the `Term` instances for a formula string. Depending on the
     66     operators involved, this may be an iterable of `Term` instances, or
   (...)
     70         formula: The formula for which an AST should be generated.
     71     """
---> 72     ast = self.get_ast(formula)
     73     if ast is None:
     74         return Structured([])

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/types/formula_parser.py:58](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/types/formula_parser.py#line=57), in FormulaParser.get_ast(self, formula)
     50 """
     51 Assemble an abstract syntax tree for the nominated `formula` string.
     52 
     53 Args:
     54     formula: The formula for which an AST should be generated.
     55 """
     56 from ..algos.tokens_to_ast import tokens_to_ast
---> 58 return tokens_to_ast(
     59     self.get_tokens(formula),
     60     operator_resolver=self.operator_resolver,
     61 )

File [~/miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/algos/tokens_to_ast.py:135](...miniconda3/envs/rnaseq/lib/python3.11/site-packages/formulaic/parser/algos/tokens_to_ast.py#line=134), in tokens_to_ast(tokens, operator_resolver)
    133 if output_queue:
    134     if len(output_queue) > 1:
--> 135         raise exc_for_missing_operator(output_queue[0], output_queue[1])
    136     return output_queue[0]
    138 return None

FormulaSyntaxError: Missing operator between `C(` and `Treatment`.

⧛`C(`Treatment⧚`, contr.treatment(base='DMSO'))`+`C(`Time`, contr.treatment(base='8hr'))`+`C(`Treatment`, contr.treatment(base='DMSO')):C(`Time`, contr.treatment(base='8hr'))`

Do you have support data for an MWE of your error ?
I think the quotes you added might be the source of the error but this should not happen.

khalilouardini requested review from BorisMuzellec, maikia and arthurPignetOwkin as code owners October 17, 2023 12:51

khalilouardini requested a review from a user October 17, 2023 12:51

khalilouardini marked this pull request as draft October 17, 2023 12:51

This was referenced Oct 17, 2023

Refactor utils.py #182

Open

Tutorial including interactions terms in the design #183

Open

[Enhancement] expanded argument not used in utils.build_design_matrix #184

Open