Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Series.drop_nil #864

Open
josevalim opened this issue Feb 20, 2024 · 8 comments
Open

Add Series.drop_nil #864

josevalim opened this issue Feb 20, 2024 · 8 comments

Comments

@josevalim
Copy link
Member

There are two possible versions of this function:

  • Operates on every series and removes nils from the series
  • Operates on list series and only drops elements inside lists (it is unclear if it is any level)

We need to decide the API. If we call it drop_nil/1 for series, how would we call it the list version? We could use the list_ prefix but we have generally avoided those.

Another option is to pass an atom to drop_nil/2, such as Series.drop_nil(series, :list) but then I am not sure how to call the regular version. Ideas are welcome.

@cigrainger
Copy link
Member

Do we need the regular version of this? It feels like clutter. The idea of minimal verbs is that you can do everything you need with them. Why not just Series.filter(not is_nil(_))?

@cigrainger
Copy link
Member

cigrainger commented Feb 20, 2024

Just thinking out loud here: I think there's got to be a more elegant way of dealing with lists than cleaving to Polars's API too closely. I'm not sure I like everything about purrr but I wonder if something like map_depth might help us out here?

There's also modify_tree. Something that indicates it's recursive might be useful?

@josevalim
Copy link
Member Author

josevalim commented Feb 20, 2024

@cigrainger you are right, we don't need drop_nil at the root. However, the issue presented here is also available for many of the aggregate functions. How to distinguish between max of a list and maxof the series? Perhaps some sort of prefix or suffix for all lists operations?

map_depth/modify_tree is definitely interesting. Implementing it is a bit less trivial. We would need to introduce some sort of LazyDepthSeries, that collects operations in a series, but as it relates to a struct or list field, and then translate that to polars. Do we want to go down this route?

@cigrainger
Copy link
Member

Maybe. I need to explore how Polars handles this itself and in py polars.

@cigrainger
Copy link
Member

So putting this a bit to the test with Python Polars:

It seems like nested lists may not be supported? That would fix the recursion problem pretty cleanly.

In [16]: df = pl.DataFrame({"values": [[None, 1, None, 2], [None], [3, 4], [[None, 1], [2], [None]]]})

In [17]: df
Out[17]:
shape: (4, 1)
┌────────────────────┐
│ values             │
│ ---                │
│ list[i64]          │
╞════════════════════╡
│ [null, 1, … 2]     │
│ [null]             │
│ [3, 4]             │
│ [null, null, null] │
└────────────────────┘

@josevalim
Copy link
Member Author

@cigrainger all lists need to be nested equally. When it fails to cast to a certain type, it returns null instead of raising.

@billylanchantin
Copy link
Contributor

@cigrainger Also, in case you missed it, there was an interesting saga of us discovering what Polars was doing WRT nested lists here:

Some take-aways:

  • The first element of a nested list is the tie-breaker when the dtype is ambiguous.
  • We decided to be more strict with our inference code than py-polars in certain situations.

@cigrainger
Copy link
Member

I did! Thanks @billylanchantin

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants