Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce group_by #228

Closed
8 of 14 tasks
dmpetrov opened this issue Aug 2, 2024 · 4 comments · Fixed by #482
Closed
8 of 14 tasks

Introduce group_by #228

dmpetrov opened this issue Aug 2, 2024 · 4 comments · Fixed by #482
Assignees
Labels
enhancement New feature or request priority-p2

Comments

@dmpetrov
Copy link
Member

dmpetrov commented Aug 2, 2024

These has to work inside DB:

from datachain import func

chain.group_by(
        name=func.first(Column("name")),
        total=func.sum(Column("num")),
        cnt=func.count(),
        partition_by=Column("class"),
)

Open question:

  • Should it be a separate group_by() or a part of agg()?

Functions to implement:

  • count
  • sum
  • avg
  • min
  • max
  • any_value
  • collect / list
  • concat - concatenate strings

Later:

  • first
  • last
  • std
  • var
  • min - for not int/float should return None is None in a list, otherwise - any value(like max([None, None, "Cat", "Dog", None]) --> None)
  • max - for not int/float should return any value but None (like max([None, None, "Cat", "Dog", None]) --> "Cat")
@dmpetrov dmpetrov added enhancement New feature or request priority-p2 labels Aug 2, 2024
@EdwardLi-coder
Copy link
Contributor

Hi @dmpetrov. I think group_by should be implemented as a separate method, rather than as part of agg(). This approach would provide a clearer API and maintain consistency with the conventional usage in most data processing libraries (such as pandas and SQL).

@dmpetrov
Copy link
Member Author

dmpetrov commented Aug 7, 2024

@EdwardLi-coder agree, it seems a cleaner API. In general, I like the idea of separating DB/CPU compute from application/GPU compute. Like mutate() and map().

@dreadatour dreadatour self-assigned this Sep 25, 2024
@dreadatour
Copy link
Contributor

dreadatour commented Sep 27, 2024

Intermediate results:

group_by.py:

import os
from datachain import DataChain, func


def path_ext(path):
    _, ext = os.path.splitext(path)
    return (ext.lstrip("."),)


(
    DataChain.from_storage("s3://dql-50k-laion-files/")
    .map(
        path_ext,
        params=["file.path"],
        output={"path_ext": str},
    )
    .group_by(
        total_size=func.sum("file.size"),
        cnt=func.count(),
        partition_by="path_ext",
    )
    .show()
)

Running:

~/playground $ python group_by.py
  path_ext  total_size    cnt
0      jpg  1079645149  43042
1     json    29743128  43047
2  parquet    15378208      5
3      txt     2927814  43042
~/playground $

TBD: cleanup the code, add more aggregate functions, add tests and create PR. Draft PR: #482

@dreadatour dreadatour linked a pull request Oct 20, 2024 that will close this issue
@dreadatour
Copy link
Contributor

Merged. Closing this issue as work will continue in the follow-up #523 issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request priority-p2
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants