Introduce group_by #228

dmpetrov · 2024-08-02T23:32:35Z

EdwardLi-coder · 2024-08-07T09:28:54Z

Hi @dmpetrov. I think group_by should be implemented as a separate method, rather than as part of agg(). This approach would provide a clearer API and maintain consistency with the conventional usage in most data processing libraries (such as pandas and SQL).

dmpetrov · 2024-08-07T19:27:20Z

@EdwardLi-coder agree, it seems a cleaner API. In general, I like the idea of separating DB/CPU compute from application/GPU compute. Like mutate() and map().

dreadatour · 2024-09-27T13:55:59Z

Intermediate results:

group_by.py:

import os
from datachain import DataChain, func


def path_ext(path):
    _, ext = os.path.splitext(path)
    return (ext.lstrip("."),)


(
    DataChain.from_storage("s3://dql-50k-laion-files/")
    .map(
        path_ext,
        params=["file.path"],
        output={"path_ext": str},
    )
    .group_by(
        total_size=func.sum("file.size"),
        cnt=func.count(),
        partition_by="path_ext",
    )
    .show()
)

Running:

~/playground $ python group_by.py
  path_ext  total_size    cnt
0      jpg  1079645149  43042
1     json    29743128  43047
2  parquet    15378208      5
3      txt     2927814  43042
~/playground $

TBD: cleanup the code, add more aggregate functions, add tests and create PR. Draft PR: #482

dreadatour · 2024-10-20T03:23:14Z

Merged. Closing this issue as work will continue in the follow-up #523 issue.

dmpetrov added enhancement New feature or request priority-p2 labels Aug 2, 2024

dreadatour self-assigned this Sep 25, 2024

dreadatour linked a pull request Oct 20, 2024 that will close this issue

Implement chain group_by #482

Merged

dreadatour mentioned this issue Oct 20, 2024

Implement more group_by functions #523

Open

6 tasks

dreadatour closed this as completed Oct 20, 2024

dreadatour mentioned this issue Oct 21, 2024

Finish SQL functions refactoring #525

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce group_by #228

Introduce group_by #228

dmpetrov commented Aug 2, 2024 •

edited by dreadatour

Loading

EdwardLi-coder commented Aug 7, 2024

dmpetrov commented Aug 7, 2024

dreadatour commented Sep 27, 2024 •

edited

Loading

dreadatour commented Oct 20, 2024

Introduce group_by #228

Introduce group_by #228

Comments

dmpetrov commented Aug 2, 2024 • edited by dreadatour Loading

EdwardLi-coder commented Aug 7, 2024

dmpetrov commented Aug 7, 2024

dreadatour commented Sep 27, 2024 • edited Loading

dreadatour commented Oct 20, 2024

dmpetrov commented Aug 2, 2024 •

edited by dreadatour

Loading

dreadatour commented Sep 27, 2024 •

edited

Loading