-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce group_by #228
Comments
Hi @dmpetrov. I think group_by should be implemented as a separate method, rather than as part of agg(). This approach would provide a clearer API and maintain consistency with the conventional usage in most data processing libraries (such as pandas and SQL). |
@EdwardLi-coder agree, it seems a cleaner API. In general, I like the idea of separating DB/CPU compute from application/GPU compute. Like mutate() and map(). |
Intermediate results:
import os
from datachain import DataChain, func
def path_ext(path):
_, ext = os.path.splitext(path)
return (ext.lstrip("."),)
(
DataChain.from_storage("s3://dql-50k-laion-files/")
.map(
path_ext,
params=["file.path"],
output={"path_ext": str},
)
.group_by(
total_size=func.sum("file.size"),
cnt=func.count(),
partition_by="path_ext",
)
.show()
) Running:
TBD: cleanup the code, add more aggregate functions, add tests and create PR. Draft PR: #482 |
Merged. Closing this issue as work will continue in the follow-up #523 issue. |
These has to work inside DB:
Open question:
group_by()
or a part ofagg()
?Functions to implement:
Later:
max([None, None, "Cat", "Dog", None]) --> None
)max([None, None, "Cat", "Dog", None]) --> "Cat"
)The text was updated successfully, but these errors were encountered: