Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Add min_by aggregate support #16139

Open
firestarman opened this issue Jul 1, 2024 · 0 comments · May be fixed by #16163
Open

[FEA] Add min_by aggregate support #16139

firestarman opened this issue Jul 1, 2024 · 0 comments · May be fixed by #16163
Labels
feature request New feature or request

Comments

@firestarman
Copy link
Contributor

firestarman commented Jul 1, 2024

Is your feature request related to a problem? Please describe.
Spark supports an aggregate called min_by, which is used to return the value of one column associated with the minimum value of another column.

> SELECT min_by(x, y) FROM VALUES ('a', 10), ('b', 50), ('c', 20) AS tab(x, y);
 a

It would be great that cuDF can support this min_by, then we can use it mixed with other common aggregates (e.g. max, min) in a single groupby-aggregation execution.

Describe the solution you'd like
We can leverage the argmin to get the indices for the minimum values used for ordering, then gather the relevant values from the value column.

Describe alternatives you've considered
argmin is already exposed to users, and we can implement min_by outside of cuDF.
But it likely to make the users code more complicated, because when min_by runs with other common aggregates (count, max, min), we can not have all the aggregates done in a single pass.
It may also lead to higher memory pressure and worse performance, because the original device memory can not be released until the following gather operation is done after all the aggregates complete.

@firestarman firestarman added the feature request New feature or request label Jul 1, 2024
@firestarman firestarman changed the title [FEA] add min_by aggregate support [FEA] Add min_by aggregate support Jul 1, 2024
@thirtiseven thirtiseven linked a pull request Jul 2, 2024 that will close this issue
3 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

1 participant