[FEA] Add `min_by` aggregate support #16139

firestarman · 2024-07-01T08:08:40Z

Is your feature request related to a problem? Please describe.
Spark supports an aggregate called min_by, which is used to return the value of one column associated with the minimum value of another column.

> SELECT min_by(x, y) FROM VALUES ('a', 10), ('b', 50), ('c', 20) AS tab(x, y);
 a

It would be great that cuDF can support this min_by, then we can use it mixed with other common aggregates (e.g. max, min) in a single groupby-aggregation execution.

Describe the solution you'd like
We can leverage the argmin to get the indices for the minimum values used for ordering, then gather the relevant values from the value column.

Describe alternatives you've considered
argmin is already exposed to users, and we can implement min_by outside of cuDF.
But it likely to make the users code more complicated, because when min_by runs with other common aggregates (count, max, min), we can not have all the aggregates done in a single pass.
It may also lead to higher memory pressure and worse performance, because the original device memory can not be released until the following gather operation is done after all the aggregates complete.

The text was updated successfully, but these errors were encountered:

firestarman added the feature request New feature or request label Jul 1, 2024

firestarman changed the title ~~[FEA] add min_by aggregate support~~ [FEA] Add min_by aggregate support Jul 1, 2024

firestarman mentioned this issue Jul 1, 2024

[FEA] support min_by function NVIDIA/spark-rapids#10968

Open

thirtiseven linked a pull request Jul 2, 2024 that will close this issue

Support min_by group by aggregate #16163

Draft

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add `min_by` aggregate support #16139

[FEA] Add `min_by` aggregate support #16139

firestarman commented Jul 1, 2024 •

edited

Loading

[FEA] Add min_by aggregate support #16139

[FEA] Add min_by aggregate support #16139

Comments

firestarman commented Jul 1, 2024 • edited Loading

[FEA] Add `min_by` aggregate support #16139

[FEA] Add `min_by` aggregate support #16139

firestarman commented Jul 1, 2024 •

edited

Loading