You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Spark supports an aggregate called min_by, which is used to return the value of one column associated with the minimum value of another column.
> SELECT min_by(x, y) FROM VALUES ('a', 10), ('b', 50), ('c', 20) AS tab(x, y);
a
It would be great that cuDF can support this min_by, then we can use it mixed with other common aggregates (e.g. max, min) in a single groupby-aggregation execution.
Describe the solution you'd like
We can leverage the argmin to get the indices for the minimum values used for ordering, then gather the relevant values from the value column.
Describe alternatives you've considered argmin is already exposed to users, and we can implement min_by outside of cuDF.
But it likely to make the users code more complicated, because when min_by runs with other common aggregates (count, max, min), we can not have all the aggregates done in a single pass.
It may also lead to higher memory pressure and worse performance, because the original device memory can not be released until the following gather operation is done after all the aggregates complete.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Spark supports an aggregate called
min_by
, which is used to return the value of one column associated with the minimum value of another column.It would be great that cuDF can support this
min_by
, then we can use it mixed with other common aggregates (e.g. max, min) in a single groupby-aggregation execution.Describe the solution you'd like
We can leverage the
argmin
to get the indices for the minimum values used for ordering, then gather the relevant values from the value column.Describe alternatives you've considered
argmin
is already exposed to users, and we can implementmin_by
outside of cuDF.But it likely to make the users code more complicated, because when
min_by
runs with other common aggregates (count, max, min), we can not have all the aggregates done in a single pass.It may also lead to higher memory pressure and worse performance, because the original device memory can not be released until the following gather operation is done after all the aggregates complete.
The text was updated successfully, but these errors were encountered: