Support min_by group by aggregate #16163

thirtiseven · 2024-07-02T10:34:58Z

Description

This pr adds support for min_by, which is used to return the value of a column associated with the minimum value of another column. It will be useful for spark-rapids.

Currently this pr only supports sort based group by, will try to add a hash group by too, but I'm not very clear how to do it right now because the input column from spark will be a struct column of value and order.

Related pr in spark-rapids: NVIDIA/spark-rapids#11123

For Spark, all orderable types (basic types and array/struct) are supported, except float and double with NaN values, because Spark has a special handling for NaN in non-nested floating types.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2024-07-02T10:35:03Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Haoyang Li <[email protected]>

bdice

Can you explain an example of where this new aggregation is a better choice than argmin + gather? libcudf already provides the essential building blocks for this kind of operation, and I don't see how this specialized implementation provides a significant benefit. I'm not sure if I find the claims in #16139 about memory pressure, single-pass aggregation, and performance to be compelling from a surface level view.

If we do go this route and decide this feature is necessary, we should also implement a max_by aggregation at the same time, for symmetry.

thirtiseven · 2024-07-08T03:18:03Z

Can you explain an example of where this new aggregation is a better choice than argmin + gather? libcudf already provides the essential building blocks for this kind of operation, and I don't see how this specialized implementation provides a significant benefit. I'm not sure if I find the claims in #16139 about memory pressure, single-pass aggregation, and performance to be compelling from a surface level view.

min_by is an aggregation for Spark but not in Pandas, so we would like to match Spark's behavior directly from the cuDF side. argmin/max + gather have few features gap between Spark's min/max_by that are difficult to handle from the spark-rapids side, such as

all nulls in a grouped order column, Spark returns null, argmin/max + gather returns first element in grouped value column.
NaN for float aggregation. In Spark, Nan is the maximum float value, but in cuDF, the calculation involving Nan is undefined.
min_by and max_by return the last minimum order value in Spark. For argmin it's matched, but for argmax cuDF will return the first maximum order value.

Now we have handled the all nulls case by modifying the null masks and using argmin+gather route to quickly support our customer's need. But for the next step, we'd like to have an independent implementation to support float aggregation and max_by.

Another reason is that min_by and max_by are two special aggregations in Spark that need to perform aggregation on two different columns. So we need to package the two columns into one struct column for cuDF to handle because AFAIK cuDF only supports aggregation on one column. Otherwise we need to do some special post-processing in spark-rapids to check if there are min/max_bys in aggs, gather them to their value column and concat the results back into the original agg result table, which would be much slower and harder to implement.

If we do go this route and decide this feature is necessary, we should also implement a max_by aggregation at the same time, for symmetry.

Will do if we go this route.

bdice · 2024-07-08T11:20:07Z

In terms of semantics, the proposed min_by would need to match the argmin + gather implementation exactly. In libcudf, null and NaN control are handled by separate arguments. For example, the collect set aggregation (https://docs.rapids.ai/api/libcudf/stable/group__aggregation__factories#gaebe680a414f3c942a631f609bcfb5781) accepts null_policy, null_equality, and nan_equality arguments. This is a better route to address the desired semantics if you can express it. There are also some examples of ordering policies in libcudf, like null_order.

We need to work on some improvements to argmin and argmax anyway, so this would be a good joint project for us. Currently we have “experimental” row comparators that we use for everything except for argmin calls, and we need to adopt those to expand support for nested list and struct columns. I can dig up the issue where this is discussed.

Signed-off-by: Haoyang Li <[email protected]>

bdice · 2024-07-09T22:06:03Z

@thirtiseven Here is the other issue I was thinking of: #14412 (comment)

thirtiseven · 2024-07-10T03:45:07Z

In terms of semantics, the proposed min_by would need to match the argmin + gather implementation exactly. In libcudf, null and NaN control are handled by separate arguments. For example, the collect set aggregation (https://docs.rapids.ai/api/libcudf/stable/group__aggregation__factories#gaebe680a414f3c942a631f609bcfb5781) accepts null_policy, null_equality, and nan_equality arguments. This is a better route to address the desired semantics if you can express it. There are also some examples of ordering policies in libcudf, like null_order.

That sounds good, adding those arguments seems to work. I will try to do this, maybe in another pr just for argmin/argmax and make min/max_by just: 1. unpack the struct column. 2. call argmin/argmax with different arguments. 3. do a gather with the struct column. Does this make sense to you?

We need to work on some improvements to argmin and argmax anyway, so this would be a good joint project for us. Currently we have “experimental” row comparators that we use for everything except for argmin calls, and we need to adopt those to expand support for nested list and struct columns. I can dig up the issue where this is discussed.
@thirtiseven Here is the other issue I was thinking of: #14412 (comment)

Thank you! I was wondering how to write a hash-based implementation for min_by since the value will always be a struct (but no idea).
Currently for spark-rapids the sort-based way is good enough according to perf tests. We'd love to write a hash-based min/max_by as the next step after argmin/max supports nested types with row comparators.

Signed-off-by: Haoyang Li <[email protected]>

github-actions bot added libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue Java Affects Java cuDF API. labels Jul 2, 2024

firestarman added feature request New feature or request non-breaking Non-breaking change labels Jul 3, 2024

firestarman mentioned this pull request Jul 3, 2024

Support MinBy on GPU NVIDIA/spark-rapids#11123

Draft

Support min_by agg sort based

101a929

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven force-pushed the min_by_cudf_sort_only branch from aa6c36b to 101a929 Compare July 3, 2024 06:09

thirtiseven added 2 commits July 4, 2024 15:19

Handle nulls in orders and values

961917a

Signed-off-by: Haoyang Li <[email protected]>

format

8b8ecda

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven marked this pull request as ready for review July 5, 2024 10:31

thirtiseven requested review from a team as code owners July 5, 2024 10:31

thirtiseven requested review from shrshi and pmattione-nvidia July 5, 2024 10:31

bdice requested changes Jul 5, 2024

View reviewed changes

max_by wip

9a4179d

Signed-off-by: Haoyang Li <[email protected]>

thirtiseven marked this pull request as draft July 9, 2024 08:26

thirtiseven added 2 commits July 10, 2024 17:27

upmerge

307998e

Signed-off-by: Haoyang Li <[email protected]>

max_by support

2da948d

Signed-off-by: Haoyang Li <[email protected]>

github-actions bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas cudf.polars Issues specific to cudf.polars pylibcudf Issues specific to the pylibcudf package labels Jul 11, 2024

upmerge

3b2fe1d

Signed-off-by: Haoyang Li <[email protected]>

github-actions bot removed cudf.pandas Issues specific to cudf.pandas cudf.polars Issues specific to cudf.polars labels Jul 11, 2024

thirtiseven added 3 commits July 11, 2024 10:52

clean up

a5f4a1f

Signed-off-by: Haoyang Li <[email protected]>

clean up

7f3a024

Signed-off-by: Haoyang Li <[email protected]>

clean up

0f3e464

Signed-off-by: Haoyang Li <[email protected]>

github-actions bot removed Python Affects Python cuDF API. pylibcudf Issues specific to the pylibcudf package labels Jul 11, 2024

thirtiseven added 2 commits July 11, 2024 15:17

max_by tests

2f53b70

Signed-off-by: Haoyang Li <[email protected]>

fix test

8ae253a

Signed-off-by: Haoyang Li <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support min_by group by aggregate #16163

Support min_by group by aggregate #16163

thirtiseven commented Jul 2, 2024 •

edited

Loading

copy-pr-bot bot commented Jul 2, 2024

bdice left a comment

thirtiseven commented Jul 8, 2024

bdice commented Jul 8, 2024

bdice commented Jul 9, 2024

thirtiseven commented Jul 10, 2024 •

edited

Loading

Support min_by group by aggregate #16163

Are you sure you want to change the base?

Support min_by group by aggregate #16163

Conversation

thirtiseven commented Jul 2, 2024 • edited Loading

Description

Checklist

copy-pr-bot bot commented Jul 2, 2024

bdice left a comment

Choose a reason for hiding this comment

thirtiseven commented Jul 8, 2024

bdice commented Jul 8, 2024

bdice commented Jul 9, 2024

thirtiseven commented Jul 10, 2024 • edited Loading

thirtiseven commented Jul 2, 2024 •

edited

Loading

thirtiseven commented Jul 10, 2024 •

edited

Loading