Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FEAT-#4605: Implementation of Small Query Compiler to support small and empty DataFrames #5113

Open
wants to merge 28 commits into
base: master
Choose a base branch
from

Conversation

billiam-wang
Copy link
Collaborator

@billiam-wang billiam-wang commented Oct 10, 2022

What do these changes do?

  • commit message follows format outlined here
  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Handle Empty/Small Data DataFrames as a separate case #4605
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@lgtm-com
Copy link

lgtm-com bot commented Oct 31, 2022

This pull request introduces 9 alerts when merging fe123de71f23dfe852bbfa6bb74ae154f3c53f94 into f492ba9 - view on LGTM.com

new alerts:

  • 7 for First parameter of a method is not named 'self'
  • 1 for Unused import
  • 1 for Variable defined multiple times

@billiam-wang billiam-wang changed the title Implementation of Small Query Compiler to support small and empty DataFrames FEAT-#4605: Implementation of Small Query Compiler to support small and empty DataFrames Oct 31, 2022
@lgtm-com
Copy link

lgtm-com bot commented Oct 31, 2022

This pull request introduces 7 alerts when merging e0f36929ab8b4a6b5e7755714eaa72c61d891cf6 into 6f0ff79 - view on LGTM.com

new alerts:

  • 7 for First parameter of a method is not named 'self'

@billiam-wang billiam-wang marked this pull request as ready for review October 31, 2022 18:17
@billiam-wang billiam-wang requested a review from a team as a code owner October 31, 2022 18:17
@lgtm-com
Copy link

lgtm-com bot commented Oct 31, 2022

This pull request introduces 6 alerts when merging a3bac2c06ba1e6ce1d467f0281daf7ac870ada17 into 6f0ff79 - view on LGTM.com

new alerts:

  • 6 for First parameter of a method is not named 'self'

Copy link
Collaborator

@devin-petersohn devin-petersohn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good! Left a couple of high level comments.

modin/core/storage_formats/base/query_compiler.py Outdated Show resolved Hide resolved
modin/core/storage_formats/pandas/query_compiler.py Outdated Show resolved Hide resolved
@Garra1980
Copy link
Collaborator

Do these changes need to be reflected in docs?

@billiam-wang billiam-wang force-pushed the bill-sqc branch 4 times, most recently from 912654e to 140d0d8 Compare November 2, 2022 22:16
Comment on lines +29 to +32
test-small-query-compiler:
- 'modin/experimental/core/storage_formats/pandas/small_query_compiler.py'
- 'modin/core/storage_formats/pandas/query_compiler.py'
- 'modin/core/storage_formats/base/query_compiler.py'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks clever, but I think it feels temporary to me - query compilers both use stuff from lower level and are used by higher level, so this integration should eventually be tested upon change in any of the things used throughout the pipeline... this should probably stay for now, but be removed towards the shiny future where the "auto-switch" would happen

queries for small data and empty ``PandasDataFrame``.
"""

from modin.config.envvars import InitializeWithSmallQueryCompilers
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the current design for configuration is that all config variables come from modin.config regardless of what value source they have (even though we have one value source so far)

Comment on lines +30 to +34
from modin.utils import MODIN_UNNAMED_SERIES_LABEL
from modin.utils import (
_inherit_docstrings,
try_cast_to_pandas,
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's no need to import from single module in multiple statements

)


def _get_axis(axis):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm pretty sure we have this code somewhere already, we should reduce the copy-pasteness of this

by_names.append(by[i].name)
elif isinstance(by[i], str):
by_names.append(by[i])
if isinstance(by, pandas.DataFrame):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can it be if by is squeezed at line 268?

Comment on lines +202 to +206
if query_compiler is None and InitializeWithSmallQueryCompilers.get():
small_dataframe = pandas.DataFrame(
data=data, index=index, columns=columns, dtype=dtype, copy=copy
)
self._query_compiler = SmallQueryCompiler(small_dataframe)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels weird in terms of architecture

self._query_compiler = query_compiler.columnarize()
if name is not None:
self._query_compiler = self._query_compiler
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this removal feels completely wrong

also this is leaking architecture IMHO

Comment on lines -2262 to +2288
def _validate_dtypes_sum_prod_mean(self, axis, numeric_only, ignore_axis=False):
def _validate_dtypes_sum_prod_mean(
self, axis, numeric_only, ignore_axis=False
): # noqa: PR01
"""
Validate data dtype for `sum`, `prod` and `mean` methods.

Parameters
Parameter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change?

Comment on lines +499 to +500
InitializeWithSmallQueryCompilers.get(),
reason="SmallQueryCompiler does not currently support IO functions.",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this "not supporting" make current PR a draft until it really does support I/O?

Comment on lines +87 to +88
if InitializeWithSmallQueryCompilers.get():
return DataFrame(query_compiler=SmallQueryCompiler(df))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this feels completely off; I thought small query compiler should go the usual route for other QCs (in this case it should get picked by FactoryDispatcher according to user's settings)

@YarShev
Copy link
Collaborator

YarShev commented Dec 5, 2023

@billiam-wang, do you have a plan to continue this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Handle Empty/Small Data DataFrames as a separate case
5 participants