You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In some case LLM will generate code to retrieve different subset of a same dataframe. In the example below it's generating the "first_purchases" dataframe as well as "repurchases" dataframe. Then these two dataframes are merged for further analysis, which makes perfect sense:
# Assuming dfs[0] is the DataFrame we need to work with
df = dfs[0]
# Filter out first-time purchases and repurchases
first_purchases = df[df['buyer_purchase_sequence'] == 1][['buyer_uid', 'product_type']]
repurchases = df[df['buyer_purchase_sequence'] > 1][['buyer_uid', 'product_type']]
# Merge first purchases with repurchases to track the product type of the first purchase against repurchases
merged_df = pd.merge(repurchases, first_purchases, on='buyer_uid', suffixes=('_repurchase', '_first_purchase'))
In the code_execution phase, two filters are extracted: "buyer_purchase_sequence == 1" (for "first_purchases" dataframe) and "buyer_purchase_sequence > 1" (for "repurchases" dataframe).
However in pandasai/connectors/sql.py, this will concatenate a where clause: where buyer_purchase_sequence == 1 AND buyer_purchase_sequence > 1. This will result an empty dataframe in dfs, and eventually an empty chart.
I understand the design of filtering dataframe is for better system performance. But considering all possible cases, the "where clause" needs to use OR instead of AND to ensure all data gets loaded in dfs.
The text was updated successfully, but these errors were encountered:
Hey @xuyuanme! 馃憢 I'm here to help you with any bugs, questions, or contributions you have in mind. Let's tackle this together!
To address the issue where the where clause in pandasai/connectors/sql.py uses AND instead of OR, you need to modify the _build_query method. Specifically, replace the " AND ".join(condition_strings) with " OR ".join(condition_strings).
Here is the modified _build_query method:
def_build_query(self, limit=None, order=None):
base_query=select("*").select_from(text(self.cs_table_name))
ifself.config.whereorself._additional_filters:
# conditions is the list of where + additional filtersconditions= []
ifself.config.where:
conditions+=self.config.whereifself._additional_filters:
conditions+=self._additional_filtersquery_params= {}
condition_strings= []
valid_operators= ["=", ">", "<", ">=", "<=", "LIKE", "!=", "IN", "NOT IN"]
fori, conditioninenumerate(conditions):
iflen(condition) ==3:
column_name, operator, value=conditionifoperatorinvalid_operators:
self._validate_column_name(column_name)
condition_strings.append(f"{column_name}{operator} :value_{i}")
query_params[f"value_{i}"] =valueifcondition_strings:
where_clause=" OR ".join(condition_strings) # Changed from AND to ORbase_query=base_query.where(
text(where_clause).bindparams(**query_params)
)
iforder:
base_query=base_query.order_by(asc(text(order)))
iflimit:
base_query=base_query.limit(limit)
returnbase_query
This change ensures that the conditions in the where clause are combined using OR instead of AND, which will load all relevant data into dfs[1].
System Info
The current version of pandasai being used: 2.2.7
馃悰 Describe the bug
In some case LLM will generate code to retrieve different subset of a same dataframe. In the example below it's generating the "first_purchases" dataframe as well as "repurchases" dataframe. Then these two dataframes are merged for further analysis, which makes perfect sense:
In the
code_execution
phase, two filters are extracted: "buyer_purchase_sequence == 1" (for "first_purchases" dataframe) and "buyer_purchase_sequence > 1" (for "repurchases" dataframe).However in
pandasai/connectors/sql.py
, this will concatenate a where clause:where buyer_purchase_sequence == 1 AND buyer_purchase_sequence > 1
. This will result an empty dataframe in dfs, and eventually an empty chart.I understand the design of filtering dataframe is for better system performance. But considering all possible cases, the "where clause" needs to use
OR
instead ofAND
to ensure all data gets loaded in dfs.The text was updated successfully, but these errors were encountered: