Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 6: invalid continuation byte #29457

Open
2 of 3 tasks
Habeeb556 opened this issue Jul 2, 2024 · 12 comments
Labels
#bug:regression Bugs that are identified as regessions sqllab Namespace | Anything related to the SQL Lab

Comments

@Habeeb556
Copy link

Bug description

I encountered the following error when querying select * from table in SQL Lab. This issue occurred after upgrading from Superset version 2.1.3 to version 4.0.1.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 6: invalid continuation byte

How to reproduce the bug

Any table column with a TIMESTAMP datatype generates this error.

Screenshots/recordings

image

Superset version

4.0.1

Python version

3.11

Node version

I don't know

Browser

Chrome

Additional context

No response

Checklist

  • I have searched Superset docs and Slack and didn't find a solution to my problem.
  • I have searched the GitHub issue tracker and didn't find a similar bug report.
  • I have checked Superset's logs for errors and if I found a relevant Python stacktrace, I included it here as text in the "additional context" section.
@dosubot dosubot bot added #bug:regression Bugs that are identified as regessions sqllab Namespace | Anything related to the SQL Lab labels Jul 2, 2024
Copy link

dosubot bot commented Jul 2, 2024

To resolve the UnicodeDecodeError when querying a table with a TIMESTAMP datatype in SQL Lab after upgrading from Superset version 2.1.3 to version 4.0.1, ensure that the timestamp columns are properly normalized to a pandas-supported datetime format. This can be done using the utils.normalize_dttm_col function in the get_df method of the BaseViz class.

Here is the relevant code snippet from the get_df method:

if not df.empty:
    utils.normalize_dttm_col(
        df=df,
        dttm_cols=tuple(
            [
                DateColumn.get_legacy_time_column(
                    timestamp_format=timestamp_format,
                    offset=self.datasource.offset,
                    time_shift=self.form_data.get("time_shift"),
                )
            ]
        ),
    )

    if self.enforce_numerical_metrics:
        self.df_metrics_to_num(df)

    df.replace([np.inf, -np.inf], np.nan, inplace=True)

Ensure that this normalization step is correctly applied to handle the timestamp columns and avoid UnicodeDecodeError [1].

Additionally, the Pinot database connector is known to cause issues with TIMESTAMP columns in Superset. The Pinot driver infers TIMESTAMP columns as LONG, which can lead to problems such as UnicodeDecodeError. Here is a quick fix for the Pinot driver:

# Pinot driver infers TIMESTAMP column as LONG, so make the quick fix.
# When the Pinot driver fixes this bug, the current method could be removed.
if isinstance(sqla_column_type, types.TIMESTAMP):
    return sqla_column_type.compile().upper()

This fix ensures that TIMESTAMP columns are correctly interpreted and processed [2].

To continue talking to Dosu, mention @dosu.

@mistercrunch
Copy link
Member

#28266 was first cherry-picked in 4.0.2

Please try upgrading and confirm the issue is fixed

@Habeeb556
Copy link
Author

@mistercrunch unfortunately, this did not fix the issue. I upgraded to version 4.0.2 and encountered the same error.

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
ERROR:superset.views.base:'utf-8' codec can't decode byte 0xff in position 6: invalid start byte

Additionally, I noticed that this issue occurs only when selecting columns with the TIMESTAMP datatype. All other columns work fine. It worked correctly with version 2.1.3 when I switched back.

@mistercrunch
Copy link
Member

mistercrunch commented Jul 3, 2024

Full stracktrace please! Also curious which database engine/driver/version your are using.

@Habeeb556
Copy link
Author

Database engine: mssql+pyodbc
Version: 5.1.0

Stracktrace:

'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
Traceback (most recent call last):
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
   rv = self.dispatch_request()
        ^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
   return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 127, in wraps
   raise ex
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 121, in wraps
   duration, response = time_function(f, self, *args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/core.py", line 1470, in time_function
   response = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/api/__init__.py", line 183, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/log.py", line 255, in wrapper
   value = f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/sqllab/api.py", line 346, in get_results
   payload = json.dumps(
             ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/__init__.py", line 395, in dumps
   **kw).encode(obj)
         ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 298, in encode
   chunks = self.iterencode(o)
            ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 379, in iterencode
   return _iterencode(o, 0)
          ^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
2024-07-03 20:26:50,670:ERROR:superset.views.base:'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
Traceback (most recent call last):
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
   rv = self.dispatch_request()
        ^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
   return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 127, in wraps
   raise ex
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 121, in wraps
   duration, response = time_function(f, self, *args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/core.py", line 1470, in time_function
   response = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/api/__init__.py", line 183, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/log.py", line 255, in wrapper
   value = f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/sqllab/api.py", line 346, in get_results
   payload = json.dumps(
             ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/__init__.py", line 395, in dumps
   **kw).encode(obj)
         ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 298, in encode
   chunks = self.iterencode(o)
            ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 379, in iterencode
   return _iterencode(o, 0)
          ^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
Triggering query_id: 41782
2024-07-03 20:26:50,944:INFO:superset.commands.sql_lab.execute:Triggering query_id: 41782
Query 41782: Running query on a Celery worker
2024-07-03 20:26:50,954:INFO:superset.sqllab.sql_json_executer:Query 41782: Running query on a Celery worker
'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
Traceback (most recent call last):
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
   rv = self.dispatch_request()
        ^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
   return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 127, in wraps
   raise ex
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 121, in wraps
   duration, response = time_function(f, self, *args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/core.py", line 1470, in time_function
   response = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/api/__init__.py", line 183, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/log.py", line 255, in wrapper
   value = f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/sqllab/api.py", line 346, in get_results
   payload = json.dumps(
             ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/__init__.py", line 395, in dumps
   **kw).encode(obj)
         ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 298, in encode
   chunks = self.iterencode(o)
            ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 379, in iterencode
   return _iterencode(o, 0)
          ^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
2024-07-03 20:26:59,507:ERROR:superset.views.base:'utf-8' codec can't decode byte 0xff in position 6: invalid start byte
Traceback (most recent call last):
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1484, in full_dispatch_request
   rv = self.dispatch_request()
        ^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask/app.py", line 1469, in dispatch_request
   return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/security/decorators.py", line 95, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 127, in wraps
   raise ex
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/views/base_api.py", line 121, in wraps
   duration, response = time_function(f, self, *args, **kwargs)
                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/core.py", line 1470, in time_function
   response = func(*args, **kwargs)
              ^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/flask_appbuilder/api/__init__.py", line 183, in wraps
   return f(self, *args, **kwargs)
          ^^^^^^^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/utils/log.py", line 255, in wrapper
   value = f(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/superset/sqllab/api.py", line 346, in get_results
   payload = json.dumps(
             ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/__init__.py", line 395, in dumps
   **kw).encode(obj)
         ^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 298, in encode
   chunks = self.iterencode(o)
            ^^^^^^^^^^^^^^^^^^
 File "/swloc/.virtualenvs/supersetvenv4/lib/python3.11/site-packages/simplejson/encoder.py", line 379, in iterencode
   return _iterencode(o, 0)
          ^^^^^^^^^^^^^^^^^
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xff in position 6: invalid start byte

@mistercrunch
Copy link
Member

Oh it appears 4.0.2 does not include the large json refactor that centralized all calls to superset/utils/json.py here -> #28702

This should make 4.1.x I believe, I don't recommend brining in this large refactor as a cherry as it'll merge-conflict heavily

@mistercrunch
Copy link
Member

mistercrunch commented Jul 3, 2024

@Habeeb556 if you have the ability to test against the master branch, you could confirm that it's working there. I'm tempted to close the issue, but will wait until you confirm the fix.

@Habeeb556
Copy link
Author

@mistercrunch, I have some good news and bad news. The good news is that I think I have successfully pushed to the master branch, and the query is running fine. However, the bad news is that the output is incorrectly formatted with Chinese characters.

image

I'm not sure if this is a bug or if my push was incorrect and missed something.

@mistercrunch
Copy link
Member

This is where the [bytes] come from:
https://github.com/apache/superset/blob/master/superset/utils/json.py#L102

The chinese characters would show if/when your binary blob are decodable to utf-8 or utf-16.

What is in your binary blob? What do you expect to see?

Maybe you're using some funky other encoding or "collation". At this point if you're using something else than utf-N in this day and age you may want to standardize, or wrap the column with some database function that brings things to a modern encoding.

@Habeeb556
Copy link
Author

Yes, I checked this now with the old version 2.1.3, and it was returned the same value [bytes] when running. So, I can confirm that this master push with version 4.x is working.

Regarding the binary blob, here's what I expect to see when running directly from the SQL server.

image

@mistercrunch
Copy link
Member

But what's in there? Some other language/character set? Guessing these bytes represents something intelligible (?)

Having worked with SQL Server a long time ago, I'm guessing this has to do with "collation" and MSFT SQL SERVER deep support for different character sets. From my understanding, all this is pretty much obsolete with the rise of the utf-8 / utf-16 standards.

Given that, Apache Superset probably shouldn't go out of its way to support the intricacies of how different databases support different character sets, and just tell people to convert to utf-x (either physically in your tables or using casting in views) in order to get Superset to deal with non ASCII characters.

@Habeeb556
Copy link
Author

I agree with you. I'm not exactly sure about the business logic here since I'm a DBA focused on database support for analytical tools. They encountered the error because of a SELECT * FROM table query, and they might not need that column, or it could reference something within the application — I'm not sure.

Overall, it's good that we can skip this error now when using SELECT *.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
#bug:regression Bugs that are identified as regessions sqllab Namespace | Anything related to the SQL Lab
Projects
None yet
Development

No branches or pull requests

2 participants