Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]%%pretty with chinese character error #767

Open
kongLingming06 opened this issue Aug 7, 2022 · 11 comments
Open

[BUG]%%pretty with chinese character error #767

kongLingming06 opened this issue Aug 7, 2022 · 11 comments

Comments

@kongLingming06
Copy link

Describe the bug
when use %%pretty function, if there is chinese character in the table, the table cannot show properly and return this error.

An internal error was encountered.
Please file an issue at https://github.com/jupyter-incubator/sparkmagic
Error:
Expected DF rows to be uniform width (581)

To Reproduce
%%pretty
df.show()

Versions:

  • SparkMagic
  • Livy (if you know it)
  • Spark
@xuansontrinh
Copy link

I experience the same issue.

@baixinzxl
Copy link

I experience the same issue too, when i want to display chinese character,it will return error:
Expected DF rows to be uniform width (804)

thanks to any help,it really confuses me TAT

@devstein
Copy link
Collaborator

This issue should be fixed by this PR, which I just released as part of the 0.20.4 release.

I'm marking this as resolved for now, but please let me know if this is not the case after you upgrade.

@baixinzxl
Copy link

baixinzxl commented Jan 31, 2023

This issue should be fixed by this PR, which I just released as part of the 0.20.4 release.

I'm marking this as resolved for now, but please let me know if this is not the case after you upgrade.

thanks for your reply!
but when i upgrade to 0.20.4 release,i will still meet the same probelm:
Expected DF rows to be uniform width (11)
but found | a| 你好| (9)

i found it is correct in livyserver,but returned error in notebook display

maybe you can use this dataframe to reappear the problem:
`
%%pretty

df = spark.createDataFrame([("a","你好"),("b","你好")],("key","value"))

df.show(5)
`

@devstein

and it is my versions
spark 2.4.5
sparkmagic 0.20.4
i dont know livy version

@devstein devstein reopened this Jan 31, 2023
@devstein
Copy link
Collaborator

Thanks for the code snippet @baixinzxl. I will investigate once I have bandwidth in the coming weeks. Contributions are welcome if you want to dive into the code!

@baixinzxl
Copy link

baixinzxl commented Feb 2, 2023

thanks, take your time~
and i guess the problem may relate with chinese encoding differences from livy to jupyter ,hope it helps

@baixinzxl
Copy link

sorry for disturbing but i wonder if there is any findings about the problem?
looking forward to hearing from you @devstein ~ thank you ~

@devstein
Copy link
Collaborator

devstein commented Apr 5, 2023

Hey @baixinzxl I haven't forgotten about this. I've been stretched for time and have tried to tackle this twice without success. The relevant code is in this file if you want to take a stab at it!

@pancodia
Copy link

I am using sparkmagic: 0.21.0

Same issue when show dataframe with Japanese column

@devstein
Copy link
Collaborator

@dosu how would you update https://github.com/jupyter-incubator/sparkmagic/blob/master/sparkmagic/sparkmagic/utils/dataframe_parser.py to handle characters with varying widths like Chinese and Japanese?

Copy link

dosubot bot commented May 27, 2024

Hey @devstein! 👋 I'm here to help with any bugs, questions, or contributions you have for the repository. I'm a bot designed to assist you while you wait for a human maintainer. Let me know how I can assist you!

To update dataframe_parser.py to handle characters with varying widths like Chinese and Japanese, you need to account for the fact that these characters typically take up more space than Latin characters. This can be done by using the wcwidth library, which provides a way to determine the display width of a string.

Here is how you can modify the extractors function and the DataframeHtmlParser class to handle characters with varying widths:

  1. Install the wcwidth library if you haven't already:

    pip install wcwidth
  2. Update the extractors function to use wcwidth to calculate the correct column widths:

from wcwidth import wcswidth

def extractors(header_top, header_content):
    """Creates functions to pull column values out of Spark DF rows.

    Based on the top of a Dataframe header, identifies start and end index for
    each column value.

    012345678901
    +---+------+
    | id|animal|
    +---+------+
    |  1|   cat|
    |  2| mouse|
    |  3| horse|
    +---+------+

    For example, the `id` column is row[0:4] and `animal` is row[4:11]

    :param header_top The header border top comprise of `+` and `-` marking off
                       demarcating different columns.
                       eg `+---+------+`
    :param header_content The string following the header_top, containing the
                            actual column names
                            eg `| id|animal|`
    :return A dict of column: function that can be applied to string-row
            representation of a a Dataframe, eg `|  1|   cat|`

            In our example:
            {'id': lambda row: row[0:4], 'animal': lambda row[4:11]}
    """
    header_pluses = list(re.finditer(r"\+", header_top))
    column_widths = [wcswidth(header_content[start.end():end.start()]) for start, end in zip(header_pluses, header_pluses[1:])]

    def _extract(l, r, row, offset=0):
        return row[offset + l : offset + r].strip()

    def _extractor_iter():
        start = 0
        for width in column_widths:
            end = start + width
            yield partial(_extract, start, end)
            start = end

    return OrderedDict((x(header_content), x) for x in _extractor_iter())
  1. Update the DataframeHtmlParser class to use the new extractors function:
class DataframeHtmlParser:
    """Parses a Spark Dataframe and presents it as a HTML table."""

    header_top_r = re.compile(header_top_pattern)
    header_content_r = re.compile(header_content_pattern)

    def __init__(self, cell, start=0, end=None):
        """Creates a Dataframe parser for a single dataframe.

        :param cell The evaluated output of a cell.
                    Cell can contain more than one dataframe, but a single
                    DataframeHtmlParser can only parse table headers/rows for a
                    a single dataframe in the substring cell[start:end]
        """
        self.cell_contents = cell
        end = end or len(self.cell_contents)
        header_spans = DataframeHtmlParser.header_top_r.finditer(
            self.cell_contents, start, end
        )
        parts = {
            "header_top": next(header_spans).span(),
            "header_content": DataframeHtmlParser.header_content_r.search(
                self.cell_contents, start, end
            ).span(),
            "header_bottom": next(header_spans).span(),
            "footer": next(header_spans).span(),
        }
        self.header_content_span = parts["header_content"]
        header_content = self._cell_span(self.header_content_span)

        self.expected_width = wcswidth(header_content.strip())

        header_top = self._cell_span(parts["header_top"])
        self.extractors = extractors(header_top.strip(), header_content.strip())
        # The content is between the header-bottom and the footer
        self.content_span = (parts["header_bottom"][1], parts["footer"][0])

    def _cell_span(self, span):
        s, e = span
        return self.cell_contents[s:e]

    def _rowspan_iter(self):
        """Extract each row from the contents of a Dataframe."""
        row_delimiters = re.compile(r"\n").finditer(
            self.cell_contents, self.content_span[0], self.content_span[1]
        )
        start = self.content_span[0]
        for row_delimiter in row_delimiters:
            end, next_start = row_delimiter.span()[0], row_delimiter.span()[1]
            yield (start, end)
            start = next_start

    def row_iter(self, transform=None):
        """Extract and transform each row from a Dataframe.

        Defaults to converting a row to a dict {colName: value}
        """
        _transform = transform or (
            lambda r: {col: x(r) for col, x in self.extractors.items()}
        )
        for rowspan in self._rowspan_iter():
            row = self._cell_span(rowspan).strip()
            if wcswidth(row) != self.expected_width:
                raise ValueError(
                    """Expected DF rows to be uniform width (%d)
                                 but found %s (%d)"""
                    % (self.expected_width, row, wcswidth(row))
                )
            yield _transform(row)

    def to_table(self):
        """Converts the contents of a notebook cell to a HTML table."""

        header_content = self._cell_span(self.header_content_span)
        table_header_html = self._to_tr(header_content.strip(), is_header=True)

        table_row_iter = self.row_iter(transform=self._to_tr)
        table_body = "".join([r for r in table_row_iter])
        return "<table>%s%s</table>" % (table_header_html, table_body)

    def _to_tr(self, row, is_header=False):
        """Converts a spark dataframe row to a HTML row."""
        tag = "th" if is_header else "td"
        row_content = [x(row) for x in self.extractors.values()]
        row_html = "".join(
            [
                '<%s><pre style="word-break: unset; background-color: unset;">%s</pre></%s>'
                % (tag, rc, tag)
                for rc in row_content
            ]
        )
        return "<tr>%s</tr>" % row_html

These changes ensure that the parser correctly handles characters with varying widths, such as Chinese and Japanese characters, by using the wcwidth library to calculate the display width of each string [1].

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants