Avoid empty lines with spaces to be transformed to empty string #59155

ritwizsinha · 2024-07-01T13:15:01Z

closes BUG: read_html returns empty list #59147 (Replace xxxx with the GitHub issue number)
Tests added and passed if fixing a bug or adding a new feature
All code checks passed.
Added type annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Aloqeely · 2024-07-01T22:18:03Z

Thanks for the PR! I'm not sure if this fixes the problem in the linked issue. Can you write a test that asserts the result of read_html on '<table><tr><td> </td></tr></table>' is not an empty list?

Aloqeely · 2024-07-02T07:23:58Z

pandas/tests/io/test_html.py

+            </table>
+        """
+            ),
+            skip_blank_lines=False,


Is this necessary? If it's left as the default then what will the result be?

Leaving it as default which is true would give the same behaviour as the bug, because if this is true, python_parser calls _remove_empty_lines which gets rid of lines with only spaces.

So running read_html without passing skip_blank_lines=False will produce an empty list, but the doc states that no empty list should be returned, so the bug isn't really fixed.

I see, so removing the new argument to parse_html, I can instead pass the skip_blank_lines to be False by default to _parse function to always get space only lines as a row instead of avoiding them

Aloqeely · 2024-07-02T07:26:16Z

pandas/io/html.py

@@ -1027,6 +1030,7 @@ def read_html(
    extract_links: Literal[None, "header", "footer", "body", "all"] = None,
    dtype_backend: DtypeBackend | lib.NoDefault = lib.no_default,
    storage_options: StorageOptions = None,
+    skip_blank_lines: bool = True,


How does this relate to fixing the bug? Generally to introduce new parameters to a function you'd have to open an enhancement issue.

This argument is needed because this function might need to specify the option whether they want to consider columns with just spaces as a valid column or not. If this is true, any column which is blank would be skipped as before. Hereis the place where this is checked to remove whitespaced lines

ritwizsinha · 2024-07-04T06:16:50Z

@Aloqeely addressed your comments and removed the new named argument

Aloqeely · 2024-07-04T22:09:40Z

Are there any implications of passing skip_blank_lines=False as the default now? I'm sure that would break some existing code.
To be quite frank I'm not very familiar with the read_html code, ping @mroeschke

ritwizsinha · 2024-07-05T06:03:18Z

Are there any implications of passing skip_blank_lines=False as the default now? I'm sure that would break some existing code. To be quite frank I'm not very familiar with the read_html code, ping @mroeschke

The difference now would be that, every line with only spaces would be included as a new row in the DataFrame

mroeschke · 2024-07-05T17:21:45Z

I am not too familiar with the read_html code either, but given the issue I lean toward @Aloqeely opinion that this PR might cause more of a behavior change as opposed to a "bug fix". And reading the original issue it seems like the docs were clear enough given this edge case.

I would lean toward more updating this docs as opposed to changing the code.

ritwizsinha · 2024-07-07T09:05:52Z

I understand, accordingly I will close this pr, so do you suggest @Aloqeely and @mroeschke to update the docs to say that data elements with only spaces will not be interpreted as rows and all spaces in the html elements would be stripped as a sidenote?

ritwizsinha added 5 commits July 1, 2024 18:43

Fix issue 59147

619fced

Remove useless prints

6eeb058

Fix ci

0982da9

Merged main

e5032f3

Fix ci

f2dee95

ritwizsinha added 2 commits July 2, 2024 12:42

Add tests

75e342a

merged main

461ebf0

Aloqeely reviewed Jul 2, 2024

View reviewed changes

ritwizsinha added 2 commits July 2, 2024 19:26

Remove key argument skip_blank_lines and pass it implicitly

ece2f7b

Merge branch 'main' into Issue#59147

e9b8530

ritwizsinha closed this Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid empty lines with spaces to be transformed to empty string #59155

Avoid empty lines with spaces to be transformed to empty string #59155

ritwizsinha commented Jul 1, 2024 •

edited

Loading

Aloqeely commented Jul 1, 2024

Aloqeely Jul 2, 2024

ritwizsinha Jul 2, 2024

Aloqeely Jul 2, 2024

ritwizsinha Jul 2, 2024

Aloqeely Jul 2, 2024

ritwizsinha Jul 2, 2024

ritwizsinha commented Jul 4, 2024

Aloqeely commented Jul 4, 2024

ritwizsinha commented Jul 5, 2024

mroeschke commented Jul 5, 2024

ritwizsinha commented Jul 7, 2024

Avoid empty lines with spaces to be transformed to empty string #59155

Avoid empty lines with spaces to be transformed to empty string #59155

Conversation

ritwizsinha commented Jul 1, 2024 • edited Loading

Aloqeely commented Jul 1, 2024

Aloqeely Jul 2, 2024

Choose a reason for hiding this comment

ritwizsinha Jul 2, 2024

Choose a reason for hiding this comment

Aloqeely Jul 2, 2024

Choose a reason for hiding this comment

ritwizsinha Jul 2, 2024

Choose a reason for hiding this comment

Aloqeely Jul 2, 2024

Choose a reason for hiding this comment

ritwizsinha Jul 2, 2024

Choose a reason for hiding this comment

ritwizsinha commented Jul 4, 2024

Aloqeely commented Jul 4, 2024

ritwizsinha commented Jul 5, 2024

mroeschke commented Jul 5, 2024

ritwizsinha commented Jul 7, 2024

ritwizsinha commented Jul 1, 2024 •

edited

Loading