-
-
Notifications
You must be signed in to change notification settings - Fork 17.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid empty lines with spaces to be transformed to empty string #59155
Conversation
Thanks for the PR! I'm not sure if this fixes the problem in the linked issue. Can you write a test that asserts the result of |
pandas/tests/io/test_html.py
Outdated
</table> | ||
""" | ||
), | ||
skip_blank_lines=False, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this necessary? If it's left as the default then what will the result be?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving it as default which is true would give the same behaviour as the bug, because if this is true, python_parser calls _remove_empty_lines which gets rid of lines with only spaces.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So running read_html
without passing skip_blank_lines=False
will produce an empty list, but the doc states that no empty list should be returned, so the bug isn't really fixed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, so removing the new argument to parse_html, I can instead pass the skip_blank_lines to be False by default to _parse function to always get space only lines as a row instead of avoiding them
pandas/io/html.py
Outdated
@@ -1027,6 +1030,7 @@ def read_html( | |||
extract_links: Literal[None, "header", "footer", "body", "all"] = None, | |||
dtype_backend: DtypeBackend | lib.NoDefault = lib.no_default, | |||
storage_options: StorageOptions = None, | |||
skip_blank_lines: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does this relate to fixing the bug? Generally to introduce new parameters to a function you'd have to open an enhancement issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This argument is needed because this function might need to specify the option whether they want to consider columns with just spaces as a valid column or not. If this is true, any column which is blank would be skipped as before. Hereis the place where this is checked to remove whitespaced lines
@Aloqeely addressed your comments and removed the new named argument |
Are there any implications of passing |
The difference now would be that, every line with only spaces would be included as a new row in the DataFrame |
I am not too familiar with the I would lean toward more updating this docs as opposed to changing the code. |
I understand, accordingly I will close this pr, so do you suggest @Aloqeely and @mroeschke to update the docs to say that data elements with only spaces will not be interpreted as rows and all spaces in the html elements would be stripped as a sidenote? |
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.