Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid empty lines with spaces to be transformed to empty string #59155

Closed
wants to merge 9 commits into from

Conversation

ritwizsinha
Copy link
Contributor

@ritwizsinha ritwizsinha commented Jul 1, 2024

@Aloqeely
Copy link
Member

Aloqeely commented Jul 1, 2024

Thanks for the PR! I'm not sure if this fixes the problem in the linked issue. Can you write a test that asserts the result of read_html on '<table><tr><td> </td></tr></table>' is not an empty list?

</table>
"""
),
skip_blank_lines=False,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? If it's left as the default then what will the result be?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving it as default which is true would give the same behaviour as the bug, because if this is true, python_parser calls _remove_empty_lines which gets rid of lines with only spaces.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So running read_html without passing skip_blank_lines=False will produce an empty list, but the doc states that no empty list should be returned, so the bug isn't really fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, so removing the new argument to parse_html, I can instead pass the skip_blank_lines to be False by default to _parse function to always get space only lines as a row instead of avoiding them

@@ -1027,6 +1030,7 @@ def read_html(
extract_links: Literal[None, "header", "footer", "body", "all"] = None,
dtype_backend: DtypeBackend | lib.NoDefault = lib.no_default,
storage_options: StorageOptions = None,
skip_blank_lines: bool = True,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does this relate to fixing the bug? Generally to introduce new parameters to a function you'd have to open an enhancement issue.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This argument is needed because this function might need to specify the option whether they want to consider columns with just spaces as a valid column or not. If this is true, any column which is blank would be skipped as before. Hereis the place where this is checked to remove whitespaced lines

@ritwizsinha
Copy link
Contributor Author

@Aloqeely addressed your comments and removed the new named argument

@Aloqeely
Copy link
Member

Aloqeely commented Jul 4, 2024

Are there any implications of passing skip_blank_lines=False as the default now? I'm sure that would break some existing code.
To be quite frank I'm not very familiar with the read_html code, ping @mroeschke

@ritwizsinha
Copy link
Contributor Author

Are there any implications of passing skip_blank_lines=False as the default now? I'm sure that would break some existing code. To be quite frank I'm not very familiar with the read_html code, ping @mroeschke

The difference now would be that, every line with only spaces would be included as a new row in the DataFrame

@mroeschke
Copy link
Member

I am not too familiar with the read_html code either, but given the issue I lean toward @Aloqeely opinion that this PR might cause more of a behavior change as opposed to a "bug fix". And reading the original issue it seems like the docs were clear enough given this edge case.

I would lean toward more updating this docs as opposed to changing the code.

@ritwizsinha
Copy link
Contributor Author

I understand, accordingly I will close this pr, so do you suggest @Aloqeely and @mroeschke to update the docs to say that data elements with only spaces will not be interpreted as rows and all spaces in the html elements would be stripped as a sidenote?

@ritwizsinha ritwizsinha closed this Jul 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: read_html returns empty list
3 participants