Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: read_html returns empty list #59147

Closed
2 of 3 tasks
Fredrik-M opened this issue Jun 29, 2024 · 3 comments · Fixed by #59209
Closed
2 of 3 tasks

BUG: read_html returns empty list #59147

Fredrik-M opened this issue Jun 29, 2024 · 3 comments · Fixed by #59209
Assignees
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap

Comments

@Fredrik-M
Copy link

Fredrik-M commented Jun 29, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas
from _io import StringIO

table = '<table><tr><td> </td></tr></table>'
res = pandas.read_html(StringIO(table), flavor='lxml')
print(len(res))

Issue Description

From the read_html docstring:

This function will always return a list of :class:DataFrame or
it will fail, i.e., it will not return an empty list.

It has something to do with the space in the <td> tag in the example. Removing the space causes the function to fail instead.

Expected Behavior

The function should either fail, or return a list containing a DataFrame representing a 1x1 table (either empty or containing the space character in its only cell). Don't know which is more appropriate.

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.9.19.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.0-30-amd64
Version : #1 SMP Debian 5.10.218-1 (2024-06-01)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 1.24.1
pytz : 2024.1
dateutil : 2.8.2
setuptools : 69.5.1
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.2.2
html5lib : None
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.4
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.5.0
gcsfs : None
matplotlib : 3.8.4
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.0
sqlalchemy : 2.0.30
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : 0.22.0
tzdata : 2024.1
qtpy : None
pyqt5 : None

@Fredrik-M Fredrik-M added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 29, 2024
@Siddharth-Latthe-07
Copy link

@Fredrik-M The issue you've encountered seems to relate to how pandas.read_html handles HTML tables that contain only whitespace within their cells. According to the read_html documentation, the function should always return a list of DataFrame objects or fail
You can try this code snippet for better testing of code:-

import pandas as pd
from io import StringIO

table = '<table><tr><td> </td></tr></table>'
res = pd.read_html(StringIO(table), flavor='lxml')
print(len(res))  # This should print 1
print(res[0])    # This should print the DataFrame containing the space character

When processing an HTML table with a space character in a <td> tag, pandas.read_html should either:

  1. Return a list containing a DataFrame that represents a 1x1 table with the space character. OR
  2. Fail gracefully with an appropriate error message.

Expected Behavior
According to the read_html docstring:

  1. The function should always return a list of DataFrame objects or fail.
  2. In this specific case, the function should return a list containing a DataFrame that represents a 1x1 table with the space character in its only cell.

Hope this helps..
plz comment if this issue persists
Thanks

@ritwizsinha
Copy link
Contributor

ritwizsinha commented Jun 30, 2024

@Fredrik-M it seems that the read_html function has no flag for skip_blank_lines = True or False, the parser defaults say that it is true. Thus when you add a bunch of space it skips those as blank lines and thus shows up an empty array.

Moreover in the HTMLParser code which parses HTML data elements there is a specific condition which strips whitespaces from a line thus a string with spaces is reduced to an empty string and passed downstream.

@ritwizsinha
Copy link
Contributor

take

@Aloqeely Aloqeely added IO HTML read_html, to_html, Styler.apply, Styler.applymap and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 30, 2024
@ritwizsinha ritwizsinha mentioned this issue Jul 8, 2024
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO HTML read_html, to_html, Styler.apply, Styler.applymap
Projects
None yet
4 participants