BUG: read_html returns empty list #59147

Fredrik-M · 2024-06-29T14:40:55Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas
from _io import StringIO

table = '<table><tr><td> </td></tr></table>'
res = pandas.read_html(StringIO(table), flavor='lxml')
print(len(res))

Issue Description

From the read_html docstring:

This function will always return a list of :class:DataFrame or
it will fail, i.e., it will not return an empty list.

It has something to do with the space in the <td> tag in the example. Removing the space causes the function to fail instead.

Expected Behavior

The function should either fail, or return a list containing a DataFrame representing a 1x1 table (either empty or containing the space character in its only cell). Don't know which is more appropriate.

Installed Versions

INSTALLED VERSIONS

commit : d9cdd2e
python : 3.9.19.final.0
python-bits : 64
OS : Linux
OS-release : 5.10.0-30-amd64
Version : #1 SMP Debian 5.10.218-1 (2024-06-01)
machine : x86_64
processor :
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8

pandas : 2.2.2
numpy : 1.24.1
pytz : 2024.1
dateutil : 2.8.2
setuptools : 69.5.1
pip : 24.0
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 5.2.2
html5lib : None
pymysql : None
psycopg2 : 2.9.9
jinja2 : 3.1.4
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : 2024.5.0
gcsfs : None
matplotlib : 3.8.4
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : 1.13.0
sqlalchemy : 2.0.30
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : 0.22.0
tzdata : 2024.1
qtpy : None
pyqt5 : None

The text was updated successfully, but these errors were encountered:

Siddharth-Latthe-07 · 2024-06-30T07:58:42Z

@Fredrik-M The issue you've encountered seems to relate to how pandas.read_html handles HTML tables that contain only whitespace within their cells. According to the read_html documentation, the function should always return a list of DataFrame objects or fail
You can try this code snippet for better testing of code:-

import pandas as pd
from io import StringIO

table = '<table><tr><td> </td></tr></table>'
res = pd.read_html(StringIO(table), flavor='lxml')
print(len(res))  # This should print 1
print(res[0])    # This should print the DataFrame containing the space character

When processing an HTML table with a space character in a <td> tag, pandas.read_html should either:

Return a list containing a DataFrame that represents a 1x1 table with the space character. OR
Fail gracefully with an appropriate error message.

Expected Behavior
According to the read_html docstring:

The function should always return a list of DataFrame objects or fail.
In this specific case, the function should return a list containing a DataFrame that represents a 1x1 table with the space character in its only cell.

Hope this helps..
plz comment if this issue persists
Thanks

ritwizsinha · 2024-06-30T15:49:05Z

@Fredrik-M it seems that the read_html function has no flag for skip_blank_lines = True or False, the parser defaults say that it is true. Thus when you add a bunch of space it skips those as blank lines and thus shows up an empty array.

Moreover in the HTMLParser code which parses HTML data elements there is a specific condition which strips whitespaces from a line thus a string with spaces is reduced to an empty string and passed downstream.

ritwizsinha · 2024-06-30T16:25:52Z

take

Fredrik-M added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 29, 2024

github-actions bot assigned ritwizsinha Jun 30, 2024

Aloqeely added IO HTML read_html, to_html, Styler.apply, Styler.applymap and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 30, 2024

ritwizsinha mentioned this issue Jul 1, 2024

Avoid empty lines with spaces to be transformed to empty string #59155

Closed

5 tasks

ritwizsinha mentioned this issue Jul 8, 2024

Update read_html docs #59209

Merged

5 tasks

mroeschke closed this as completed in #59209 Jul 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: read_html returns empty list #59147

BUG: read_html returns empty list #59147

Fredrik-M commented Jun 29, 2024 •

edited

Loading

INSTALLED VERSIONS

Siddharth-Latthe-07 commented Jun 30, 2024

ritwizsinha commented Jun 30, 2024 •

edited

Loading

ritwizsinha commented Jun 30, 2024

BUG: read_html returns empty list #59147

BUG: read_html returns empty list #59147

Comments

Fredrik-M commented Jun 29, 2024 • edited Loading

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

INSTALLED VERSIONS

Siddharth-Latthe-07 commented Jun 30, 2024

ritwizsinha commented Jun 30, 2024 • edited Loading

ritwizsinha commented Jun 30, 2024

Fredrik-M commented Jun 29, 2024 •

edited

Loading

ritwizsinha commented Jun 30, 2024 •

edited

Loading