`LinkContentFetcher` uses wrong encoding for html responses where encoding is not explicitly defined in `Content-Type` header. #7976

tstadel · 2024-07-04T12:18:10Z

Describe the bug
LinkContentFetcher uses requests library under the hood. If requests library receives Content-Type headers without explict endoding, it infers encoding ISO-8859-1, which is in line with https://datatracker.ietf.org/doc/html/rfc2616#section-3.7.1 but is most of the time wrong as utf-8 has become a defacto standard (See https://stackoverflow.com/a/52615216 and https://stackoverflow.com/a/44203633).

This is especially a problem with HTML which lets you define the encoding on the HTML layer and does not rely on the Content-Type.

As we want to get the bytes anyways, there's no need to fiddle with the response as string, as we can simply access bytes via response.content.

Error message
No error message, but content is wrongly encoded

Expected behavior
Encoding is correct

Additional context
Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.

To Reproduce
As an example where this is an issue, take https://www.mckinsey.com/about-us/new-at-mckinsey-blog/equal-at-mckinsey : Encoding is defined in html correctly as utf-8, but Content-Type is text/html only.
Browsers and httpx library handle this site correctly as utf-8, LinkContentFetcher does not.

FAQ Check

Have you had a look at our new FAQ page?

System:

OS:
GPU/CPU:
Haystack version (commit or version number):
DocumentStore:
Reader:
Retriever:

The text was updated successfully, but these errors were encountered:

tstadel mentioned this issue Jul 4, 2024

fix: LinkContentFetcher html text encoding #7975

Merged

tstadel changed the title ~~LinkContentFetcher uses wrong encoding for responses where encoding is not explicitly defined in Content-Type header.~~ LinkContentFetcher uses wrong encoding for html responses where encoding is not explicitly defined in Content-Type header. Jul 4, 2024

vblagoje closed this as completed in #7975 Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`LinkContentFetcher` uses wrong encoding for html responses where encoding is not explicitly defined in `Content-Type` header. #7976

`LinkContentFetcher` uses wrong encoding for html responses where encoding is not explicitly defined in `Content-Type` header. #7976

tstadel commented Jul 4, 2024 •

edited

Loading

LinkContentFetcher uses wrong encoding for html responses where encoding is not explicitly defined in Content-Type header. #7976

LinkContentFetcher uses wrong encoding for html responses where encoding is not explicitly defined in Content-Type header. #7976

Comments

tstadel commented Jul 4, 2024 • edited Loading

`LinkContentFetcher` uses wrong encoding for html responses where encoding is not explicitly defined in `Content-Type` header. #7976

`LinkContentFetcher` uses wrong encoding for html responses where encoding is not explicitly defined in `Content-Type` header. #7976

tstadel commented Jul 4, 2024 •

edited

Loading