Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LinkContentFetcher uses wrong encoding for html responses where encoding is not explicitly defined in Content-Type header. #7976

Closed
1 task
tstadel opened this issue Jul 4, 2024 · 0 comments · Fixed by #7975

Comments

@tstadel
Copy link
Member

tstadel commented Jul 4, 2024

Describe the bug
LinkContentFetcher uses requests library under the hood. If requests library receives Content-Type headers without explict endoding, it infers encoding ISO-8859-1, which is in line with https://datatracker.ietf.org/doc/html/rfc2616#section-3.7.1 but is most of the time wrong as utf-8 has become a defacto standard (See https://stackoverflow.com/a/52615216 and https://stackoverflow.com/a/44203633).

This is especially a problem with HTML which lets you define the encoding on the HTML layer and does not rely on the Content-Type.

As we want to get the bytes anyways, there's no need to fiddle with the response as string, as we can simply access bytes via response.content.

Error message
No error message, but content is wrongly encoded

Expected behavior
Encoding is correct

Additional context
Add any other context about the problem here, like document types / preprocessing steps / settings of reader etc.

To Reproduce
As an example where this is an issue, take https://www.mckinsey.com/about-us/new-at-mckinsey-blog/equal-at-mckinsey : Encoding is defined in html correctly as utf-8, but Content-Type is text/html only.
Browsers and httpx library handle this site correctly as utf-8, LinkContentFetcher does not.

FAQ Check

System:

  • OS:
  • GPU/CPU:
  • Haystack version (commit or version number):
  • DocumentStore:
  • Reader:
  • Retriever:
@tstadel tstadel changed the title LinkContentFetcher uses wrong encoding for responses where encoding is not explicitly defined in Content-Type header. LinkContentFetcher uses wrong encoding for html responses where encoding is not explicitly defined in Content-Type header. Jul 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant