Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In windows prompts cannot be chinese refer to pathlib read_text #380

Closed
wants to merge 1 commit into from

Conversation

liseri
Copy link

@liseri liseri commented Jul 5, 2024

Description

In windows , prompts cannot be chinese , because pathlib read_text default to use gbk to read prompts file;

Proposed Changes

pathlib read_text add encoding=utf-8 params

Checklist

  • [ *] I have tested these changes locally.
  • [ *] I have reviewed the code changes.
  • [ *] I have updated the documentation (if necessary).
  • [ *] I have added appropriate unit tests (if applicable).

@liseri liseri requested a review from a team as a code owner July 5, 2024 10:34
Copy link
Author

@liseri liseri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's ok

@liseri
Copy link
Author

liseri commented Jul 5, 2024

@microsoft-github-policy-service agree

@microsoft-github-policy-service agree

@jgbradley1
Copy link
Contributor

jgbradley1 commented Jul 5, 2024

Thanks for the contribution @liseri! Should errors be set here as well? And do you think it would be more appropriate/helpful here to have errors=‘strict’ (which would raise an exception and exit when non-utf8 is encountered) or errors=‘ignore’?

@glide-the
Copy link
Contributor

Thanks for the contribution @liseri! Should errors be set here as well? And do you think it would be more appropriate/helpful here to have errors=‘strict’ (which would raise an exception and exit) or errors=‘ignore’?

This is a common bug with the Windows system in China, caused by various factors, primarily summarized as follows: GPK, as a universal code, is widely used, but the Win32 interface at the bottom of Windows defaults to using system variables. This variable can be modified to other encodings, but it may cause some programs to crash because some programs do not use this Win32 interface when running. This issue is mostly resolved by upstream frameworks specifying an encoding parameter or by directly using binary reading.

In this case, binary reading should be used instead of file stream reading.

@liseri
Copy link
Author

liseri commented Jul 8, 2024

Thanks for the contribution @liseri! Should errors be set here as well? And do you think it would be more appropriate/helpful here to have errors=‘strict’ (which would raise an exception and exit when non-utf8 is encountered) or errors=‘ignore’?

It is sufficient to set the utf-8 parameter ; because the prompts file is generated by graphrag itself in utf-8, ensuring that read_text reads in utf-8 will guarantee correctness; if the utf-8 parameter is not set for read_text, it may read using the operating system’s default character encoding (such as gbk encoding on Windows systems).

1 similar comment
@liseri
Copy link
Author

liseri commented Jul 8, 2024

Thanks for the contribution @liseri! Should errors be set here as well? And do you think it would be more appropriate/helpful here to have errors=‘strict’ (which would raise an exception and exit when non-utf8 is encountered) or errors=‘ignore’?

It is sufficient to set the utf-8 parameter ; because the prompts file is generated by graphrag itself in utf-8, ensuring that read_text reads in utf-8 will guarantee correctness; if the utf-8 parameter is not set for read_text, it may read using the operating system’s default character encoding (such as gbk encoding on Windows systems).

@AlonsoGuevara
Copy link
Contributor

@liseri what are your thoughts on @glide-the 's comment? We could replace this by binary reading

@liseri
Copy link
Author

liseri commented Jul 9, 2024

@liseri what are your thoughts on @glide-the 's comment? We could replace this by binary reading

I think that solution is better; I just chose the simplest approach, as I didn’t want to implement something too complicated; as long as the issue can be resolved, I’m open to any solution; I’ll go ahead and close this pull later.

@liseri liseri closed this Jul 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants