Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

使用簡體中文素材時出現錯誤 #2

Open
cauzp opened this issue May 4, 2024 · 4 comments
Open

使用簡體中文素材時出現錯誤 #2

cauzp opened this issue May 4, 2024 · 4 comments

Comments

@cauzp
Copy link

cauzp commented May 4, 2024

您好!您的項目給我提供了很多幫助!我clone了你的項目,但是更換使用簡體中文的語料時,部分主題出現了亂碼,您能提供一些支持嗎?如果我希望使用您的項目處理簡體中文的語料?特別是需要具體修改哪些部分?因爲我髮現您的項目相較於原始樣本做出了較多修改。

image

@Aidenzich
Copy link
Owner

嗨!感謝你提出這個問題,方便提供給我你使用的資料嗎? 我想可能是來自於資料所使用的編碼或許不是"utf-8"

@cauzp
Copy link
Author

cauzp commented May 22, 2024

嗨!感謝你提出這個問題,方便提供給我你使用的資料嗎? 我想可能是來自於資料所使用的編碼或許不是"utf-8"

非常感謝你的回复,我是用的是UTF-8編碼,而且只會在topic3出現同樣的錯誤。這使得我非常困惑,我將分享我所使用的數據,透過如下的github專案連結。https://github.com/cauzp/data

@Aidenzich
Copy link
Owner

Ok 我來研究一下

@Aidenzich
Copy link
Owner

哈囉,我嘗試用以下代碼檢查了以下你的檔案,裏面有些行數包含了非 utf-8 編碼的資訊:

import chardet


def detect_line_encoding(line):
    result = chardet.detect(line)
    return result

file_path = 'YOUR_FILE_PATH.csv'


with open(file_path, 'rb') as f:
    for line_number, line in enumerate(f, start=1):
        encoding_info = detect_line_encoding(line)
        encoding = encoding_info['encoding']
        confidence = encoding_info['confidence']
        if encoding != 'utf-8':
            print(f"Line {line_number}: Detected encoding: {encoding}, Confidence: {confidence}")

檢查結果:

Line 1: Detected encoding: ascii, Confidence: 1.0
Line 6: Detected encoding: None, Confidence: 0.0
Line 93: Detected encoding: None, Confidence: 0.0
Line 104: Detected encoding: None, Confidence: 0.0
Line 109: Detected encoding: None, Confidence: 0.0
Line 113: Detected encoding: None, Confidence: 0.0
Line 144: Detected encoding: None, Confidence: 0.0
Line 155: Detected encoding: None, Confidence: 0.0
Line 162: Detected encoding: None, Confidence: 0.0
Line 177: Detected encoding: None, Confidence: 0.0
Line 366: Detected encoding: None, Confidence: 0.0
Line 369: Detected encoding: None, Confidence: 0.0
Line 373: Detected encoding: None, Confidence: 0.0
Line 404: Detected encoding: None, Confidence: 0.0
Line 448: Detected encoding: None, Confidence: 0.0
Line 452: Detected encoding: None, Confidence: 0.0
Line 474: Detected encoding: None, Confidence: 0.0
Line 497: Detected encoding: None, Confidence: 0.0
Line 508: Detected encoding: None, Confidence: 0.0
Line 598: Detected encoding: None, Confidence: 0.0
Line 647: Detected encoding: None, Confidence: 0.0
Line 701: Detected encoding: None, Confidence: 0.0
Line 710: Detected encoding: None, Confidence: 0.0
Line 746: Detected encoding: None, Confidence: 0.0
Line 759: Detected encoding: None, Confidence: 0.0
Line 770: Detected encoding: None, Confidence: 0.0
Line 819: Detected encoding: None, Confidence: 0.0
Line 827: Detected encoding: None, Confidence: 0.0
Line 835: Detected encoding: None, Confidence: 0.0
Line 889: Detected encoding: None, Confidence: 0.0
Line 892: Detected encoding: TIS-620, Confidence: 0.20892844569841748
Line 959: Detected encoding: None, Confidence: 0.0
Line 990: Detected encoding: None, Confidence: 0.0
Line 992: Detected encoding: None, Confidence: 0.0
Line 1043: Detected encoding: None, Confidence: 0.0
Line 1056: Detected encoding: None, Confidence: 0.0
Line 1105: Detected encoding: None, Confidence: 0.0
Line 1290: Detected encoding: None, Confidence: 0.0
Line 1365: Detected encoding: None, Confidence: 0.0
Line 1404: Detected encoding: None, Confidence: 0.0
Line 1426: Detected encoding: None, Confidence: 0.0
Line 1463: Detected encoding: None, Confidence: 0.0
Line 1490: Detected encoding: None, Confidence: 0.0
Line 1507: Detected encoding: Windows-1252, Confidence: 0.2509375
Line 1528: Detected encoding: None, Confidence: 0.0
Line 1545: Detected encoding: None, Confidence: 0.0
Line 1551: Detected encoding: None, Confidence: 0.0
Line 1554: Detected encoding: None, Confidence: 0.0
Line 1575: Detected encoding: None, Confidence: 0.0
Line 1581: Detected encoding: None, Confidence: 0.0
Line 1591: Detected encoding: None, Confidence: 0.0
Line 1597: Detected encoding: None, Confidence: 0.0
Line 1602: Detected encoding: None, Confidence: 0.0
Line 1626: Detected encoding: None, Confidence: 0.0
Line 1711: Detected encoding: None, Confidence: 0.0
Line 1712: Detected encoding: None, Confidence: 0.0
Line 1713: Detected encoding: TIS-620, Confidence: 0.30224981811037727
Line 1734: Detected encoding: None, Confidence: 0.0
Line 1758: Detected encoding: None, Confidence: 0.0
Line 1808: Detected encoding: None, Confidence: 0.0
Line 1817: Detected encoding: None, Confidence: 0.0
Line 1827: Detected encoding: None, Confidence: 0.0
Line 1845: Detected encoding: None, Confidence: 0.0

在直接爬下來或比較舊的中文資料裡面常常會有一些資料是使用非 utf-8 的編碼,然而因為多數語言模型都是使用 utf-8 編碼的資料進行訓練的,所以需要首先確保數據是正確解碼並轉換為UTF-8格式。
可以先試著把這些行數去掉看看亂碼是否消失,之後再想辦法把他們也轉換成utf-8編碼

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants