Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

大佬们原始网页的数据清洗是否会发布? #2

Open
wqcabjkcuh opened this issue Mar 6, 2020 · 1 comment
Open

大佬们原始网页的数据清洗是否会发布? #2

wqcabjkcuh opened this issue Mar 6, 2020 · 1 comment

Comments

@wqcabjkcuh
Copy link

common Crawl 中包含的网页数据里脏数据很多,需要进行细致的过滤才能获得纯净的中文文本。大佬我看您给出的技术文档里面说明了几种处理手段,但是较为笼统。不知道之后数据清洗的代码是否能开源呀。

@michael-wzhu
Copy link

同问:清洗数据的代码是否会公开?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants