Losing POS Tagging & Other Token Attributes when Segmenting with Jieba or Pkuseg #12846
creolio
started this conversation in
Language Support
Replies: 1 comment 1 reply
-
Hi @creolio, in the first example (with the default segmenter) you're using the In your other two snippets, using |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
I'm trying to ensure that I have accurate word segmentation/tokenization for Chinese while retaining access to token attributes such as part of speech, but it seems that when I switch segmenters from the default, I lose most of the token attribute data. I'm not training any custom models or anything like that.
My base jupyter notebook code looks like this:
With the above, I'm able to get both segmentation and token attributes, but am confused because I thought the default segmenter was "char". I'm using this as my solution for now, but would like to be able to play with other segmenters:
![image](https://private-user-images.githubusercontent.com/85093210/255005897-90287f28-87cb-495c-adaa-3d0f2a0d281e.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjExODE0NTIsIm5iZiI6MTcyMTE4MTE1MiwicGF0aCI6Ii84NTA5MzIxMC8yNTUwMDU4OTctOTAyODdmMjgtODdjYi00OTVjLWFkYWEtM2QwZjJhMGQyODFlLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzE3VDAxNTIzMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWQ1NTZhYTQ1YzhkNjI1MjAxZjIyNGJjMzBmYjJiMWNmNmFmNjJmNzJiN2YxNmEzNmNkMTFlMWMzYmYyNGJjNmMmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.ZC6VQPdZ6aoBlC-rmKKhKfWlYGI-VyIdt9JbohZHfOc)
When I change:
nlp = spacy.load("zh_core_web_sm")
To:
I get this output:
![image](https://private-user-images.githubusercontent.com/85093210/255006316-e81d2bc1-1ea8-4444-ba6a-2afc229af362.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjExODE0NTIsIm5iZiI6MTcyMTE4MTE1MiwicGF0aCI6Ii84NTA5MzIxMC8yNTUwMDYzMTYtZTgxZDJiYzEtMWVhOC00NDQ0LWJhNmEtMmFmYzIyOWFmMzYyLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzE3VDAxNTIzMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTc5NGNkY2ExOWE2YjgyMmM5YTAwNjY3YmFkOTM5M2U0NTk4OGY0MWE3NTdkYzM0ZTBlNDQwNDI0NDNkN2Q3ZmYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.bNyi91RkgP6tyxTO1q5X1UGXxDH7XcM35dhUUvoNXtQ)
Or if I use the following instead:
Or:
I get this output:
![image](https://private-user-images.githubusercontent.com/85093210/255006520-572b5078-2406-414b-84f8-89de56d09fa6.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3MjExODE0NTIsIm5iZiI6MTcyMTE4MTE1MiwicGF0aCI6Ii84NTA5MzIxMC8yNTUwMDY1MjAtNTcyYjUwNzgtMjQwNi00MTRiLTg0ZjgtODlkZTU2ZDA5ZmE2LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNDA3MTclMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjQwNzE3VDAxNTIzMlomWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPWRhMzA0OTU1YmJjYjg3YTdmNTJjODhjNGM0NmZlOTMxODI4ZmJlMmU5ZTNlYTY3NTk4NzVkYzM5MjY4OTM3ODYmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0JmFjdG9yX2lkPTAma2V5X2lkPTAmcmVwb19pZD0wIn0.Jp1oZA-3xUHfaC_rkEKklqICAtFx0YEMsg079KFo3wk)
Using:
python3.10
spacy3.6
Beta Was this translation helpful? Give feedback.
All reactions