Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Extended OMW #183

Open
wants to merge 3 commits into
base: gh-pages
Choose a base branch
from
Open

Update Extended OMW #183

wants to merge 3 commits into from

Conversation

ekaf
Copy link
Contributor

@ekaf ekaf commented Feb 17, 2022

This PR updates the "extended_omw" package with additional wordnets from the "wns" folder in the recent OMW 1.4 source release (retrieved from https://github.com/omwn/omw-data/archive/refs/tags/v1.4.zip).

In particular, this PR corrects large numbers of errors in the Tosk Albanian ('als'), Standard Arabic ('als') and Castilian ('spa') wiktionary wordnets in the 'wikt' folder.

First added, but retracted again, following #183 (comment) : Persian ("fas") and an alternative Chinese wordnet ("qcn") which are included in NLTK's "omw" package, but were left out of omw-1.4 because of quality concerns (cf. discussions at #171).

Everything in this PR was just copied verbatim from the upstream source release. As a consequence, all folders now include LICENSE and citation.bib files, so that the standard citation() and license() functions return appropriate information about the languages covered in extended_omw.

Sample use, assuming nltk/nltk#2946:

import nltk
from nltk.corpus import wordnet as wn
print(f"Loaded Wordnet v. {wn.get_version()} with {len(wn.langs())} languages from OMW-1.4")

Loaded Wordnet v. 3.0 with 32 languages from OMW-1.4

wn.add_exomw()
print(f"Loaded {len(wn.langs())} languages in total with Extended OMW")

Loaded 1192 languages in total with Extended OMW

ss=wn.synset('example.n.01')
print(ss.lemma_names(lang="cmn"))

['事例', '例', '例子', '例证']

print(ss.lemma_names(lang="cmn_wikt"))

['例子', '例', '榜样', '例证']

Retracted:

print(ss.lemma_names(lang="qcn"))

['例子', '比方']

@fcbond
Copy link
Contributor

fcbond commented Feb 17, 2022 via email

@ekaf
Copy link
Contributor Author

ekaf commented Feb 17, 2022

@fcbond: Of course I will remove these two languages if you insist.
My understanding was that these files were ok to put in an omw-extra package, since you still distribute them in the OMW-data source release.
Also, since they are still included in NLTK's old "omw" package, there is a concern that users might miss them when upgrading to newer NLTK versions, and that old code would break.
Last, the "fas" and "qcn" wordnets are supported by scientific papers (cf. their "citation.bib"), so the quality issues might not be much worse than some other languages (for ex. French, which also has big quality issues).
There may still be a concern about the vague licensing terms of the Persian wordnet, but maybe this could be resolved by asking the authors?

@ekaf
Copy link
Contributor Author

ekaf commented Feb 18, 2022

According to #171 (comment), "Native speakers of Farsi and Mandarin have pointed out that these two resources have some quality issues".

It could be interesting to hear anything about the severity of the alleged issues.

And wouldn't the same argument apply to all wordnets? In particular, many quality issues have been reported about Princeton Wordnet. Issues are also often raised in OEWN. Discussing the issues openly is a way to eventually solve them...

@ekaf
Copy link
Contributor Author

ekaf commented Feb 18, 2022

Two languages ('fas' and 'qcn') were retracted, since @fcbond clearly does not allow their redistribution, cf. #183 (comment).

The big wordnetwiktionaryalignments-2013-02-19.tsv file is not included, since there is no handler for it.

So now, the proposed update consists in the addition of citation.bib files in the wikt and cldr folders, and 3 updated wiktionary wordnets, with the following numbers of lemmas:

2567 wn-wikt-als.tab (Tosk Albanian)
9337 wn-wikt-arb.tab (Standard Arabic)
25311 wn-wikt-spa.tab (Castilian)
37215 total

@stevenbird
Copy link
Member

@ekaf: sorry for the delay. I don't want to blow away the existing zipfile with a new one, but to replace individual files. Would you please help me out with a list of the required files? Is it:

wikt/wn-wikt-als.tab
wikt/wn-wikt-arb.tab
wikt/wn-wikt-spa.tab
wikt/citation.bib
cldr/citation.bib

I'm confused because you say: "3 new wiktionary wordnets", but those 3 files already exist. Also, I see a new top-level citation.bib file apart from wikt/citation.bib and cldr/citation.bib... what should happen to that?

@ekaf
Copy link
Contributor Author

ekaf commented Jul 14, 2022

@stevenbird, yes, your list is accurate. The top-level citation.bib refers to the whole OMW project and should be added as well.
It is true that the old package already contains files for the 3 "new" wordnets, but the old ones have huge issues: almost all the als lemmas are misplaced into the spa file, while the arb lemmas are erroneously marked lemma:arn, which is the identifier for the Mapuche, Mapudungun language.
Everything in the new package is just copied verbatim from the newer OMW-1.4 source, so there is one additional change: many of the wordnet filenames in the wikt folder contain a star sign '*', which is now replaced by an 'X'. Avoiding '*' in filenames may not always be crucial, but it is safer.

@stevenbird
Copy link
Member

Hi @ekaf, @fcbond, sorry for the long delay on this! Can you please suggest the simplest way for me to get the current files? Perhaps a full drop-in replacement for NLTK's extended_omw.zip (minus anything @fcbond doesn't want included)?

@ekaf
Copy link
Contributor Author

ekaf commented Jun 19, 2024

Hi @stevenbird, thanks for your interest :) Yes, this package is a drop-in update of @ExplorerFreda's original package. I think it is ok, except that there is now a newer webpage URL (https://omwn.org) to include in extended_omw.xml. The topmost README file might also benefit from some editing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants