Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Maintainability of WikimediaLanguageCodes.java #413

Open
wetneb opened this issue May 31, 2019 · 6 comments
Open

Maintainability of WikimediaLanguageCodes.java #413

wetneb opened this issue May 31, 2019 · 6 comments

Comments

@wetneb
Copy link
Member

wetneb commented May 31, 2019

The class WikimediaLanguageCodes contains a hand-crafted mapping between Wikimedia language codes and IETF language codes.

Maintaining this mapping there is not ideal: this requires curation effort, and this database is not easy to reuse for people who need it outside the Java ecosystem. This begs the question: Maybe we could use a sort of generic collaborative open data project to maintain a mapping between identifier schemes??? What on earth could that project be?

… we need to push the existing data to Wikidata and update the mapping periodically by running a SPARQL query. How meta!

Example query: https://w.wiki/4Vy

@Tpt
Copy link
Collaborator

Tpt commented Jun 4, 2019

Indeed. It's very painful to maintain. +1 for the SPARQL query.

@wetneb
Copy link
Member Author

wetneb commented Jun 4, 2019

Of course the difficult bit is to import the existing data, because Wikidata does not contain all the Wikimedia language codes to date… I have started doing this, creating items such that https://www.wikidata.org/wiki/Q64363007.

@Tpt
Copy link
Collaborator

Tpt commented Jun 4, 2019

I'm not sure it's useful to import all language codes. New language code should follow BCP 47 so properly formatting them should be enough for converting most of language tags. We could then have a dictionnary of the exception extracted from a SPARQL queries or hardcoded in WikidataToolkit.

@wetneb
Copy link
Member Author

wetneb commented Jun 4, 2019

Makes sense! Currently I am actually using this mapping in OpenRefine to check whether a Wikimedia language code exists at all, so for this application I do need completeness… But it's not what this class is intended for. So I might as well just store the allowed language codes there directly.

@Tpt
Copy link
Collaborator

Tpt commented Jun 4, 2019

If you need the full list. Let's keep it there. There is no point of having to write the conversion code if you already have to maintain the list of language tags.

@wetneb
Copy link
Member Author

wetneb commented Jun 6, 2019

Yeah, but after all I am not exactly sure how the existing mapping was constructed, so it is not clear to me that I can safely import that in Wikidata. If I don't import this, then some mappings will disappear if the data becomes generated from SPARQL, so it might be a regression.

I still think it could be a good idea to maintain this in Wikidata, but not knowing the specifics of these different language codes and the application @mkroetzsch had in mind when writing this, I will abstain from this for now.

I will rather store the list of valid language codes for terms and monolingual text directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants