Maintainability of WikimediaLanguageCodes.java #413

wetneb · 2019-05-31T08:47:34Z

The class WikimediaLanguageCodes contains a hand-crafted mapping between Wikimedia language codes and IETF language codes.

Maintaining this mapping there is not ideal: this requires curation effort, and this database is not easy to reuse for people who need it outside the Java ecosystem. This begs the question: Maybe we could use a sort of generic collaborative open data project to maintain a mapping between identifier schemes??? What on earth could that project be?

… we need to push the existing data to Wikidata and update the mapping periodically by running a SPARQL query. How meta!

Example query: https://w.wiki/4Vy

Tpt · 2019-06-04T13:24:40Z

Indeed. It's very painful to maintain. +1 for the SPARQL query.

wetneb · 2019-06-04T14:44:56Z

Of course the difficult bit is to import the existing data, because Wikidata does not contain all the Wikimedia language codes to date… I have started doing this, creating items such that https://www.wikidata.org/wiki/Q64363007.

Tpt · 2019-06-04T14:53:00Z

I'm not sure it's useful to import all language codes. New language code should follow BCP 47 so properly formatting them should be enough for converting most of language tags. We could then have a dictionnary of the exception extracted from a SPARQL queries or hardcoded in WikidataToolkit.

wetneb · 2019-06-04T15:00:46Z

Makes sense! Currently I am actually using this mapping in OpenRefine to check whether a Wikimedia language code exists at all, so for this application I do need completeness… But it's not what this class is intended for. So I might as well just store the allowed language codes there directly.

Tpt · 2019-06-04T15:43:29Z

If you need the full list. Let's keep it there. There is no point of having to write the conversion code if you already have to maintain the list of language tags.

wetneb · 2019-06-06T17:54:51Z

Yeah, but after all I am not exactly sure how the existing mapping was constructed, so it is not clear to me that I can safely import that in Wikidata. If I don't import this, then some mappings will disappear if the data becomes generated from SPARQL, so it might be a regression.

I still think it could be a good idea to maintain this in Wikidata, but not knowing the specifics of these different language codes and the application @mkroetzsch had in mind when writing this, I will abstain from this for now.

I will rather store the list of valid language codes for terms and monolingual text directly.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Maintainability of WikimediaLanguageCodes.java #413

Maintainability of WikimediaLanguageCodes.java #413

wetneb commented May 31, 2019 •

edited

Loading

Tpt commented Jun 4, 2019

wetneb commented Jun 4, 2019

Tpt commented Jun 4, 2019

wetneb commented Jun 4, 2019

Tpt commented Jun 4, 2019

wetneb commented Jun 6, 2019

Maintainability of WikimediaLanguageCodes.java #413

Maintainability of WikimediaLanguageCodes.java #413

Comments

wetneb commented May 31, 2019 • edited Loading

Tpt commented Jun 4, 2019

wetneb commented Jun 4, 2019

Tpt commented Jun 4, 2019

wetneb commented Jun 4, 2019

Tpt commented Jun 4, 2019

wetneb commented Jun 6, 2019

wetneb commented May 31, 2019 •

edited

Loading