wikidatastats.py

This module allows interaction with Wikipedia data that has been predownloaded. Specifically, the data is name pairs for transliteration.

The purpose of this module is to gather script statistics from all the languages, so we can get a measure of script similarity.

We also romanized all name pairs, using uroman, from Ulf Hermjakob at ISI.

The wikipedia name pairs will be available for download soon.

In the future, we will also build support for reading name pairs described in [Irvine2012].

Module Documentation

wikidatastats.compare(langid1, langid2, langdists)

Get script similarity between languages. This just retrieves utils.Language objects that have ISO codes of langid1 and langid2, then calls simdist().

If the languages are not present in langdists, then the returned score is -1.

Parameters:
  • langid1 – 3-letter language code
  • langid2 – 3-letter language code
  • langdists – output from wikidatastats.loaddump()
Returns:

a score of script similarity

wikidatastats.countscripts(langdists)

This counts the number of scripts in the data. This is a convenience method meant to be run from the command line. It prints the results.

The method is a bottom-up clustering. For each new language, it uses simdist() to get a similarity score between all previous clusters. If all scores are below a threshold (arbitrarily set at 0.5), then it starts a new cluster. Otherwise this language joins the best cluster.

Parameters:langdists – output from wikidatastats.loaddump()
wikidatastats.getclosest(lang, langdists)

This calculates script similarities between lang and all other languages in langdists.

Parameters:
Returns:

a map of form {langcode : float, ...}

wikidatastats.listsizes(limit=0)

List all the languages with size greater than limit

param:integer lower limit on number of lines in the data file
wikidatastats.loaddump(dumpname='data/wikilanguages.pkl')

This loads a pickle file which has been previously created using wikidatastats.makedump(). Most importantly, the returned utils.Language object has the charfreqs field set.

Parameters:dumpname – name of the pickle file to load from.
Returns:a map of form {wikiname : utils.Language, ...}
wikidatastats.loadnamemap()

Produces a map from two letter code to wikipedia name. This reads data/wikilanguages.

Returns:map of form {two letter code: wiki name, ...}
wikidatastats.makedump(mypath, outname='data/wikilanguages.pkl')

This collects and dumps information about every wikidata file. The name of the output file is

Parameters:mypath – is the path to the wikidata/ folder.
wikidatastats.simdist(d1, d2)

This gives a similarity score between character distributions. These character distributions are usually taken from the charfreqs field in utils.Language

Parameters:
  • d1 – map from {character : float, ...}
  • d2 – map from {character : float, ...}
Returns:

similarity score (float)

Bibliography

[Irvine2012]Ann Irvine, Chris Callison-Burch, and Alexandre Klementiev. “Transliterating from All Languages” Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA). 2010.