wikidatastats.py¶
This module allows interaction with Wikipedia data that has been predownloaded. Specifically, the data is name pairs for transliteration.
The purpose of this module is to gather script statistics from all the languages, so we can get a measure of script similarity.
We also romanized all name pairs, using uroman, from Ulf Hermjakob at ISI.
The wikipedia name pairs will be available for download soon.
In the future, we will also build support for reading name pairs described in [Irvine2012].
Module Documentation¶
-
wikidatastats.compare(langid1, langid2, langdists)¶ Get script similarity between languages. This just retrieves
utils.Languageobjects that have ISO codes of langid1 and langid2, then callssimdist().If the languages are not present in langdists, then the returned score is -1.
Parameters: - langid1 – 3-letter language code
- langid2 – 3-letter language code
- langdists – output from
wikidatastats.loaddump()
Returns: a score of script similarity
-
wikidatastats.countscripts(langdists)¶ This counts the number of scripts in the data. This is a convenience method meant to be run from the command line. It prints the results.
The method is a bottom-up clustering. For each new language, it uses
simdist()to get a similarity score between all previous clusters. If all scores are below a threshold (arbitrarily set at 0.5), then it starts a new cluster. Otherwise this language joins the best cluster.Parameters: langdists – output from wikidatastats.loaddump()
-
wikidatastats.getclosest(lang, langdists)¶ This calculates script similarities between lang and all other languages in langdists.
Parameters: - lang – 3-letter language code
- langdists – output from
wikidatastats.loaddump()
Returns: a map of form {langcode : float, ...}
-
wikidatastats.listsizes(limit=0)¶ List all the languages with size greater than limit
param: integer lower limit on number of lines in the data file
-
wikidatastats.loaddump(dumpname='data/wikilanguages.pkl')¶ This loads a pickle file which has been previously created using
wikidatastats.makedump(). Most importantly, the returnedutils.Languageobject has the charfreqs field set.Parameters: dumpname – name of the pickle file to load from. Returns: a map of form {wikiname : utils.Language, ...}
-
wikidatastats.loadnamemap()¶ Produces a map from two letter code to wikipedia name. This reads data/wikilanguages.
Returns: map of form {two letter code: wiki name, ...}
-
wikidatastats.makedump(mypath, outname='data/wikilanguages.pkl')¶ This collects and dumps information about every wikidata file. The name of the output file is
Parameters: mypath – is the path to the wikidata/ folder.
-
wikidatastats.simdist(d1, d2)¶ This gives a similarity score between character distributions. These character distributions are usually taken from the charfreqs field in
utils.LanguageParameters: - d1 – map from {character : float, ...}
- d2 – map from {character : float, ...}
Returns: similarity score (float)
Bibliography¶
| [Irvine2012] | Ann Irvine, Chris Callison-Burch, and Alexandre Klementiev. “Transliterating from All Languages” Proceedings of the Conference of the Association for Machine Translation in the Americas (AMTA). 2010. |