Dr. Robert Reynolds
Resources
Resources I develop(ed)
  • UDAR - "UDAR Does Accented Russian"
    Open-source morphological analyzer/generator built using 2-level formalism in xfst/hfst. Can analyze/generate stressed wordforms. Code for the project is located here. Precompiled FSTs can be downloaded here. To easily test basic functionality, Giellatekno hosts a cgi interface for the Russian analyzer and generator. I welcome bug reports, or other discussion and collaboration.

  • VIEW
    Russian exercises built on Visual Input Enhancement of the Web (VIEW). Available on...

  • External Resources (things I find interesting/useful)
  • nlpub.ru
    Russian-language wiki dedicated to natural language processing. See links in its left sidebar.

  • CoCoCo
    Easily explore collocations using Mikhail Kopotev's website.

  • Dialog morphological analysis engine
    A well-known open-source Russian morphological analysis engine (Windows or Linux). Based on Zaliznjak's grammar dictionary.

  • pymorphy2 by Mikhail Korobov and others
    A python implementation of the morphological dictionary of OpenCorpora.org (which is an expanded version of Dialog). Includes guessing algorithms for unknown words. Actively developed with an active discussion group.

  • 'Mocky' Russian taggers and parsers by Serge Sharoff et al.
    Statistical Russian NLP resources (Russian Multext-East-tagset, POS tagging, parsing, and corpora). Lots of useful links.

  • Frequency lists by Serge Sharoff et al.
    Especially note the RNC frequency lists here.

  • mystem morphological analysis engine (Яndex)
    A well-known, free (but closed-source) Russian morphological analysis engine. The latest version includes some contextual disambiguation and unknown word guessing. The URL changes frequently, so if the link is broken, just Google 'mystem'.

  • ETAP-3 parser online interface
    This online interface lets you parse sentences using the same tool used to generate the SynTagRus corpus.

  • odict.ru
    Sergey Slepov's Open Dictionary (based on Zaliznjak). Morpher.ru is based on this dictionary (and by extension, the RussianGram Chrome plugin).

  • Universal Cyrillic decoder
    This is a nice online tool by Petko Yotov to automatically detect what encodings you are dealing with when Cyrillic text has been decoded using the wrong encoding (i.e. "кракозябры" or "mojibake").

  • Russian Corpora
  • Google Books N-grams
    Very large corpus of automatically semi-annotated text taken from Google Books' Russian books. Although it is freely available and very large, there is lots of noise from OCR artifacts, e.g. 0ффициальный

  • Russian National Corpus
    The de facto Russian corpus. If only it were open.... (Many parts of the corpus can be downloaded with license agreements: SynTagRus, the accentological corpus, and a 1-million-word disambiguated corpus, and certainly other subcorpora.)

  • Four Russian Corpora from Serge Sharoff
    Web interface for four corpora, including part of the Russian National Corpus

  • OpenCorpora.org
    As the name implies, a free corpus alternative to the Russian National Corpus. Includes an xml dictionary originally based on pymorphy, in turn based on the Dialog system (above). Exciting project, get involved!

  • ГИКРЯ - Генеральный Интернет-Корпус Русского Языка
    Over 15-billion tokens taken from RuNet (Russian Internet). Collaboration of multiple universities and ABBYY. Annotation by ABBYY Compreno.

  • ruTenTen
    A representative sample of RuNet (Russian internet). 10 billion words.

  • OPUS-2
    Including this here to remind myself to look later.

  • Wortschatz (Leipzig University)
    Russian news corpus (2008). The online interface gives significant cooccurrence, left- and right-neighbor data. Many uncommon languages have corpora available here.

  • ruWac
    Scroll down to the bottom of the linked page to find the link to download this corpus. Syntactically annotated corpus (automatic, uncorrected) by Serge Sharoff. 2 billion words.

  • Aranea
    "A Family of Comparable Gigaword Web Corpora"

  • RULEC: Russian Learner Corpus of Academic Writing
    longitudinal corpus of Russian learner language that includes written papers produced by advanced American students of Russian as a Foreign or Heritage Language

  • Resources for Russian wiki page from ACL
    The Association for Computational Linguistics maintains a wiki page for Russian resources (including corpora).