Dr. Robert Reynolds
Resources I develop(ed)
  • UDAR - "UDAR Does Accented Russian"
    Open-source morphological analyzer/generator built using 2-level formalism in xfst/hfst. Can analyze/generate stressed wordforms. Code for the project is located here. Precompiled FSTs can be downloaded here. To easily test basic functionality, Giellatekno hosts a cgi interface for the Russian analyzer and generator. I welcome bug reports, or other discussion and collaboration.

  • VIEW
    Russian exercises built on Visual Input Enhancement of the Web (VIEW). Available on...

External Resources (things I find interesting/useful)
  • nlpub.ru
    Russian-language wiki dedicated to natural language processing. See links in its left sidebar.

  • CoCoCo
    Easily explore collocations using Mikhail Kopotev's website.

  • Dialog morphological analysis engine
    A well-known open-source Russian morphological analysis engine (Windows or Linux). Based on Zaliznjak's grammar dictionary.

  • pymorphy2 by Mikhail Korobov and others
    A python implementation of the morphological dictionary of OpenCorpora.org (which is an expanded version of Dialog). Includes guessing algorithms for unknown words. Actively developed with an active discussion group.

  • 'Mocky' Russian taggers and parsers by Serge Sharoff et al.
    Statistical Russian NLP resources (Russian Multext-East-tagset, POS tagging, parsing, and corpora). Lots of useful links.

  • Frequency lists by Serge Sharoff et al.
    Especially note the RNC frequency lists here.

  • mystem morphological analysis engine (Яndex)
    A well-known, free (but closed-source) Russian morphological analysis engine. The latest version includes some contextual disambiguation and unknown word guessing. The URL changes frequently, so if the link is broken, just Google 'mystem'.

  • ETAP-3 parser online interface
    This online interface lets you parse sentences using the same tool used to generate the SynTagRus corpus.

  • odict.ru
    Sergey Slepov's Open Dictionary (based on Zaliznjak). Morpher.ru is based on this dictionary (and by extension, the RussianGram Chrome plugin).

  • Universal Cyrillic decoder
    This is a nice online tool by Petko Yotov to automatically detect what encodings you are dealing with when Cyrillic text has been decoded using the wrong encoding (i.e. "кракозябры" or "mojibake").

Russian Corpora
  • Google Books N-grams
    Very large corpus of automatically semi-annotated text taken from Google Books' Russian books. Although it is freely available and very large, there is lots of noise from OCR artifacts, e.g. 0ффициальный

  • Russian National Corpus
    The de facto Russian corpus. If only it were open.... (Many parts of the corpus can be downloaded with license agreements: SynTagRus, the accentological corpus, and a 1-million-word disambiguated corpus, and certainly other subcorpora.)

  • Four Russian Corpora from Serge Sharoff
    Web interface for four corpora, including part of the Russian National Corpus

  • OpenCorpora.org
    As the name implies, a free corpus alternative to the Russian National Corpus. Includes an xml dictionary originally based on pymorphy, in turn based on the Dialog system (above). Exciting project, get involved!

  • ГИКРЯ - Генеральный Интернет-Корпус Русского Языка
    Over 15-billion tokens taken from RuNet (Russian Internet). Collaboration of multiple universities and ABBYY. Annotation by ABBYY Compreno.

  • ruTenTen
    A representative sample of RuNet (Russian internet). 10 billion words.

  • OPUS-2
    Including this here to remind myself to look later.

  • Wortschatz (Leipzig University)
    Russian news corpus (2008). The online interface gives significant cooccurrence, left- and right-neighbor data. Many uncommon languages have corpora available here.

  • ruWac
    Scroll down to the bottom of the linked page to find the link to download this corpus. Syntactically annotated corpus (automatic, uncorrected) by Serge Sharoff. 2 billion words.

  • Aranea
    "A Family of Comparable Gigaword Web Corpora"

  • RULEC: Russian Learner Corpus of Academic Writing
    longitudinal corpus of Russian learner language that includes written papers produced by advanced American students of Russian as a Foreign or Heritage Language

  • Resources for Russian wiki page from ACL
    The Association for Computational Linguistics maintains a wiki page for Russian resources (including corpora).