SIMON KREK was the editor-in-chief of a new Oxford comprehensive English-Slovenian dictionary published in two volumes between 2005-2006, a project that introduced modern corpus-based lexicography in Slovenia. Currently his affiliation is with the Amebis software company, specialized in natural language processing products for Slovene, and with Jo┼żef Stefan Institute as a researcher in the field of language technologies. He is coordinating a five year project whose results include a billion-word corpus of Slovene with a new tagger, parser and pedagogically-oriented web concordancer, a lexicon and lexical database, serving as a basis for a web-based pedagogical dictionary, grammar and manual of style.

Language data for digital natives: old wine in a new bottle or...?

If the eighties brought the first extensive use of digitized dictionaries for linguistic querying, and the nineties were dedicated to collecting and exploring huge amounts of language data in digital format, such as corpora, lexicons, ontologies, lexical databases etc., the first decade in the new century saw the explosion of freely available (crowd-sourced) web contents such as Wikipedia and online dictionaries, every day use of NLP technologies, and the first move towards the abandonment of paper as the primary medium of written language transmission. At the beginning of the present decade it is therefore reasonable to ask what will fulfil the persisting human need to understand difficult parts of one's native language, or contribute to maintaining a common language standard. On the multilingual side, there is an equally important need to communicate with people or understand texts in languages other than one's own, the area where free web content and freely available statistical machine translation tools made a considerable step forward in recent years.

While dictionaries on paper will continue to have an important role in the digitally underdeveloped environments, it is clear that in those parts of the world where access to the internet and mobile telephony is beginning to be understood as one or the basic human rights, paper format may be abandoned. Consequently, it is necessary to conceptualize a new format which will satisfy the same needs, but will deliberately break away from the 18th- and 19th-century dictionary concept and the codex format. We will try to guesstimate what the new format could be, taking into account language data and NLP technologies already available, as well as the maturing technologies. The format will be conceptualized as an interactive web portal where reliable information on all aspects of a particular language is available – an "all-about".

