|
SIMON KREK
was
the
editor-in-chief of a new Oxford comprehensive English-Slovenian
dictionary published in two volumes
between 2005-2006, a project that introduced modern corpus-based
lexicography in Slovenia. Currently his affiliation is with the
Amebis
software company, specialized in natural language processing products
for
Slovene, and with Jožef Stefan Institute as a researcher in the field
of
language technologies. He is coordinating a five year project
whose
results
include a billion-word corpus of Slovene with a new tagger, parser and
pedagogically-oriented web concordancer, a lexicon and lexical
database, serving
as a basis for a web-based pedagogical dictionary, grammar and manual
of style.
|
Language data
for digital
natives: old wine in a new bottle or...?
If the
eighties brought the first extensive use of digitized dictionaries for
linguistic querying, and the nineties were dedicated to collecting and
exploring
huge amounts of language data in digital format, such as corpora,
lexicons,
ontologies, lexical databases etc., the first decade in the new century
saw the
explosion of freely available (crowd-sourced) web contents such as
Wikipedia
and online dictionaries, every day use of NLP technologies, and the
first move
towards the abandonment of paper as the primary medium of written
language
transmission. At the beginning of the present decade it is therefore
reasonable
to ask what will fulfil the persisting human need to understand
difficult parts
of one's native language, or contribute to maintaining a common
language standard.
On the multilingual side, there is an equally important need to
communicate
with people or understand texts in languages other than one's own, the
area
where free web content and freely available statistical machine
translation
tools made a considerable step forward in recent years.
While
dictionaries on paper will continue to have an important role in the
digitally
underdeveloped environments, it is clear that in those parts of the
world where
access to the internet and mobile telephony is beginning to be
understood as
one or the basic human rights, paper format may be abandoned.
Consequently, it
is necessary to conceptualize a new format which will satisfy the same
needs,
but will deliberately break away from the 18th-
and 19th-century
dictionary concept and the codex format. We will try to guesstimate
what the
new format could be, taking into account language data and NLP
technologies
already available, as well as the maturing technologies. The format
will be
conceptualized as an interactive web portal where reliable information
on all
aspects of a particular language is available – an
"all-about".
|