Co-located event with TLT'07 on December 5–6, 2007:

Unified Linguistic Annotation – Transcontinental Perspectives

In the past decade, linguistic annotation of corpora has made tremendous progress. While corpus annotation before was restricted to morpho-syntactic and surface-syntactic phenomena, now a whole range of phenomena, ranging from syntactic and semantic structures to anaphora, coreference and discourse structure are added. These new annotation efforts are not restricted to English but include Norwegian, German, French, and Japanese, as well as other languages. For logistic and organizational reasons, the majority of these annotations are performed independently of each other, even when they use the same underlying text materia (e.g. the Penn Treebank, PropBank, NomBank, and TimeBank), while some efforts are closely tied to specific applications (such as parse and generation ranking, e.g. in the LOGON translation project). This way of working often limits the reusability and interoperability of results. Some international frameworks, e.g. ParGram and DELPH-IN (to which the LOGON and TREPIL projects have ties) promote unified linguistic annotation across languages. These approaches have generated considerable interest.

Since linguistic annotation is a very time and cost intensive task, the availability of existing linguistic annotations for different linguistic levels raises the question whether a combination of these levels into a single representation can add new information and enhance the usability of the existing resources. However, the creation of such a single representation is far from being straightforward: the single annotations generally use individual and highly specialized representations, and in many cases, the annotations on different levels do not agree with each other. Therefore, decisions must be made as to whether these annotations should be synchronized, whether there is a level of annotation that serves as the basis for the other levels, whether the representation should be language-specific or language-independent, which representation format should be used. etc. In order to establish a consensus based on which a standard can be developed, we need to bring the major researchers from all continents together.

In the past three years, several workshops have started addressing these problems: The Frontiers in Annotation Workshops, the LREC-2006 Workshop on Merging and Layering Annotation, the ACL 2007 Linguistic Annotation Workshop, and the NSF workshop on Unified Linguistic Annotation (2007). Additionally, there is a large European initiative, CLARIN, which has as its goal to establish an integrated and interoperable research infrastructure of language resources. It aims at lifting the current fragmentation, offering a stable, persistent, accessible and extensible infrastructure and therefore enabling eHumanities. The proposed workshop will extend the reach of CLARIN over the boundaries of Europe, in order to establish a world-wide research effort to integrate existing linguistically annotated resources in order to make them available to the whole linguistic community. The workshop aims at strengthening the research ties between Norwegian partners in KUNSTI projects, European partners in the coming CLARIN infrastructure, and related efforts on other continents (primarily America and Asia).