&

LLAVES

IST-1999-12219

Unlocking topicality in text – foreground and background information in text

 

 

Periodic Progress Report N°:1

Covering period 1.1.2000-30.12.2000

 

 

 

 

Report Preparation Date: December 17, 2000

 

Contract Start Date: January 1, 2000 Duration: 1 year

Project Co-ordinator: Jussi Karlgren, SICS

Partners: SICS, Conexor Ltd.

Project funded by the European Community under the "Information Society Technologies" Programme (1998-2002)


 

 

 

Table of Contents


Executive Summary

This language technology project was designed bring mature results from language technology together with well motivated conjecture and plausible hypotheses from discourse and text linguistics in order to broaden the research frontier of both. The proposal was for a first year experiment to determine the viability of further research in the area; the direction is well motivated in terms of potential applications and holds great promise for fruitful exploitation if the technical and scientific questions can be solved.

A bottleneck for improving today's information management systems is that we know little of texts as text. Systems view texts as simple sets of words or terms, discarding information such as clause style and argument structure as noise. This project aims to bridge the gap from syntax to text, and show how syntactic mechanisms of language, which primarily concern clause-internal structure, carry text-level information as well. If we would be able to chart some features of the topical progression in a text we will give a road map for algorithms for further processing: indexing and search, summarization, report generation, and optical text recognition are all application areas which would benefit from better knowledge of what makes texts texts.

To accomplish this, a small research project was proposed by SICS, the Swedish Institute of Computer Science, a not-for-profit research organization based in Stockholm with a well established track record in many international projects, including several funded by IST, and Conexor Ltd, a recent high-technology start-up based in Helsinki with a product palette based on cutting edge research in language technology, and a participant in several international projects, both research and commercial. The partners had equal roles, with algorithm development mainly performed at Conexor and studies performed at SICS.

This first year assessment project focussed on a simple evaluation of the utility of a new and untried extension of language technology: extracting semantic roles form text with the purpose of separating foreground from background in texts. The evaluation was performed using a well established information retrieval test bench. The original aim was to use assessments of foregroundedness made by human judges as a foundation for the methods developed, but the project found better purchase for its development efforts in literature in theoretical linguistics. The results of the evaluation were successful, and we must now extend the result to further languages as test corpora with accompanying assessments become available for them.


Work Progress Overview

Specific Objectives and Progress Overview

As per the original project proposal and project plan, the overall success criterion of the project was to find that all of the following apply:

All items are quantitative; all have built-in thresholds for hypothesis testing.

The first criterion was not fulfilled as regards human judges: distinguishing foreground from background proved unreliable if human judges were used. However, the clauses where judges agreed did have common characteristics - most simply, they were longer than were other clauses (by Mann Whitney U; p > 0,95). The second criterion was fulfilled. The third was not tested due to the same methodological problems that overthrew the results of the first criterion. The first and third points remain unproven.

Deliverables

The deliverables listed below are the deliverables actually delivered. In the proposal stage, the project listed text corpora as well; this was due to a misunderstanding on the part of the coordinator concerningn the nature of deliverables. The text corpora naturally cannot be delivered to the commission due to practical reasons and due to copyright reasons: we specified in the proposal that we make use of TREC and CLEF corpora which are made available by publishing houses to research institutions.
No. Deliverable Date Type Classification Responsible partner Delivery
2.1 Report on existing literature on transitivity 3 R Pub S,C LLAVES Report 1
4.1 Report on inter-language and inter-judge agreement 3 R Pub S LLAVES Report 2
5.1 Report on clause characteristics 5 R Pub C,S LLAVES Report 3
6.1 Efficient tool for clausal discrimination 5 R Pub C LLAVES Report 4
7.2 Report of TREC evaluation 9 R Pub C,S LLAVES Report 5.
8.1 Final report 12 R Pub C,S This text
9.1 Continuation proposal 13 R Pub C,S To be delivered during the Spring of 2001.

State-of-the-art Update

For this, we refer to LLAVES Project Report 1. In short, at present, this project is unique in its approach.

Project Management and Coordination

We have no specific issues to report here. Project meetings were held at intervals, and a joint five day working session was arranged in Helsinki in October for concentrated experimentation.

Information Dissemination and Exploitation

We intend to publish a full and comprehensive recount of the experiment in a scientific journal or conference. The results are of a type that are suitable for publication rather than immediate commercialization.