&

LLAVES

IST-1999-12219

Unlocking topicality in text – foreground and background information in text

Final Report for LLAVES Assessment year

Covering period 1.1.2000-30.12.2000

 

 

 

 

 

 

 

Report Preparation Date: March 1, 2001

 

Contract Start Date: January 1, 2000 Duration: 1 year

Project Co-ordinator: Jussi Karlgren, SICS

Partners: SICS, Conexor Ltd.

Project funded by the European Community under the "Information Society Technologies" Programme (1998-2002)


Table of Contents


Executive Summary

This language technology project was designed bring mature results from language technology together with well motivated conjecture and plausible hypotheses from discourse and text linguistics in order to broaden the research frontier of both. The proposal was for a first year experiment to determine the viability of further research in the area; the direction is well motivated in terms of potential applications and holds great promise for fruitful exploitation if the technical and scientific questions can be solved.

A bottleneck for improving today's information management systems is that we know little of texts as text. Systems view texts as simple sets of words or terms, discarding information such as clause style and argument structure as noise. This project aims to bridge the gap from syntax to text, and show how syntactic mechanisms of language, which primarily concern clause-internal structure, carry text-level information as well. If we would be able to chart some features of the topical progression in a text we will give a road map for algorithms for further processing: indexing and search, summarization, report generation, and optical text recognition are all application areas which would benefit from better knowledge of what makes texts texts.

To accomplish this, a small research project was proposed by SICS, the Swedish Institute of Computer Science, a not-for-profit research organization based in Stockholm with a well established track record in many international projects, including several funded by IST, and Conexor Ltd, a recent high-technology start-up based in Helsinki with a product palette based on cutting edge research in language technology, and a participant in several international projects, both research and commercial. The partners had equal roles, with algorithm development mainly performed at Conexor and studies performed at SICS.

This first year assessment project focussed on a simple evaluation of the utility of a new and untried extension of language technology: extracting semantic roles form text with the purpose of separating foreground from background in texts. The evaluation was performed using a well established information retrieval test bench. The original aim was to use assessments of foregroundedness made by human judges as a foundation for the methods developed, but the project found better purchase for its development efforts in literature in theoretical linguistics. The results of the evaluation were successful, and we must now extend the result to further languages as test corpora with accompanying assessments become available for them.


Project Objectives and Results

Below the project objectives are taken in order, as formulated in the original project plan, and discussed point by point.

Text Structure

This project will show how syntactic mechanisms of language, which primarily concern clause-internal structure, carry text-level information as well. This first assessment year will aim at a set of experiments to determine the viability of one particular approach, well-motivated in terms of theory and seemingly feasible if recent engineering successes hold up to their promises. We will take a large number of texts in several languages and attempt to partition the clauses into a number of graded categories according to foregroundedness. These clause categories can then be used in different ways to improve standard statistical indexing methods, to generate multi-document summaries, and to calculate text similarities.

This result is positive, but at this date only for English. With rather simple experimentation, surprisingly clear benefits from using foreground grading of clauses were shown. An improvement in indexing resulted in improved retrieval results in a retrieval experiment using a simple lab system on a large number of texts.

Solid Experimental Base

Most of the work in this first-year experiment will be patterned on the Text Retrieval Conferences (TRECs) organized annually by the US National Institute of Standards. TREC provides participants with a large collection of texts together with a yearly set of fifty queries. These queries are used for retrieval on the collection, and the resulting retrieved sets are evaluated by a panel of human judges in time for each year's conference. 1999 is the eighth year the conference is being held - SICS has participated the last four years and Conexor the past two - and there is a pool of several hundred queries with relevance judgments made on items in the text collection. These can be retrospectively used to tune information retrieval queries which is what we will be doing that in this project.

Our approach is geared to be included in a standard information retrieval tool. While our experiments build on a better understanding of text, the results can be seamlessly integrated in any retrieval engine architecture, which allows the results to be tested with and compared to approaches in use today. Our experiment will take a standard 'vanilla' retrieval engine which typically is based on statistical modelling of term occurrences, and compare it with the same engine using textual data analyzed and refined using our algorithms. The comparison will give a direct numerical result to evaluate the improvement given by the syntactic analysis we perform. This can be done in several languages.

As per above, we did run a retrospective TREC experiment.

Subjective Bootstrap Data

For the definition of the structures we will be implementing we will need to begin by empirical studies of text understanding. We will distribute a small number of texts in several languages to a large number of people and ask them to mark topically central and peripheral clauses in the text. This will give us purchase to define algorithms to perform the same partitioning automatically.

This did not pan out. Starting with a simple pre-study it was apparent judge divergence was too large for us to be able to use results in a practical way.

Criteria for Success

The overall success criterion of the project is to find that any or all of the following apply:

All items are quantitative; all have built-in thresholds for hypothesis testing.

The first criterion was not fulfilled as regards human judges: distinguishing foreground from background proved unreliable if human judges were used. However, the clauses where judges agreed did have common characteristics - most simply, they were longer than were other clauses (by Mann Whitney U; p > 0,95). The second criterion was fulfilled. The third was not tested due to the same methodological problems that overthrew the results of the first criterion. The first and third points remain unproven.

Risks

There are some research-oriented risks with this experiment which will need to be addressed during the first year assessment period.

Descriptive success:
The effects we seek and find are too weak to improve the rather rough edged-evaluation metrics. The results will be interesting for descriptive linguistics, but may not be applicable to system development. This question is the central hinge point for the entire experiment. We know the mechanisms are in place; the experiment is designed to measure their strength.
Genre:
Style specific effects swamp efforts at generalization. We will need to concentrate on some specific genre, such as newsprint, to avoid this.
Idiosyncratic variation:
Between-document and between-author variation swamps between-clause variation. The material at hand must be well-edited to avoid this. Newsprint is a good candidate.

We avoided the risks by using a well-edited subcorpus for the experiment.


Methodologies

Referring to LLAVES Project Report 1, we submit that the language technology used in this project is unique in that no other research site we are familiar with performs semantic role analysis with tools of this robustness and scale for the purpose of information access. After further research and development will be useful as a base for research and commercial projects alike.

By contrast, the user studies performed and reported in LLAVES Project Report 2 proved disappointing.


Project Results and Achievenments

By comparison to the original objectives, this assessment project is a qualified success. We intended to base our work on human judgments of texts, but found that the judgments of vague categories diverged too far for us to continue that track. However, we found enough purchase in the theoretical literature to base method development on, and found that the experiment as described in LLAVES Project Report 5 was a success.

Our additional objectives were to investigate the utility of our approach for summarization, and this is yet untested, as are the multi-lingual aspects.

This type of technology is uniquely suited for the European market where there is a large population with linguistic competence to produce and consume information in any of a number of languages. The base technology one the partners in this project produces -- as do other European players, in contrast with other market areas -- is well suited to be enhanced with the type of results found even in this first year assessment project. This is the sort of technology that would be difficult to develop in a mono-lingual environment. As outlined in the project proposal, each language has its own specific mechanisms to realize the foregroundedness and backgroundedness of topical statements. This is a fruitful field to till, and nowhere better than in an area where a number of well-supported and well-researched languages with a high technological and educational base can be found.

With respect to the long-range benefits of this research track, the benefits of understanding text as text better are obvious and hardly need restating. Text is the major repository of human-generated information and the only reliable memory our society with which we can transcend large distances in time, space, and human activities. If we manage to develop mechanisms to accommodate textual information better, several crucial activities of our information society will require less effort, time, and cost and thus - both directly and indirectly - raise the quality of life for us all.


Deliverables

The deliverables listed below are the deliverables actually delivered. In the proposal stage, the project listed text corpora as well; this was due to a misunderstanding on the part of the coordinator concerningn the nature of deliverables. The text corpora naturally cannot be delivered to the commission due to practical reasons and due to copyright reasons: we specified in the proposal that we make use of TREC and CLEF corpora which are made available by publishing houses to research institutions.
No. Deliverable Date Type Classification Responsible partner Delivery
2.1 Report on existing literature on transitivity 3 R Pub S,C LLAVES Report 1
4.1 Report on inter-language and inter-judge agreement 3 R Pub S LLAVES Report 2
5.1 Report on clause characteristics 5 R Pub C,S LLAVES Report 3
6.1 Efficient tool for clausal discrimination 5 R Pub C LLAVES Report 4
7.2 Report of TREC evaluation 9 R Pub C,S LLAVES Report 5.
8.1 Final report 12 R Pub C,S This text
9.1 Continuation proposal 13 R Pub C,S To be delivered during the Spring of 2001.

Future Outlook

We have several unopened research questions yet to address. We have great hopes for the utility of our methods for summarization -- this is as of yet untested. Due to practical problems as regards systematically graded test materials, we were not able to formally test the efficiency of the approach for several languages. Given a well established and goal-oriented research consortium for a continuation project, we will be able to produce research prototypes to answer those questions. This first assessment year has proven that there is information there to be found -- now we must address the concrete question of putting it to work.