|
&
|
LLAVES IST-1999-12219 Unlocking topicality in text – foreground and background information in text |
Final Report for LLAVES Assessment year
Covering period 1.1.2000-30.12.2000
Report Preparation Date: March 1, 2001
Contract Start Date: January 1, 2000 Duration: 1 year
Project Co-ordinator: Jussi Karlgren, SICS
Partners: SICS, Conexor Ltd.
|
|
Project funded by the European Community under the "Information Society Technologies" Programme (1998-2002) |
A bottleneck for improving today's information management systems is that we know little of texts as text. Systems view texts as simple sets of words or terms, discarding information such as clause style and argument structure as noise. This project aims to bridge the gap from syntax to text, and show how syntactic mechanisms of language, which primarily concern clause-internal structure, carry text-level information as well. If we would be able to chart some features of the topical progression in a text we will give a road map for algorithms for further processing: indexing and search, summarization, report generation, and optical text recognition are all application areas which would benefit from better knowledge of what makes texts texts.
To accomplish this, a small research project was proposed by SICS, the Swedish Institute of Computer Science, a not-for-profit research organization based in Stockholm with a well established track record in many international projects, including several funded by IST, and Conexor Ltd, a recent high-technology start-up based in Helsinki with a product palette based on cutting edge research in language technology, and a participant in several international projects, both research and commercial. The partners had equal roles, with algorithm development mainly performed at Conexor and studies performed at SICS.
This first year assessment project focussed on a simple evaluation of the utility of a new and untried extension of language technology: extracting semantic roles form text with the purpose of separating foreground from background in texts. The evaluation was performed using a well established information retrieval test bench. The original aim was to use assessments of foregroundedness made by human judges as a foundation for the methods developed, but the project found better purchase for its development efforts in literature in theoretical linguistics. The results of the evaluation were successful, and we must now extend the result to further languages as test corpora with accompanying assessments become available for them.
This result is positive, but at this date only for English. With rather simple experimentation, surprisingly clear benefits from using foreground grading of clauses were shown. An improvement in indexing resulted in improved retrieval results in a retrieval experiment using a simple lab system on a large number of texts.
Most of the work in this first-year experiment will be patterned on the Text Retrieval Conferences (TRECs) organized annually by the US National Institute of Standards. TREC provides participants with a large collection of texts together with a yearly set of fifty queries. These queries are used for retrieval on the collection, and the resulting retrieved sets are evaluated by a panel of human judges in time for each year's conference. 1999 is the eighth year the conference is being held - SICS has participated the last four years and Conexor the past two - and there is a pool of several hundred queries with relevance judgments made on items in the text collection. These can be retrospectively used to tune information retrieval queries which is what we will be doing that in this project.
Our approach is geared to be included in a standard information retrieval tool. While our experiments build on a better understanding of text, the results can be seamlessly integrated in any retrieval engine architecture, which allows the results to be tested with and compared to approaches in use today. Our experiment will take a standard 'vanilla' retrieval engine which typically is based on statistical modelling of term occurrences, and compare it with the same engine using textual data analyzed and refined using our algorithms. The comparison will give a direct numerical result to evaluate the improvement given by the syntactic analysis we perform. This can be done in several languages.
As per above, we did run a retrospective TREC experiment.
For the definition of the structures we will be implementing we will need to begin by empirical studies of text understanding. We will distribute a small number of texts in several languages to a large number of people and ask them to mark topically central and peripheral clauses in the text. This will give us purchase to define algorithms to perform the same partitioning automatically.
This did not pan out. Starting with a simple pre-study it was apparent judge divergence was too large for us to be able to use results in a practical way.
The overall success criterion of the project is to find that any or all of the following apply:
All items are quantitative; all have built-in thresholds for hypothesis testing.
The first criterion was not fulfilled as regards human judges: distinguishing foreground from background proved unreliable if human judges were used. However, the clauses where judges agreed did have common characteristics - most simply, they were longer than were other clauses (by Mann Whitney U; p > 0,95). The second criterion was fulfilled. The third was not tested due to the same methodological problems that overthrew the results of the first criterion. The first and third points remain unproven.
There are some research-oriented risks with this experiment which will need to be addressed during the first year assessment period.
We avoided the risks by using a well-edited subcorpus for the experiment.
By contrast, the user studies performed and reported in LLAVES Project Report 2 proved disappointing.
Our additional objectives were to investigate the utility of our approach for summarization, and this is yet untested, as are the multi-lingual aspects.
This type of technology is uniquely suited for the European market where there is a large population with linguistic competence to produce and consume information in any of a number of languages. The base technology one the partners in this project produces -- as do other European players, in contrast with other market areas -- is well suited to be enhanced with the type of results found even in this first year assessment project. This is the sort of technology that would be difficult to develop in a mono-lingual environment. As outlined in the project proposal, each language has its own specific mechanisms to realize the foregroundedness and backgroundedness of topical statements. This is a fruitful field to till, and nowhere better than in an area where a number of well-supported and well-researched languages with a high technological and educational base can be found.
With respect to the long-range benefits of this research track, the benefits of understanding text as text better are obvious and hardly need restating. Text is the major repository of human-generated information and the only reliable memory our society with which we can transcend large distances in time, space, and human activities. If we manage to develop mechanisms to accommodate textual information better, several crucial activities of our information society will require less effort, time, and cost and thus - both directly and indirectly - raise the quality of life for us all.
| No. | Deliverable | Date | Type | Classification | Responsible partner | Delivery |
| 2.1 | Report on existing literature on transitivity | 3 | R | Pub | S,C | LLAVES Report 1 |
| 4.1 | Report on inter-language and inter-judge agreement | 3 | R | Pub | S | LLAVES Report 2 |
| 5.1 | Report on clause characteristics | 5 | R | Pub | C,S | LLAVES Report 3 |
| 6.1 | Efficient tool for clausal discrimination | 5 | R | Pub | C | LLAVES Report 4 |
| 7.2 | Report of TREC evaluation | 9 | R | Pub | C,S | LLAVES Report 5. |
| 8.1 | Final report | 12 | R | Pub | C,S | This text |
| 9.1 | Continuation proposal | 13 | R | Pub | C,S | To be delivered during the Spring of 2001. |