|
&
|
LLAVES IST-1999-12219 Unlocking topicality in text – foreground and background information in text |
Periodic Progress Report N°:1
Covering period 1.1.2000-30.12.2000
Report Preparation Date: December 17, 2000
Contract Start Date: January 1, 2000 Duration: 1 year
Project Co-ordinator: Jussi Karlgren, SICS
Partners: SICS, Conexor Ltd.
|
|
Project funded by the European Community under the "Information Society Technologies" Programme (1998-2002) |
A bottleneck for improving today's information management systems is that we know little of texts as text. Systems view texts as simple sets of words or terms, discarding information such as clause style and argument structure as noise. This project aims to bridge the gap from syntax to text, and show how syntactic mechanisms of language, which primarily concern clause-internal structure, carry text-level information as well. If we would be able to chart some features of the topical progression in a text we will give a road map for algorithms for further processing: indexing and search, summarization, report generation, and optical text recognition are all application areas which would benefit from better knowledge of what makes texts texts.
To accomplish this, a small research project was proposed by SICS, the Swedish Institute of Computer Science, a not-for-profit research organization based in Stockholm with a well established track record in many international projects, including several funded by IST, and Conexor Ltd, a recent high-technology start-up based in Helsinki with a product palette based on cutting edge research in language technology, and a participant in several international projects, both research and commercial. The partners had equal roles, with algorithm development mainly performed at Conexor and studies performed at SICS.
This first year assessment project focussed on a simple evaluation of the utility of a new and untried extension of language technology: extracting semantic roles form text with the purpose of separating foreground from background in texts. The evaluation was performed using a well established information retrieval test bench. The original aim was to use assessments of foregroundedness made by human judges as a foundation for the methods developed, but the project found better purchase for its development efforts in literature in theoretical linguistics. The results of the evaluation were successful, and we must now extend the result to further languages as test corpora with accompanying assessments become available for them.
All items are quantitative; all have built-in thresholds for hypothesis testing.
The first criterion was not fulfilled as regards human judges: distinguishing foreground from background proved unreliable if human judges were used. However, the clauses where judges agreed did have common characteristics - most simply, they were longer than were other clauses (by Mann Whitney U; p > 0,95). The second criterion was fulfilled. The third was not tested due to the same methodological problems that overthrew the results of the first criterion. The first and third points remain unproven.
| No. | Deliverable | Date | Type | Classification | Responsible partner | Delivery |
| 2.1 | Report on existing literature on transitivity | 3 | R | Pub | S,C | LLAVES Report 1 |
| 4.1 | Report on inter-language and inter-judge agreement | 3 | R | Pub | S | LLAVES Report 2 |
| 5.1 | Report on clause characteristics | 5 | R | Pub | C,S | LLAVES Report 3 |
| 6.1 | Efficient tool for clausal discrimination | 5 | R | Pub | C | LLAVES Report 4 |
| 7.2 | Report of TREC evaluation | 9 | R | Pub | C,S | LLAVES Report 5. |
| 8.1 | Final report | 12 | R | Pub | C,S | This text |
| 9.1 | Continuation proposal | 13 | R | Pub | C,S | To be delivered during the Spring of 2001. |