Jussi Karlgren
October 1999
This language technology project will bring mature results from language technology together with well motivated conjecture and plausible hypotheses from discourse and text linguistics in order to broaden the research frontier of both. The proposal outlines a first year experiment to determine the viability of further research in the area; the direction is well motivated in terms of potential applications and holds great promise for fruitful exploitation if the technical and scientific questions can be solved.
A bottleneck for improving today's information management systems is that we know little of texts as text. Systems view texts as simple sets of words or terms, discarding information such as clause style and argument structure as noise. This project aims to bridge the gap from syntax to text, and show how syntactic mechanisms of language, which primarily concern clause-internal structure, carry text-level information as well.
Once we are able to chart some features of the topical progression in a text we will give a road map for algorithms for further processing: indexing and search, summarization, report generation, and optical text recognition are all application areas which would benefit from better knowledge of what makes texts texts.
We will take a large number of texts in several languages and partition the clauses in them into a number of graded categories according to foregroundedness. These clause categories can then be used in different ways for indexing, multi-document summarization, and text item similarity calculation. This first assessment project takes the form of an experiment on text. If the experiment is successful, it opens up an entire research field, which we will continue examining in a future project.
This project will show how syntactic mechanisms of language, which primarily concern clause-internal structure, carry text-level information as well. This first assessment year will aim at a set of experiments to determine the viability of one particular approach, well-motivated in terms of theory and seemingly feasible if recent engineering successes hold up to their promises. We will take a large number of texts in several languages and attempt to partition the clauses into a number of graded categories according to foregroundedness. These clause categories can then be used in different ways to improve standard statistical indexing methods, to generate multi-document summaries, and to calculate text similarities.
Most of the work in this first-year experiment will be patterned on the Text Retrieval Conferences (TRECs) organized annually by the US National Institute of Standards. TREC provides participants with a large collection of texts together with a yearly set of fifty queries. These queries are used for retrieval on the collection, and the resulting retrieved sets are evaluated by a panel of human judges in time for each year's conference. 1999 is the eighth year the conference is being held - SICS has participated the last four years and Conexor the past two - and there is a pool of several hundred queries with relevance judgments made on items in the text collection. These can be retrospectively used to tune information retrieval queries which is what we will be doing that in this project.
Our approach is geared to be included in a standard information retrieval tool. While our experiments build on a better understanding of text, the results can be seamlessly integrated in any retrieval engine architecture, which allows the results to be tested with and compared to approaches in use today. Our experiment will take a standard 'vanilla' retrieval engine which typically is based on statistical modelling of term occurrences, and compare it with the same engine using textual data analyzed and refined using our algorithms. The comparison will give a direct numerical result to evaluate the improvement given by the syntactic analysis we perform. This can be done in several languages.
For the definition of the structures we will be implementing we will need to begin by empirical studies of text understanding. We will distribute a small number of texts in several languages to a large number of people and ask them to mark topically central and peripheral clauses in the text. This will give us purchase to define algorithms to perform the same partitioning automatically.
The overall success criterion of the project is to find that all of the following apply:
All items are quantitative; all have built-in thresholds for hypothesis testing.
There are some research-oriented risks with this experiment which will need to be addressed during the first year assessment period.
The consortium consists of two organizations with strong research backgrounds: 1) SICS, Stockholm and 2) Conexor, Helsinki. Both will participate in this project with an equal contribution of personnel from its outset to its completion. SICS is a private non-profit research laboratory and Conexor a research intensive newly formed independent private company. SICS is represented in this project by Jussi Karlgren and Conexor by Pasi Tapanainen. SICS will be the financial and scientific coordinator.
Texts are viewed as simple sets of words or terms by most contemporary information systems. This far most research in text and document management has focused on ingesting large numbers of texts and crunching them into tables of such terms, discarding as noise extraneous information such as text style and other less content bearing information such as clausal organization. This is a reasonable starting point for the enterprise of text understanding. The first document databases indexed bibliographical data on the basis of keywords selected by trained indexers. The next step involved selecting keywords from text, possibly with the aid of a thesaurus or terminological database. And the step we have been working on the last few years has been to go towards including all terms in a text into the index, weighted by occurrence statistics as an estimate of importance.
This has brought about a huge increase in usefulness of text databases. Indexing texts with a few chosen search terms or by title words alone often fails to acknowledge the full usefulness of text items: it is impossible to predict all future uses of a certain piece of information, and the flexibility afforded by full-text retrieval is impossible to match by keyword assignment, however judiciously chosen the keywords may be.
But as we all know post-WWW, full text retrieval is not a cure-all. The quality of search results suffers from using words as the basic element of analysis: words are both too exact and too broad as carriers of information. Often words have many meanings, and often a single meaning can be expressed using any of several words or terms. We know texts are more than bags of words.
In the meanwhile, most of the active work in linguistics has concerned itself with the internal structure of clauses. In several important ways, and for many languages, linguists are near the point of having solved the problem of reliable clausal analysis.
However, this work has had no significant impact neither on text analysis nor on information management techniques. On the contrary: texts are still studied ad hoc by philologists or literature scholars, while researchers in information retrieval are debating the utility of including term dependency analysis in indexing systems -- clause level analysis is not considered useful. It seems the general type of information we are able to glean from syntactic and semantic analysis is too vapid to be useful for large scale document processing and too general to be useful to text scholars. But to quote two prominent scholars in the field -- Karen Sparck Jones and Martin Kay: "We take heart particularly from two facts: first, linguists are turning their attention more and more to larger units of discourse than the sentence, and second, on-line retrieval systems are likely to involve retrievable units smaller than traditional documents. We believe that the relevance of these fields to one another will become more apparent as the size of the text units they deal with becomes more commensurable." These optimistic words were written more than twenty years ago (Sparck Jones and Kay, 1976), and yet today few linguists work with information retrieval and even fewer information scientists provide useful results to linguistics.
Our hope is we can find better indexing mechanisms using more knowledge about language -- we do not want to abandon the entire framework of information retrieval technology which has served us well the past forty years since its first formulation in the late fifties (Luhn, 1957). In essence we want to have a more clever estimate of what terms are important in a document than systems have today.
There are many things that can be improved with the information access systems of today, but one of the clear bottlenecks is that we have too little knowledge of texts. The threshold for providing better tools for text understanding lies not in formalizing text as if it were a macro-level clause, but in uncovering the less conventionalized mechanimsms for information organization and topical structure in texts.
Text level processes are not formalizable in the same way clause level structure is. While the latter is susceptible to situation- speaker- and content- independent analysis, the former is not. Indeed -- this is what makes texts texts. Their situation-boundedness is precisely what distinguishes them from random word collections ("word salad").
This is the starting point of this project. Our premise is that we need to know more about texts to be able to understand them. The information we want is in the text: the problem is not to find out if the information is there but how to find it reliably.
Efficient and reliable computational models for clause level syntax can and have recently been implemented by several research sites. Our project bases its analysis systems on ground-breaking work done at the University of Helsinki. We have working surface syntactic analysis for English, Finnish, German, Spanish, and Swedish and full syntactic analysis for English, with more languages in the pipeline.
There is an entire body of research put into uncovering the topical structure of clauses and texts. There is a long tradition of semantic and pragmatic study of clause structure from the Charles University in Prague, there are several results supporting our hypotheses using the general theory of transitivity (Hopper, 1979), there are numbers of algorithms for anaphor resolution which touch clausal categorization, there are studies of automatic summarization algorithms, and there are studies of text grammars which all have bearing on our work. However, no studies have been made specifically on clausal categorization for topical analysis, and the empirical validation of these ideas have been held back for lack of effective tools.
Our first research question could be posed as "What makes a text a text?" More specifically, we will want to trace the topicality patterns in a text to answer the question: "How do we know what this text is about -- what communicative mechanisms has the author employed to tell us what the topic is?" Our hypothesis at this juncture is that clauses bear different roles in this respect, and that we must examine clauses both as regards their form and dependencies to context to determine if this clause is intended to convey background or foreground information.
While many clauses in a text will contribute both foreground and background information, many will be primarily oriented towards one of the two: some clauses set the stage for presenting more information; some take the stage to inform of some topical or timely particularity. The prototypical clauses are different not only as regards their content, but as regards their structure and arguments as well: some clauses are stative, progressive, and concern impersonal processes (``The weather was fine''); others event-oriented, localized in time and space, and involve animate agents (``The baker burnt his finger and slammed the oven shut''). We will form a typology of clauses to tease out foreground and background themes in text, clause by clause.
The mechanisms we are investigating -- topical dependencies and clause dynamics -- are less language specific and syntactically encoded than many other linguistic mechanisms. They are in our view brought about by the general communicative mechanisms common to all humans and people. These are the sort of conventions that are poised on the verge of grammaticalization, but not easily generalizable and formalizable.
Given this our practical questions have to do with investigating what processing tools and algorithms are needed to reliably understand these mechanisms. Some of these tools may already exist; others have to be developed for the purpose.
And finally, outside the formal scope of this proposal, but underlying the entire enterprise, we are convinced of the utility of pursuing this line of inquiry, and wish to prove this by demonstrating techniques and algorithms that could be included in a cycle of prototyping and evaluation.
Clauses have different information bearing roles.
The role of a clause is conveyed from author to reader by their surface form or position.
Clausal roles are not language specific, but heavily text ecology and domain specific.
Different sorts of clause can be coaxed apart using language and domain specific mechanisms.
These mechanisms can be applied in the development of practical tools.
We will bring together the body of solid theoretical work with the recent successful engineering results to validate the usefulness of using topicality and textuality as tools for practical processing of text. This has not been tried before, mainly for lack of processing power and robust tools, but also for lack of obvious application areas. All of these obstacles are tractable enough today to allow for a first cycle of experimentation.
This work has a strong European dimension. There are no areas in the world other than Europe where as many languages co-exist with comparable levels of technological sophistication, cultural equality, and functional diversity. We will be able to find textual samples from a large number of languages; we have the technology to process at least a number of them. Performing this type of experiment in a single language would not yield as fruitful results. Every language has its own mechanisms for clausal and textual organization; if we overfit our experiments to predict the results in a single language we risk to draw hasty conclusions that will not be able to travel from language to language. As one concrete example which we will be looking at aspectual differences between clauses are signalled in very different ways across languages: French and English use tense differences, Russian and other Slavic languages have verb pairs, Finnish uses object case, and Swedish uses -- largely lexicalized -- verb particles to signal much of the same sort of effects. If we were to experiment in English we would draw the conclusion that this sort of difference is largely marked by morphological varation and lexical choice in the verb. Were we to use Finnish as a starting point we would discuss qualities of the object instead. Neither approach would be fruitful on its own.
There is an obvious beneficial long range effect for speakers of languages other than English to promote research on cross-lingual techniques.
This project will tap into a reserve of linguistic competence to provide hitherto unexploited knowledge sources to process the vast repositories of knowledge stored digitally across the world. Any improvement over today's shoddy mono-lingual methods - however slight - will prove immensely effective, in economic terms. This project aims at treading the first steps down a new path: the economic returns are to be expected in the long range.
While we do not plan to include application work into this project which is more of an exploratory nature we do wish to point out some obvious applications, if the technology we will develop carries out its promise.
Indexing is the main example used to motivate this project, and indeed, the general problem of affixing useful content bearing labels on texts is what has motivated this research plan, and we believe we will be able to do this in a much better motivated way once we have a low level understanding of mechanisms for textual topic progression.
Summarization and report generation both of single texts and and even more crucially several texts is a currently very active research frontier -- we believe that an effort to find cross linguistic principles for text analysis will be useful especially for the multi-document task.
We will take a large number of texts in several languages and attempt to partition the clauses into a number of graded categories according to foregroundedness. These clause categories can then be used in different ways for indexing, multi-document summarization, and text item similarity calculation.
Most of the work in this first-year experiment will be patterned on the Text Retrieval Conferences (TRECs) organized annually by the US National Institute of Standards. TREC provides participants with a large collection of texts together with a yearly set of fifty queries. These queries are used for retrieval on the collection, and the resulting retrieved sets are evaluated by a panel of human judges in time for each year's conference. 1999 is the eighth year the conference is being held - SICS has participated the last four years and Conexor the past two - and there is a pool of several hundred queries with relevance judgments made on items in the text collection. These can be retrospectively used to tune information retrieval queries which is what we will be doing that in this project.
For the definition of the structures we will be implementing we will need to begin by empirical studies of text understanding. We will distribute a small number of texts to a large number of people and ask them to mark topically central and peripheral clauses in the text. This will give us purchase to define algorithms to perform the same partitioning automatically.
For the experimental work stages, we have defined specific measures of success to be applied in evaluating the success of the entire project. The success criteria have been outlined above, in the section on project objectives; they are to be measured as detailed below in each work package.
We will measure agreement between test subjects with the well-established kappa statistic (e.g. Carletta, 1996). We expect a reasonable level of agreement among subjects: this is a central hypothesis of the project and an important base for the continued experimentation.
Agreement between languages and foregroundedness can be evaluated similarly to the single-language case using the kappa statistic. This is one of the success criteria of the entire project.
The tools exist; they merely need to be tuned to find the appropriate patterns. They can be evaluated system-internally, by precision and recall visavi material processed by human judges.
Analyses can be validated against the test subjects' judgments using standard non-parametric hypothesis testing methods (e.g. Wilcoxon or Mann Whitney U). This is another success criterion for the entire project.
Running information retrieval experiments is straightforward. We have a well established information retrieval test bench for testing indexing and retrieval behavior from our participation in TREC. If material which is separately indexed for foreground and background material can be shown to attain better retrieval results for some or most test bench queries than material which is not analyzed using foreground and background, we find that the clausal distinctions between foreground and background indeed does carry information useful for text understanding on this basic level. This is the third and most important success criterion of the project.
TREC has covered several languages besides English. There is experimental material available for French, German, Italian, and Spanish. To run indexing experiments on other languages than these we must define similar measures: this will most likely be beyond the capabilities of this one-year project.
| No. | Deliverable | Date | Requires | Responsible partner |
| 1.1 | English corpus | 1 | -- | C,S |
| 1.2 | Multilingual corpus | 2 | -- | C,S |
| 2.1 | Report on existing literature on transitivity | 3 | -- | S,C |
| 3.1 | Tagset for test corpora | 3 | 1.1, 1.2 | S |
| 4.1 | Report on inter-language and inter-judge agreement | 3 | 3.1 | S |
| 5.1 | Report on clause characteristics | 5 | 2.1,3.1 | C,S |
| 6.1 | Efficient tool for clausal discrimination | 5 | 2.1, 3.1, 5.1 | C |
| 7.1 | Indexed corpus | 8 | 6.1 | C |
| 7.2 | Report of TREC evaluation | 9 | 7.1 | C,S |
| 8.1 | Comprehensive report | 12 | All | C,S |
| 9.1 | Continuation proposal | 13 | All | C,S |
In this project SICS will be coordinating the work of evaluating and performing user studies, while Conexor will be customizing and developing its well-known and very successful technology for syntactic analysis. The theoretical basis of the project has been formulated in close cooperation, and will be further developed during the course of the project. While SICS can provide the stability and experience necessary for a cooperative project and long term research, Conexor will be able to provide new cutting-edge technology for project purposes.
SICS and Conexor have both participated in numerous international multi-site projects, most recently participated together with Rutgers University and General Electric in the TREC-7 evaluation in 1998, and currently participate with the same project group in the TREC-8 evaluation. This project can be run and coordinated with a minimum of effort thanks to previous experience and other current related engagements: it has a clear goal, a well defined time span, and clearly delimited work stages which all have been agreed upon.