LLAVES Assessment year deliverable

Current research in textual topic, foreground and background, with special attention paid to applications - LLAVES Report 1
Jussi Karlgren
jussi@sics.se
September 2000

The LLAVES project plans to show how syntactic mechanisms of language, which primarily concern clause-internal structure, carry text-level information as well. The plan is to find mechanisms which distinguish different types of clause, with the aim of teasing out foreground information from background in a text.

Foreground and Background: topic in text

There is an entire body of research put into uncovering the topical structure of clauses and texts. There is a long tradition of semantic and pragmatic study of clause structure from the Charles University in Prague (e.g. Hajicová, 1993), there are several results supporting our hypotheses using the general theory of transitivity (Halliday, 1967, 1978; Hopper, 1979), there are numbers of algorithms for anaphor resolution which touch clausal categorization, there are studies of automatic summarization algorithms, and there are studies of text grammars which all have bearing on our work. However, no studies have been made specifically on clausal categorization for topical analysis, and the empirical validation of these ideas have been held back for lack of effective tools.

Transitivity and clauses

Transitivity is one of the most basic notions in the system of language, but ill formalized in the formal study of language. Clauses in language represent events and processes of various kinds, and transitivity is that characteristic of a clause which models the character of the process or event it represents. This systemic model was first formulated by Halliday (1967) and has since been elaborated by Hopper and others in a theoretic sense: very little empirical study on large numbers of texts has been performed, and no systematic let alone quantitative evaluation of the theories has even been proposed.

One of the basic conceptual structures of language in use is that actions are done by people and affect things. How the action is performed, by whom, and on what are all encoded in the clause by various syntactic mechanisms, in a general system of transitivity. For most non-linguists, transitivity is only explicitly mentioned in foreign-language classes when classifying verbs as transitive or intransitive, meaning if the verb in question takes a direct object or prefers not to. This is of course central to the task of modeling action and effect, but transitivity covers more than this one aspect of process structure. Halliday's model mentions a number of specific factors or "systems" that cover the more general "system" of transitivity: Number, type, and role of participant: human or not? Agent? Benefactive?; Process type: existence, possession, spatial/locative, spatial/mobile (e.g. 1978, p. 118). These aspects of clausal organization hook up with factors such as temporal, aspectual, or mood systems to produce a clause. This clause not only carries information about the event or process it represents, but it also crucially builds a text, together with adjacent clauses. In Halliday's model (most comprehensively delineated in his 1967 publication) a clause is the confluence of three systems of syntactic choice: transitivity, mood and theme. Transitivity, he writes, is the set of options relating to cognitive content, mood being the system for organizing the utterance into a speech situation, and theme being the system for organizing the utterance into a discourse.

While there is ample psycholinguistic evidence that the syntactic form of a clause is discarded after being processed by the hearer or reader (e.g. Jarvella, 1979), the communicative structure of the clause is retained to organize the information content of the text or discourse. The structure of a clause is not arbitrary, and cannot be determined in isolation from other clauses in the vicinity and other events, processes, and participants represented and mentioned in the text. This modeling is primarily done through measures of lexical cohesion as will be shown below.

But models of transitivity are lacking. Transitivity has been and is being studied only as a very theoretical construction, and little work has been done which would be of direct implementional quality. The theoretical work concentrates on syntactic modeling of languages of which there is rather little knowledge yet, as a first stage towards building a more complete description. Practical clues as to how to make use of transitivity are mainly due to Hopper and Thompson (Hopper, 1979; Hopper and Thompson, 1980). Hopper argues for the distinction between background and foreground in narrative, signaled by variation along the qualities of the subject - such as animacy or humanness, the predicate verb - such as aspect or tense marking, and the voice of the clause. Many or even most of these factors cut across language divides (see e.g. Dahl and Karlsson, 1976). Hopper and Thompson then propose a number of characteristics along which transitivity is measured, some of which are directly quantifiable as shown in the table below. These factors we can make use directly in our implementation effort.

Feature High Low
Participants 2 or more less
Kinesis action non-action
Aspect completed partial or imperfect
Punctuality punctual continuous
Volitionality volitional non-volitional
Polarity affirmative negative
Reality real non-real
Agency potent agent non-potent agent
Effect on object object totally affected object not affected
Individuation of object individual object non-individual object

Clauses and Topic

There is a large number of approaches to textual modeling with very varying basis in theories of language or syntax. Most models of text are statistically based, or have some high-level model of argumentation to follow irrespective of syntax; some take recourse to cue phrases or expressions specific to some domain to build a text model. Some use syntactic analysis as a low-level building block, but discard what is left of the syntactic analysis after the argument structure of the clause has been established.

Local coherence

Much of topical study centers on local coherence of discourse or text, such as research models of theme-rheme or topic-comment, or research strands such as recent projects in modeling centering (Grosz et al, 1995). In these approaches, topic is a feature of the clause, and is carried over to the next clause through relatively overt syntactic mechanisms such as argument organization, anaphor, or ordering. These types of model of local coherence, where some have a fairly sophisticated theoretical base rather along the lines of Halliday's theme system, will well benefit from using transitivity as a factor.

Narrative models

Other studies try to understand topic from the top down, building argument structures or narrative frames (e.g. Lehnert, 1980). Lehnert, for instance, discussing application to summarization, argues that we must have a picture of plot progression throughout a text, with a model of mental states of an implied reader which the text affects in various ways. This high-level type of approach, most often with less psychological modeling involved, was typical during the knowledge-based systems projects of the late eighties. The failings of such systems are often that they have too little actual text processing capacity, and stumble on text processing as a task. Many systems attempt to generate rhetorical structures of various flavors based on local coherence models (e.g. Marcu, 1994 and onward; Corston-Oliver, 1998; Liddy, 1993), but quite often need more syntactic competence. Simpler models, with a form-filling approach (e.g. Strzalkowski et al, 1998) perform quite well, up to a point, with much less investment in discourse modeling. There is a span of such models, ranging from completely general templates and very strictly task-oriented and tailored extraction patterns; the middle ground between them is claimed by hybrid approaches, which indeed are just that: combinations of both rather than bridges between them (Kan and McKeown, 1999). The greater effort in building a narration or discourse model has yet to prove useful: the bridge from text to discourse model has not been usefully closed yet. It is in this type of model our contribution most clearly would be of benefit: these systems need to form a clearer understanding of the informational rationale of syntax.

Lexical chains

Several approaches try to establish lexical chains in text as a basis for understanding content for either indexing for retrieval or summarization tasks. Lexical units in the text are picked out by some algorithm, possibly after the text is segmented (e.g. Barzilay and Elhadad, 1997), and relations between units are established using terminological models. Many of these models utilize text segmentation algorithms, based on occurrence statistics (e.g. Hearst, 1994 or Reynar, 1994), or thesauri and terminological databases, or cue or trigger phrases (e.g. Boguraev and Kennedy, 1997) of some sort. These types of model tend to be quite successful, but often quite a-theoretic and will be difficult to improve using theoretical add-on models.

Application area: Information extraction

A typical application for textual modeling is to extract pieces of information (see e.g. MUC or TDT proceedings, available from NIST). In general information extraction systems are knowledge-intensive, putting a fair amount of effort into building a model of the domain that can be used to predict content of texts analyzed. The syntactic analysis used, if any, is used to aid in retrieving the type of content that already has been predicted to occur; systems typically do not discover new informational structure but work with a given model of knowledge and an amorphous model of text structure. Variation in clause structure typically is viewed as noise, and the systems put some effort into normalizing that variation without retaining the information clause structure might carry for topical progression. (e.g. Grishman, 1997). Information extraction uses a high-level model of topic combined with very local coherence analysis to solve its task. Since this application area is highly knowledge-intensive and task-tailored, general purpose models will probably at least initially damp results rather than improve them. Our models will not be useful inasmuch they do not improve the general syntactic processing tools the systems make use of.

Application area: Summarization

Summarization is most often implemented as the selection of sentences in text, based on those sentences' supply of content words (e.g. starting from Earl, 1970; Edmundson, 1969 and continuing to most available tools today). This is the area where qualities of single clauses might be most useful for future improvement of system performance; the drawback for our present project is that the evaluation of summarization tools is non-trivial and subjective and will not easily lend itself to prove the utility of our approach.

Application area: Information retrieval

Pure retrieval systems usually invest a fair amount of effort into completely ignoring text as text. Some exceptions include experiments to statistically process syntactic relations in text (e.g. Strzalkowski et al, 1997) to find typical relations entities engage in (in a sort of small-scale version of extraction technology) and others trying to establish reference chains in text (e.g. Liddy, 1994) to sort out occurrence frequencies obscured by anaphor. Information retrieval systems typically have neither textual models nor local coherence models to guide their analysis of texts; word occurrence statistics are good enough for the tasks these systems are used for at present. While the prospects of impressing information retrieval system engineers with syntactic and semantic niceties will be unlikely, the evaluation framework provided by information retrieval systems is useful enough for us to test our future algorithms for this purpose.

Bottlenecks

Bottlenecks for most approaches described above have to do with processing capacity. We have better basic processing power than most, through our access to the Conexor FDG, and must concentrate on using it to its best capacity. We should try to extend its processing power beyond the clause to the textual level.

Open Research Questions

The first research question we formulated in our research plan was "What makes a text a text?" Clearly, this is a question others have asked before. Current work in text understanding is plentiful and partially successful. Most of the work in our field - that of language engineering - is based on statistical models of term occurrence, whether along lexical chains, in sentence extraction algorithms, or using thesauri as a domain model. The main exception is centering and other related and non-related anaphora resolution approaches. Most of the effort being put into text analysis today is along the lines of the theme system in Halliday's analysis.

The arguably primary aspect of the clause is that of its cognitive content, and its relation to the other systems: this is measured using very simple statistically based models or thesaurus-based models of lexical cohesion. The study of transitivity would raise the sophisitication of this system to match that of the study of theme and topicality per se. This gives us the task of primarily concentrating on transitivity as a high level description of clause content, function, and structure; when we do it can be connected to the discourse through the efforts of other projects as outlined above --- and as an end result gain more knowledge of the structure of texts.

Since our hypothesis is that clauses bear different roles in a text, and that these roles at least in part are communicated through their semantic role structure, this is where we should concentrate our efforts. The mechanisms modeled by transitivity are strongly encoded in syntax, and thus largely language specific in their encoding. However, their function is not. The utility of building a transitivity-based model of text will be language independent.

There seems to be great promise to see our work to provide empirical data towards building more syntactic analysis tools with ambitions towards building a more complete yet practical model of text. Transitivity on the clause level is one of the key factors in understanding information organization on the textual level, and as of now, an untapped resource.

Evaluation

It is clear the most promising avenue of application lies in summarization, especially with application to multiple documents. But we cannot currently easily evaluate summarization performance, even if we had recourse to a summarization tool to improve; we should concentrate on building a tool for semantic role analysis and see if we can evaluate it in retrieval applications as a indexing mechanism -- which will be more straightforward as an evaluation and require less application oriented work.

References

  • Barzilay, Regina and Michael Elhadad. 1997. "Using lexical chains for text summarization". In ACL/EACL Workshop on Intelligent Scalable Text Summarization, pages 10-17.
  • Boguraev, Branimir and Christopher Kennedy. 1997. "Salience-based content characterisation of text documents". In ACL/EACL Workshop on Intelligent Scalable Text Summarization.
  • Corston-Oliver, Simon. 1998. "Beyond string matching and cue phrases" In Proceedings of AAAI 98 Spring Symposium on Intelligent Text Summarization.
  • Dahl, Östen, and Fred Karlsson. 1976. "Verbien aspektit ja objektin sijamerkintä: vertailua suomen ja venäjän välillä". Sananjalka 18, 1976, 28-52.
  • Lois L. Earl. 1970. Experiments in automatic extracting and indexing. Information storage and retrieval 6: 313-334.
  • H. P. Edmundson. 1969. New methods in automatic abstracting. Journal of the Association for Computing Machinery. 16:264-285.
  • Ralph Grishman. 1997. "Information Extraction: Techniques and Challenges". In Information Extraction (International Summer School SCIE-97), edited by Maria Teresa Pazienza. Springer-Verlag.
  • Barbara J. Grosz, Aravind K. Joshi, and Scott Weinstein. 1995. ``Centering: A Framework for Modelling the Local Coherence of Discourse''. Computational Linguistics. 21:2 203-227.
  • Hajicová, Eva. 1993. Issues of Sentence Structure and Discourse Patterns, volume 2 of Theoretical and Computational Linguistics. Prague: Institute of Theoretical and Computational Linguistics, Charles University.
  • Halliday, M. A. K. 1967. "Notes on Transitivity and Theme in English." Journal of Linguistics 3:37-81, 199-244.
  • Halliday, M. A. K. 1978. Language as social semiotic. London: Edward Arnold Ltd.
  • Marti Hearst. 1994. "Multi-Paragraph Segmentation of Expository Text". Proceedings of the 32th Annual Meeting of the Association of Computational Linguistics}, (Las Cruces, June 1994). ACL.
  • Hopper, Paul J., 1979. Aspect and foregrounding in discourse. In Syntax and Semantics, Vol 12, pp. 213-241. Academic Press.
  • Hopper, Paul J. and Sandra Thompson. 1980. "Transitivity in Grammar and Discourse", Language, 56:2, pp. 251-299.
  • Jarvella, Robert. 1979. "Immediate memory and discourse processing" in G.B. Bower (ed) The psychology of learning and motivation Vol. 13. New York: Academic Press.
  • Kan, Min-Yen and Kathleen McKeown. 1999. Information Extraction and Summarization: Domain Independence through Focus Types. Columbia University Technical Report CUCS-030-99.
  • Lehnert, Wendy G. 1980. "Narrative text summarization". In Proceedings of the First National Conference on Artificial Intelligence, 1980.
  • Elizabeth Liddy. 1993. "Development and implementation of a discourse model for newspaper texts". Dagstuhl Seminar on Summarizing Text for Intelligent Communication.
  • Marcu, Daniel. 1996. Building up rhetorical structure trees. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 1069-1074.
  • Marcu, Daniel. 1997. From discourse structures to text summaries. In Proceedings of the ACL/EACL Workshop on Intelligent Scalable Text Summarization.
  • Marcu, Daniel. 1997. From local to global coherence: A bottom-up approach to text planning. In Proceedings of the Fourteenth National Conference on Artificial Intelligence.
  • Marcu, Daniel. 1997. The rhetorical parsing of natural language texts. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics, pages 96-103.
  • Marcu, Daniel. 1998. To built text summaries of high quality, nuclearity is not sufficient. In AAAI-98 Spring Symposium on Intelligent Text Summarization.
  • Jeffrey C. Reynar. 1994. An Automatic Method of Finding Topic Boundaries. Proceedings of the 32th Annual Meeting of the Association of Computational Linguistics}, (Las Cruces, June 1994). ACL.
  • Strzalkowski, Tomek, Louise Guthrie, Jussi Karlgren, Jim Leistensnider, Fang Lin, Jose Perez-Carballo, Troy Straszheim, Jin Wang, Jon Wilding. 1997. "Natural Language Information Retrieval: TREC-5 Report". Proceedings of the fifth Text Retrieval Conference, Donna Harman (ed.), NIST Special Publication, Gaithersburg: NIST.
  • Strzalkowski, Tomek, Jin Wang, and Bowden Wise. 1998. "A robust practical text summarizer". In AAAI 98 Spring Symposium on Intelligent Text Summarization, pages 26-33.