LLAVES Assessment year deliverable
Current research in textual topic, foreground and background, with special attention
paid to applications - LLAVES Report 1
Jussi Karlgren
jussi@sics.se
September 2000
The LLAVES project plans to show how syntactic mechanisms of language,
which primarily concern clause-internal structure, carry text-level
information as well. The plan is to find mechanisms which distinguish
different types of clause, with the aim of teasing out foreground information
from background in a text.
Foreground and Background: topic in text
There is an entire body of research put into uncovering the topical
structure of clauses and texts. There is a long tradition of semantic and
pragmatic study of clause structure from the Charles University in Prague
(e.g. Hajicová, 1993), there are several results supporting our hypotheses
using the general theory of transitivity (Halliday, 1967, 1978; Hopper,
1979), there are numbers of algorithms for anaphor resolution which touch
clausal categorization, there are studies of automatic summarization
algorithms, and there are studies of text grammars which all have bearing
on our work. However, no studies have been made specifically on clausal
categorization for topical analysis, and the empirical validation of these
ideas have been held back for lack of effective tools.
Transitivity and clauses
Transitivity is one of the most basic notions in the system of language,
but ill formalized in the formal study of language. Clauses in language
represent events and processes of various kinds, and transitivity is that
characteristic of a clause which models the character of the process or
event it represents. This systemic model was first formulated by Halliday
(1967) and has since been elaborated by Hopper and others in a theoretic
sense: very little empirical study on large numbers of texts has been
performed, and no systematic let alone quantitative evaluation of the
theories has even been proposed.
One of the basic conceptual structures of language in use is that actions
are done by people and affect things. How the action is performed, by whom,
and on what are all encoded in the clause by various syntactic mechanisms,
in a general system of transitivity. For most non-linguists, transitivity
is only explicitly mentioned in foreign-language classes when classifying
verbs as transitive or intransitive, meaning if the verb in question takes
a direct object or prefers not to. This is of course central to the task of
modeling action and effect, but transitivity covers more than this one
aspect of process structure. Halliday's model mentions a number of specific
factors or "systems" that cover the more general "system" of transitivity:
Number, type, and role of participant: human or not? Agent? Benefactive?;
Process type: existence, possession, spatial/locative, spatial/mobile
(e.g. 1978, p. 118). These aspects of clausal organization hook up with
factors such as temporal, aspectual, or mood systems to produce a clause.
This clause not only carries information about the event or process it
represents, but it also crucially builds a text, together with adjacent
clauses. In Halliday's model (most comprehensively delineated in his 1967
publication) a clause is the confluence of three systems of syntactic
choice: transitivity, mood and theme. Transitivity, he writes, is the
set of options relating to cognitive content, mood being the
system for organizing the utterance into a speech situation, and theme
being the system for organizing the utterance into a discourse.
While there is ample psycholinguistic evidence that the syntactic form of a
clause is discarded after being processed by the hearer or reader
(e.g. Jarvella, 1979), the communicative structure of the clause is
retained to organize the information content of the text or discourse. The
structure of a clause is not arbitrary, and cannot be determined in
isolation from other clauses in the vicinity and other events, processes,
and participants represented and mentioned in the text. This modeling is
primarily done through measures of lexical cohesion as will be shown below.
But models of transitivity are lacking. Transitivity has been and is being
studied only as a very theoretical construction, and little work has been
done which would be of direct implementional quality. The theoretical work
concentrates on syntactic modeling of languages of which there is rather
little knowledge yet, as a first stage towards building a more complete
description. Practical clues as to how to make use of transitivity are
mainly due to Hopper and Thompson (Hopper, 1979; Hopper and Thompson,
1980). Hopper argues for the distinction between background and foreground
in narrative, signaled by variation along the qualities of the subject -
such as animacy or humanness, the predicate verb - such as aspect or tense
marking, and the voice of the clause. Many or even most of these factors
cut across language divides (see e.g. Dahl and Karlsson, 1976). Hopper and
Thompson then propose a number of characteristics along which transitivity
is measured, some of which are directly quantifiable as shown in the table
below. These factors we can make use directly in our implementation effort.
|
Feature
|
High
|
Low
|
|
Participants
|
2 or more
|
less
|
|
Kinesis
|
action
|
non-action
|
|
Aspect
|
completed
|
partial or imperfect
|
|
Punctuality
|
punctual
|
continuous
|
|
Volitionality
|
volitional
|
non-volitional
|
|
Polarity
|
affirmative
|
negative
|
|
Reality
|
real
|
non-real
|
|
Agency
|
potent agent
|
non-potent agent
|
|
Effect on object
|
object totally affected
|
object not affected
|
|
Individuation of object
|
individual object
|
non-individual object
|
Clauses and Topic
There is a large number of approaches to textual modeling with very varying
basis in theories of language or syntax. Most models of text are
statistically based, or have some high-level model of argumentation to
follow irrespective of syntax; some take recourse to cue phrases or
expressions specific to some domain to build a text model. Some use
syntactic analysis as a low-level building block, but discard what is left
of the syntactic analysis after the argument structure of the clause has
been established.
Local coherence
Much of topical study centers on local coherence of discourse or text, such
as research models of theme-rheme or topic-comment, or
research strands such as recent projects in modeling centering
(Grosz et al, 1995). In these approaches, topic is a feature of the clause,
and is carried over to the next clause through relatively overt syntactic
mechanisms such as argument organization, anaphor, or ordering. These types
of model of local coherence, where some have a fairly sophisticated
theoretical base rather along the lines of Halliday's theme system, will
well benefit from using transitivity as a factor.
Narrative models
Other studies try to understand topic from the top down, building argument
structures or narrative frames (e.g. Lehnert, 1980). Lehnert, for instance,
discussing application to summarization, argues that we must have a picture
of plot progression throughout a text, with a model of mental states of an
implied reader which the text affects in various ways. This high-level type
of approach, most often with less psychological modeling involved, was
typical during the knowledge-based systems projects of the late
eighties. The failings of such systems are often that they have too little
actual text processing capacity, and stumble on text processing as a task.
Many systems attempt to generate rhetorical structures of various flavors
based on local coherence models (e.g. Marcu, 1994 and onward;
Corston-Oliver, 1998; Liddy, 1993), but quite often need more syntactic
competence. Simpler models, with a form-filling approach (e.g. Strzalkowski
et al, 1998) perform quite well, up to a point, with much less investment
in discourse modeling. There is a span of such models, ranging from
completely general templates and very strictly task-oriented and tailored
extraction patterns; the middle ground between them is claimed by hybrid
approaches, which indeed are just that: combinations of both rather than
bridges between them (Kan and McKeown, 1999). The greater effort in
building a narration or discourse model has yet to prove useful: the bridge
from text to discourse model has not been usefully closed yet. It is in
this type of model our contribution most clearly would be of benefit: these
systems need to form a clearer understanding of the informational rationale
of syntax.
Lexical chains
Several approaches try to establish lexical chains in text as a basis for
understanding content for either indexing for retrieval or summarization
tasks. Lexical units in the text are picked out by some algorithm, possibly
after the text is segmented (e.g. Barzilay and Elhadad, 1997), and
relations between units are established using terminological models. Many
of these models utilize text segmentation algorithms, based on occurrence
statistics (e.g. Hearst, 1994 or Reynar, 1994), or thesauri and
terminological databases, or cue or trigger phrases (e.g. Boguraev and
Kennedy, 1997) of some sort. These types of model tend to be quite
successful, but often quite a-theoretic and will be difficult to improve
using theoretical add-on models.
Application area: Information extraction
A typical application for textual modeling is to extract pieces of
information (see e.g. MUC or TDT proceedings, available from NIST). In
general information extraction systems are knowledge-intensive, putting a
fair amount of effort into building a model of the domain that can be used
to predict content of texts analyzed. The syntactic analysis used, if any,
is used to aid in retrieving the type of content that already has been
predicted to occur; systems typically do not discover new informational
structure but work with a given model of knowledge and an amorphous model
of text structure. Variation in clause structure typically is viewed as
noise, and the systems put some effort into normalizing that variation
without retaining the information clause structure might carry for topical
progression. (e.g. Grishman, 1997). Information extraction uses a
high-level model of topic combined with very local coherence analysis to
solve its task. Since this application area is highly knowledge-intensive
and task-tailored, general purpose models will probably at least initially
damp results rather than improve them. Our models will not be useful
inasmuch they do not improve the general syntactic processing tools the
systems make use of.
Application area: Summarization
Summarization is most often implemented as the selection of sentences in
text, based on those sentences' supply of content words (e.g. starting from
Earl, 1970; Edmundson, 1969 and continuing to most available tools
today). This is the area where qualities of single clauses might be most
useful for future improvement of system performance; the drawback for our
present project is that the evaluation of summarization tools is
non-trivial and subjective and will not easily lend itself to prove the utility
of our approach.
Application area: Information retrieval
Pure retrieval systems usually invest a fair amount of effort into
completely ignoring text as text. Some exceptions include experiments to
statistically process syntactic relations in text (e.g. Strzalkowski et al,
1997) to find typical relations entities engage in (in a sort of
small-scale version of extraction technology) and others trying to
establish reference chains in text (e.g. Liddy, 1994) to sort out
occurrence frequencies obscured by anaphor. Information retrieval
systems typically have neither textual models nor local coherence models to
guide their analysis of texts; word occurrence statistics are good enough
for the tasks these systems are used for at present. While the prospects
of impressing information retrieval system engineers with syntactic and
semantic niceties will be unlikely, the evaluation framework provided by
information retrieval systems is useful enough for us to test our future
algorithms for this purpose.
Bottlenecks
Bottlenecks for most approaches described above have to do with processing
capacity. We have better basic processing power than most, through our
access to the Conexor FDG, and must concentrate on using it to its best
capacity. We should try to extend its processing power beyond the clause to
the textual level.
Open Research Questions
The first research question we formulated in our research plan was "What
makes a text a text?" Clearly, this is a question others have asked before.
Current work in text understanding is plentiful and partially successful.
Most of the work in our field - that of language engineering - is based on
statistical models of term occurrence, whether along lexical chains, in
sentence extraction algorithms, or using thesauri as a domain model. The
main exception is centering and other related and non-related anaphora
resolution approaches. Most of the effort being put into text analysis
today is along the lines of the theme system in Halliday's analysis.
The arguably primary aspect of the clause is that of its cognitive content,
and its relation to the other systems: this is measured using very simple
statistically based models or thesaurus-based models of lexical cohesion.
The study of transitivity would raise the sophisitication of this system to
match that of the study of theme and topicality per se. This gives us the
task of primarily concentrating on transitivity as a high level description
of clause content, function, and structure; when we do it can be connected
to the discourse through the efforts of other projects as outlined above
--- and as an end result gain more knowledge of the structure of texts.
Since our hypothesis is that clauses bear different roles in a text, and
that these roles at least in part are communicated through their semantic
role structure, this is where we should concentrate our efforts. The
mechanisms modeled by transitivity are strongly encoded in syntax, and thus
largely language specific in their encoding. However, their function is
not. The utility of building a transitivity-based model of text will be
language independent.
There seems to be great promise to see our work to provide empirical data
towards building more syntactic analysis tools with ambitions towards
building a more complete yet practical model of text. Transitivity on the
clause level is one of the key factors in understanding information
organization on the textual level, and as of now, an untapped resource.
Evaluation
It is clear the most promising avenue of application lies in summarization,
especially with application to multiple documents. But we cannot currently
easily evaluate summarization performance, even if we had recourse to a
summarization tool to improve; we should concentrate on building a tool for
semantic role analysis and see if we can evaluate it in retrieval
applications as a indexing mechanism -- which will be more straightforward
as an evaluation and require less application oriented work.
References
Barzilay, Regina and Michael Elhadad. 1997. "Using lexical chains for text summarization". In ACL/EACL Workshop on Intelligent Scalable Text Summarization,
pages 10-17.
Boguraev, Branimir and Christopher Kennedy. 1997. "Salience-based content characterisation of text documents". In ACL/EACL Workshop on Intelligent Scalable Text Summarization.
Corston-Oliver, Simon. 1998. "Beyond string matching and cue phrases" In
Proceedings of AAAI 98 Spring Symposium on Intelligent Text Summarization.
Dahl, Östen, and Fred Karlsson. 1976.
"Verbien aspektit ja objektin sijamerkintä: vertailua suomen ja venäjän välillä". Sananjalka 18, 1976, 28-52.
Lois L. Earl. 1970. Experiments in automatic extracting and indexing. Information storage and retrieval 6: 313-334.
H. P. Edmundson. 1969. New methods in automatic abstracting.
Journal of the Association for Computing Machinery. 16:264-285.
Ralph Grishman. 1997.
"Information Extraction: Techniques and Challenges".
In Information Extraction (International Summer School SCIE-97), edited by Maria Teresa Pazienza. Springer-Verlag.
Barbara J. Grosz, Aravind K. Joshi, and Scott Weinstein. 1995.
``Centering: A Framework for Modelling the Local Coherence of
Discourse''. Computational Linguistics. 21:2 203-227.
Hajicová, Eva. 1993. Issues of Sentence Structure and Discourse
Patterns, volume 2 of Theoretical and Computational
Linguistics. Prague: Institute of Theoretical and Computational
Linguistics, Charles University.
Halliday, M. A. K. 1967. "Notes on Transitivity and Theme in English." Journal of Linguistics 3:37-81, 199-244.
Halliday, M. A. K. 1978. Language as social semiotic. London: Edward Arnold Ltd.
Marti Hearst. 1994. "Multi-Paragraph Segmentation of Expository Text".
Proceedings of the 32th Annual Meeting of the Association of Computational
Linguistics}, (Las Cruces, June 1994). ACL.
Hopper, Paul J., 1979.
Aspect and foregrounding in discourse.
In Syntax and Semantics, Vol 12, pp. 213-241. Academic Press.
Hopper, Paul J. and Sandra Thompson. 1980. "Transitivity in Grammar and Discourse", Language, 56:2, pp. 251-299.
Jarvella, Robert. 1979. "Immediate memory and discourse processing" in
G.B. Bower (ed) The psychology of learning and motivation
Vol. 13. New York: Academic Press.
Kan, Min-Yen and Kathleen McKeown. 1999. Information Extraction and
Summarization: Domain Independence through Focus Types. Columbia University
Technical Report CUCS-030-99.
Lehnert, Wendy G. 1980. "Narrative text summarization". In Proceedings of the
First National Conference on Artificial Intelligence, 1980.
Elizabeth Liddy. 1993. "Development and implementation of a discourse model
for newspaper texts". Dagstuhl Seminar on Summarizing Text for Intelligent Communication.
Marcu, Daniel. 1996. Building up rhetorical structure trees. In Proceedings of the Thirteenth National Conference on Artificial Intelligence, pages 1069-1074.
Marcu, Daniel. 1997. From discourse structures to text summaries. In Proceedings of the ACL/EACL Workshop on Intelligent Scalable Text Summarization.
Marcu, Daniel. 1997. From local to global coherence: A bottom-up approach to text planning. In Proceedings of the Fourteenth National Conference on Artificial
Intelligence.
Marcu, Daniel. 1997. The rhetorical parsing of natural language texts. In Proceedings of the 35th Annual Meeting of the Association for Computational Linguistics,
pages 96-103.
Marcu, Daniel. 1998. To built text summaries of high quality, nuclearity is not sufficient. In AAAI-98 Spring Symposium on Intelligent Text Summarization.
Jeffrey C. Reynar. 1994.
An Automatic Method of Finding Topic Boundaries.
Proceedings of the 32th Annual Meeting of the Association of Computational Linguistics}, (Las Cruces, June 1994). ACL.
Strzalkowski, Tomek, Louise Guthrie, Jussi Karlgren, Jim Leistensnider,
Fang Lin, Jose Perez-Carballo, Troy Straszheim, Jin Wang, Jon Wilding.
1997. "Natural Language Information Retrieval: TREC-5 Report".
Proceedings of the fifth Text Retrieval Conference, Donna Harman
(ed.), NIST Special Publication, Gaithersburg: NIST.
Strzalkowski, Tomek, Jin Wang, and Bowden Wise. 1998. "A robust practical
text summarizer". In AAAI 98 Spring Symposium on Intelligent Text
Summarization, pages 26-33.