Phase 3 (January 1998 - December 1999)

This section introduces the current state of SVENSK, phase three. As for now, the work is funded until December 1999. The most urgent short-term goal is to wrap the system up in order to distribute it. We've set the dead-line for achieving this goal to April 25 1999.

Currently Supplied LE Components

Currently, there are nine different components integrated in the SVENSK system; ranging in tasks from automated language identification to shallow semantic analysis.

TextCat

The TextCat program is uses n-gram models for identifying documents as belonging to one of several languages. Currently, the program knows about 69 different languages. TextCat is implemented by Gertjan van Noord.

SweTwol-2 & SweCG

SweTwol-2 is a comprehensive morphological description of Swedish. It consists of a set of finite-state rule automata and a dictionary of approximately 80,000 entries where a typical entry covers all inflectional forms of one word, and possibly also forms of some words derived from the entry. The output is the words in the input text together with their associated tags. There could be several tag sets associated to one word, corresponding to some ambiguity in the morphological analysis. Currently no tools exist for extending the SWETWOL lexicon.

SweCG (Swedish Constraint Grammar) is a surface-oriented syntactic description of Swedish grammar. At present the grammar consists of some 1400 rules for word-form (part of speech) disambiguation, and some 200 rules for syntactic tagging. The SWECG 1.0 component disambiguates multiple morphological readings, determine the location of intra-sentential clause boundaries, and assigns syntactic roles to individual words. It picks the correct alternative with an error rate not exceeding 0,3 %, leaving some 5 % of the morphological ambiguities undecided. Furthermore, SWECG provides a surface syntactic analysis in terms of dependency grammar. From the SWECG surface analysis, an explicit dependency graph representation is built: representations may include multiple graphs spanning the same input in cases of ambiguity; and the representation may be a set of dependency analyses in cases where the input is incomplete or fragmentary. This component, together with a procedure that finds start and end positions of each token, provides a GATE standard interface to the components described below. Currently, however, SWECG has to be started and loaded each time the component is invoked in SVENSK. A future extension ought to include an API so that SWECG may be run in the background.

Both SWETWOL-2 and SWECG 1.0 are developed by LingSoft, Inc. in Helsinki.

SweCG2CLE

SweCG2CLE is a conversion module that converts between SweCG tags to the corresponding feature structures which can serve as input to the Deep-Level Unification-Based Processor (DUP).

Deep-Level Unification-Based Processor (DUP)

This component is based upon a Swedish unification grammar and an LR parser developed at SICS. The grammar is supplemented with a fully-defined lexicon of about 1500 words. The component will provide a relatively `deep' level of analysis but at the cost of robustness. Initially this component will build directly upon the morpho-syntactic analysis provided by SWECG, complemented with the further refined mappings allowed for by the fully-defined lexicon. Subsequent versions will use the dependency representation interface, although experiments will be conducted to determine how much of the dependency analysis is actually required.

Domain-Specific Processor (DSP)

This component builds directly upon the dependency analysis to provide semantic representations and is aimed at research projects which require a shallower, but robust, natural language interface for a specific application (unlike DUP, there is no compositional semantic analysis). In this component, the dependency graph analyses are transformed into semantic representations by matching the graphs against domain-specific templates. This component will adopt graph-matching techniques to yield a `best-fit' analysis (if possible, using statistical information automatically derived from the domain corpora) and, depending on the application, predictions about the input could be used to guide the graph building process.

A Tokeniser for Swedish (Svensktoken)

Svensktoken, which is based on the VIE tokeniser that comes with the GATE system, functions as a pre-processing module for the sentence splitter and UCP modules described below and uses mainly structural --- and only a limited amount of linguistic --- information in its aim to recognise tokens.

The problem of tokenization is often set aside even though tokens are the basic items in many text processing systems. Tokenization is far from trivial. Some, but not all, of the tokens may be described and identified without the use of linguistic knowledge.

The tokenizer primarily uses structural information to recognise tokens, but is also equipped with a stop list containing frequent abbreviations. The list is used for distinguishing the cases of a token containing a period (.) in which the period should be considered as a sentence delimiter from those cases where it should not. The tokenizer also recognises and signals consecutive newline characters in the input, information that may be used at later processing steps to identify headings as sentence fragments.

The stop list of abbreviations is obtained from a 300,000 word portion of the Stockholm-Umeå Corpus. After extracting them, the abbreviations were sorted according to frequency, and all those which occurred more than once and had a period in them were picked out.

A Sentence Splitter for Swedish (Svensksplit)

This sentence splitter for Swedish is a pre-processing module intended for use in the GATE platform, which may as well be run as a stand alone program. Its primary task is to provide the Brill tagger described later on with sentences and annotations associated with them.

The problem of splitting a text into sentences is related to that of tokenization. Possible approaches to solving the problem of finding and, if necessary, disambiguating sentence boundaries may involve a system that uses rules (e.g., a hand-made or induced grammar), probabilistics (e.g., HMMs), or some sort of neural network. Common to all approaches is that they require one or more pre-processing modules such as a tokenizer, morphological analyser, or a program that calculates word frequencies.

The sentence splitter for Swedish implements a rule-based approach using a minimum of information since it is not geared towards any specific corpora (thus assuming a minimum of information to be present in the input). The information present is mostly structural. The amount of linguistic knowledge is small and implicitly provided by the tokenizer.

A Brill PoS-tagger for Swedish (UppBrill)

The Brill tagger uses a method called transformation-based error-driven learning for inferring rules from a tagged training corpus. The rules are then applicable for tagging previously unseen text. The tagger is written by Eric Brill and it is public domain software available from ftp://ftp.cs.jhu.edu/pub/brill/Programs.

The Department of Linguistics, Uppsala University, has been pursuing the training of a Brill tagger for Swedish within the framework of the ETAP project. The criteria for the training corpus are that the texts must be in Swedish, translated to other languages, correct and available. The actual text used for this purpose is, among others, Regeringsförklaringen (Eng. Statement of government policy) as presented to the Swedish parliament in 1994.

Uppsala Chart Processor (UCP)

The Uppsala Chart Processor, UCP, was originally developed at Uppsala University in the early 80's The current version is written in COMMON LISP, but there will be a C version as well.

UCP is able to process the input in a variety of ways and with respect to different levels of linguistic description, e.g., only morphological processing of the input, or syntactic processing using a top-down filtering, chart-guided bottom-up parsing approach. The version of UCP integrated in the SVENSK system is intended to function as a morphological analyzer.

UCP as such has been used within several research projects, most recently as a part of a grammar checker for controlled Swedish in the SCANIA project.

Although UCP can be told to use various rule invokation strategies, the most common one is bottom-up (possibly with top-down filtering) using a left to right search of the input, that is, the input is read from the left and the tokens in it are matched against the lexicon and placed in the chart for further processing.

Updated at 10:47:06 on 991020.