UPGRADE YOUR BROWSER! If you see this message, it means your browser, which is CCBot/1.0 (+http://www.commoncrawl.org/bot.html), does not support current webstandards. Please, see the webstandards project.

This web page contains Magnus Sahlgren's term paper proposal for the course on machine learning taught at the graduate school of language technology, fall 2003. Author of this page is Magnus Sahlgren.

A study of different vector-based representational schemes for natural language-based machine learning

The goal of the project is to investigate how different representations of a natural language problem affect different machine learning algorithms. Standard practice in NLP-oriented ML is to use traditional BoW (Bag-of-Words) representation of the data. This has proven to be a viable and easily implemented approach for many NLP problems. However, there are problems with the BoW approach, including:

  1. it is not scalable, since the dimensionality of the representations equals the size of the vocabulary (unless we do some, more or less ill-justified, feature selection), and
  2. it is liable to vocabulary variation, since it cannot handle synonymy and polysemy.

One way of solving the first problem is to use reduced BoW representations, where the dimensionality does not depend on the size of the data. Essentially, this boils down to using some form of dimension reduction, such as factor analytic methods (including SVD (Singular Value Decomposition) and PCA (Principal Component Analysis)), or random indexing methods such as Random Indexing (RI) or Random Mapping (RM). In my project, I wish to further investigate the effect of using RI and RM to reduce the dimensionality of BoW representations.

To solve the second problem (or at least the synonymy part of it), we can use coocurrence-based approaches that represent text data as Bag-of-Concepts (BoC). The idea here is to represent words by their distributional profiles, so that similar words get similar representations. This can be accomplished using, e.g., the Random Indexing or Random Mapping techniques mentioned above. In the project, I wish to study how BoC representations can be produced using different kinds of cooccurrences.

Thus, I propose to compare the following representational schemes:

  1. Standard BoW,
  2. BoW with feature selection (possibly different kinds of feature selection),
  3. Reduced representations (RI vs. RM),
  4. BoC with different definitions of context (possibly RI vs. RM).

I will use text categorization with the Reuters-21578 collection as the task, since this is a well understood problem and a well studied test collection with an abundance of published results. If time permits, I also want to use DACT (Dialogue ACT) recognition data, since this is a considerably more difficult problem than "standard" text categorization. In this case, I will use the English MapTask data and the Finnish Interact corpus. By using two typologically different languages, I can also investigate whether different representational schemes are more suited for different types of languages.

The working hypothesis (which seems fairly obvoius) is that different kinds of representations will be viable for different kinds of problems. It does not seem as a controversial assumption that higher-level NLP problems (such as DACT recognition) require more semantic/pragmatic knowledge than lower-level (or word-based) ones, such as text categorization. For example, it is common knowledge that standard BoW representations produce good results on the Reuters data, but one might conjecture that for some of the classes in the Reuters data, BoC representations might actually be more suitable. Initial experiments in DACT recognition supports this hypothesis.

The impact of using different representations for different tasks is an interesting subject in its own right. However, the focus in my proposed project is to first and foremost investigate whether different representations affect different machine learning algorithms in different ways. That is, I wish to raise (at least) the following questions:

In order to answer these questions, I need to select a pertinent set of learning algorithms. As a preliminary selection, I propose to compare a Support Vector Machine (SVM), an instance-based algorithm (kNN and/or Rocchio), and possible Learning Vector Quantization (LVQ) or some form of ANN.

The outcome of the project will be a nice little table (or tables, depending on how many tasks I end up doing) with performance measures for X algorithms using Y representations:

  BoW RedRep BoC
SVM      
kNN      
LVQ