| UPGRADE YOUR BROWSER! If you see this message, it means your browser, which is CCBot/1.0 (+http://www.commoncrawl.org/bot.html), does not support current webstandards. Please, see the webstandards project. |
This web page contains the solution to the third part (memory based learning) of the course on machine learning taught at the graduate school of language technology, fall 2003. Authors of this page are Fredrik Olsson and Magnus Sahlgren.
Bag-of-concepts (BoC) is a method for text representation that aims at capturing the content of the text. In short, the idea is to use a higher-level representational scheme that allows the learning algorithm to overcome problems with vocabulary variation, and to base its decisions on the content of the text rather than on the surface word forms. This approach might be useful for higher-level (semantic/pragmatic) tasks where traditional word-based methods fail.
In the first set of experiments, we wanted to investigate how TiMBL reacted to different context parameters, and to using 90-D dense (Random Mapping) vs. 1000-D sparse (Random Indexing) vectors. We used the English MapTask and the Finnish InterAct data with a frequency threshold excluding words with frequency < 2, and a train/test split equal to 90/10 (i.e. no X-validation).
All results in the tables below are given in percentage of correct classifications (i.e. classification accuracy). BoI means reduced representations (i.e. BoW representations with reduced dimensionality), Doc means document-based contexts, and 2+2, 2+0 and 0+2 are the sizes of the word-based contexts. Also, the context windows (2+2, 2+0, 0+2) were distance-weighted.
As a comparison, BoW (Bag-of-Words) representations produce 48.66% for the English MapTask data, and 45.70% for the Finnish InterAct data.
| EN/MapTask | FI/InterAct | |||
| 90-D | 1000-D | 90-D | 1000-D | |
| BoI | 45.88 | 47.83 | 40.62 | 44.53 |
| Doc | 48.23 | 48.23 | 44.14 | 45.31 |
| 2 + 2 | 47.70 | 47.87 | 43.35 | 44.53 |
| 2 + 0 | 47.44 | 47.60 | 45.31 | 45.31 |
| 0 + 2 | 47.80 | 47.77 | 44.53 | 43.75 |
Things to note:
Next, we wanted to see if tweaking TiMBL's parameters could lead to an increase in performance. Preferably, the results should be compatible with those reached by our own kNN-algorithm. Thus, we used different sizes of k, and different class voting weights.
The results are produced using 90-D dense (RM) vectors with document-based contexts and the English MapTask data. (The reason we used the 90-D vectors in these runs is because they (of course) did not require as much memory and execution time as the 1000-D vectors - each run with the 90-D vectors took 299MB RAM, and between 1 and 2 hours to complete (depending on the size of k).) M is for MVDM, O is for Overlap, MajV is for default majority voting, InvD is for Inverse Distance weighting, and InvL is for Inverse Linear weighting.
| M/MajV | M/InvD | M/InvL | O/MajV | O/InvD | O/InvL | |
| k = 1 | 45.72 | 45.75 | 45.69 | 48.23 | 48.26 | 48.26 |
| k = 3 | 46.21 | 46.35 | 45.82 | 49.55 | 49.25 | 48.39 |
| k = 5 | 46.84 | 47.27 | 46.41 | 50.11 | 50.04 | 49.38 |
| k = 7 | 46.97 | 47.90 | 47.01 | 50.41 | 50.61 | 49.75 |
| k = 9 | 47.80 | 48.00 | 47.07 | 50.34 | 50.87 | 50.01 |
| k = 15 | 47.34 | 47.50 | 47.80 | 49.68 | 50.64 | 50.77 |
The default Overlap metric consistently outperforms the MVDM metric. This is somewhat surprising, since the MVDM metric has been reported to outperform the Overlap metric in other published experiments on DACT classification (granted, using different data; Lendvai et al. 2003). Regarding the size of k, it seems as optimal performance is reached when k > 7. This is consistent with our previous experiments, and with published reports (Lendvai et al. 2003). Turning to the class voting weights, it seems as if the Inverse Distance weighting generates the best results on this setup. This is not consistent with Lendvai et al.'s findings.