Tense Prediction

Background

Past and present tense sound alike for most regular verbs in English. This makes it difficult to distinguish the two forms in speech recognition tasks. The text should contain information to predict which is used with a considerable degree of accuracy. Most notably, other verbs in the text should indicate if the text is in past or present tense.

Tense Progression N-Grams?

One idea we talked about previously and which Satoshi apparently tested was using tense n-grams. Now, n-grams model very local context. This is a problem, since many verbs in a text do not follow the text tense model at all: e.g. all verbs in indirect contexts such as quoted speech, and verbs in attributive constructions or in subordinate clauses. Rather than n-grams, Christer Samuelsson suggested I use the ratio of past to present in the entire text as a distinguishing criterion: this measure should be more global.

Material

I looked at one issue of the Wall Street Journal, processed by Engcg (wsj_08*.engcg). The ratio of past to present tense form frequencies varies from 0.1 (e.g. WSJ900801-0075: 9 present, 78 past: 0.115) to over 5 (e.g. WSJ900801-0061: 11 present, 2 past: 5.5), with an average of 1.2 - suggesting that there is plenty of information to be tapped. The distribution of ratios for one issue of the Wall Street Journal is in the following graph.

0    ****************************************
0.16 **********
0.2  ***
0.25 *****
0.33 *****
0.5  *******************
1    *************************************
2    *************************
3    ************
4    *****
5    ****
6    ***
> 6  *********
So, given a text, I look at the ratio of past to present in auditively different cases: there are plenty of irregular verbs (go - went) and verbs with regular stem changes (win - won), as well as verbs which end in dental stops (-t, -d) to use as fix points.

I extract probabilities (train.perl) on the training material (wsj_0702.engcg) using several different settings.

Details

The probabilities are based on the distribution of observed relative frequencies. if f(pres,doc) and f(past,doc) are the frequencies of present and past tenses in a document, respectively, and q(doc) is the ratio f(pres,doc)/(f(past,doc)+1) the q:s for the training set are tabulated as in the following table.

q(doc)>k
k         |    6    5    4    3    2    1  0.5 0.33 0.25 0.20 0.17    0 |   sum
-------------------------------------------------------------------------------
f(pres)   | 2007  427  445 1151  719 1997 1554  743  193   78    9  301 |  9624
f(past)   |  322  104  115  465  325 1548 1966 1374  481  235   24 1117 |  8076
-------------------------------------------------------------------------------
sum per q | 2329  531  560 1616 1044 3545 3520 2117  674  313   33 1418 | 17700

The probabilities for a verb in a text being of either tense given observed tense frequencies are then simply approximated to be the relative frequency of f(pres) to f(past). If, e.g. a text hitherto has observed 40 present tense forms and 6 past tense forms:

q(doc) = 40/(6+1) = 5.7

and

p(pres|q(doc)>5) = f(pres)/(f(past)+f(pres)) = 427/531 = 0.80
p(past|q(doc)>5) = f(pres)/(f(past)+f(pres)) = 427/531 = 0.20

Settings used

08probs.prespret

using all verbs as observed criterion. in the real case this is no good: we will not have this info for the test material. (prespret? pres is for present, pret for preterite - another name for past.)
f(pres)/f(past)   p(pres)  p(past)

0                 0.085    0.915
0.17              0.167    0.833
0.2               0.186    0.814
0.25              0.233    0.767
0.33              0.302    0.698
0.5               0.425    0.575
1                 0.587    0.413
2                 0.720    0.281
3                 0.783    0.217
4                 0.827    0.173
5                 0.857    0.143
6                 0.906    0.094

08probs.ab

Uses "strong" verbs: stem change or other irregularity. I filtered out those with identical present and past forms (e.g. put - put). The list is in the code: it should be obtained from comlex or some other better thought through source. ("ab" stands for "ablaut", one of the paradigms for regular stem change.)

08probs.td

Uses only verbs ending in dental stops. These should have auditively very different past forms.

08probs.tdab

Uses a simple sum of td + ab. This proved not to help. The sources should be merged differently.

08probs.ab.skip

Skip every finite verb immediately after a relative pronoun (which, that), on the assumption that they will not be part of the overall text tense system. This proved not to help.

Results

Using bare probability estimates

                  vrbs corr miss
results.prespret: 6973 4139 2834  59%
results.ab:       3194 1971 1223  62%
results.td:       5348 3188 2160  60%
results.tdab:     1569  972  591  62%
results.ab.skip:  2986 1655 1331  55%
A baseline estimate would be to always guess the more common form. For the material at hand this would be present tense, and would give us between 54 and 56% correct guesses.

Playing it safe

If we back off from making predictions in cases where we have inconclusive probabilities we will be better off in terms of accuracy at cost of coverage. This may be a sensible method to stave off some of the more embarrassing errors.

strong verbs

p > 0.5        3194 1971 1223  62%
p > 0.6        2058 1339  719  65%
p > 0.7        1603 1062  541  66%
p > 0.8         583  436  147  75%

dental stop verbs

p > 0.5        5348 3188 2160  60%
p > 0.6        2717 1734  983  64%
p > 0.7
p > 0.8         529  376  153  71%



What next?

  • Should do indirect context spotting test.
  • Use a better relative clause spotter.
  • Many errors could be solved using an adverbial spotter: a text which mainly reports in the present tense will have past tense clauses at unpredictable intervals clearly marked with an adverbial:
... most justices agree that ... But, as retired Justice Lewis Powell
warned in a speech last year, ... (WSJ900702-0182)
New York University, December 1996

Post new comment

The content of this field is kept private and will not be shown publicly.
  • You may quote other posts using [quote] tags.
  • Web page addresses and e-mail addresses turn into links automatically.

More information about formatting options