Fredrik Olsson successfully defended his Ph.D. thesis December 19, 2008
Title: Bootstrapping Named Entity Annotation by Means of Active Machine Learning: A method for creating corpora
Time: Dec 19, 2008, at 13.15
Place: Lilla
hörsalen, Humanisten, University of Gothenburg
Opponent: Miles Osborne, The University of Edinburgh, School of Informatics
This thesis describes the development and in-depth empirical
investigation of a method, called BootMark, for bootstrapping the
marking up of named entities in textual documents. The reason for
working with documents, as opposed to for instance sentences or
phrases, is that the BootMark method is concerned with the creation of
corpora. The claim made in the thesis is that BootMark requires a
human annotator to manually annotate fewer documents in order to
produce a named entity recognizer with a given performance, than would
be needed if the documents forming the basis for the recognizer were
randomly drawn from the same corpus. The intention is then to use the
created named entity recognizer as a pre-tagger and thus eventually
turn the manual annotation process into one in which the annotator
reviews system-suggested annotations rather than creating new ones
from scratch. The BootMark method consists of three phases: (1) Manual
annotation of a set of documents; (2) Bootstrapping -- active machine
learning for the purpose of selecting which document to annotate next;
(3) The remaining unannotated documents of the original corpus are
marked up using pre-tagging with revision.
Five emerging issues are identified, described and empirically
investigated in the thesis. Their common denominator is that they all
depend on the realization of the named entity recognition task, and as
such, require the context of a practical setting in order to be
properly addressed. The emerging issues are related to: (1) the
characteristics of the named entity recognition task and the base
learners used in conjunction with it; (2) the constitution of the set
of documents annotated by the human annotator in phase one in order to
start the bootstrapping process; (3) the active selection of the
documents to annotate in phase two; (4) the monitoring and termination
of the active learning carried out in phase two, including a new
intrinsic stopping criterion for committee-based active learning; and
(5) the applicability of the named entity recognizer created during
phase two as a pre-tagger in phase three.
The outcomes of the empirical investigations concerning the emerging
issues support the claim made in the thesis. The results also suggest
that while the recognizer produced in phases one and two is as useful
for pre-tagging as a recognizer created from randomly selected
documents, the applicability of the recognizer as a pre-tagger is best
investigated by conducting a user study involving real annotators
working on a real named entity recognition task.
