The EACL 2006 Workshop on New Text was hosted in conjunction with the 11th Conference of the European Chapter of the Association for Computational Linguistics on April 4, 2006, in Trento, Italy.
Recent advances in publication and dissemination systems have given rise to new types of text - dynamic, reactive, multi-lingual, with numerous cooperating or even adversarial authors and little or no editorial control. Many of these new types of text remain true to established existing textual genres; others break new ground, moving towards new emergent textual genres enabled by the dramatically lowered publication threshold and distribution mechanisms.
Most notably these new forms of text, with a considerable amount of attention from traditional media, include *blogs* - texts written as a timely running commentary of public or private matters. Another well-established and remarkable new genre is the *wikipedia* - an encyclopaedia built by cooperative effort from its readers. New forms of communication such as these examples raise a number of questions for researchers in a number of fields, and this past Spring has seen no less than two workshops held on the analysis of new texts.
A workshop on "New Text - Wikis and blogs and other dynamic text sources", was held in conjunction with the 11th Conference of the European Chapter of the Association for Computational Linguistics (EACL) in Trento, Italy on April 4, 2006; marking the rising interest in these new data, preceding the New Text workshop by a week, the American Association for Artificial Intelligence hosted as part of its Spring Symposium series a related symposium on "Computational Approaches to Analysis of Weblogs" at Stanford, California on March 27-29, 2006 with some of the same participants.
One of the major themes of the New Text workshop was the question of how new text is different. How new is "new"? Didn't we use to have new text before? What is the difference between "new" and "old", really?
The presentations at the workshop addressed this question from different perspectives. It is a fact, easily ascertainable from experimentation and well established by several of the participating research groups, that blog text can be distinguished from other texts on the same topic. Blogs are personal, involved, are more often than not written from a personal perspective, and quite often express opinions or sentiments of the author. At the same time, blog texts often show characteristics from other established types of text, adhering to e.g. literary language. This might seem a contradiction to the popular perception that blog texts are hastily put together in a language reminiscent of brief notes, spoken asides, or short letters; but seeing as language usage in general is based on speakers' and writers' previous experiences, the text produced by blog authors quite naturally will refer to the linguistic experiences of blog authors - and as such will include references, citations, and patterns from essays, high literature, and classics as well as diaries and personal letters.
At the same time other presentations showed how authors of wikipedia articles aim to emulate the style of an encyclopaedia, and that they succeed in this aim to an impressive extent. Where the authors do not, wikipedias are edited by other authors and readers with this stylistic objective explicitly in mind. The meta-discussions on matters Wikipaedic was given in a discernibly different register.
In summary, it is quite clear that authors of new texts, quite as well as authors of traditional ones, are aware of linguistic styles of various sorts and use them in ways they deem appropriate. When new genres emerge they either pattern themselves on existing ones, thereby drawing on prestige and position of those existing genres, or cast around for forms suitable for their intended impact and stature. How to achieve this form where none exists is a matter yet to be resolved (and one presentation at the New Text workshop proposed a model of textual editing tools to better aid collaboration between authors than the models heretofore allow).
But new texts are more than traditional texts: they have features that traditional texts lack. They are interconnected in a network made explicit by authors and readers in a complex interplay of explict textual references; they position themselves much more explicitly in a context of other texts than texts previously have done. The study of this fabric of textuality is only taking first steps.
In view of the less formalised publication process, the credibility of new texts can be called into question. When traditional texts are published in paper form, a number of steps - variable from one mode of publication to another - involve satisfying editors or publishers of the veracity, relevance, quality, and impact of a text. These steps may be more or less formal, more or less reliable, and can be circumvented by the determined author, but in most cases the process and the thresholds authors need to cross to publish ensure a certain level of oversight and coherence with accepted truth. (Whether this is a good or a bad thing is a different discussion entirely!) New texts lack this guarantee of having passed many eyes en route from author to reader. Some new texts turn out to have high social impact; some sink without a trace; some have high import in tightly knit circles and communities - but there is no simple measure of the impact, the variable perceived intellectual status and quality of new texts. Understanding credibility, authority, and other facets of quality are central to any attempt at analysis of impact of new texts -- more so than for traditional texts, given the simpler publication process. Presentations at both the New Text workshop and the AAAI blog symposium discussed the issue; modelling the history and social context of a text and its author is one example of analysis to provide partial answers.
Underlying the issue of credibility and authority is the question of who the author is and why. What makes a blogger blog? Why do people devote time and energy to editing wikipedia pages? Understanding the motivations and intentions of authors is not incidental to the task of understanding the texts. Integral to the blog is who and why; integral to the wiki is purpose -- noone can even pretend that the texts are analyzable in isolation. While texts remain texts, even with new syntactic patterns and new lexical items, their contextuality is so great as to dominate many other content features.
And this, in fact, is truly new!
With respect to the publication of new texts there is the question of what publication and public is understood to mean. There are frequent recent examples of texts and comments made public in ways that their authors did not quite expect: some authors appear to assume that a private diary, even if kept in a public space on the net still is private, that comments to a closed mailing lists are to be understood to be confidential, that material published under a pseudonym should not be traced to its author, and that the rights to a picture posted to a web page remain with its originator. In some cases the legal practice conforms to commonly held expectations, in some it does not, and in most cases there are no precedents to consult. We can expect a large number of interesting legal cases and new codification in the near future!
What services can be expect to emerge from the analysis of new text? Already, several information access services access wikipedias to extract facts and relationships for better understanding of other texts. Analysis of public opinion on issues or of consumer attitudes towards products and services on the market has found a rich vein of data in blogs. These new data sources are only first examples of how a communication channel built by many for other multitudes allows tracking of the global village in an entirely new way. Where producers of consumer goods earlier only had recourse to directed market research or the opinions of consumers who volunteered their time and effort to communicating directly with them they now can find information from a larger section of their customer base. We can model the behaviour of large swathes of the population based on the writing and reporting on blogs - drawing on the studies presented at the two workshops, to a large extent conforming to expectations of what people like and why! However, to do this with any level of reliability, our processing tools, tuned to newsprint and other well-edited texts, need to address the challenges of variable text, multi-lingual, with register swings and formality melanges - not shoddy, but New!
We are currently in a transitionary phase, exceedingly interesting both philologically and industrially. Similar phases have been seen before - with the introduction of inexpensive printing processes, publishers put out compilations of private correspondences as one form of written communication assumed to be of public interest. Today, while such compilations still are published, they are but a curiosity in the larger stream of published material. The only certainty we have today is that tomorrow, people will find creative ways of utilising the technology we are introducing today - again, not unpredictable, but New!
These new movements will be discussed in coming research events. Those interested are welcome to contact the author of this text at newtext@sics.se for more information!
Links out:
The EACL Workshop on New Text - Wikis and blogs and other dynamic text sources
http://www.sics.se/jussi/newtext
The AAAI Spring Symposium on Computational Approaches to Analyzing Weblogs
http://www.umbriacom.com/aaai2006_weblog_symposium/
The notes from the workshop are published as part of the EACL proceedings.
The entire proceedings in PDF
| Text Linkage in the Wiki Medium - A Comparative Study |
| Alexander Mehler |
| Errors in wikis |
| Ann Copestake |
| Linguistic features of Italian blogs: literary language |
| Mirko Tavosanis |
| An analysis of Wikipedia digital writing |
| Antonella Elia |
| Learning to Recognize Blogs: A Preliminary Exploration |
| Erik Elgersma, Maarten de Rijke |
| Interpreting Genre Evolution on the Web |
| Marina Santini |
| Novelle, a collaborative open source writing tool software |
| Federico Gobbo, Michele Chinosi, Massimiliano Pepe |
| A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia |
| Antonio Toral, Rafael Muñoz |
| Finding Similar Sentences across Multiple Languages in Wikipedia |
| Sisay Fissaha Adafre, Maarten de Rijke |
| Multilingual interactive experiments with Flickr |
| Paul D Clough, Julio Gonzales, Jussi Karlgren |