Proteins in text.

Proteinhalt i text

  • Summary
  • Partners
  • Funding
  • The protein name tagger Yapex (TEMPORARILY OUT OF ORDER)
  • The Yapex Corpus
  • Information mailing list
  • Publications

  • Summary

    The amount of scientific text produced in the biomedical domain is overwhelming and impossible to survey even for researchers in specialized sub-fields. Yet, the published results in scientific papers form an important basis for further research, as well as for applied biotechnology.

    At the intersection of bioinformatics and computational linguistics, the project "Proteinhalt i text" (protein concentration in text) develops methods for automatic identification of names of proteins. The results will be used to enhance user interaction in searching and browsing biomedical texts, as well as for further refinement of the information, e.g., by building knowledge bases of protein interactions. This requires both knowledge about the biomedical domain and linguistic knowledge; the linguistic variation makes it possible to express, for example, the relation SOMETHING-INTERACTS-WITH-SOMETHING in many different ways:

    • Electrophoretic mobility shift assays indicate that MS-2beta and MS-2gamma bind to nuclear factors that are induced during U937 differentiation
    • ... synergistic interactions between overlapping binding sites for the serum response factor and ELK-1 proteins ...

    Potential target proteins for the development of new drugs for medical therapy could be found with the help of such knowledge bases.


    Funding

    Partial funding for this project has been provided by VINNOVA, the Swedish Agency for Innovation Systems.


    The protein name tagger - Yapex

    Within the project, a protein name tagger - Yapex - has been developed. The tagger, described in Eriksson et al 2002 and Franzén et al 2002, is a rule-based system that employs lexical and syntactic analysis to decide if a text string is part of a protein name or not. You can test the system here (TEMPORARILY OUT OF ORDER) and find the evaluation results for the current version of the system here.


    The Yapex Corpus

    A description of the data sets used for reference and evaluation can be found here. The reference (training) data that we used when developing the tagger can be found here. The test data used for evaluation can be found here. All three files can be downloaded as one tar file from here.

    Thu May 12 2005: Test collection corrected in accordance to comments by Kevin B. Cohen. Thank you, Kevin! (Two tags were moved inside the PubmedArticle tags, conventional uses of "<" and ">" were converted to xml entities & l t ; and & g t ; (no spaces) respectivly)


    Information mailing list

    If you want to receive update information about the Yapex protein name recognizer and the data sets used for training and testing of Yapex, please subscribe to the yapexinfo mailing list by sending an e-mail to Majordomo@sics.se with subscribe yapexinfo in the body of the mail. Typical information will be about updates, corrections, and other changes concerning the program or the data sets.

    If you want to communicate with the people involved in the project, please feel free to email yapexmen@sics.se


    Publications

    Franzén et al 2002
    Kristofer Franzén, Gunnar Eriksson, Fredrik Olsson, Lars Asker Per Lidén and Joakim Cöster. 2002. ``Protein names and how to find them''. To be published in International Journal of Medical Informatics special issue on Natural Language Processing in Biomedical Applications.
    Draft version: [ ps | pdf ]
    Olsson et al 2002
    Fredrik Olsson, Gunnar Eriksson, Kristofer Franzén, Lars Asker and Per Lidén. ``Notions of Correctness when Evaluating Protein Name Taggers''. In Proceedings of the 19th International Conference on Computational Linguistics (COLING 2002), Taipei, Taiwan, 24 August - 1 September.
    [ ps | pdf ]
    Franzén et al 2002
    Kristofer Franzén, Gunnar Eriksson, Fredrik Olsson, Lars Asker and Per Lidén. ``Exploiting Syntax when Detecting Protein Names in Text''. In Proceedings of Workshop on Natural Language Processing in Biomedical Applications. Nicosia, Cyprus. March.
    Final version: [ps | pdf]
    Lidén et al 2002
    Per Lidén, Lars Asker, Gunnar Eriksson, Kristofer Franzén and Fredrik Olsson.``Protein name tagging for browsing support, active database cross linking, and information retrieval''. In Proceedings of Bioinformatics 2002. Bergen, Norway. April. Poster presentation.
    Abstract: [pdf] Poster: [pdf]
    Eriksson et al 2002
    Gunnar Eriksson, Kristofer Franzén, Fredrik Olsson, Lars Asker and Per Lidén. ``Using Heuristics, Syntax and a Local Dynamic Dictionary for Protein Name Tagging''. In Proceedings of Human Language Technology 2002. San Diego, USA. March. Poster presentation.
    Draft version of extended abstract: [ps | pdf ]