Proteinhalt i textSummaryThe amount of scientific text produced in the biomedical domain is overwhelming and impossible to survey even for researchers in specialized sub-fields. Yet, the published results in scientific papers form an important basis for further research, as well as for applied biotechnology. At the intersection of bioinformatics and computational linguistics, the project "Proteinhalt i text" (protein concentration in text) develops methods for automatic identification of names of proteins. The results will be used to enhance user interaction in searching and browsing biomedical texts, as well as for further refinement of the information, e.g., by building knowledge bases of protein interactions. This requires both knowledge about the biomedical domain and linguistic knowledge; the linguistic variation makes it possible to express, for example, the relation SOMETHING-INTERACTS-WITH-SOMETHING in many different ways:
Potential target proteins for the development of new drugs for medical therapy could be found with the help of such knowledge bases. FundingPartial funding for this project has been provided by VINNOVA, the Swedish Agency for Innovation Systems. The protein name tagger - YapexWithin the project, a protein name tagger - Yapex - has been developed. The tagger, described in Eriksson et al 2002 and Franzén et al 2002, is a rule-based system that employs lexical and syntactic analysis to decide if a text string is part of a protein name or not. You can test the system here (TEMPORARILY OUT OF ORDER) and find the evaluation results for the current version of the system here.The Yapex CorpusA description of the data sets used for reference and evaluation can be found here. The reference (training) data that we used when developing the tagger can be found here. The test data used for evaluation can be found here. All three files can be downloaded as one tar file from here.Thu May 12 2005: Test collection corrected in accordance to comments by Kevin B. Cohen. Thank you, Kevin! (Two tags were moved inside the PubmedArticle tags, conventional uses of "<" and ">" were converted to xml entities & l t ; and & g t ; (no spaces) respectivly) Information mailing listIf you want to receive update information about the Yapex protein name recognizer and the data sets used for training and testing of Yapex, please subscribe to the yapexinfo mailing list by sending an e-mail to Majordomo@sics.se with subscribe yapexinfo in the body of the mail. Typical information will be about updates, corrections, and other changes concerning the program or the data sets.If you want to communicate with the people involved in the project, please feel free to email yapexmen@sics.se Publications
|