This page was updated 2005-09-15
FetchProtPartnersSICS - Swedish Institute of Computer Science AB
SICS is the project coordinator, responsible for the development of
Information Extraction and language analysis tools. Contact
franzen Center for Genomics and Bioinformatics, Karolinska Institutet
CGB is responsible for the EXProt database, the domain knowledge and
the supply of scientific text. Contact
Bjorn.Ursing
MMX is responsible for the system architecture, the system design, and
the system implementation. Contact
patrik.hassel The FetchProt group as of September 2005. From the left: Patrik Hassel (MMX), Björn Ursing (CGB), Kristofer Franzén (SICS), and Pär Lannerö (MMX). On the photo to the right, our heroic annotator, Daniel Oppenheimer (CGB) SummaryThe project FetchProt intends to provide both academia and industry in the biotechnology and life science areas with a publicly available infrastructure for management and retrieval of knowledge about protein functions.The goal is to automatically find, assess, and gather information about proteins with experimentally verified functions, from scientific texts within the areas of molecular biology and bio-chemistry, by developing and applying language technology, and to build infrastructure and knowledge bases to make this information accessible and manageable. ApproachThe project will begin by establishing a definition of experimentally verified function. This definition, along with a thorough, data driven, empirical study of how experimental evidence of protein function is realised in text, will constitute the theoretical base of the project. A close collaboration between SICS and CGB during the whole project will ensure that the domain knowledge of CGB is correctly implemented in the domain specific language components.Building on the Yapex protein name tagger (Franzén et al., 2002) and methods from the area of Information Extraction, a system will be built to fill the EXProt database (Ursing et al., 2002) with information about proteins whose functions have been experimentally verified. To guarantee the empirical foundation of the project, extensive work on manual annotation of scientific text will be done. This will result in a key, divided in a reference corpus and a test corpus, with which the system can be trained and evaluated. The development of the text analysis tools will be conducted in an iterative process in which the individual components will be continuously evaluated and improved during the time of the project. The system solution that will be developed in the project will also be specified at an early stage. MMX will base the system architecture and design on principles that correspond to the professional users' expectations and the conventions of the domain. To guarantee scalability, durability and flexibility the system will be built around well defined components connected by standardised communication protocols. The system must be able to fetch information from a heterogeneous collection of information sources and the user interface must be habitable and intuitive. We will take pains to make the results generalizable, both when it comes to text analysis and system architecture, so that they can be used outside the specific case of the project. The results from the project will be three-fold: The extended EXProt database, the algorithms and methods for text analysis, and the system. The database will be made publicly available. FundingPartial funding for this project has been provided by VINNOVA, the Swedish Agency for Innovation Systems. Documentation2003 project proposal (in Swedish)SICS Open House poster presentation 2004-05-07 (in Swedish) System architecture as of 2004-05-13 FetchProt Corpus Documentation (2005-10-11) The FetchProt CorpusWithin the FetchProt Project we have compiled a data set to be used for training and testing of our system. The corpus consists of 190 full text journal articles of which 140 describe experimental evidence for tyrosine kinase activity in at least one protein. In total, wild types and 85 different mutants of 77 proteins are subject to experimental validation in 613 experiments. The documentation of the corpus can be found here (pdf) or here (msword format). The full corpus can be downloaded from this directory. To receive announcements and information about updates, bugs, and corrections, and to discuss every aspect of the FetchProt Corpus, you may subscribe to the FecthProt Corpus e-mail list by sending an e-mail to Majordomo@sics.se with subscribe fetchprotcorpus in the body. NewsLatest version of Corpus finally uploaded 2007-05-09. Corpus updates and corrections 2005-10-11. See the changes.txt file for details. E-mail list set up 2005-09-30 An e-mail list for discussions and information about the FetchProt Corpus was set up today. You can find subscription information under the FetchProt Corpus heading above. Corpus update 2005-09-26 Two files in the irrelevant set (JBC_278_27896_27902 and JBC_272_1355_1362) were removed today since they were actually relevant and already in the relevant set! We are sorry for the inconvenience. Job openingsAt present, there are no job openings in this project. |