The Yapex Collections of MEDLINE abstracts ========================================== The two collections consist of MEDLINE abstracts obtained in different ways: 1) A document set was obtained by posing the query 'protein binding [Mesh term] AND interaction AND molecular' with the parameters 'abstract', 'english', 'human', and 'publication date 1996-2001' to MEDLINE. From this set 99 abstracts were drawn randomly to form the reference (training) collection. Another non-overlapping set of 48 abstracts was drawn to form a part of the test collection. 2) The remaining 53 abstracts of the test collection were randomly chosen from the GENIA corpus (cf., Collier et al. 1999. 'The GENIA project: corpus-based knowledge acquisition and information extraction from genome research papers'. In: Proc. of 9th Conf. of EACL. pp. 271-272). The protein names of all the abstracts above were annotated by domain experts connected to the Yapex project. ------------------------------ Reference collection 99 abstracts (cf. above) constaining 1745 protein names: yapex_ref_collection.xml ------------------------------ Test collection 101 abstracts (cf. above) containing 1966 protein names: yapex_test_collection.xml Date: Thu May 12 10:53:38 CEST 2005 Test collection corrected in accordance to comments by Kevin B. Cohen. Thank you, Kevin! (Two tags were moved inside the PubmedArticle tags, conventional uses of "<" and ">" were converted to xml entities < and > respectivly) ------------------------------- Information This file: README_yapex_text_collection.txt -------------------------------- The files can be downloaded from: http://www.sics.se/humle/projects/prothalt/