Systematic curation of protein and genetic interaction data for computable biology

Research over the past three decades has revealed that cellular behavior is governed by dynamic, complex net­works of interactions among proteins, RNA, DNA, lipids and metabolites [1]. As such, each discrete interaction represents the minimal input unit for computational models of biological system responses. This principle in turn requires that biological interactions be rigorously documented in a readily computable format to enable mathematical models of network function that are able to predict non­obvious behavior. Despite the foundational shifts in our conception of biology, however, we still use the same primary method to disseminate scientific information that was used by Darwin over 150 years ago, namely the freetext, descriptive narrative style of ­conventional publications. The vast and ever­increasing biomedical literature thus poses a formidable challenge for the annotation of biological data and computational analysis [2].Our 2006 publication describing the comprehensive curation of protein and genetic interactions for the budding yeast

more than 500,000 interactions across some 30 different model organism species [4]. Comprehensive literature curation has also been completed for the fission yeast Schizosaccharomyces pombe and the thale cress Arabi dopsis thaliana. In parallel, curation projects on human interactions in themed areas of biomedical interest have been undertaken. These datasets are disseminated by many different partner databases and metaresources, including the Saccharomyces Genome Database [5] and other model organism databases, the Gene Ontology Consortium [6], the International Molecular Exchange (IMEx) Consortium [7], and the Pathway Commons initiative [8]. The Proteomics Standards InitiativeMole cular Interactions (PSIMI) standard has been developed by an international consortium to unify experimental evidence codes for protein interactions across databases [7], and analogous standards are now being developed by BioGRID and its partners for genetic interactions and quantitative phenotypic traits. The BioGRID dataset has been kept current with the literature through archived monthly updates, and has found numerous applications, from the analysis of biological network properties, to predictions of gene function, to the interpretation of genetic interactions, to a standard for automated text mining approaches [4].
Since our original publication, the rate of generation of molecular interaction data has exploded. Over the past several years, a variety of different HTP approaches have been used to chart protein and genetic interaction networks, which have greatly extended the scope of bio logical interaction space and motivated many hypothesis driven studies. In yeast, the protein interaction landscape has grown from about 23,000 nonredundant interactions in 2006 to over 75,000 at present, while the number of genetic interactions measured by synthetic growth effects has increased from about 14,000 nonredundant inter actions in 2006 to over 140,000 at present [3,4]. HTP methods for detection of protein and nucleic acid inter actions have also enabled the comprehensive inference of regulatory relationships [9]. A host of analogous sys te matic detection approaches have now begun to chart the Systematic curation of protein and genetic interaction data for computable biology Kara Dolinski 1 *, Andrew Chatr-aryamontri 2 and Mike Tyers 2 *

A N N I V E R S A RY U P DAT E Open Access
*Correspondence: kara@genomics.princeton.edu; md.tyers@umontreal.ca extensive networks of interactions and associated protein modifications in human cells [10]. Even in yeast, however, the full extent of the protein and genetic interactomes, and the regulatory relationships that connect the two, remain to be determined. The availability of robust datasets derived from the primary literature and HTP studies has enabled graphical representation and interrogation of global interaction networks, and the prediction of gene and network func tion [11]. Such tools are essential for deconvolution of the now commonplace but inscrutable interaction 'hair ball' (Figure 1), which belies the regulatory logic that is encoded in complex networks [1]. These methods have begun to allow analysis of networks implicated in human disease and the identification of critical nodes as thera peutic targets [10]. This network approach to under standing disease should not only identify new targets for drug discovery but should also predict drug combinations tailored to compensate for specific network mutations [12].
Despite the striking experimental progress that has ushered in the era of the hairball [1], the annotation and computational analysis of interaction datasets is still at a nascent stage. A fundamental issue with expert manual curation is the rate of growth of the primary literature, which manifestly outstrips the rate of curation ( Figure 2). To put this problem in perspective, PubMed currently contains over 22,000,000 publication entries (some 12,000,000 of which pertain to human biology), and new publications are accumulating at a rate of approximately two every minute. Although automated textmining approaches can expedite curation, these approaches are inherently limited by the inadequacies of natural language processing algorithms, and it is clear that much of the literature will remain opaque to computation unless experimental interaction data are explicitly anno tated as a part of the publication process. A simple and cost effective solution would be to mandate the deposi tion of structured records that rigorously describe experi mental evidence and quantitative parameters for bio logical inter actions as an inherent part of the publication process.
A further formidable challenge will be the reconcilia tion of literaturebased interaction data and HTP data, which are often still in discord. The level of detail and reliability of different studies varies greatly, and has led to a call for semiquantitative metrics to score interaction reliability. The low affinity protein interactions that often underpin biological network regulation are particularly problematic in this regard, and undoubtedly account for a large fraction of currently uncharted interactions. In addition to focused studies in the literature that often draw on subtle inferences and clever experiments to detect such interactions, the application of new methods, such as protein crosslinking followed by mass spectro metric deconvolution, should help increase the rate of detection of transient regulatory interactions.
The biomedical research community is now beginning to leverage the wealth of network data across multiple  species to gain a better understanding of human health and disease [9]. As the deluge of genome sequence data associated with cancer and other diseases continues to mount, a crossspecies approach that draws on experimentally tractable model species will be a key step toward understanding the function of conserved but poorly annotated human genes. To provide the necessary underlying data for these efforts, interaction and phenotype curation has been expanded by our group and others to capture data across the major model organism species, including bacteria, viruses, yeasts, plants, nematode, fruit fly, zebrafish, mouse, rat, primate, and human [4]. These curation efforts are often focused on central biological processes or diseases. For example, we have recently undertaken exhaustive curation of the extensive interaction networks that control the state of chromatin modification and protein ubiquitination in both yeast and humans [4]. Similarly, the Gene Ontology Consortium has undertaken initiatives to describe specific developmental and diseaseassociated processes [6]. A related challenge is the elaboration of generic interaction networks that often lack context specificity towards more realistic dynamic networks that incorporate information on particular cellular contexts, developmental states or disease conditions. This task will require both more detailed annotation and the inte gra tion of different data types such as tissuespecific expres sion and precise phenotypes [13]. To begin to address this, our curation efforts have begun to include the regu lation of protein interactions by posttranslational modi fi cations, the specific contexts or conditions under which the interaction occurs, and the classification of genetic interactions according to quantitative phenotypes [4]. The utility of the comprehensive yeast interaction dataset that we described in 2006 has grown well beyond our original simple intended application as a benchmark for HTP datasets. BioGRID now houses a vast amount of data from multiple species, and is a general resource for experimental computational biologists alike. The BioGRID, its partner interaction databases, model organism databases, and public metaresources will all play a crucial role in biomedical research in the post genomics era. We close this brief overview by noting that there is an urgent need to develop the equivalent of a unified human model organism database that incor porates protein and genetic interaction data, regulatory data at the DNA, RNA and protein levels, polymorphism and diseaseassociated sequence variation data, quanti tative phenotypic data, and drugtarget interaction data. These integrated datasets will eventually set the stage for sophisticated computational models able to predict cellular behavior, disease outcomes and new modes of therapeutic intervention.