首页 !!!!!OrthoMCL Identification of Ortholog Groups for Eukaryotic Genomes

!!!!!OrthoMCL Identification of Ortholog Groups for Eukaryotic Genomes

举报
开通vip

!!!!!OrthoMCL Identification of Ortholog Groups for Eukaryotic Genomes OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes Li Li, Christian J. Stoeckert Jr., and David S. Roos1 Departments of Biology and Genetics, Center for Bioinformatics, and Genomics Institute, University of Pennsylvania, Philadelphia, Pennsyl...

!!!!!OrthoMCL Identification of Ortholog Groups for Eukaryotic Genomes
OrthoMCL: Identification of Ortholog Groups for Eukaryotic Genomes Li Li, Christian J. Stoeckert Jr., and David S. Roos1 Departments of Biology and Genetics, Center for Bioinformatics, and Genomics Institute, University of Pennsylvania, Philadelphia, Pennsylvania 19104, USA The identification of orthologous groups is useful for genome annotation, studies on gene/protein evolution, comparative genomics, and the identification of taxonomically restricted sequences. Methods successfully exploited for prokaryotic genome analysis have proved difficult to apply to eukaryotes, however, as larger genomes may contain multiple paralogous genes, and sequence information is often incomplete. OrthoMCL provides a scalable method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm to group (putative) orthologs and paralogs. This method performs similarly to the INPARANOID algorithm when applied to two genomes, but can be extended to cluster orthologs from multiple species. OrthoMCL clusters are coherent with groups identified by EGO, but improved recognition of “recent” paralogs permits overlapping EGO groups representing the same gene to be merged. Comparison with previously assigned EC annotations suggests a high degree of reliability, implying utility for automated eukaryotic genome annotation. OrthoMCL has been applied to the proteome data set from seven publicly available genomes (human, fly, worm, yeast, Arabidopsis, the malaria parasite Plasmodium falciparum, and Escherichia coli). A Web interface allows queries based on individual genes or user-defined phylogenetic patterns (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating P. falciparum genes identifies numerous enzymes that were incompletely annotated in first-pass annotation of the parasite genome. With the progress of large-scale sequencing efforts, comparative genomic approaches have increasingly been employed to facili- tate both evolutionary and functional analyses: Conserved se- quences can be used to infer evolutionary history, and to the extent that homology implies conserved biochemical function, this information may be used to facilitate genome annotation. The concepts of orthology and paralogy originated from the field of molecular systematics (Fitch 1970), and have recently been applied to functional characterizations and classifications on the scale of whole-genome comparisons (Tatusov et al. 1997, 2000, 2001; Chervitz et al. 1998; Mushegian et al. 1998; Wheelan et al. 1999; Rubin et al. 2000). Orthologs and paralogs constitute two major types of homologs: The first evolved from a common an- cestor by speciation, and the latter are related by duplication events (Fitch 1970, 2000). Although we can assume that paralogs arising from ancient duplication events are likely to have di- verged in function (as in the case of �- and �-tubulins), true orthologs (e.g., �-tubulin from yeast and flies) are likely to retain identical function over evolutionary time, making ortholog iden- tification a valuable tool for gene annotation. In comparative genomics, the clustering of orthologous genes provides a frame- work for integrating information from multiple genomes, high- lighting the divergence and conservation of gene families and biological processes. For pathogens such as the human malaria parasite of Plasmodium falciparum (Gardner et al. 2002; Kissinger et al. 2002; Bahl et al. 2003), orthologous groupings can facilitate the identification of candidates for drug and/or vaccine develop- ment. The identification of orthologous groups in prokaryotic ge- nomes has permitted cross-referencing of genes from multiple species, facilitating genome annotation, protein family classifi- cation, studies on bacterial evolution, and the identification of candidates for antibacterial drug development (Tatusov et al. 1997; Galperin and Koonin 1999; Natale et al. 2000a,b; Forterre 2002). The Clusters of Orthologous Groups (COG) database (http://www.ncbi.nlm.nih.gov/COG/) is constructed based on all-against-all BLAST searches of complete proteomes (Tatusov et al. 2000, 2001). Sequences from distinct genomes that are recip- rocal best hits (i.e., the first sequence finds the second sequence as its best hit in the second species, and vice versa) are identified as a pair of orthologs, and “COGs” recognizing relationships among at least three distinct lineages (triangles) have been iden- tified across distant phylogenetic lineages. Although Saccharomyces cerevisiae is included in the COG database, general application of this approach in the construc- tion of orthologous groups for other eukaryotic genomes has proved problematic (even for complete prokaryotic genomes, ex- tensive manual inspection of COGs is often required to correct false-positives and split mega-clusters). Complications associated with ortholog group construction for eukaryotic genomes in- clude extensive gene duplication and functional redundancy, the multidomain structure of many proteins, and the predominance of incomplete eukaryotic genome sequencing (Doolittle 1995; Henikoff et al. 1997). These challenges demand an approach able to distinguish between “recent” paralogs (i.e., gene duplications occurring subsequent to speciation, such as the multiple �-tubu- lins found in the human genome), and “ancient” paralogs likely to exhibit different function(s). Recent paralogs (which are equally related to orthologs in other species) are likely to retain similar function, and should be grouped with true orthologs, It is also important to assess global relationships among orthologs, without being misled by local relationships coming from com- plicated domain structures or incorrect ortholog assignments. Unfortunately, the computational costs of multiple sequence alignments and phylogenetic tree construction, and the diffi- culty in interpreting such alignments and trees, preclude a phy- logenetic approach for whole-genome comparisons in eukary- otes. 1Corresponding author. E-MAIL droos@sas.upenn.edu; FAX (215) 746-6697. Article and publication are at http://www.genome.org/cgi/doi/10.1101/ gr.1224503. Methods 2178 Genome Research 13:2178–2189 ©2003 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/03 $5.00; www.genome.org www.genome.org The INPARANOID algorithm (Remm et al. 2001) exploits a BLAST-based strategy to identify orthologs as reciprocal best hits between two species, while applying additional rules to accom- modate paralogs arising from duplication after speciation (a.k.a. in-paralogs). Note that the resulting ortholog groups include paralogs derived by recent gene duplication, as each of these proteins is orthologous to a protein in another species. This al- gorithm performs well in the identification of ortholog groups compared to a curated set of yeast versus mammalian orthologs defined by phylogenetic methods, providing evidence that the strategy based on reciprocal best hits works well in separating orthologs from “ancient” paralogs. Unfortunately, the rule-based approach used by INPARANOID assumes that pairwise compari- son is limited to comparisons between two species. EGO (previ- ously named TOGA; Lee et al. 2002) applies a COG-based ap- proach to the TIGR gene indices (Quackenbush et al. 2000, 2001), and this method is applicable to multiple species. In the absence of a rigorously curated data set of orthologs from multiple eu- karyotic species, it is difficult to assess performance, but EGO is easily misled by the functional redundancy of multiple paralogs, and by the absence of true orthologs within incomplete genome data sets. Motivated by these challenges, we developed OrthoMCL as an alternative approach for automated eukaryotic ortholog group identification. To distinguish functional redundancy from diver- gence, this method identifies “recent” paralogs to be included in ortholog groups as within-species BLAST hits that are reciprocally better than between-species hits. This approach is similar to INPARANOID, but differs primarily in the requirement that re- cent paralogs must be more similar to each other than to any sequence from other species. To resolve the many-to-many or- thologous relationships inherent in comparisons across multiple genomes, OrthoMCL applies the Markov Cluster algorithm (MCL; Van Dongen 2000; http://micans.org/mcl/), which is based on probability and graph flow theory and allows simulta- neous classification of global relationships in a similarity space. MCL simulates random walks on a graph using Markov matrices to determine the transition probabilities among nodes of the graph. The MCL algorithm has previously been exploited for clustering a large set of protein sequences, where it was found to be very fast and reliable in dealing with complicated domain structures (Enright et al. 2002). OrthoMCL generates clusters of proteins where each cluster consists of orthologs or “recent” paralogs from at least two species. We have now employed Or- thoMCL to examine the proteomes from several other genomes, including Homo sapiens (human), Drosophila melanogaster (fruit fly), Caenhorhabditis elegans (nematode worm), Saccharomyces cer- evisiae (yeast), the flowering plant Arabidopsis thaliana, the pro- tozoan malaria parasite Plasmodium falciparum, and the bacte- rium Escherichia coli; results can be examined online (http:// www.cbil.upenn.edu/gene-family). The underlying object-based relational storage model GUS (Genomic Unified Schema; David- son et al. 2001) also hosts the human–mouse DoTS gene index (http://www.allgenes.org) and the PlasmodiumGenome Database PlasmoDB (http://PlasmoDB.org; Kissinger et al. 2002; Bahl et al. 2003), permitting these results to be integrated with various or- ganismal data types to facilitate comprehensive data mining. RESULTS Identification of Orthologous Groups by OrthoMCL The OrthoMCL procedure starts with all-against-all BLASTP com- parisons of a set of protein sequences from genomes of interest (Fig. 1). Putative orthologous relationships are identified between pairs of genomes by reciprocal best similarity pairs. For each pu- tative ortholog, probable “recent” paralogs are identified as se- quences within the same genome that are (reciprocally) more similar to each other than either is to any sequence from another genome. A P-value cut-off of 1e-5 was chosen for putative or- thologs or paralogs, based on empirical studies. Next, putative orthologous and paralogous relationships are converted into a graph in which the nodes represent protein sequences, and the weighted edges represent their relationships. As shown in Figure 2, weights are initially computed as the av- erage �log10 (P-value) of BLAST results for each pair of se- quences. Because the high similarity of “recent” paralogs relative to orthologs can bias the clustering process, edge weights are then normalized to reflect the average weight for all ortholog pairs in these two species (or “recent” paralogs when comparing within species). Although more sophisticated weighting schemes can be envisioned, this simple method for adjusting the system- atic bias between edges connecting sequences within the same genome and edges connecting sequences from different genomes seems to generate satisfactory results, judging from the compari- son with INPARANOID, the EGO database, and EC annotations (see below). The resulting graph is represented by a symmetric similarity matrix to which the MCL algorithm (Enright et al. 2002) is applied. MCL uses flow simulation and considers all the relationships in the graph globally and simultaneously during clustering, providing a robust method for separating diverged paralogs, distant orthologs mistakenly assigned based on (weak) reciprocal best hits, and sequences with different domain struc- tures. An important parameter in the MCL algorithm is the in- flation value, regulating the cluster tightness (granularity); in- creasing the inflation value increases cluster tightness (see be- low). Clusters containing sequences from at least two species Figure 1 Flow chart of the OrthoMCL algorithm for clustering ortholo- gous proteins. Identifying Eukaryotic Ortholog Groups Genome Research 2179 www.genome.org form the final output of this procedure: clustered groups of or- thologs and “recent” paralogs. OrthoMCL Performance on a Pairwise Comparison of Worm and Fly Proteomes In order to evaluate the performance of OrthoMCL on pair- wise comparisons between two species, both OrthoMCL and INPARANOID were applied to the complete set of protein pre- dictions for the fly and worm. Because OrthoMCL uses WU-BLAST results for sequence similarities and INPARANOID uses NCBI-BLAST, INPARANOID was adapted to use the �log10 (P-value) fromWU-BLAST as a similarity measure, rather than the NCBI-BLAST bit score (see Methods). INPARANOID identified slightly more orthologous sequences than previously reported (Remm et al. 2001), due to a lower stringency in filtering BLAST results (see Methods). The computational time required for the application of either method is primarily at- tributable to BLAST analysis and postpro- cessing of these results. Given processed BLAST results, OrthoMCL required ∼35 min on a Linux i686 computer to cluster the worm and fly protein data sets (in- cluding database transactions), whereas INPARANOID required 15 min (note that OrthoMCL is implemented as a pipeline on a relational database whereas INPARANOID operates on flat files). Clusters obtained us- ing the two methods were compared by de- termining the number of groups that are identical, and those that are coherent—that is, where the sequences in a group gener- ated by one method are a subset of se- quences in a group generated by the other (note that identical groups are a subset of coherent groups). As shown in Table 1, from a total of 33,062 proteins (13,288 fly; 19,774 worm), OrthoMCL clustered 10,849 sequences (33% of the total data set) into 4061 groups, whereas INPARANOID clustered 11,357 sequences (34%) into 4135 groups. We found that 10,597 sequences (32% of the total data set) were recognized by both OrthoMCL and INPARANOID. Thus, 98% of the proteins grouped by OrthoMCL were also grouped by INPARANOID, whereas 93% of the proteins grouped by INPARANOID were also grouped by OrthoMCL. In addition, 8629 proteins (81% of the total number grouped by both algorithms) were grouped into 3735 identical groups, representing 92% of the total number of orthologous groups identified by OrthoMCL, and 90% of the INPARANOID groups. It was revealed that 10,229 proteins (97%) formed coherent groups; 3888 OrthoMCL groups (96%) were a subset of an INPARANOID group, and 3912 INPARANOID groups (95%) were a subset of an OrthoMCL group. These results demonstrate that when employed for the comparison of two genomes, OrthoMCL and INPARANOID ex- hibit very similar performances. OrthoMCL Performance on a Three-Species Data Set (Yeast, Worm, Fly) A serious limitation to the general application of INPARANOID for comparative genomics applications is that this algorithm can only be employed to compare two sets of proteins, as noted above. In contrast, OrthoMCL can be applied to all-against-all Figure 2 Illustration of sequence relationships and similarity matrix construction. Dotted arrows represent “recent” paralogy (duplication subsequent to speciation); solid arrows represent orthol- ogy. The upper right half of the matrix contains initial weights calculated as average �log10 (P-value) from pairwise WU-BLASTP similarities. The lower left half contains corrected weights supplied to the MCL algorithm; the edge weight connecting each pair of sequences wij is divided by Wij/W, where W represents the average weight among all ortholog (underlined) and “recent” paralog (italicized) pairs, and Wij represents the average edge weight among all ortholog pairs from species i and j. The net result of this normalization is to correct for systematic differences in comparisons between two species (e.g., differences attributable to nucleotide composition bias), and when i = j, to minimize the impact of “recent” paralogs (duplication within a given species) on the clustering of cross-species orthologs. Table 1. Comparison of Ortholog Groups Identified by OrthoMCL vs. INPARANOID Total OrthoMCLa INPARANOID Grouped by both (�)b Identical groups Coherent groups # Protein sequences 33,062 10,849 (33%) 11,357 (34%) 10,597 (98/93%) 8,629 (81%)c 10,229 (97%)c Fly data set 13,288 5,133 (39%) 5,550 (42%) 5,006 (98/90%) 4,058 (81%) 4,820 (96%) Worm data set 19,774 5,716 (29%) 5,807 (29%) 5,591 (98/96%) 4,571 (82%) 5,409 (97%) # Groups 4,061 4,135 3,735 (92/90%)d 3,888/3,912e (96/95%)d aUsing inflation index I = 1.5 (see text). bPercentages indicate percent of sequences grouped by either OrthoMCL (left) or INPARANOID (right). cPercent of sequences grouped by both OrthoMCL and INPARANOID. dPercent of OrthoMCL groups (left); percent of INPARANOID groups (right). eOrthoMCL groups entirely contained within INPARANOID groups (left); INPARANOID groups entirely contained within OrthoMCL groups (right). Li et al. 2180 Genome Research www.genome.org T ab le 2 . C om p ar is on of O rt h ol og G ro up s Id en ti fi ed b y O rt h oM C L vs .E G O To ta l O rt h oM C La EG O b G ro up ed b y b ot h (� ) Id en ti ca l g ro up s EG O ex te n d s O rt h oM C Le O rt h oM C L ex te n d s EG O f C oh er en t g ro up s # Pr ot ei n se qu en ce s 39 ,4 20 13 ,8 51 (3 5% ) 5, 28 6 (1 3% ) 4, 95 9 (3 6/ 94 % )c 2, 43 2 (4 9% )d 15 8 (3 % )d 3, 00 4 (6 1% )d 4, 71 6 (9 5% )d Ye as t da ta se t 6, 35 8 2, 53 1 (4 0% ) 92 3 (1 5% ) 88 2 (3 5/ 96 % ) 45 2 (5 1% ) 38 (4 % ) 61 7 (7 0% ) 82 7 (9 4% ) Fl y da ta se t 13 ,2 88 5, 40 9 (4 1% ) 2, 13 8 (1 7% ) 2, 01 8 (3 7/ 94 % ) 98 7 (4 9% ) 66 (3 % ) 1, 17 4 (5 8% ) 1, 92 8 (9 6% ) W or m da ta se t 19 ,7 74 5, 91 1 (3 0% ) 2, 22 5 (1 1% ) 2, 05 9 (3 5/ 93 % ) 99 3 (4 8% ) 54 (3 % ) 1, 21 3 (5 9% ) 1, 96 1 (9 5% ) # G ro up s 4, 42 5 3, 62 0 no t ap pl ic ab le 98 9 (2 2/ 27 % )g 70 (2 % )i 2, 03 8 (5 6% )j 1, 05 9/ 3, 02 7h (2 4/ 84 % )g Ye as t, fly no t w or m se qu en ce s 58 6 (4 % )k 81 6 (1 7% )l 56 (1 0/ 7% ) 40 (7 1% ) 2 (4 % ) 9 (1 6% ) 51 (9 2% ) gr ou ps 21 5 (5 % )i 44 0 (1 2% )j no t ap pl ic ab le 20 (9 /5 % ) 1 (0 .5 % ) 5 (1 % ) 21 /2 5 (1 0/ 6% ) Ye as t, w or m no t fly se qu en ce s 47 0 (3 % ) 91 1 (1 7% ) 62 (1 3/ 7% ) 28 (4 5% ) 0 (0 % ) 30 (4 8% ) 58 (9 4% ) gr ou ps 15 5 (4 % ) 49 2 (1 4% ) no t ap pl ic ab le 14 (9 /3 % ) 0 (0 % ) 19 (4 % ) 14 /3 3 (9 /7 % ) Fl y, w or m no t ye as t se qu en ce s 6, 33 7 (4 6% ) 3, 56 8 (6 7% ) 1, 61 4 (2 5/ 45 % ) 1, 16 1 (7 2% ) 18 (1 % ) 39 0 (2 4% ) 1, 53 5 (9 5% ) gr ou ps 2, 30 7 (5 2% ) 1, 87 4 (5 2% ) no t ap pl ic ab le 57 1 (2 5/ 30 % ) 9 (0 .4 % ) 20 8 (1 1% ) 58 0/ 77 9 (2 5/ 42 % ) Fl y, w or m an d ye as t se qu en ce s 6, 45 8 (4 7% ) 2, 10 5 (4 0% ) 1, 86 8 (2 9/ 89 % ) 1, 20 3 (6 4% ) 48 (3 % ) 65 4 (3 5% ) 1, 76 3 (9 4% ) gr ou ps 1, 74 8 (4 0% ) 81 4 (2 2% ) no t ap pl ic ab le 38 4 (2 2/ 47 % ) 16 (1 % ) 26 3 (3 2% ) 40 0/ 64 7 (2 3/ 79 % ) a U si ng in fla tio n in de x I= 2. 5 (s ee te xt ). b Se e te xt fo r de sc rip tio n of pr un in g m et ho d us ed to id en tif y gr ou ps co nt ai ni ng se qu en ce s fr om at le as t tw o of th e th re e sp ec ie s un de r co ns id er at io n (y ea st , fly , an d w or m ). c P er ce nt of se qu en ce s gr ou pe d by O rt ho M C L (le ft ) or EG O (r ig ht ). d Pe rc en t of se qu en ce s gr ou pe d by bo th O rt ho M C L an d EG O . e C oh er en t (b ut no t id en tic al ) gr ou ps ex te nd ed by EG O (E G
本文档为【!!!!!OrthoMCL Identification of Ortholog Groups for Eukaryotic Genomes】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑, 图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。
下载需要: 免费 已有0 人下载
最新资料
资料动态
专题动态
is_819451
暂无简介~
格式:pdf
大小:556KB
软件:PDF阅读器
页数:0
分类:
上传时间:2014-01-16
浏览量:56