OrthoMCL: Identification of Ortholog Groups
for Eukaryotic Genomes
Li Li, Christian J. Stoeckert Jr., and David S. Roos1
Departments of Biology and Genetics, Center for Bioinformatics, and Genomics Institute, University of Pennsylvania,
Philadelphia, Pennsylvania 19104, USA
The identification of orthologous groups is useful for genome annotation, studies on gene/protein evolution,
comparative genomics, and the identification of taxonomically restricted sequences. Methods successfully exploited
for prokaryotic genome analysis have proved difficult to apply to eukaryotes, however, as larger genomes may
contain multiple paralogous genes, and sequence information is often incomplete. OrthoMCL provides a scalable
method for constructing orthologous groups across multiple eukaryotic taxa, using a Markov Cluster algorithm to
group (putative) orthologs and paralogs. This method performs similarly to the INPARANOID algorithm when
applied to two genomes, but can be extended to cluster orthologs from multiple species. OrthoMCL clusters are
coherent with groups identified by EGO, but improved recognition of “recent” paralogs permits overlapping EGO
groups representing the same gene to be merged. Comparison with previously assigned EC annotations suggests a
high degree of reliability, implying utility for automated eukaryotic genome annotation. OrthoMCL has been
applied to the proteome data set from seven publicly available genomes (human, fly, worm, yeast, Arabidopsis, the
malaria parasite Plasmodium falciparum, and Escherichia coli). A Web interface allows queries based on individual genes or
user-defined phylogenetic patterns (http://www.cbil.upenn.edu/gene-family). Analysis of clusters incorporating P.
falciparum genes identifies numerous enzymes that were incompletely annotated in first-pass annotation of the parasite
genome.
With the progress of large-scale sequencing efforts, comparative
genomic approaches have increasingly been employed to facili-
tate both evolutionary and functional analyses: Conserved se-
quences can be used to infer evolutionary history, and to the
extent that homology implies conserved biochemical function,
this information may be used to facilitate genome annotation.
The concepts of orthology and paralogy originated from the field
of molecular systematics (Fitch 1970), and have recently been
applied to functional characterizations and classifications on the
scale of whole-genome comparisons (Tatusov et al. 1997, 2000,
2001; Chervitz et al. 1998; Mushegian et al. 1998; Wheelan et al.
1999; Rubin et al. 2000). Orthologs and paralogs constitute two
major types of homologs: The first evolved from a common an-
cestor by speciation, and the latter are related by duplication
events (Fitch 1970, 2000). Although we can assume that paralogs
arising from ancient duplication events are likely to have di-
verged in function (as in the case of �- and �-tubulins), true
orthologs (e.g., �-tubulin from yeast and flies) are likely to retain
identical function over evolutionary time, making ortholog iden-
tification a valuable tool for gene annotation. In comparative
genomics, the clustering of orthologous genes provides a frame-
work for integrating information from multiple genomes, high-
lighting the divergence and conservation of gene families and
biological processes. For pathogens such as the human malaria
parasite of Plasmodium falciparum (Gardner et al. 2002; Kissinger
et al. 2002; Bahl et al. 2003), orthologous groupings can facilitate
the identification of candidates for drug and/or vaccine develop-
ment.
The identification of orthologous groups in prokaryotic ge-
nomes has permitted cross-referencing of genes from multiple
species, facilitating genome annotation, protein family classifi-
cation, studies on bacterial evolution, and the identification of
candidates for antibacterial drug development (Tatusov et al.
1997; Galperin and Koonin 1999; Natale et al. 2000a,b; Forterre
2002). The Clusters of Orthologous Groups (COG) database
(http://www.ncbi.nlm.nih.gov/COG/) is constructed based on
all-against-all BLAST searches of complete proteomes (Tatusov et
al. 2000, 2001). Sequences from distinct genomes that are recip-
rocal best hits (i.e., the first sequence finds the second sequence
as its best hit in the second species, and vice versa) are identified
as a pair of orthologs, and “COGs” recognizing relationships
among at least three distinct lineages (triangles) have been iden-
tified across distant phylogenetic lineages.
Although Saccharomyces cerevisiae is included in the COG
database, general application of this approach in the construc-
tion of orthologous groups for other eukaryotic genomes has
proved problematic (even for complete prokaryotic genomes, ex-
tensive manual inspection of COGs is often required to correct
false-positives and split mega-clusters). Complications associated
with ortholog group construction for eukaryotic genomes in-
clude extensive gene duplication and functional redundancy, the
multidomain structure of many proteins, and the predominance
of incomplete eukaryotic genome sequencing (Doolittle 1995;
Henikoff et al. 1997). These challenges demand an approach able
to distinguish between “recent” paralogs (i.e., gene duplications
occurring subsequent to speciation, such as the multiple �-tubu-
lins found in the human genome), and “ancient” paralogs likely
to exhibit different function(s). Recent paralogs (which are
equally related to orthologs in other species) are likely to retain
similar function, and should be grouped with true orthologs, It is
also important to assess global relationships among orthologs,
without being misled by local relationships coming from com-
plicated domain structures or incorrect ortholog assignments.
Unfortunately, the computational costs of multiple sequence
alignments and phylogenetic tree construction, and the diffi-
culty in interpreting such alignments and trees, preclude a phy-
logenetic approach for whole-genome comparisons in eukary-
otes.
1Corresponding author.
E-MAIL droos@sas.upenn.edu; FAX (215) 746-6697.
Article and publication are at http://www.genome.org/cgi/doi/10.1101/
gr.1224503.
Methods
2178 Genome Research 13:2178–2189 ©2003 by Cold Spring Harbor Laboratory Press ISSN 1088-9051/03 $5.00; www.genome.org
www.genome.org
The INPARANOID algorithm (Remm et al. 2001) exploits a
BLAST-based strategy to identify orthologs as reciprocal best hits
between two species, while applying additional rules to accom-
modate paralogs arising from duplication after speciation (a.k.a.
in-paralogs). Note that the resulting ortholog groups include
paralogs derived by recent gene duplication, as each of these
proteins is orthologous to a protein in another species. This al-
gorithm performs well in the identification of ortholog groups
compared to a curated set of yeast versus mammalian orthologs
defined by phylogenetic methods, providing evidence that the
strategy based on reciprocal best hits works well in separating
orthologs from “ancient” paralogs. Unfortunately, the rule-based
approach used by INPARANOID assumes that pairwise compari-
son is limited to comparisons between two species. EGO (previ-
ously named TOGA; Lee et al. 2002) applies a COG-based ap-
proach to the TIGR gene indices (Quackenbush et al. 2000, 2001),
and this method is applicable to multiple species. In the absence
of a rigorously curated data set of orthologs from multiple eu-
karyotic species, it is difficult to assess performance, but EGO is
easily misled by the functional redundancy of multiple paralogs,
and by the absence of true orthologs within incomplete genome
data sets.
Motivated by these challenges, we developed OrthoMCL as
an alternative approach for automated eukaryotic ortholog group
identification. To distinguish functional redundancy from diver-
gence, this method identifies “recent” paralogs to be included in
ortholog groups as within-species BLAST hits that are reciprocally
better than between-species hits. This approach is similar to
INPARANOID, but differs primarily in the requirement that re-
cent paralogs must be more similar to each other than to any
sequence from other species. To resolve the many-to-many or-
thologous relationships inherent in comparisons across multiple
genomes, OrthoMCL applies the Markov Cluster algorithm
(MCL; Van Dongen 2000; http://micans.org/mcl/), which is
based on probability and graph flow theory and allows simulta-
neous classification of global relationships in a similarity space.
MCL simulates random walks on a graph using Markov matrices
to determine the transition probabilities among nodes of the
graph. The MCL algorithm has previously been exploited for
clustering a large set of protein sequences, where it was found to
be very fast and reliable in dealing with complicated domain
structures (Enright et al. 2002). OrthoMCL generates clusters of
proteins where each cluster consists of orthologs or “recent”
paralogs from at least two species. We have now employed Or-
thoMCL to examine the proteomes from several other genomes,
including Homo sapiens (human), Drosophila melanogaster (fruit
fly), Caenhorhabditis elegans (nematode worm), Saccharomyces cer-
evisiae (yeast), the flowering plant Arabidopsis thaliana, the pro-
tozoan malaria parasite Plasmodium falciparum, and the bacte-
rium Escherichia coli; results can be examined online (http://
www.cbil.upenn.edu/gene-family). The underlying object-based
relational storage model GUS (Genomic Unified Schema; David-
son et al. 2001) also hosts the human–mouse DoTS gene index
(http://www.allgenes.org) and the PlasmodiumGenome Database
PlasmoDB (http://PlasmoDB.org; Kissinger et al. 2002; Bahl et al.
2003), permitting these results to be integrated with various or-
ganismal data types to facilitate comprehensive data mining.
RESULTS
Identification of Orthologous Groups by OrthoMCL
The OrthoMCL procedure starts with all-against-all BLASTP com-
parisons of a set of protein sequences from genomes of interest
(Fig. 1). Putative orthologous relationships are identified between
pairs of genomes by reciprocal best similarity pairs. For each pu-
tative ortholog, probable “recent” paralogs are identified as se-
quences within the same genome that are (reciprocally) more
similar to each other than either is to any sequence from another
genome. A P-value cut-off of 1e-5 was chosen for putative or-
thologs or paralogs, based on empirical studies.
Next, putative orthologous and paralogous relationships are
converted into a graph in which the nodes represent protein
sequences, and the weighted edges represent their relationships.
As shown in Figure 2, weights are initially computed as the av-
erage �log10 (P-value) of BLAST results for each pair of se-
quences. Because the high similarity of “recent” paralogs relative
to orthologs can bias the clustering process, edge weights are
then normalized to reflect the average weight for all ortholog
pairs in these two species (or “recent” paralogs when comparing
within species). Although more sophisticated weighting schemes
can be envisioned, this simple method for adjusting the system-
atic bias between edges connecting sequences within the same
genome and edges connecting sequences from different genomes
seems to generate satisfactory results, judging from the compari-
son with INPARANOID, the EGO database, and EC annotations
(see below). The resulting graph is represented by a symmetric
similarity matrix to which the MCL algorithm (Enright et al.
2002) is applied. MCL uses flow simulation and considers all the
relationships in the graph globally and simultaneously during
clustering, providing a robust method for separating diverged
paralogs, distant orthologs mistakenly assigned based on (weak)
reciprocal best hits, and sequences with different domain struc-
tures. An important parameter in the MCL algorithm is the in-
flation value, regulating the cluster tightness (granularity); in-
creasing the inflation value increases cluster tightness (see be-
low). Clusters containing sequences from at least two species
Figure 1 Flow chart of the OrthoMCL algorithm for clustering ortholo-
gous proteins.
Identifying Eukaryotic Ortholog Groups
Genome Research 2179
www.genome.org
form the final output of this procedure: clustered groups of or-
thologs and “recent” paralogs.
OrthoMCL Performance on a Pairwise Comparison
of Worm and Fly Proteomes
In order to evaluate the performance of OrthoMCL on pair-
wise comparisons between two species, both OrthoMCL and
INPARANOID were applied to the complete set of protein pre-
dictions for the fly and worm. Because OrthoMCL uses
WU-BLAST results for sequence similarities and INPARANOID
uses NCBI-BLAST, INPARANOID was adapted to use the �log10
(P-value) fromWU-BLAST as a similarity measure, rather than the
NCBI-BLAST bit score (see Methods). INPARANOID identified
slightly more orthologous sequences than previously reported
(Remm et al. 2001), due to a lower stringency in filtering BLAST
results (see Methods). The computational time required for the
application of either method is primarily at-
tributable to BLAST analysis and postpro-
cessing of these results. Given processed
BLAST results, OrthoMCL required ∼35 min
on a Linux i686 computer to cluster
the worm and fly protein data sets (in-
cluding database transactions), whereas
INPARANOID required 15 min (note that
OrthoMCL is implemented as a pipeline on
a relational database whereas INPARANOID
operates on flat files). Clusters obtained us-
ing the two methods were compared by de-
termining the number of groups that are
identical, and those that are coherent—that
is, where the sequences in a group gener-
ated by one method are a subset of se-
quences in a group generated by the other
(note that identical groups are a subset of
coherent groups).
As shown in Table 1, from a total of
33,062 proteins (13,288 fly; 19,774 worm),
OrthoMCL clustered 10,849 sequences
(33% of the total data set) into 4061 groups,
whereas INPARANOID clustered 11,357
sequences (34%) into 4135 groups. We
found that 10,597 sequences (32% of the
total data set) were recognized by both
OrthoMCL and INPARANOID. Thus, 98%
of the proteins grouped by OrthoMCL
were also grouped by INPARANOID,
whereas 93% of the proteins grouped by
INPARANOID were also grouped by
OrthoMCL. In addition, 8629 proteins
(81% of the total number grouped by both algorithms) were
grouped into 3735 identical groups, representing 92% of the total
number of orthologous groups identified by OrthoMCL, and
90% of the INPARANOID groups. It was revealed that 10,229
proteins (97%) formed coherent groups; 3888 OrthoMCL groups
(96%) were a subset of an INPARANOID group, and 3912
INPARANOID groups (95%) were a subset of an OrthoMCL
group. These results demonstrate that when employed for the
comparison of two genomes, OrthoMCL and INPARANOID ex-
hibit very similar performances.
OrthoMCL Performance on a Three-Species Data Set
(Yeast, Worm, Fly)
A serious limitation to the general application of INPARANOID
for comparative genomics applications is that this algorithm can
only be employed to compare two sets of proteins, as noted
above. In contrast, OrthoMCL can be applied to all-against-all
Figure 2 Illustration of sequence relationships and similarity matrix construction. Dotted arrows
represent “recent” paralogy (duplication subsequent to speciation); solid arrows represent orthol-
ogy. The upper right half of the matrix contains initial weights calculated as average �log10
(P-value) from pairwise WU-BLASTP similarities. The lower left half contains corrected weights
supplied to the MCL algorithm; the edge weight connecting each pair of sequences wij is divided
by Wij/W, where W represents the average weight among all ortholog (underlined) and “recent”
paralog (italicized) pairs, and Wij represents the average edge weight among all ortholog pairs from
species i and j. The net result of this normalization is to correct for systematic differences in
comparisons between two species (e.g., differences attributable to nucleotide composition bias),
and when i = j, to minimize the impact of “recent” paralogs (duplication within a given species) on
the clustering of cross-species orthologs.
Table 1. Comparison of Ortholog Groups Identified by OrthoMCL vs. INPARANOID
Total OrthoMCLa INPARANOID
Grouped by
both (�)b
Identical
groups
Coherent
groups
# Protein sequences 33,062 10,849 (33%) 11,357 (34%) 10,597 (98/93%) 8,629 (81%)c 10,229 (97%)c
Fly data set 13,288 5,133 (39%) 5,550 (42%) 5,006 (98/90%) 4,058 (81%) 4,820 (96%)
Worm data set 19,774 5,716 (29%) 5,807 (29%) 5,591 (98/96%) 4,571 (82%) 5,409 (97%)
# Groups 4,061 4,135 3,735 (92/90%)d 3,888/3,912e (96/95%)d
aUsing inflation index I = 1.5 (see text).
bPercentages indicate percent of sequences grouped by either OrthoMCL (left) or INPARANOID (right).
cPercent of sequences grouped by both OrthoMCL and INPARANOID.
dPercent of OrthoMCL groups (left); percent of INPARANOID groups (right).
eOrthoMCL groups entirely contained within INPARANOID groups (left); INPARANOID groups entirely contained within OrthoMCL groups (right).
Li et al.
2180 Genome Research
www.genome.org
T
ab
le
2
.
C
om
p
ar
is
on
of
O
rt
h
ol
og
G
ro
up
s
Id
en
ti
fi
ed
b
y
O
rt
h
oM
C
L
vs
.E
G
O
To
ta
l
O
rt
h
oM
C
La
EG
O
b
G
ro
up
ed
b
y
b
ot
h
(�
)
Id
en
ti
ca
l
g
ro
up
s
EG
O
ex
te
n
d
s
O
rt
h
oM
C
Le
O
rt
h
oM
C
L
ex
te
n
d
s
EG
O
f
C
oh
er
en
t
g
ro
up
s
#
Pr
ot
ei
n
se
qu
en
ce
s
39
,4
20
13
,8
51
(3
5%
)
5,
28
6
(1
3%
)
4,
95
9
(3
6/
94
%
)c
2,
43
2
(4
9%
)d
15
8
(3
%
)d
3,
00
4
(6
1%
)d
4,
71
6
(9
5%
)d
Ye
as
t
da
ta
se
t
6,
35
8
2,
53
1
(4
0%
)
92
3
(1
5%
)
88
2
(3
5/
96
%
)
45
2
(5
1%
)
38
(4
%
)
61
7
(7
0%
)
82
7
(9
4%
)
Fl
y
da
ta
se
t
13
,2
88
5,
40
9
(4
1%
)
2,
13
8
(1
7%
)
2,
01
8
(3
7/
94
%
)
98
7
(4
9%
)
66
(3
%
)
1,
17
4
(5
8%
)
1,
92
8
(9
6%
)
W
or
m
da
ta
se
t
19
,7
74
5,
91
1
(3
0%
)
2,
22
5
(1
1%
)
2,
05
9
(3
5/
93
%
)
99
3
(4
8%
)
54
(3
%
)
1,
21
3
(5
9%
)
1,
96
1
(9
5%
)
#
G
ro
up
s
4,
42
5
3,
62
0
no
t
ap
pl
ic
ab
le
98
9
(2
2/
27
%
)g
70
(2
%
)i
2,
03
8
(5
6%
)j
1,
05
9/
3,
02
7h
(2
4/
84
%
)g
Ye
as
t,
fly
no
t
w
or
m
se
qu
en
ce
s
58
6
(4
%
)k
81
6
(1
7%
)l
56
(1
0/
7%
)
40
(7
1%
)
2
(4
%
)
9
(1
6%
)
51
(9
2%
)
gr
ou
ps
21
5
(5
%
)i
44
0
(1
2%
)j
no
t
ap
pl
ic
ab
le
20
(9
/5
%
)
1
(0
.5
%
)
5
(1
%
)
21
/2
5
(1
0/
6%
)
Ye
as
t,
w
or
m
no
t
fly
se
qu
en
ce
s
47
0
(3
%
)
91
1
(1
7%
)
62
(1
3/
7%
)
28
(4
5%
)
0
(0
%
)
30
(4
8%
)
58
(9
4%
)
gr
ou
ps
15
5
(4
%
)
49
2
(1
4%
)
no
t
ap
pl
ic
ab
le
14
(9
/3
%
)
0
(0
%
)
19
(4
%
)
14
/3
3
(9
/7
%
)
Fl
y,
w
or
m
no
t
ye
as
t
se
qu
en
ce
s
6,
33
7
(4
6%
)
3,
56
8
(6
7%
)
1,
61
4
(2
5/
45
%
)
1,
16
1
(7
2%
)
18
(1
%
)
39
0
(2
4%
)
1,
53
5
(9
5%
)
gr
ou
ps
2,
30
7
(5
2%
)
1,
87
4
(5
2%
)
no
t
ap
pl
ic
ab
le
57
1
(2
5/
30
%
)
9
(0
.4
%
)
20
8
(1
1%
)
58
0/
77
9
(2
5/
42
%
)
Fl
y,
w
or
m
an
d
ye
as
t
se
qu
en
ce
s
6,
45
8
(4
7%
)
2,
10
5
(4
0%
)
1,
86
8
(2
9/
89
%
)
1,
20
3
(6
4%
)
48
(3
%
)
65
4
(3
5%
)
1,
76
3
(9
4%
)
gr
ou
ps
1,
74
8
(4
0%
)
81
4
(2
2%
)
no
t
ap
pl
ic
ab
le
38
4
(2
2/
47
%
)
16
(1
%
)
26
3
(3
2%
)
40
0/
64
7
(2
3/
79
%
)
a
U
si
ng
in
fla
tio
n
in
de
x
I=
2.
5
(s
ee
te
xt
).
b
Se
e
te
xt
fo
r
de
sc
rip
tio
n
of
pr
un
in
g
m
et
ho
d
us
ed
to
id
en
tif
y
gr
ou
ps
co
nt
ai
ni
ng
se
qu
en
ce
s
fr
om
at
le
as
t
tw
o
of
th
e
th
re
e
sp
ec
ie
s
un
de
r
co
ns
id
er
at
io
n
(y
ea
st
,
fly
,
an
d
w
or
m
).
c P
er
ce
nt
of
se
qu
en
ce
s
gr
ou
pe
d
by
O
rt
ho
M
C
L
(le
ft
)
or
EG
O
(r
ig
ht
).
d
Pe
rc
en
t
of
se
qu
en
ce
s
gr
ou
pe
d
by
bo
th
O
rt
ho
M
C
L
an
d
EG
O
.
e
C
oh
er
en
t
(b
ut
no
t
id
en
tic
al
)
gr
ou
ps
ex
te
nd
ed
by
EG
O
(E
G
本文档为【!!!!!OrthoMCL Identification of Ortholog Groups for Eukaryotic Genomes】,请使用软件OFFICE或WPS软件打开。作品中的文字与图均可以修改和编辑,
图片更改请在作品中右键图片并更换,文字修改请直接点击文字进行修改,也可以新增和删除文档中的内容。
该文档来自用户分享,如有侵权行为请发邮件ishare@vip.sina.com联系网站客服,我们会及时删除。
[版权声明] 本站所有资料为用户分享产生,若发现您的权利被侵害,请联系客服邮件isharekefu@iask.cn,我们尽快处理。
本作品所展示的图片、画像、字体、音乐的版权可能需版权方额外授权,请谨慎使用。
网站提供的党政主题相关内容(国旗、国徽、党徽..)目的在于配合国家政策宣传,仅限个人学习分享使用,禁止用于任何广告和商用目的。