BMC Biology BioMed Central

The complete plastid genome sequence of the parasitic green alga Helicosporidium sp. is highly reduced and structured Abstract Background: Loss of photosynthesis has occurred independently in several plant and algal lineages, and represents a major metabolic shift with potential consequences for the content and structure of plastid genomes. To investigate such changes, we sequenced the complete plastid genome of the parasitic, non-photosynthetic green alga, Helicosporidium.


Background
Introgression can play a key evolutionary role in plants [1,2], but has until recently been considered rare and controversial in animals [3][4][5]. This is largely because hybrids between animal species are rare, but may also stem from a belief that hybridization and introgression between species is unnatural [1,3,5]. Hybrids are usually rare on a per individual basis, but species undergoing occasional interpecific hybridization are common. Existing surveys suggest that around 10% of animal species, and 25% of plant species hybridize [5]: for instance, 9% of bird species [6] and 11% of European butterfly species [7] are known to hybridize with at least one other species. Even if occasional, natural hybridization can lead to successful introgression, with important consequences in ecology, evolution and conservation [3,5]. However, hybridization in nature does not guarantee that genes will pass between species, because hybrids are typically selected against, and may be completely infertile or inviable. To determine whether hybridization leads to introgression, we must investigate the patterns of distribution of alleles among hybridizing species. Shared alleles in descendent species may have been inherited as pre-existing polymorphisms from their joint ancestors, as well as via recent gene flow since speciation. These two routes to allelic sharing, which both result in genealogical polyphyly at the species level, are hard to distinguish on the basis of genetic data.
Recently, two classes of molecular methods have been used to test for introgression. Both rely on the idea that introgression in some genomic locations will be prohibited by reproductive isolation or divergent natural selection, while at other loci introgressing alleles will establish more freely. In fact, without heterogeneity of divergence across the genome, it will typically be difficult to discriminate recent speciation from recent gene flow. The first method examines genotypic data at multiple low-resolution loci, such as chromosomal morphs, allozymes, microsatellites, amplified fragment length polymorphisms (AFLPs) or single nucleotide polymorphisms (SNPs), for heterogeneity of divergence in allele frequency. Alleles that flow freely will have their frequencies homogenised across species, while alleles whose introgression is blocked by divergent selection will retain strong frequency differences. Thus, heterogeneity in allele frequency differences among loci suggests that on-going gene flow as a likely explanation for similar allele frequencies at some genes in pairs of taxa that hybridise [8][9][10][11][12][13]. A second approach, adopted in this paper, employs DNA sequence data, coupled with a statistical approach based on gene genealogies and coalescence theory, to test whether shared haplotype polymorphisms could have been inherited from a common ancestor or are more likely due to introgression since speciation [14][15][16][17][18].
Heliconius butterflies are a rapidly radiating, tropical group, well known for diverse warning colors and Müllerian mimicry [19][20][21][22]. Around 35% of the species in this group are known to hybridize in nature [5,21]. This study concentrates on Heliconius cydno and H. melpomene, which co-occur throughout Central America and the Northern Andes [23,24]. Although closely related, separating approximately 1.5 million years ago [24], they are not classical 'sibling species' (i.e. species difficult to distinguish using morphology). Nor are they host races, ecological morphs or members of a recent island radiation. Unusually for studies of introgression, we have chosen two species that are partially sympatric, continental taxa differing strongly in morphology, as well as in larval and adult ecology [21,25,26]. These species diverged in colour pattern due to Müllerian mimicry [27]: Heliconius melpomene mimics the geographically widespread red, yellow and black H. erato, while H. cydno mimics the black and white or yellow H. sapho and H. eleuchia [27,28] in the Northern Andes and Central America. The two are not even strictly sister species, since a number of other morphologically distinct species, such as H. heurippa, H. pachinus, H. timareta, and H. tristero, appear more closely related to H. cydno than to H. melpomene [23,[29][30][31][32]. Nonetheless, natural hybrids occur in most areas of range overlap. The frequency of hybridization is rare (less than one in every 1000 individuals [21,27]), but hybrids of both sexes are viable, and, although F1 hybrid females are sterile, in accordance with Haldane's Rule, males are fertile and produce viable backcross progeny with either parental species [33][34][35]. Furthermore, about half the hybrids collected from the wild are backcrosses, identifiable by their colour pattern phenotype [35]. Introgression mediated by fertile hybrids therefore seems likely [5,21].
To test the importance of introgressive gene flow between the two species, we sequenced multiple haplotypes of four unlinked loci using a geographic sampling regime designed to reveal unusual genealogical patterns due to introgression. We chose sympatric populations of H. cydno chioneus and H. melpomene rosina from Panama, which are known from a few specimens to hybridize occasionally in the wild, and an allopatric French Guiana population of H. melpomene melpomene that does not overlap with H. cydno, and whose mtDNA is also known to be divergent relative to other H. melpomene populations [23]. The French Guiana population acts as a control for population divergence and the high variance of coalescence time expected. If introgression were occurring at a particular locus, we would expect sympatric haplotypes to be unusually similar, while at the same time the allopatric population of H. melpomene can be used to confirm that evolutionary divergence occurs normally at the same locus. We sequenced DNA from four unlinked loci: one mitochondrial region (cytochrome oxidase subunits I and II -CoI, CoII; hereafter Co), and intron-spanning regions of a sex-linked locus Tpi (triose-phosphate isomerase, on the Zchromosome), autosomal Mpi (mannose-6-phosphate isomerase), and Ci (cubitus interruptus); for details see Table  1). These loci were selected because it was reasoned that mitochondrial genes and nuclear non-coding regions would diverge rapidly enough to provide high resolution between closely related taxa [36]. The nuclear loci are all known to be on different linkage groups [37], providing a scatter of loci across the genome with which to detect introgression.

Results
The genealogical pattern at each locus Details of the data and models of evolution are given in Table 1. As expected if allelic coalescence has occurred more recently than the split between the two species, the maximum likelihood genealogy for Co is consistent with reciprocal monophyly for H. melpomene and H. cydno, with 2.5-3.0% net divergence between species compared to < 1.3% uncorrected divergence within populations, and 0.4% net divergence between H. melpomene races ( Table  2). There was no evidence for recombination at this locus (Table 1), as expected for mitochondrial sequences [38]. A previous parsimony study of Co found French Guiana (MG) and Panama (MP) haplotypes of H. melpomene to be distinct (as here), and suggested that the French Guiana clade was sister to a clade including Panama H. melpomene and H. cydno [23], making H. melpomene paraphyletic with respect to H. cydno (CP) at this locus. Our results differ in that although melpomene paraphyly has a marginally greater likelihood, mutual monophyly cannot be rejected (SH test: ∆logL = 1.36, P = 0.27). Paraphyly of H. melpomene, as found by Brower [23] using a parsimony analysis, is however strongly rejected by equal-weighted parsimony bootstrapping on our data. The difference between the two analyses is probably due to additional information provided by the greater length of mitochon- drial sequence we used (1573 bp vs. Brower's 942 bp). We therefore show the constrained, mutually monophyletic tree (Fig. 1).
The genealogical pattern for Tpi is somewhat similar in some respects to that for Co, in that mutual monophyly of H. cydno and H. melpomene seems likely, with 1.4-2.5% net divergence between the two species at this locus, compared to 1.3% between races within H. melpomene (Table  2). There are several examples of homoplasious indels in the tree, as expected if recombination had taken place within these haplotypes. There is additional evidence from a statistical test for recombination in both H. melpomene (P = 0.023), and H. cydno (P = 0.004) from Panama. In contrast to Co, Tpi yields an almost reverse maximum likelihood genealogy in which H. melpomene forms a monophyletic group within a paraphyletic Heliconius cydno. But again, the evidence for paraphyly does not stand up: there is less than 50% parsimony bootstrap support for a paraphyletic tree over a monophyly-constrained tree, and likelihood support for paraphyly is also weak (SH test, ∆logL = 6.63, P = 0.22). Therefore, we again represent the genealogy as the most likely mutually monophyletic species tree (Fig. 2 Table 1). The vast majority of indels conform to the estimated genealogy, suggesting recombination is rare (Fig. 3). The maximum likelihood genealogy suggests that many identical or nearly identical sequences are shared between the two species in Panama, both of which populations are strongly differentiated from H. melpomene from French Guiana (Fig. 3). Four very short Mpi sequences, with a 280 bp deletion spanning most of the intron, were also identical across taxa (Fig. 3, haplotype group III). These sequences are placed basally by maximum likelihood because of the lack of phylogenetic information. A tree search constrained to be mutually monophyletic between the two species is strongly rejected (∆logL = 133.24, P < 0.001).
The existence of some very short alleles and high rates of divergence among haplotype groups within each taxon might suggest that divergent duplicate loci are being sequenced. However, a battery of tests confirm that Mpi acts as a single-copy nuclear, coding marker which can be mapped readily in crosses (see methods). Similar results have recently been obtained with the same primers by a number of other workers, showing that the results here are not anomalous [24,39].
The genealogical pattern for Ci is more complex. As with Mpi, net divergence is low between H. cydno and H. melpomene in Panama (0.7%), and rather higher between geographic races of H. melpomene (3.7%) and between H. cydno and French Guiana H. melpomene (2.9%). However, there are no fixed differences between any pair of taxa, and the numbers of shared polymorphisms appear to be roughly equal between all comparisons, both within and between species at this locus (Table 3).
Unlike Mpi, Ci shows many inferred homoplasies, both of single base pairs and of indels: in addition to the apomorphic indels shown, 90 inferred insertion/deletion events involve 27 homoplasious indels, too numerous to be shown in Fig. 4. These homoplasious events are presumably mainly due to recombination (significant in all four separate tests, P = 0.02, P = 0.001, P < 0.001, P < 0.001; see Table 1), which negates the validity of representing the Ci genealogy as a strictly bifurcating tree as in the likelihood analysis ( Fig. 4). Indels are expected to be less prone to reversal or homoplasy than single-base changes. At the Ci locus, seven out of 11 indels not subject to homoplasy or reversal supported groups of French Guiana melpomene melpomene, while of the remaining three, two are autapomorphic, one supports a (cydno+melpomene rosina) haplotype pair. One indel supports a grouping of H. cydno with a single associated H. melpomene sequence that lacks the indel (but which is attached to the base of the clade, and so does not contravene the idea that the indel may be a character found only in H. cydno). These contrasting patterns suggest either multiple ancestral polymorphisms, or introgression of haplotypes and recombination both within and between the two species in Panama, but a lack of gene flow between either of these and H. melpomene melpomene sequences from French Guiana. Several groups of Ci haplotypes from French Guiana H. melpomene melpomene form distinct clades, but overall there is strong evidence against mutually monophyletic separation between cydno and melpomene ( Fig. 4, ∆logL = 130.63, P < 0.001).
Thus the pattern seems similar to that for Mpi, but closely similar or identical haplotypes at this locus are rarer between than within species. Thus, if introgression explains similarities among haplotypes of the sympatric species pair, it may be more ancient than at Mpi.

Bayesian analysis of genealogy and introgression
The strong polyphyly of Mpi and Ci genealogies between the two species suggests that selective introgression at these loci may be the cause. To examine this possibility we employed the Bayesian 'Isolation-Migration' (IM) algorithm of Hey & Nielsen [18]. The IM test deals with pairs of taxa only, so we examined only sympatric H. melpomene rosina and H. cydno chioneus from Panama. We carried out analyses on three modified datasets to conform with the assumptions and limitations of the IM algorithm (i.e. no gap polymorphism, no recombination within loci). Firstly, we obtained a dataset that is as complete as possible obtained by deleting highly indel-laden haplotypes and indel polymorphic regions (the basic 'IM Dataset')for example, the short intronless sequences at Mpi could not be used in this analysis. Two additional datasets were sampled from this basic dataset, by removing apparent recombinant regions or haplotypes ('IM Reduced Dataset 1', and 'IM Reduced Dataset 2' -see methods and Table  1). In the event, all three analyses gave broadly similar results. We calibrate the analyses via neutral substitution rates obtained from an insect mitochondrial DNA clock calibration [32] to obtain approximate per base pair neutral substitution rates (µ) for each locus, times since speciation (t), effective population size measured as a product of mutation rate and total population size (θ = 4 Nµ), and per locus bidirectional introgression rates (m). The absolute values of these parameters are of course of some interest, but are highly dependent on the clock calibration of Brower [32] based on a shorter sequence of mtDNA, which may therefore be somewhat unreliable. What are of more interest here are the relative values for introgression (m) between the species at different loci. The results are shown in Tables 4, 5, 6 and Fig. 5.
Overall, the results suggest that substitution rates (µ) for the three nuclear loci are three to four-fold higher than the rate for the mitochondrial region. This may seem rapid for nuclear genes, but the comparison is between mitochondrial coding sequence and largely intronic nuclear sequence (Tables 4, 5, 6), and the result agrees approximately with a broader comparative study in Heliconius in which third base pairs in codons of the Co mitochondrial region were shown to diverge at approximately the same rates as intronic sequences in two of the nuclear genes studied here (Tpi and Mpi), overall, and apparently even faster at low divergences similar to those found here (1-2% at Co; see Figs. 1a, b in ref. [24]). Only weak information is available from the genealogical data about time of divergence, but speciation is inferred to have taken place more than about a million years ago (Tables 4, 5 [32]. Very little information is available to estimate the ancestral population size (θA), as can be seen from the graphs in Fig. 5.

Maximum likelihood genealogy for CoI and CoII
Of most importance here, bidirectional introgression rates (m) for each locus apparently identify a single outlier, Mpi, which has m at least an order of magnitude higher than at any of the other loci. The lower tails of the posterior probability distributions of m for Co, Tpi, and Ci are never complete even at the lowest values sampled, so we follow Hey & Nielsen [18] in inferring a lack of evidence for introgression at such loci.

Discussion
The Finally, the IM analysis [18] suggests exchange of Mpi haplotypes between the two species in Panama, but no evidence of gene flow at other loci. Instead of being due to gene flow, IM suggests that polyphyly in the genealogies of Ci may be due to ancestral polymorphism inherited from a common ancestor, or possibly due to Ci haplotype introgression in the distant past. In only the most reduced dataset (IM reduced dataset 1) for Ci is there a relatively flat distribution of posterior probability for higher migration levels (amplified 10 × in Fig. 5 for clarity). In view of the stronger clustering of Ci haplotypes by species than at Mpi, a conclusion of little evidence for introgression at Ci seems warranted. The analysis therefore provides evidence for introgression only for Mpi haplotypes, where effective gene flow is about 1.5 × 10 -6 per generation (with a 95% credibility interval of 9 × 10 -7 -4.50 × 10 -5 , Table 5).

Maximum likelihood genealogy for Mpi
Loci embedded in divergently selected genomic regions may be less likely to cross the species boundary than others [15]. Our data are consistent with linkage maps and knowledge of sterility between H. melpomene and H. cydno. Female hybrids between the two species are sterile [34], which should prevent mitochondrial transfer.
(Female hybrids between Panama and French Guiana H. melpomene are also sterile, but in only one direction of the cross [40], so that mitochondrial transfer should still be possible between the geographic races, although here geography is probably the cause of divergence). Tpi may have a similar genealogy to that of CoI/CoII because the gene is syntenic with Z-linked female sterility in crosses [34]. Ci maps to linkage group 18 [37], which also contains an important locus causing fixed colour pattern differences between the two species. Introgression is thus likely to be inhibited in this locus by strong mimetic selection against intermediate or introgressed colour patterns [20,27,28]. In contrast, introgression of autosomal Mpi haplotypes (linkage group 3 [37]) between H. cydno and H. melpomene may occur more readily because it is unlinked to any loci known to be associated with sterility or other divergently selected traits.
The use of a color pattern 'toolbox' of switch genes shared between multiple species has been suggested as a means by which similar, homoplasious colour patterns spread throughout the H. melpomene -silvaniform clade of the genus Heliconius [24,41]. Many of these species hybridize in nature [5].  [30,31,41]. Occasional introgression may thus have allowed wholesale transfer of multilocus, ready-made mimetic adaptations. Our finding of introgressed autosomal haplotypes between two members of this subgroup provides molecular evidence for the possibility such a claim.

Inferring history and species status from limited sequence data
The data presented here also illustrate the difficulties of reconstructing phylogeny of closely related species from DNA sequences. For this study, finished sequence was obtained from 30 individuals, and from 43-45 haplotypes for each nuclear gene. For each individual, cloning stages must be added so that some 10,000 extra bp DNA had to be sequenced per individual, or about 390,000 bp total to obtain reliable sequences for just four loci sequenced in every individual. Of the loci used, two (Co and Tpi) agree with prior morphologically-based understanding of species delimitation, showing no significant deviation from reciprocal monophyly. Two loci strongly reject the same phylogenetic interpretation, of which one (Ci) shows potential evidence for ancient gene flow and/ or abundant ancestral polymorphism, and the other (Mpi) shows clear evidence for ongoing introgression between the species in sympatry and differentiation between geographic races. Still another locus tried (Dopa decarboxylase -Ddc, [see Additional file 1]) was inconclusive, although we hypothesise fixed differences and reciprocal monophyly for this locus too, since H. cydno apparently lacks priming sites in H. melpomene and other related taxa. Previous conclusions that H. cydno is nested within Guiana and Panama melpomene clades [23] appear on current data to be incorrect, and may result from a lack of resolution due to the smaller CoI/CoII fragment used earlier. However, although mutual monophyly is not ruled out using the Co and Tpi fragments studied here, maximum likelihood genealogies of both genes suggest that species paraphyly is as, or more, likely. At the same Although we did not expect such strong differences beween loci a priori, it is clear that there are many other differences in the evolution of the introns sampled at the four nuclear loci. The intron at Mpi is characterised by major splits between highly divergent haplotype groups maintained as polymorphisms (possibly as a result of balancing selection -see [24]), each haplotype group associated with particular indels, and showing little evidence of recombination. Tpi and Ci show many polymorphic indels, and strong evidence for recombination between haplotypes. Introns at Ddc, in contrast, [see Additional file 1] show strong sequence conservation, and are readily alignable even with distantly related species such as H. himera (a close relative of H. erato), while the highly divergent intronic sequences from Tpi or Mpi are impossible to align between erato-group and melpomene-group species of Heliconius [24]. Ddc also has few indels, and shows no evidence for recombination, even in its intronic sequences. A major finding of this study has been to demonstrate how idiosyncratically different genes can evolve. In this respect, our results are concordant with those obtained both for other sympatric, regularly hybridizing insect groups (e.g. Anopheles gambiae sensu lato [13,17] and the Drosophila pseudoobscura group [16,18]) and also in a largely allopatric radiation of three sibling species between which hybridization is unlikely and introgression not observed (Drosophila simulans group [42]). A 'species phylogeny' of closely related taxa such as these at best provides an artificial consensus of multiple conflicting genealogical patterns, rather than a meaningful representation of actual lineage diversification.
This study also highlights the difficulty of delimiting species on the basis of limited sequence information. In our populations, a short mtDNA barcode sequence could be used to assign individuals to clades each having fixed differences, and these clades could then be labelled as belonging either to H. melpomene or to H. cydno on the basis of already established biological, ecological and morphological traits. However, a priori attempts to define µ L = mutation rate per base pair per generation for each locus L. θ sp = combined parameter 4 N e µ Co of species 'sp', or their ancestor 'A', relative to the mitochondrial mutation rate µ Co N e represents the effective population size. t = time of divergence in years. m L = locus-specific migration rates per generation for each locus L. * = unreliable estimate or limit due to flat or incomplete posterior probability distribution sampled. 0 = effectively zero, although the lowest 'bin' does not actually include zero.
species boundaries on the basis of the mtDNA marker will normally fail: we might decide that all strongly supported branches of the mitochondrial tree in Fig. 1 [5,49].
Thus the recent tendency to split geographic forms into species based on fixed mitochondrial differences [49,51] but see [52] is inadequate without investigating a panel of nuclear genes as well. Oddly, this is similar to the criticism leveled by Zink [53] himself against the subspecies erected based on morphological characters: when such markers or characters are used to delimit taxa, the resulting taxon predicts little about the behaviour of the rest of the genome. Yet taxonomic inflation caused by elevation of populations to species level based on mtDNA data is surely a worse problem for taxonomy and conservation than difficulties at the subspecies level [54].
The considerable interlocus variation in mode of evolution and genealogical history demonstrates that large numbers of sequenced loci will be needed in order to investigate and identify phylogeny and species boundaries reliably, and even then will provide only a consensus of genealogies, because a 'true' species phylogeny is not possible when genealogies conflict. In particular, taxon delimitation based on mitochondrial sequences alone will give little insight into the behaviour of nuclear genes. It could be argued that Heliconius is a special case, but very appreciable per species rates of hybridization are found both in Heliconius butterflies and in many bird groups currently undergoing mtDNA-based species delimitation and taxonomic inflation [5]. We argue that morphological, ecological, and behavioural data, coupled with geographical distribution data (particularly sympatry), will remain superior to DNA sequences for species delimitation [52], at least until larger numbers of loci can be readily analysed using a truly 'genealogical genomics' approach.

Conclusion
This work adds to a small but growing body of DNA sequence evidence showing that genetic material may pass regularly between closely related animal species in nature, millions of generations after speciation [16,20]. This work supports the hypothesis that introgressive hybridisation could make a significant contribution to adaptive evolution in Heliconius and in animals generally. If it is generally true that closely related, non-sibling, ecologically distinct animal species are often permeable to introgression, the nature of such species, their ecology, causes of speciation [1,16], phylogeny reconstruction [43], as well as conservation issues concerning hybridizing populations [44] all require re-evaluation. In particular, phylogenetic reconstruction and diagnostic tests for species status employing single genes must be used with great caution [45][46][47][48]. Species boundaries in radiating groups, even in 'normal' continental species such as Heliconius, appear to remain porous long after divergence.

Sampling methods and DNA extraction
Thirty wild butterflies were sampled (28 males and 2 females), 10 each of H. melpomene rosina from Panama, H. melpomene melpomene from French Guiana, and H. cydno chioneus from Panama. Sequences from H. numata were used as outgroups. (Genealogies were also checked using other related Heliconius, as well as H. himera as outgroups; data not shown). Butterflies were collected in the field, and preserved in liquid nitrogen. These samples are stored at -70°C at the Smithsonian Tropical Research Institute in Panama. From each individual, 1/3 of a thorax was ground in liquid nitrogen, and genomic DNA was extracted using the standard phenol-chloroform method [58].

Loci and primers
A region of mtDNA spanning the 3' end of CoI, leucine-tRNA (tRNA-leu), and CoII was selected as a suitable mitochondrial region following work by Beltrán et al. [22,24]. This mitochondrial region has been used in many insect studies, and a shorter region of CoII included within our study was used in pioneering molecular phylogenetic studies of the genus Heliconius [23,59]. Tpi is an important enzyme in carbohydrate metabolism encoded by a sexlinked nuclear gene in most Lepidoptera [60]. The region  Legend as for Table 4 amplified spans a single intron in Heliconius [24], is inherited in a Mendelian manner and is sex-linked in Heliconius [27,34,37,40,61] and very likely many other Lepidoptera (e.g. Ostrinia - [62]; and Bombyx -see GenBank accession AY734490). Mpi also spans a single intron and is encoded by an autosomal gene. The protein product is highly polymorphic in Lepidoptera, including Heliconius [63][64][65][66]. The time since divergence of the two species (t), and the locus-specific bidirectional introgression rates are also shown (m). The three datasets analysed are the basic IM dataset (blue), and IM reduced dataset 1 (pink) and IM reduced dataset 2 (green). Analysis of the basic IM dataset is compromised by recombination within Tpi and Ci, which is assumed not to occur in the IM algorithm. Reduced datasets containing apparently non-recombined segments of the genes were analyzed to overcome this difficulty. IM reduced dataset 1 differs only from IM reduced dataset 2 in that a different, shorter, part of the Ci locus is used; the low sequence information probably explains why there is little information in the former run in the last panel (probabilities for IM dataset 1 are enhanced 10-fold in this panel only, for clarity). The curves show useful parameter estimation , except in the case of ancestral population size (θ A ), the upper tail of the time of divergence (t) and the introgression for Ci for IM reduced dataset 1.
their possible involvement in wing pigmentation genetics of butterflies [61]. Ci is a transcription factor serving to activate the transcription of wingless [70], and involved in wing-patterning in some butterflies [71]; the fragment studied here spanned two introns. Ddc is involved in the melanin pathway in insects, where it catalyses the conversion of dopa to dopamine [67][68][69]; the fragment studied here spanned two introns.
Primers and details of methods for all the loci have been described earlier [24,37,61,72]; further details are given in [Additional file 1].
Because of the many insertions and deletions in the intron-spanning sequences studied here, direct sequencing produced ambiguous base calls in heterozygotes. We therefore separated the two alleles of nuclear loci by cloning prior to sequencing. Products amplified from genomic DNA were run in a low-melting point agarose gel (as for mtDNA) and the bands excised and dissolved in agarase. The products were cloned to obtain the sequence for each allele, using pGEM ® -T Easy Vector System II (Promega). Five or more clones per individual were selected; re-amplified, and again purified on an agarose gel. Positive bands were excised and dissolved using agarase.

Purification, sequencing and allele editing
Templates from all loci were cycle sequenced using primers and methods already published [24,37]. PCR will generate Taq errors of amplification, and these errors can be 'fixed' when extra steps of cloning and PCR are added [73][74][75]. To correct these errors, we sequenced a minimum of five clones per individual. These were aligned and sorted into haplotype classes ('A' and 'B' where shown to be heterozygous), and the consensus sequence was deduced by assuming that single-base Taq error was likely to occur only once. During this procedure, we found one individual (CP569) for which one clone had a recombinant Tpi allele, which matched the A allele for part of its length and then the B allele for the rest, clearly resulting from Taqinduced recombination during the initial PCR stage. In no other case could sequenced clones from each individual be interpreted as belonging to more than two alleles of Mpi, Tpi, Ci per individual, giving further evidence against duplicate loci.

Testing for duplicate loci, pseudogenes and other anomalous sequences
To test for the presence of pseudogenes and other duplicates, we checked our sequences against those already obtained for related species, and for the stop codons expected in pseudogene sequences. In no case did we detect anomalies. At Mpi, there were major divergences in sequence within species, populations and even individuals (as had already been discovered in Heliconius [24]), and as the evidence for introgression was strongest at this locus, it was important to check for the possibility of duplicate loci. We screened for heterozygosity using Temporal Temperature Gradient gel Electrophoresis (TTGE) [76]. Mpi clone were run using 8 µl of double stranded PCR product using the BioRad TTGE 'Dcode' system. Gels contained 8% acrylamide and 1.75 M tris acetate EDTA buffer (TAE), and were run from 46 -53°C at a temperature ramp of 1°C per hour. In no case were more than two alleles observed in any individual. Several broods were also tested to verify Mendelian segregation using RFLP polymorphisms derived from the sequence information, and all autosomal and sex-linked loci behaved as expected for single loci in broods mapped using AFLP markers [34,37]. Haplotypes therefore segregate in the expected Mendelian fashion and were inherited in complete linkage with the Mpi allozyme locus in broods of H. erato and himera [61]. We conclude that Mpi behaves as expected for a single-copy locus, in spite of its high intraspecific variability, which may be related to hybridization and introgression (see below).

Sequence alignment and phylogenetic/genealogical analysis
Chromatograms for all genes were edited, base calls were checked and aligned manually. Complete original alignments are given in [Additional file 2]. All single base polymorphisms occurring in only one or two individuals were rechecked against chromatograms to ensure they were correctly read prior to phylogenetic analysis. All sequences were translated to check for reading-frame errors and stop codons. The new haplotype sequences studied here are deposited under GenBank accessions AF512970-AF512993 (Co), AF516210-AF516255 (Mpi), AF545437-AF545469 (Tpi), AY429261-AY429304 (Ci), and complete alignments are given in [Additional file 2]. Sequences were verified by aligning against Heliconius cDNA sequences, or other Lepidoptera or Drosophila sequences for the same gene. An unusual insertion and deletion in Ci consisting of an approx 270 bp sequence was found in three clones (melpomene rosina MP 545A and MP 545B, and cydno CP 809A). This indel aligned well between these three clones, but was unalignable with any other sequence, including those from other Heliconius. The region was clearly homoplasious with the rest of the genealogy and was deleted prior to analysis. Unalignable sequences are a widespread problem in molecular evolutionary biology [77], and their deletion can lead to a loss of information. However, with the insertion included, the topology was nearly identical, the main difference being the extremely long branch lengths and similar sequences of the three problem haplotypes tended to cause them to group together. Net divergence estimates and shared polymorphism counts were also performed on the data set with this unalignable region excluded. Apart from these sequence fragments, the entire sequences were used to obtain an estimate of genealogy.
Phylogenetic analyses were performed with PAUP* version 4.0b10 [78]. Models of sequence evolution were compared by means of likelihood ratio tests using Model-Test 3.04 [79]. PAUP* was then used to search for the maximum likelihood (ML) tree, based on the best fit model and parameter estimates given by ModelTest (Table 1) and using a heuristic search with tree bisection reconnection (TBR). Confidence in different hypotheses (e.g. constrained to mutual monophyly versus maximum likelihood) was tested using the bootstrapped Shimodaira-Hasegawa likelihood-ratio test (SH test [80]) as implemented by PAUP*.
For comparison, maximum parsimony trees were also obtained using a heuristic search with TBR branch swapping. Confidence in each node was assessed by bootstrapping (10000 replicates also with TBR branch swapping). Because we were most interested in branch support, we did not analyse the data using Bayesian methods, because the very high branch support that this method produces has come under suspicion, and may be due to great sensitivity of Bayesian phylogenetic analysis to the form of the prior distribution assumed [81]. In addition to phylogenetic analysis, the data were analyzed to estimate a range of population genetic parameters. Polymorphism and divergence estimates ( Table 2, 3) were calculated using the SITES program [82,83].
To estimate the importance of introgression, we initially attempted to use the Wakeley-Hey (WH) algorithm [14,83] to test the null hypothesis of equal rates of accumulation of divergence and polymorphisms, but the simulation-based computer implementation (WH) failed to complete. This appears to be a common situation for data of this kind where relatively few loci, highly heterogeneous for levels of divergence, are used to estimate ancestral population sizes [14,17]. In addition, the SITES program and the WH test are not very appropriate because they assume an infinite sites model, under which a single site cannot be substituted twice in the same genealogy. However, polymorphisms with three bases are quite common in all of the rapidly evolving sequences studied here. Another problem was that any sites with missing data or spanned by indels in one or more aligned sequences are ignored. Our intron data often have indels, so that, summed over all individuals, much of the sequence may be spanned by one or more indels leading to a complete loss of information in SITES analyses.
We therefore analysed the data using the Isolation-Migration (IM) method [18]. The program employs a Metropolis-Coupled Monte-Carlo Markov-Chain (MCMCMC) algorithm for Bayesian estimation of genealogical parameters related to mutation and introgression in a single pair of species. The programme has recently been upgraded to deal with bugs which affected the HKY mutation model used here (November 2005, Hey pers. comm.); all analyses were re-run with the new executable files. Like SITES, the program still ignores DNA sites for which any sequence has missing data, and the program also assumes no recombination within each sequenced locus. SITES revealed a certain amount of 'recombinant' sequences, particularly at the Ci locus, so IM could give misleading results using our data. However, SITES uses a four-gamete test for recombination between individual sites, which is valid only under the infinite sites mutation model, and so gives a criterion likely to be much too strict for more realistic, HKY-type models of evolution, since repeated changes at the same site were observed. A better clue to recombination is the overall pattern of multi-site sequences, and genealogical homoplasy of indels (Fig. 2, 3, 4), which may be less likely to recur via mutation than single base pair changes (particularly transitions under HKY and more complex models). We also tested for recombination using a model-neutral test based on a bootstrapped correlation of linkage disequilibrium (R 2 ) with physical distance [38]. Following Hey & Nielsen [18, J. Hey pers. comm. 2005], for the IM analysis, therefore, the data were pared down, firstly to remove any indel information not analysable by IM, while maximizing the sequence data (the basic IM dataset), and then subsampled to remove clearly recombinant regions, by sampling from the 5' region of each gene until a probable recombinant pattern is observed (reduced dataset 1). Because the 5' apparently unrecombined region of the Ci locus was very short, we also used a different subset containing instead the longer 3' unrecombined region of Ci (reduced dataset 2). 'Inheritance scalars' (per locus constant effective population sizes relative to those for an autosomal locus) were set at 0.25 for Co, 0.75 for the sex-linked Tpi, and 1.00 for the other loci. All loci were used in each run to estimate individual species and ancestral population sizes θ = 4 Nµ, along with parameters time of divergence (t), relative mutation/substitution rates rates (µ), and per locus bidirectional gene flow (m). These parameters were calibrated to a molecular clock to obtain parameters per base pair and per generation via Brower's [32] estimate of insect mitochondrial divergence of 2.3% per million years (i.e. a neutral substitution rate, µ, of 1.15% per million years), and with four generations per year assumed in Heliconius. Neutral mutation (substitution rates) and migration rates (m) were allowed to vary between loci. However, to reduce the numbers of parameters, introgression was assumed symmetrical within each locus (using the terminology of IM, m1=m2, i.e. option -j56). After optimizing parameter search limits using initial runs, each of the three datasets were run for at least 30 million steps after burn-in under the HKY model in IM using 5 chains per set, with linear heating increment parameters, h of 0.033, and a discarded burn-in of 200,000 steps. Actual run durations after burn in were 35,729,000 steps for the IM dataset, 49,769,000 steps for IM reduced dataset 1, and 54,460,000 steps for IM reduced dataset 2. The three IM datasets used have been provided in [Additional file 2].