Paucity of chimeric gene-transposable element transcripts in the Drosophila melanogaster genome

Background Recent analysis of the human and mouse genomes has shown that a substantial proportion of protein coding genes and cis-regulatory elements contain transposable element (TE) sequences, implicating TE domestication as a mechanism for the origin of genetic novelty. To understand the general role of TE domestication in eukaryotic genome evolution, it is important to assess the acquisition of functional TE sequences by host genomes in a variety of different species, and to understand in greater depth the population dynamics of these mutational events. Results Using an in silico screen for host genes that contain TE sequences, we identified a set of 63 mature "chimeric" transcripts supported by expressed sequence tag (EST) evidence in the Drosophila melanogaster genome. We found a paucity of chimeric TEs relative to expectations derived from non-chimeric TEs, indicating that the majority (~80%) of TEs that generate chimeric transcripts are deleterious and are not observed in the genome sequence. Using a pooled-PCR strategy to assay the presence of gene-TE chimeras in wild strains, we found that over half of the observed chimeric TE insertions are restricted to the sequenced strain, and ~15% are found at high frequencies in North American D. melanogaster populations. Estimated population frequencies of chimeric TEs did not differ significantly from non-chimeric TEs, suggesting that the distribution of fitness effects for the observed subset of chimeric TEs is indistinguishable from the general set of TEs in the genome sequence. Conclusion In contrast to mammalian genomes, we found that fewer than 1% of Drosophila genes produce mRNAs that include bona fide TE sequences. This observation can be explained by the results of our population genomic analysis, which indicates that most potential chimeric TEs in D. melanogaster are deleterious but that a small proportion may contribute to the evolution of novel gene sequences such as nested or intercalated gene structures. Our results highlight the need to establish the fixity of putative cases of TE domestication identified using genome sequences in order to demonstrate their functional importance, and reveal that the contribution of TE domestication to genome evolution may vary drastically among animal taxa.


Background
The origin of genetic novelty is of great interest in evolutionary biology. As mutation is the ultimate source of all genetic variation, understanding the mutational processes that lead to novel genomic features such as new genes, expression patterns or system interactions is paramount. The most commonly invoked mutational source of genetic novelty (after point substitution) is either segmental or whole genome duplication [1,2]. More recently, the role of duplicative transposition -the copying and pasting of particular DNA sequences from one part of genome to another -has been shown to play an important role in the evolution of new genes (e.g. [3]). Evidence from the human and mouse genomes indicates that, in addition to providing the source of the transpositional machinery, transposable elements (or TEs) [4] can also provide the template DNA for new genes or regulatory sequences [5][6][7][8][9][10][11]. However, to understand the general role of TE domestication in eukaryotic genome evolution, it is important to assess the acquisition of functional TE sequences by host genomes in a variety of different species, and to understand in greater depth the population dynamics of these mutational events.
Here we have investigated the incorporation of TEs into mature transcripts in the fruitfly Drosophila melanogaster, a species about which much is known in terms of the sequence and function of genic and intergenic regions. To do so, we searched for potentially domesticated "chimeric" transcripts (i.e. transcripts containing both TE and host gene sequences) backed by experimental support in the form of expressed sequence tag (EST) evidence (cp. [10,11]). The focus of this study is gene-TE associations contained within mRNA transcripts (i.e. within exons or untranslated regions, UTRs), so here we do not consider TEs that are either wholly contained in introns or located in the immediate vicinities of genes. An advantage of our approach is that the gene-TE chimeras identified are supported by experimental evidence rather than just by coordinate overlaps or mere proximity (cf. [12,13]), and thus enriches for a subset of TE insertions that may contribute to functional gene evolution in the host.
In addition, we have assessed the presence in wild populations of gene-TE chimeras identified using the genome sequence, to gain insight into the evolutionary forces acting on these mutations in nature. Using a pooled-PCR strategy, we estimated population frequencies for a sample of chimeric TE insertions in North American strains of D. melanogaster. By comparing population frequencies of chimeric TEs to those of non-chimeric TEs of the same family from similar genomic contexts, we evaluated whether chimeric TEs generally segregate either at unusually high frequencies (indicating the action of adaptive selection) or at unusually low frequencies (indicating the action of purifying selection). These results also revealed which of the gene-TE chimeras detected in the genome sequence are likely to be constitutive components of the D. melanogaster transcriptome.
By comparing our set of gene-TE chimeras to the entire set of annotated genes and TEs in the D. melanogaster Release 3 euchromatin, we show that a chimeric TE insertion has a much lower probability than a non-chimeric TE insertion of existing in the sequenced strain. This extreme paucity of chimeric TEs can be explained by the simple fact that TE insertions generating chimeric transcripts are likely to be strongly deleterious for the host. However, we find that the population frequencies of observed chimeric TEs are generally indistinguishable from similarly paired non-chimeric TE insertions, and we find that some chimeric TE insertions can be found at high frequency in North American populations. This pattern indicates that chimeric TE insertions observed in the genome sequence do not differ substantially from non-chimeric TEs in their selective effects, and that the D. melanogaster transcriptome permits a low-level flux of chimeric transcripts that may contribute to the formation of new gene sequences. Finally, we discuss the possibility that chimeric transcripts explain the curious phenomenon of regulated somatic expression of TE transcripts in the developing Drosophila embryo.

Identification of chimeric gene-TE transcripts in the D. melanogaster genome
In order to study the functional integration of TE sequences into host genes, we identified TE insertions present in mature transcripts of the D. melanogaster euchromatic Release 3 genome sequence. We call such transcripts "chimeric" as each of them has one component from a host gene and one from a TE insertion. In addition to using the standard methods in the field for directly finding genes and TEs that share overlapping coordinates or querying annotated transcripts directly for TE sequences [8,10,11], we also sought evidence for chimeric transcripts using a novel three-step process based on expressed sequence tags (ESTs) (see Materials and Methods). This indirect method of identifying gene-TE chimeras was necessary to avoid annotation biases resulting from the fact that "coding exons were not annotated in sequences with homology to transposable elements" [14] in the D. melanogaster genome.
In total, we found 63 protein-coding genes that produce chimeric transcripts supported by EST evidence (Table 1; for more information [see Additional file 1]). These chimeric transcripts involve 63 different TE insertions, but the relationship is not simply one-to-one: in two cases, TE insertions (FBti0019107 and FBti0020178, Table 1) occur  in overlapping 3'UTRs of convergently transcribed neighboring genes producing two separate chimeric transcripts each (see Figure 1A); and in one case, three TE insertions are found in a chimeric transcript for a single gene (CG32021) on the 4 th chromosome. In addition, we found one noncoding transcript, the αγ-element [15], which is generated by two TE insertions within a larger nest of TEs situated between the Hsp70 Ba and Bbb genes. Our screen appears to have high sensitivity as evidenced by the fact that we identified four of the five exonic TE insertions previously reported in [12] (we found no supporting EST evidence for the fifth gene CG7900); the single exonic jockey insertion in the gene CG6191 reported in [16]; and the chimeric transcript generated by a Doc insertion into the gene CHKov1 (CG10618) reported in [17,18]. We did not identify the Bari-1 insertion in cyp12a4 recently reported in [19], which is supported by EST evidence, since the region of overlap (18 bp) does not pass our length threshold.
We note that six of the 65 chimeric TE insertions identified by BLAST-based methods do not have corresponding TEs in the Release 3.2 annotation. However, unannotated TEs of the correct family can be found in the genome sequence for these chimeric TE insertions (Table 1). This result indicates that an unknown proportion of real TE insertions has not been annotated in the Release 3 genome sequence (see below). To be able to analyze aspects of chimeric TEs in the context of the genome annotation, we excluded these six TE insertions from the "annotated set" of 59 TE insertions, although we do consider them to be bona fide members of the "total set" of 65 potential gene-TE chimeras in the D. melanogaster genome.

Properties of chimeric gene-TE transcripts
Most of the 63 genes generating the total set of chimeric transcripts are of unknown function, but we did identify chimeric transcripts in 23 characterized protein-coding genes including brown (bw), a gene that appears to be a hot-spot for natural TE insertions [20] and is known to carry a viable mutation (bw 1 ) in the sequenced strain [14]. Our in silico screen also identified a chimeric TE insertion generated by the serine protease encoding gene Tequila that has recently been shown to impair the transcription of this gene, but with no apparent phenotypic consequences [21]. A general analysis of the molecular function and cellular localization of the total set of genes with chimeric transcripts, however, did not indicate a significant enrichment of any particular Gene Ontology (GO) category (data not shown).
Relative to other non-chimeric TEs inserted in transcribed regions (i.e. intronic TE insertions), the annotated set of TEs present in chimeric transcripts is significantly enriched for LTR insertions (Figure 2A). This observation largely accounts for the fact that the annotated set is also enriched in long TEs ( Figure 2B), since LTR insertions tend to be longer than other classes of TE insertion in the genome [14]. Furthermore, chimeric TEs have a greater tendency to be present in high-recombination areas of the genome than non-chimeric, intronic TE insertions ( Figure  2C). However, the overabundance of chimeric TEs in regions of high recombination is not caused simply by the fact that chimeric transcripts are preferentially formed by LTR insertions, since high-recombination TE insertions are over-represented among the chimeric non-LTR (i.e. LINE-like, TIR and FB) elements even more strongly than among the chimeric LTRs (data not shown).
TE sequences are found in UTRs in most of the chimeric transcripts they generate: 38 of the 63 TE insertions are found in 3'UTRs, 23 in the 5'UTRs and 4 in coding exons. We note that these numbers total more than 63 because two TE insertions (chimeras 47 and 61 [see Additional File 1]) fall into multiple categories. The higher incidence of TEs in UTRs and specifically in 3'UTRs parallels findings in the human and mouse genomes [10,11]. The increased prevalence of TE insertions in 3'UTRs may be attributed to the increased average length of 3'UTRs (442 bp) relative to 5'UTRs (265 bp) in Drosophila [22] (as has been suggested previously to explain such patterns in the human genome [10]), or to the lower density of functional signals in 3' regions relative to 5' regions of genes. This pattern does not appear to result from biases in the EST libraries, since over 10 times more 5' ESTs were analyzed than 3' ESTs [23].
Surprisingly, the genes involved in chimeric transcripts are not always those nearest to the sites of the corresponding TE insertions. Four chimeric transcripts skip one or more genes between the gene and TE components of the transcript (chimeras 12, 18, 23 and 50; Table 1, Figure 1B and 1C), thereby creating nested or intercalated gene arrangements. The process of gene-or exon-skipping in chimeric transcript formation suggests a novel mutational mechanism to explain the surprisingly large proportion of nested genes in the D. melanogaster genome (many of which bear no hallmark of retroposition) [22,24], as well as the evolution of complex intercalated gene structures that cannot arise via simple mechanisms of gene duplication.

Paucity of TEs in mature transcripts indicates that chimeric TE insertions are generally strongly deleterious
Of the 1,566 valid TEs in the Release 3.2 annotation of the D. melanogaster genome sequence, we estimate that 59 are chimeric TE insertions with some component co-transcribed in an exon, 414 are transcribed but entirely contained within spliced intronic sequences, and 1,093 are entirely contained within intergenic sequences not currently annotated as transcribed. A similar rank order pattern of TE abundance in different functional compartments has been observed in the Arabidopsis thaliana genome [25]. These numbers of TE insertions deviate significantly from their expected proportions based on the genome annotation of the 116.8 Mb Release 3 sequence (p < 1 × 10 -15 ) ( Table 2). This deviation from expectations is the result of two factors: there are fewer TEs in transcribed regions than in intergenic regions (p < 1 × 10 -15 ) [14], and there is a further reduction in exonic regions relative to intronic regions (p < 1 × 10 -15 ). The reduction in transcribed regions, however, is not solely caused by under-representation in exonic sequences, since the number of intronic TE insertions is reduced relative to the number in intergenic regions (p < 1 × 10 -15 ). Together, these results indicate that there is a paucity of chimeric TE insertions in the genome, and that the causes of this paucity go above and beyond the effects of simply being transcribed.
To estimate the extent to which the number of exonic TE insertions is reduced while controlling for the effect of transcription per se on the distribution of TEs, we use the number of intronic TEs and the length of the intronic compartment of the genome to estimate the proportion of unobserved chimeric TE insertions. The total length of intronic regions in the D. melanogaster genome is approximately 37.7 Mb and the total length of exonic regions is 28.2 Mb [22,26]. If the selective pressures on exonic TEs were similar in magnitude to those on intronic TEs we would expect to find approximately 414*(28.2/37.7) = 310 TE insertions in the predicted exonic (coding plus untranslated) regions of the genome. The fact that we detect only 59 chimeric TEs out of an expected 310 (or 19%) indicates that a chimeric TE insertion is much more likely to be highly deleterious to the organism than a nonchimeric TE insertion that is spliced out of a mature transcript. These results are consistent with previous findings in the human genome, that the proportion of TE-derived sequence increases with increasing distance upstream from the start of transcription [10].
These calculations are based on a comparison of the annotated set of chimeric TE insertions relative to the total set of annotated TE insertions. As noted above, however, our results reveal that an unknown proportion of TEs in the Release 3 sequence were not annotated in [14]. If we assume that the frequency of unannotated TEs in intronic regions is proportional to that of the unannotated TE insertions in our sample (~10%), the expected number of TE insertions in exonic regions would increase to 310*1.10 = 341. Thus, using the total set under this proportionality assumption, the percentage of chimeric TE insertions detected relative to expectation is little changed (65 out of 341, 19%). To the extent that the number of unannotated TE insertions in introns is proportionally higher than in our sample, the percentage of observed chimeric TE insertions decreases even further, strengthening the claim for a paucity of chimeric TE insertions relative to expectation.

Observed chimeric TEs are not under unusual selective pressures
We estimated that ~80% of the TEs that have been inserted into mature genic transcripts are immediately purged from the genome by strong purifying selection, and therefore are not observed in the sequenced strain. What about the remaining ~20% of chimeric TE insertions that we do detect? We can envisage three scenarios to explain the existence of these chimeric TE insertions: 1) they are under strong purifying selection, like the TE insertions we do not observe; 2) they are adaptive, contributing useful sequences to the host genome; or 3) they are neither particularly deleterious nor particularly advantageous in comparison to the observed non-chimeric TE insertions in the genome.
In order to evaluate these possibilities, we surveyed the frequencies of chimeric TE insertions in wild D. mela-Examples of gene-TE chimeric transcripts in the D. melanogaster genome  Properties of chimeric transcripts in the D. melanogaster genome. A) Proportions of 59 chimeric TEs in different element classes, compared to those of the 414 non-chimeric TEs found within intronic regions of genes. The proportion of LTR elements among the chimeric TEs is significantly greater than that of TEs found in introns. B) Proportions of 59 chimeric TEs in different length classes, compared to those of the 414 non-chimeric TEs within intronic regions of genes. The "low", "medium" and "high" length classes are defined according to the 33% and 66% length quantiles for the entire set of genomic TEs (748 and 3818 bp, respectively). The chimeric TEs show a significant enrichment for long TE insertions. C) Distribution of chimeric and non-chimeric TEs found within genes, partitioned by different recombination rates. Both distributions are compared against that of the number of genes found in each section of D. melanogaster euchromatin. See Methods for the definitions of "high", "low" and "zero" recombination. Note that the distribution of non-chimeric TEs deviates from the distribution of genes much more significantly than that of the chimeric TEs. In all three panels, the error bars on the numbers of chimeric insertions were obtained by assuming that the intronic proportion is the "true" probability p. Under a normal approximation, we expect the number of chimeric insertions to have mean np and variance np(1-p), where n is the number of chimeric elements. Based on this model, we constructed a 95% confidence interval around the observed number of chimeric elements that corresponds to the error bars in our figure. Error bars on the numbers of intronic insertions in panel C are based on the corresponding proportions of protein-coding genes.  We were able to generate population data for 48 of the 59 annotated chimeric TE insertions. Twenty-seven chimeric TE insertions were found only in the sequenced strain, seven were found in all six-strain pools and 14 had intermediate pool frequencies. These proportions of absent (56%) and polymorphic (44%) chimeric TEs are very similar to a combined, non-random sample of 92 non-chimeric TE insertions with previously reported population frequency data that map to annotated Release 3 TEs: absent (58%) and polymorphic (42%) [12,17,[27][28][29]. The negative effects of intronic TE insertions on transcription do not strongly affect this non-chimeric sample, since similar proportions of absent and polymorphic TE insertions are observed in intronic (60% absent, 40% polymorphic; n = 30) and intergenic (56% absent, 44% polymorphic; n = 62) regions.
To determine whether the chimeric TE insertions are, on aggregate, subject to unusual selective constraints, we compared each of their pool frequencies to those of similar, non-chimeric TE insertions (Table 4). By "similar," we mean that these TE insertions came from the same family as their chimeric counterparts, that they had similar lengths, and were inserted in areas with similar recombination rates (see Methods for details). Since the selective constraint on a TE insertion is expected to increase with its length and the recombination rate of its genomic neighborhood [17,30], we tried to bracket each chimeric TE with a pair of similar non-chimeric family members: one with slightly higher, and one with slightly lower, length and recombination rate (columns 4 and 6 of Table 4, respectively). Our null hypothesis was that the chimeric TE insertions are neither particularly deleterious nor particularly advantageous in comparison with their non-chimeric counterparts. If this null hypothesis is true, we expect the pool frequencies of non-chimeric TE insertions in column 5 of Table 4 to be no higher, and the pool frequencies in column 7 to be no lower, than those of the chimeric TE insertions in column 3.
For the set of 48 TE insertions for which we have population data, we cannot reject the null hypothesis of no difference in pool frequencies between chimeric and nonchimeric TE insertions. Neither the Wilcoxon one-sided test nor the Kruskal-Wallis test reject the null hypothesis in favor of the alternative that pool frequencies of chi-meric TEs are significantly higher than those of their counterparts with greater lengths and recombination rates (p = 0.38 and p = 0.75, respectively; tests performed on the n = 34 TEs in Table 4 that have the appropriate counterparts). This indicates that, in general, the fact that a TE insertion is chimeric does not increase the likelihood that it is at higher population frequency and is therefore potentially adaptive. Similarly, we find no evidence that chimeric TEs in general have pool frequencies lower than those with shorter lengths and lower recombination rates (p = 0.15 for the one-sided Wilcoxon rank sum test, p = 0.30 for the Kruskal-Wallis rank sum test; n = 46). Thus, the fact that an observed TE insertion is chimeric does not increase the likelihood that it is deleterious.
While we do not provide evidence for unusual selection pressures acting on chimeric TE insertions overall, we do find a few exceptions to this general rule when TE insertions are analyzed on an individual basis. As shown in Figure 3, by comparing pool frequencies of chimeric TEs to those of the two types of non-chimeric counterparts, we detect evidence for two exceptional chimeric TE insertions. One, a Doc insertion (FBti0019430), which creates a truncated version of the putative choline transferase gene CHKov1 (CG10618), has a significantly elevated population frequency (chimera 44, Figure 3A) and has been reported previously to be a putatively adaptive TE insertion [17,18]. The second, a pogo (FBti0019206) insertion into the fructose-bisphosphate encoding gene fbp, has a significantly decreased population frequency (chimera 21, Figure 3B) and is likely to be more deleterious than similar non-chimeric pogo insertions.

Discussion
We conducted a thorough search for TE insertions in the mature transcripts of genes in the sequenced D. melanogaster genome. To do so we used three different computational methods, including a novel, indirect EST-based approach (see Materials and Methods). As with all ESTbased bioinformatics methods, this new approach to finding gene-TE chimeras is subject to biases in EST library composition. Such an approach was necessitated by annotation biases in the Drosophila genome that would have caused any direct analysis of annotated transcripts to underestimate the number of putative chimeric transcripts in the genome. Despite these conflicting biases, most of the 63 genes generating chimeric transcripts were identified by more than one method [see Additional file 1], although each method revealed unique chimeric TE insertions. Thus, multiple complementary approaches should be used in genome-wide studies of TE domestication to overcome both annotation and methodological biases.
Even using multiple methods for detecting chimeric transcripts, we estimate that only 0.46% of protein coding genes in Drosophila generate chimeric transcripts. Clearly the number of chimeric genes would be expected to increase somewhat with better annotation and/or increased EST coverage. Nevertheless, the number of chimeric transcripts in the Drosophila genome is likely to be more than an order of magnitude less than in the human and mouse genomes, where an estimated 27% and 18% of genes contain TE sequences [11]. These results together also suggest a rank order relationship between the proportion of chimeric genes and the amount of TE DNA in a genome (human, 46.36%; mouse, 38.55%; fly, 5.3%) [31-33]; however, further studies are needed to evaluate the strength and generality of this trend. Even a low number of gene-TE chimeras, such as presently observed in the D. melanogaster genome, may in the long-term contribute to the evolution of new transcripts and help explain unusual aspects of genomic organization structures such as nested or intercalated genes.
The low number of chimeric transcripts observed is not just the result of random effects of sparse TE insertion or the deleterious effects of TEs on transcription in the D. melanogaster genome. In fact, we found far fewer chimeric TE insertions in the genome than expected, relative to the number of non-chimeric TE insertions found in introns. This result indicates that the majority of TE insertions that occur in mature gene transcripts have a much higher probability of being deleterious than non-chimeric, intronic ones. The paucity of chimeric TE insertions in exons relative to introns demonstrates that the deleterious effects of chimeric TE insertions must exceed the cost of simply being transcribed, and probably results from improper translation or disruption of other functions of the mRNA such as localization or stability. Many of these unobserved events may contribute to the genome-wide load of deleterious mutations found in natural populations of D. melanogaster [34,35].
Population frequencies of the chimeric TE insertions observed in the genome sequence of the isogenized y; cn, bw, sp strain on the whole do not differ significantly from those of their non-chimeric counterparts. This does not imply that chimeric TE insertions found in the sequenced strain have no effects on fitness; rather that the distribution of their fitness effects is not substantially different from that of the non-chimeric TE insertions located elsewhere in the genome. At worst the observed chimeric TE insertions may be weakly deleterious and counterselected, in contrast to the unobserved chimeric TE insertions, which are presumed to be strongly deleterious and purged rapidly from the population.
There is, however, some indirect evidence that chimeric TE insertions may in fact be less weakly deleterious on average than non-chimeric TE insertions. If TE insertions are weakly deleterious, we expect a skew towards genomic regions of lower recombination where natural selection is less effective due to increased linkage between alleles of opposing selective effects [36]. This effect can be observed in the distribution of non-chimeric, intronic TE insertions, but is not observed in the distribution of chimeric TE insertions ( Figure 2C). Thus, a typical observed chimeric TE insertion may in fact have a smaller negative effect on fitness than a typical non-chimeric TE insertion. This conclusion is supported by a lack of detectable fitness effects in direct experimental challenges on flies carrying the chimeric TE insertion detected in the Tequila (graal) gene [21].
The one TE insertion we did identify as putatively adaptive (chimera 44; Figure 3A) was previously identified in a randomly chosen set of ~60 TEs [17,18]. We conclude that, in a search for adaptive TE insertions, selecting chimeric TE insertions is no better than selecting TEs from the Drosophila genome at random. This is perhaps not surprising, considering our finding that there is nothing unusual about the fitness effects of observed chimeric TE inser- tions. It is possible, however, that our inability to detect a significant difference in selection pressures resulted from the relatively small sample of both chimeric and control TE insertions studied here. Consideration of a larger number of strain pools will provide us with more statistical power and might show effects of chimerism on TE fitness that were not detected in this study.
Regardless of the forces that may have governed their history, we did identify seven chimeric TE insertions that appear to be at high frequency We sought evidence for the possibility of read-through transcription as an explanation for regulated TE expression by querying the second release of the BDGP in situ database [42,43] for embryonic expression patterns of the TEs and genes involved in chimeric transcripts detected in this study. Remarkably, as shown in Figure 4, we found that the embryonic expression pattern for developmental stages 11-16 of the gene CG12094 is almost identical to the expression pattern determined directly for the 412 ele-ment that is involved in the chimeric transcript generated by this gene.
Can read-through transcription from CG12094 explain the pattern of expression of the 412 element?
We believe the answer to this question is no, for the simple reason that the probe used to determine the expression patterns of the 412 element (GM07634) shares no sequences for potential cross-hybridization with the chimeric CG12094 transcript (Figure 4). In addition, the TE insertion in CG12094 is not fixed, whereas the pattern of 412 element expression is similar among different strains (see [44]), suggesting that the presence of the 412 element insertion in CG12094 is not required for embryonic expression pattern of the 412 element. (In fact, these data taken together are more consistent with the stage 11-16 expression pattern of CG12094 detected by the RE52190 probe being generated by spurious cross-hybridization to transcripts emanating from 412 elements located elsewhere in the genome.) Thus in the case of the 412 element, we conclude that the best candidate gene in the D. melanogaster genome cannot explain somatic TE expression by production of a read-through chimeric transcript. Clearly more data will be necessary to evaluate the generality of this conclusion, but the lack of a role for read-though transcription in this case is generally consistent with the paucity and low population frequencies of the chimeric TE insertions in the D. melanogaster genome (

Conclusion
In contrast to mammalian genomes, we found that fewer than 1% of Drosophila genes produce mRNAs that include bona fide TE sequences, and that the vast majority of potential chimeric TE insertions are likely to be deleterious and therefore unobserved in the genome sequence. Of those chimeric TE insertions that have weak enough negative fitness effects to have been observed in the sequenced D. melanogaster genome, over half are restricted to the sequenced strain and fewer than ~15% are likely to be fixed and therefore contribute to the origin of new gene sequences in the D. melanogaster genome. The relatively low numbers of fixed chimeric TE insertions also argue against read-through transcription as a predominant mechanism for generating patterns of somatic TE transcription in Drosophila embryos. These results also highlight the need to establish the fixity of putative cases of TE domestication identified in other genome sequences in order to demonstrate their functional importance, and indicate that the process of TE domestication may vary drastically among animal taxa.

Methods in silico screen for chimeric gene-TE transcripts
Chimeric gene-TE transcripts were identified by three independent methods (with the following number codes used in Additional file 1): 1) a genomic coordinate intersection analysis; 2) a TE-to-gene BLAST analysis; and 3) a TE-to-EST-to-gene BLAST analysis. Coordinate overlaps were evaluated using the UCSC D. melanogaster . Finally, the annotated host gene (±5000 bp) was used to query the TE database to ensure that a TE of the appropriate family is present in the genomic region, thereby filtering artifacts generated by EST library construction.
Transcripts from heterochromatic regions of the Release 3 genome were excluded from this analysis, as were genes labeled as "pseudogene" or unnamed genes with "existence uncertain" status in FlyBase. We also note that as in [14], we excluded from this analysis the enigmatic We note that some of the isofemale strains above may be heterozygous for a given TE insertion. This would lead to a slight increase in the effective number of strains in any given pool. However, such an increase is unlikely to have an effect on the qualitative nature of our results, as the addition of several strains to a pool generally has no significant effect on the confidence limits of the population frequency of a TE. For instance, in the section on population frequency estimation (below, also [see Additional file 2]), we show the extent to which the population frequency estimate remains the same when we treat 8-strain and 12-strain pools as if they were equivalent to each other.

PCR assays
The presence/absence of TEs in all strain pools was determined using the polymerase chain reaction (PCR). All PCR primers were designed using Primer 3 [55] and were checked with Virtual PCR [56]. All primers have a melting temperature of 63°C (+/-0.2°C) and were synthesized by  6 and 7). For the first type of similar TE insertion, we expect slightly higher selective constraints, and thus slightly lower population frequency. The converse is true for the second type of similar TE insertion.

), we give the number of strain pools in which the TE is present (column 3), the same for a similar TE with greater length in an area of higher recombination (columns 4 and 5), and for a similar TE with lower length inserted in an area with lower recombination (columns
(1) # FBti0018870 Frequency of chimeric TE insertions compared with similar non-chimeric counterparts in North American strains of D. melanogaster. A) Pool frequencies of chimeric TEs (blue dots) versus those of their counterparts with lower length and recombination (red circles). Chimeric TE 44 has a significantly greater pool frequency than its counterpart, and was previously found to be adaptive [17,18]. B) Pool frequencies of chimeric TEs (blue dots) versus those of longer counterparts with higher recombination rates (green triangles). Only one chimeric TE (number 21) has a significantly lower frequency than its counterpart. In both panels, the "population frequency" scale on the right-hand side gives maximum-likelihood estimates of the TE frequencies in the population (see Table 3 and the Methods for details).

Estimation of TE population frequencies from pool frequencies
Given that a TE insertion is present in some of the North American strain pools and absent from others (i.e. given its pool frequency), we wished to calculate the likeliest frequency of this insertion in the entire North American population, as well as suitable confidence bounds around such a frequency estimate.
Let x 1 (a number between 0 and 2) and x 2 (a number between 0 and 4) be the respective numbers of 8-strain and 12-strain pools in which a particular element is present. Let y be the theoretical frequency of this element in the North American D. melanogaster population. The likelihood L, of any particular value of y given the observed values of x 1 and x 2 is proportional to the probability of obtaining such x 1 and x 2 if y has that value. That is, L(y|x 1 , x 2 ) ∝ Pr(x 1 | y) × Pr(x 2 | y) (1) Where Pr(x 1 |y) is the probability that x 1 out of two 8-strain pools contain the element and Pr(x 2 |y) is the probability that x 2 out of four 12-strain pools contain the element, given that its overall frequency in the population is y.
The first term on the right hand side of equation (1) is equal to: Where (1-y) 8 is the probability that an element is not found in a given 8-strain pool, 1-(1-y) 8 is the probability that it is, and the first term on the right hand side is the appropriate binomial coefficient. Similarly, the second term of equation (1) is equal to: Substituting (2) and (3) into (1) and simplifying, we find that Where k is an arbitrary multiplicative constant that absorbs the binomial coefficients in (2) and (3), since they are independent of the parameter y. In accordance with common practice, we make use of the log-likelihood function ln(L), which entails an arbitrary additive constant ln(k).
Additional file 2 provides three examples of the resulting log-likelihood functions. These functions correspond to the three possible combinations of x's that yield a total of four pools with detected element presence (i.e. for (x 1 , x 2 ) equal to (0, 4), (1, 3) and (2,2)). This file demonstrates that, given that the element is present in four out of six pools, estimation of population frequencies is relatively insensitive to the number of pools that contain eight or 12 strains. Therefore, to simplify the analysis, we combined all combinations of x 1 and x 2 under a common category such that x 1 + x 2 = 4.
For each log-likelihood function, the maximum likelihood estimate of the population frequency is the value of y at which the function reaches its maximum (middle column of Table 3). The confidence limits are determined by a likelihood ratio test of the values of y where the function drops below its maximus minus two (rightmost column of Table 3). The test statistic is the likelihood ratio of the 0-parameter model where y is fixed at the value of its maximum likelihood estimate to the 1-parameter model where y is allowed to vary. This statistic is distributed as a χ 2 distribution with one degree of freedom. When the difference in log-likelihoods increases above 2, the likelihood ratio increases above e 2 = 7.39, where e is the base of the natural logarithm. This value is the 99.3% quantile of the χ 2 distribution (corresponding to p = 0.007, 1 d.f.).
These confidence limits were used to set the error bars in Figures 3A and 3B. Note that in situations with more than one possible combination of x 1 and x 2 the two rightmost columns of Table 3 list values that are averaged over all Pr ( ) x y x y y

Estimation of genomic recombination rate in the neighborhood of each TE insertion
We estimated the recombination rate at each TE insertion site method using a method previously developed for the D. melanogaster genome [54]. This method combines the known physical and genetic distances between D. melanogaster genes to estimate the recombination rate profile of each chromosome as a second-degree polynomial function. An explanation of the method, and a tool that demonstrates its use, can be found on the world-wide web [57].
In Figure 2C, we classify chromosomal sites where the polynomial functions in [54] drop below zero as areas with "zero" recombination. We find that for the TE insertions in non-zero recombination areas, the median recombination rate is 2.75 cM / Mbp. Accordingly, we classify chromosomal sites with recombination rates above 0 and below 2.75 as areas with "low" recombination rates. The remaining chromosomal regions are labeled as areas of "high" recombination.