A Unique Ribosome Signature Reveals Bacterial Translation Initiation Sites

While methods for annotation of genes are increasingly reliable the exact identification of the translation initiation site remains a challenging problem. Since the N-termini of proteins often contain regulatory and targeting information developing a robust method for start site identification is crucial. Ribosome profiling reads show distinct patterns of read length distributions around translation initiation sites. These patterns are typically lost in standard ribosome profiling analysis pipelines, when reads from footprints are adjusted to determine the specific codon being translated. Using these unique signatures we build a model capable of predicting translation initiation sites and demonstrate its high accuracy using N-terminal proteomics. Applying this to prokaryotic samples, we re-annotate translation initiation sites and provide evidence of N-terminal truncations and elongations of annotated coding sequences. These re-annotations are supported by the presence of Shine-Dalgarno sequences, structural and sequence based features and N-terminal peptides. Finally, our model identifies 61 novel genes previously undiscovered in the genome.

Identification of translated open reading frames (ORFs) is a critical step towards annotation of genes and the understanding of a genome. In addition to providing functional information via the peptide sequence, regulatory and targeting information are often contained within protein N-termini 1,2 , this makes accurate identification of the beginning of ORFs essential. Whole genome ORF identification in prokaryotes is most commonly performed in silico, using a variety of sequence features, such as GC codon bias and motifs such as the ribosomal binding site or Shine-Dalgarno sequence 3,4,5 in order to differentiate those ORFs that are thought to be functional from those that occur in the genome by chance. While these techniques are able to identify genomic regions containing ORFs with a high accuracy 5 , predicting translation initiation sites (TISs), and thus the exact beginning of a protein coding sequence (CDS), is substantially more challenging. This has led to the development of a number of in silico based TIS identification methods relying on a variety of sequence features [6][7][8][9] , which are typically post processing tools applied after initial ORF annotation in order to re-annotate the often erroneously predicted TIS.
High throughput proteogenomics has the potential to enable identification of protein Ntermini, and by extension TISs, from an entire proteome. In practice however variation in protein expression levels, physical properties, MS-incompatibility and the occurrence of protein modifications limit the number of detectable protein N-termini 10,11 . In prokaryotes N-terminal proteomics typically captures the corresponding peptides of hundreds to the low thousands of genes 11 . For example, a recent study identified N-terminal peptides of 910 of the 4140 (22%) annotated genes in Escherichia coli 12 . Although falling short of providing full genome annotation, such datasets provide an effective means of experimental TIS validation.
Significantly higher coverage of TISs can be achived by using sequencing based technologies. By specifically focusing on ribosome protected fragments, ribosome profiling 13 (ribo-seq) infers which parts of the transcriptome are actively undergoing translation. In this way, ribo-seq has been used to demonstrate translation of many RNAs and regions that were not thought to be associated with ribosomes [14][15][16][17][18][19][20] . Being able to identify translation on a transcriptome-wide scale has obvious application to ORF annotation and a number of methodologies have been developed for prediction of translated ORFs 17,19,[21][22][23] . These methods rely on a number of features, like codon periodicity, read context and read lengths, in order to distinguish footprints indicative of translation 72 from other, non-translating, footprints frequently observed in ribo-seq data. While extensive progress has been made on finding translated regions, delineating their exact boundaries has received less attention. Antibiotic treatment can be used to stall and capture footprints from the initiating ribosome 14,24,25 , but finding a suitable compound has been elusive in prokaryotes with only one dataset to date 26 .
Here we present a generally applicable method that does not depend on specialised chemical treatment, but can be take advantage of such data (Figure 1a). Using N-terminal proteomics we demonstrate its high accuracy and show that it is consistent with other features linked to translation initiation. Applying the model we predict numerous novel initiation sites in Salmonella enterica serovar Typhimurium.

Ribosome profiling enables accurate annotation of translation initiation sites
We trained a random forest model on TISs from the top 50% translated ORFs (see methods), to recognise the patterns in 5' ribo-seq read lengths and sequence contexts in a -20 to +10 nt window around start codons. In addition we encoded information about the start codon position within the ORF and the read abundance upstream and downstream of the start sites. The model was then used to predict TISs from all in-frame cognate and near cognate (one edit distance from ATG) start codons around annotated genes in the S.
Typhimurium genome. Predictions on the two samples were highly accurate with area under curve (AUC) values of 0.9958 and 0.9956 on independent validation sets for the monosome and polysome sample, respectively (parameter importance for the models is summarised in Supplementary Tables S1-2). In total 4610 (monosome) and 4601 (polysome) TISs were predicted in the two sets. From these, we constructed a high confidence set from predictions common to both replicates. In total this set contained 4272 predictions, representing an 86.50% agreement between the replicates. The discrepancies predominantly originate from genes with scarce translation. Of the high confidence TISs,  To further assess the predictions we compared the newly predicted TISs with the previously, potentially erroneously, annotated TIS. A highly significant sequence feature of translation initiation sites is the Shine-Dalgarno (SD) sequence which facilitate translation initiation in prokaryotes 29 . The consensus sequence GGAGG is located approximately 10 nt upstream of the start codon 30  sequences centred 9-10 nt upstream of the start codon ( Figure 4a). Strikingly the annotated TISs, in these same genes where our model has predicted novel sites, show an absence of the SD sequence ( Figure 4a). Since our model evaluates sequence context it is unsurprising that the predictions carry this signature, but the absence of these motifs around previously annotated start codons is notable.
Besides the presence of SD sequences, the guanine-cytosine (GC) content is commonly used to identify CDSs in prokaryotes. The overall GC content of a genome or genomic region is often highly optimised. In coding regions this optimisation can be achieved via synonymous substitutions, predominantly at third codon positions 31  Another significant feature of prokaryotic translation initiation is the absence of intrinsic structure in the region around the start codon enabling easier access for ribosomes to bind 32 . We therefore calculated the average free energy over all predicted sites and compared them to the previous annotation in the same genes. Consistent with GC content patterns, the annotated sites display a lower propensity to form secondary structure upstream of the start codon in elongated ORFs, and downstream of the start codon in truncated ORFs. In the predicted sites these less-structured regions can clearly be observed directly over the start codon, highly indicative of true initiation sites (Figure 4b lower).
Ribosomes translocate along mRNAs three nucleotides at a time, corresponding to one codon and amino acid (aa). Consequently, reads originating from bona fide translated regions also exhibit a three nucleotide periodicity in adjusted read counts, with a bias towards mapping to the first nucleotide in each codon 21 Table S4).

Translation initiation sites are predicted at novel genomic regions
In order to discover potential novel translated ORFs we applied our prediction models to look for TISs in genomic regions outside annotated ORFs. Novel ORFs that were similar in size to known CDSs (> 100aa) and with ribo-seq coverage along a high proportion of the ORF (>75 % coverage, see methods) were considered candidate translated novel ORFs. Of  Table S5).

Tetracycline treated samples improve classifier accuracy
While reads isolated from elongating ribosomes provide sufficient information to predict the majority of translation start sites we set out to explore the full potential of our classifier in combination with publicly available data from initiating ribosomes. A recent study on E. coli 12 demonstrated the use of tetracycline as a translation inhibitor to enrich for footprints from initiating ribosomes in prokaryotes. The tetracycline datasets show the pattern that we expect to see from initiating ribosomes as a range of read lengths starting 28-14nt upstream of the initiation codon (5' data), but ending at the same positions 14-15nt downstream of the initiation codon (3' data). An additional pattern of shorter fragment lengths can also be observed starting 26-18nt upstream, and ending 2nt downstream of the initiation codon (Supplementary Fig. S1.a,b,g,h).
We trained separate classifiers on chloramphenicol (elongating) and tetracycline (initiating) libraries from this dataset, using two replicates for each of the conditions (Supplementary    Table S4).
As expected, models based on initiating reads perform better than models based on elongating ribo-seq reads, suggesting that an optimal strategy for TIS identification would favour the use of the more focused, initiating ribo-seq profiles. However the degree of improvement between the models was relatively small, confirming the suitability of both elongating and initiating ribo-seq libraries for the purposes of TIS and ORF detection.
While the mechanistic or experimental origin of the patterns that our model captures remain unexplored, it is interesting to note the importance the models place (Supplementary Tables S1-2 Whether similar patterns of read length distributions can be observed in eukaryotic riboseq datasets remains to be determined, although the method that we describe in this article is, regardless, fully extendable to eukaryotic datasets. In conclusion, this study demonstrates the utility of ribo-seq fragment length patterns for TIS identification across multiple experimental conditions. These models provide a significant step forward in experimental TIS discovery, facilitating the move towards complete ORF annotation in both presumably well-annotated model organisms, as well as the ever growing list of newly sequenced genomes. Ribosome-protected mRNA footprints with sizes ranging from 26-34 nt were selected and processed as described previously 14 with some minor adjustments as previously described 35 . The resulting ribo-seq cDNA libraries of the monosome and polysome sample were duplexed and sequenced on a NextSeq 500 instrument (Illumina) to yield 75 bp single-end reads.

Ribo-seq data processing
Ribo-seq data were preprocessed with cutadapt 36  Recent publications reporting prokaryotic ribo-seq 26,28,39,40 suggest that read fragments from libraries digested with micrococcal nuclease align more precisely to their 3' rather than 5' ends. Consistent with this, we observe a modest increase in the periodicity of meta profiles of the S. Typhimurium ribo-seq libraries when reads are brought to codon resolution from the 3' end ( Supplementary Fig. S1), however this does not hold true for the E. coli datasets, where the use of 3' poly adenosine adaptors, results in a loss of resolution at the 3' end after read trimming ( Supplementary Fig. S1), making the use of 5' ends preferable. Regardless, the protected read fragment patterns that we use in the input feature vectors for the classifier takes both length and position into consideration. Consequently the classifier is unaffected by this choice. However, to maintain consistency throughout this study read counting for model predictors was performed from the 5' end for all libraries.

Read distributions and heat maps
Ribo-seq read distributions were summarised over all annotated start codons in the S.
Typhimurium and E. coli annotations respectively. 5' read counts were taken from regions 30nt upstream to 60nt downstream of the start codon, 3' read counts were taken from the first nucleotide of the start codon up to 90 nucleotides downstream. All reads with a MAPQ greater than 10, from the upper 90% of genes by total CDS expression were included. Total counts were scaled to a sum of one per individual region, in order to not disproportionately favour profiles from highly expressed genes. Meta plots were then produced to show the proportion of read counts over the window across all genes. 3' and 5' heatmaps were generated from the scaled regions, showing the number of standard deviations from the row (fragment length) mean.

Model implementation
For each candidate TIS a feature vector was defined as each nucleotide in a -20 to +10nt window around the position, the ribo-seq 5' FPKM (fragments per kilobase per million mapped reads) between the current position and the next in-frame downstream stop codon, the count of potential in-frame start sites (codons within one edit distance of ATG) from the nearest in-frame upstream stop codon to the current position, the proportion of 5' ribo-seq reads upstream in a 20 nt window, the proportion of 5' ribo-seq reads downstream in a 20 nt window, the ratio of 5' ribo-seq reads up and downstream and the proportion 5' ribo-seq counts per fragment length for a fixed range of positions in relation to current site (selected from visual inspection of 5' fragment length heatmaps (Supplementary Fig. S4).  HCl (4M f.c.) and subjected to N-terminal COFRADIC analysis as described previously 42 . Free amines were blocked at the protein level making use of an Nhydroxysuccinimide ester of (stable isotopic encoded) acetate (i.e. NHS esters of 13 C2D3 acetate), which allows distinguishing in vivo and in vitro blocked N-terminal peptides 43 . The modified protein sample was digested overnight with sequencing-grade modified trypsin (1/100 (w/w trypsin /substrate)) at 37 °C and subsequent steps of the N-terminal COFRADIC procedure were performed as previously described 42 .   N-term)) and pyroglutamate formation of N-terminal glutamine (both at peptide level). Endoproteinase semi-Arg-C/P (semi Arg-C specificity with Arg-Pro cleavage allowed) was set as enzyme allowing for no missed cleavages. Mass tolerance was set to 10 ppm on the precursor ion and to 0.5 Da on fragment ions. Peptide charge was set to 1+, 2+, 3+ and instrument setting was put to ESI-TRAP. Only peptides that were ranked one, have a minimum amino acid length of seven, scored above the threshold score, set at 95% confidence, and belonged to the category of in vivo or in vitro blocked N-terminal peptides compliant with the rules of initiator methionine (iMet) processing 44 were withheld.

LC-MS/MS analysis
More specifically, iMet processing was considered in the case of iMet-starting N-termini followed by any of the following amino acids; Ala, Cys, Gly, Pro, Ser, Thr, Met or Val and only if the iMet was encoded by ATG or any of the following near-cognate start codons; GTG and TTG (Supplementary Table S13). In contrast to eukaryotic nascent protein Ntermini, the typical lack of significant steady-state levels of N-terminal protein modification (e.g. Nt-acetylation or Nt-formylation), warrant caution to unequivocally assign bacterial protein N-termini as proxies of translation initiation. As such, a high confidence subset of Nt-formylated Met-starting N-termini, was selected (Supplementary Table S14). Amino acid sequences of novel ORF were compared to known proteins in the nonredundant protein database (Update date:2016/12/15) and protein domains (cdd.v.3.15) using BLASTP 47 (version 2.5.1+). Hits with the greatest coverage of query sequence and lowest evalue were selected. Hits were considered highly similar if they shared >95% identify to a protein sequence, over 100% of the novel ORF sequence

DATA AVAILABILITY
The previously published E. coli ribo-seq dataset was downloaded from the NCBI SRA