Skip to content

Advertisement

Open Access

Sequence-based prediction of permissive stretches for internal protein tagging and knockdown

  • Sabine Oesterle1,
  • Tania Michelle Roberts1,
  • Lukas Andreas Widmer1, 2, 3,
  • Harun Mustafa1, 4,
  • Sven Panke1 and
  • Sonja Billerbeck1, 5Email author
BMC Biology201715:100

https://doi.org/10.1186/s12915-017-0440-0

Received: 1 August 2017

Accepted: 11 October 2017

Published: 30 October 2017

Abstract

Background

Internal tagging of proteins by inserting small functional peptides into surface accessible permissive sites has proven to be an indispensable tool for basic and applied science. Permissive sites are typically identified by transposon mutagenesis on a case-by-case basis, limiting scalability and their exploitation as a system-wide protein engineering tool.

Methods

We developed an apporach for predicting permissive stretches (PSs) in proteins based on the identification of length-variable regions (regions containing indels) in homologous proteins.

Results

We verify that a protein's primary structure information alone is sufficient to identify PSs. Identified PSs are predicted to be predominantly surface accessible; hence, the position of inserted peptides is likely suitable for diverse applications. We demonstrate the viability of this approach by inserting a Tobacco etch virus protease recognition site (TEV-tag) into several PSs in a wide range of proteins, from small monomeric enzymes (adenylate kinase) to large multi-subunit molecular machines (ATP synthase) and verify their functionality after insertion. We apply this method to engineer conditional protein knockdowns directly in the Escherichia coli chromosome and generate a cell-free platform with enhanced nucleotide stability.

Conclusions

Functional internally tagged proteins can be rationally designed and directly chromosomally implemented. Critical for the successful design of protein knockdowns was the incorporation of surface accessibility and secondary structure predictions, as well as the design of an improved TEV-tag that enables efficient hydrolysis when inserted into the middle of a protein. This versatile and portable approach can likely be adapted for other applications, and broadly adopted. We provide guidelines for the design of internally tagged proteins in order to empower scientists with little or no protein engineering expertise to internally tag their target proteins.

Keywords

Permissive siteInternal protein taggingTEV proteaseProtein knockdownsCell-free biotechnology

Background

Small functional peptides have shifted into focus as tools for advanced in vivo imaging and chemical biology. Peptides offer a diversity of functions condensed into few amino acids: they can serve as highly specific binding motifs for small molecules or metals [13] and as recognition sequences for proteases [4] or labelling enzymes [5, 6], and they can self-catalyse covalent bonding [7]. Furthermore, they can mimic carbohydrates [8] and serve as inhibitors [9] or as epitopes able to elicit an immune response [10] or modulate innate immunity [11]. Tagging proteins with functional peptides has revolutionised the ease and scale of protein purification, opened novel pathways for vaccine design, and also enabled visualisation and characterisation of biological systems and processes in vivo and in vitro [12]. Although many proteins can be tagged N- or C-terminally, it is frequently necessary or desirable to insert a tag internally at a permissive site that accepts additional amino acids. There are several possible reasons for doing this, in addition to multiple tagging. The termini of a protein may be functionally relevant or buried [1316] so that an internal tag might be more resistant to proteolytic degradation than a terminal fusion [15], the inserted peptide may need to be structurally stabilised in order to exhibit its function (e.g. sufficiently rigidified as shown for lanthanide binding tags for nuclear magnetic resonance (NMR) studies [3]), or the specific peptide’s conferred function may require it to be located internally (as is the case for engineering conditional protein knockdowns by protease hydrolysis site insertion) [13, 1719].

Due to the limited understanding of the precise mechanisms underlying site permissiveness, the full potential of internal protein tagging has largely remained untapped. State-of-the-art approaches are based on transposon mutagenesis [14, 18, 19]. We previously showed the feasibility of this approach by identifying permissive sites in the molecular chaperonin GroEL [13]. However, transposon mutagenesis is laborious, involving several in vitro DNA manipulation and engineering steps. This limits its potential use for high-throughput protein tagging and thus prevents the true exploitation of permissive sites as a proteome-wide engineering approach. Furthermore, tags inserted by transposon mutagenesis contain large (~19 bp) transposase recognition sites flanking the tag sequence, which we have previously observed to impair protein function [13].

In contrast, rational approaches could minimise the number of engineering cycles for designing and implementing desired functions and thus improve scalability. In combination with novel precision genome-editing tools [20, 21], rational design approaches could augment systematic internal protein tagging efforts directly in the genome, e.g. as recently proposed for systematic protein quantification using designed peptide tags for mass spectrometry [22].

Here, we present a general method for predicting permissive stretches (PSs) in proteins. We hypothesise that length-variable regions (regions containing indels) in homologous proteins tolerate insertions. Such regions can be inferred by searching for gaps in a multiple sequence alignment (MSA) of homologous proteins. A similar strategy was employed to identify a permissive site in the glycoprotein of vesicular stomatitis virus [23], the yeast Ser/Thr kinases TOR1 and TOR2 [16], and the zebrafish proteins Tcf21 and Tbx18 [24], but the generality of this approach remained unclear. This procedure only requires primary sequence information, which is available for presumably any protein of interest. We further combine this method — which we call permissive stretch search (PSS) — with secondary structure and surface accessibility measures to establish a workflow that allows us to select permissive sites, which are structurally flexible and located in surface accessible regions (Fig. 1).
Figure 1
Fig. 1

Established workflow for identifying permissive stretches (PSs) in proteins and design of protein knockdowns. The established workflow is exemplified with adenylate kinase (Adk) and requires primary structure information alone. a Gaps in a multiple sequence alignment (MSA) of several (>5) homologous proteins indicate stretches in a protein likely permissive to insertion of additional amino acid residues. b The span of a PS is defined as the gap in the alignment plus its flanking residues. The four identified PSs within Adk are indicated with Roman numerals. c The design of protein knockdowns requires the insertion of a Tobacco etch virus protease recognition site (TEV-tag) into a flexible, surface accessible PS. Relative surface accessibility (RSA) and structural context of a PS can be predicted based on primary structure information. RSA values for each PS within Adk are indicated and were calculated by computing the geometric means of the RSA values of adjacent residue pairs within a given stretch and taking their maximum value. RSA values range from 0 (buried) to 1 (fully exposed). The average maximum geometric mean RSA of a random stretch was determined to be 0.30. For illustration, PSs were mapped onto the surface representation of the crystal structure of Adk (Protein Data Bank (PDB) 1AKE). d The information acquired above guides the identification of a potentially functional, surface exposed, and flexible PS for chromosomal TEV-tag insertion. PSII shows a high RSA, and secondary structure prediction indicates that it stretches across a 6-residue loop. PSIII shows the same RSA as PSII and it stretches across an 18-residue loop. But PSII was shown to be functionally relevant ([42] and Fig. 2) and therefore not chosen for TEV-tag insertion

We apply this workflow to engineer conditional protein knockdowns for tailoring cell-free production platforms, in which the catalysts for complex biological processes can be recruited from living cells but employed outside the cell. Cell-free biotechnology is a quick and cost-effective method that allows for facilitated supply of non-membrane permeable or toxic substrates, easy monitoring, manipulation, and access to the desired products [25]. Cell-free platforms have shifted into focus for in vitro protein synthesis, for production of fine chemicals or medications [2528], or for implementation of paper-based biosensors [29].

To prevent laborious and expensive purification schemes, we only disrupt the cellular envelope, leaving a crude lysate as the source of the required catalysts [26]. However, yield-efficient operation is then compromised by the presence of a complex enzymatic background that interferes with the desired reaction by sequestering starting materials, intermediates, and/or co-factors [26].

Here we tackle this problem by employing conditional protein knockdowns: we label enzymes with an internal Tobacco etch virus protease recognition site (TEV-tag) at a PS such that it can be inactivated at the cell-free stage through hydrolysis by a selective orthogonal protease. This allows us to target essential proteins that cannot be genetically inactivated or proteins that cause a growth phenotype in the biomass production strain in case of genetic elimination.

As a proof of principle, we address the rapid degradation of the expensive and universal co-factor adenosine triphosphate (ATP) and its hydrolysis product adenosine diphosphate (ADP), a problem that constrains the productivity of most cell-free production efforts [26, 30]. To avoid the addition of stoichiometric amounts of ATP, these cell-free processes rely on ATP regeneration from ADP. Stabilising the availability of the expensive co-factor ATP as well as its hydrolysis product ADP could allow for a more cost-efficient operation of cell-free production systems in general, especially when combined with existing strategies for ATP generation from glucose and the usage of nucleoside monophosphates (NMPs) as a source for the generation of nucleoside triphosphates (NTPs), which are required for messenger RNA (mRNA) production [31].

Results

Identification of functional PSs within proteins using sequence information alone

We first tested our hypothesis that length-variable regions are permissive to tag insertion with two Escherichia coli proteins for which various sites permissive to five-residue insertions had been experimentally identified by transposon mutagenesis previously: triosephosphate isomerase (TpiA, personal communication with Victor de Lorenzo, Additional file 1: Table S1) and TEM1 β-lactamase (Bla) [32]. For both proteins, sequences of four to six functionally conserved homologs with sequence identities ranging between 23% and 52% were selected and aligned. Input sequences are summarised in Additional file 2: Table S2 and alignments are given in Additional file 3: Figure S1.

The imposed sequence identity range was defined empirically: sequences with high similarity had very few gaps, while sequences with low identity exhibited too much alignment error, and their homology cannot be reliably determined from sequence information alone [33]. In general, an identity range between ~30% and ~70% works best in our view (see Additional file 2: Table S2 for input sequences and identity ranges). Transposon-identified permissive sites for TpiA and Bla were mapped onto the corresponding alignments (Additional file 2: Table S2 and Additional file 3: Figure S1). Remarkably, almost all of the experimentally identified permissive sites within TpiA (13 out of 16) and all within Bla (5 out of 5) mapped either directly to one of the identified gaps (9 out of 13 for TpiA and 4 out of 5 for Bla) or scattered close (within +/– 5 residues) to an identified gap (4 out of 13 for TpiA and 1 out of 5 for Bla). Scattered permissive sites — sites that are shifted from the actual gap in the specific alignment — tended to be at least part of the same secondary structural element as the identified gap. For Bla, semi-permissive sites (arbitrarily defined as conferring ≤ 30% wild-type resistance to ampicillin by the authors of the study) and non-permissive sites (loss of function) had been characterised as well [32]. Only 4 out of 10 semi-permissive sites mapped to one of the identified gaps, but perhaps more importantly, none of the non-permissive sites mapped to a gap.

Overall, this first analysis suggested that the predicted gapped regions in the alignments were indeed tolerant to insertion of additional amino acids. However, the gap search resulted in the identification of a potentially flexible stretch within a protein rather than a precise site. We therefore call this approach permissive stretch search (PSS) rather than permissive site search.

Experimental validation of PSS with three test proteins

To experimentally validate the PSS, we used three test proteins which are either essential or conditionally essential: adenylate kinase (Adk), glycerol-3-phosphate dehydrogenase (GpsA), and the previously discussed triosephosphate isomerase (TpiA). MSAs for all test proteins, the identified PSs, and their numbering can be found in Additional file 4: Figure S2. We selected TpiA as a test protein for PSS — despite the fact that transposon-identified permissive sites for this protein are known — to validate the permissiveness of PSs that had not been transposon-identified (PSIV, Additional file 4: Figure S2), to sample additional sites scattering around transposon-identified PSs (site T130 scattering around PSVI, Additional file 4: Figure S2), and also to explore permissiveness of transposon-identified sites towards longer insertions (site E55 within PSIII and site T153 within PSVI, Additional file 4: Figure S2).

For all proteins, we inserted a TEV-tag (ENLYFQ↓G), or derivatives thereof, and explored insertion positions lying directly within predicted PSs as well as insertion positions scattering closely around them — similar to the distribution of our reference (transposon-identified) permissive sites. Varying the exact sites around the identified PSs is reasonable, as the exact gap position in an MSA can vary depending on the choice of alignment algorithm and the gap open and extension penalties that were employed [34, 35]. We also explored small deletions or duplications of the original sequence to study the extent of permissiveness for specific insert designs. Table 1 summarises the identified PSs for each protein and the final sequences of tagged protein variants. Input sequences and MSAs for all three proteins can be found in Additional file 2: Table S2 and Additional file 4: Figure S2.
Table 1

Overview of internally tagged protein variants

 

Stretch

Span

Insertion site

Original sequencea

Sequence after insertionb

Plasmid insertions

 Adk

PSI

I72-R78

D76

QEDCRNGFLLD

1-QEDENLYFQGLLD

PSII

D94-A95

A93

PQADAMKE

PQAENLYFQGMKE

K97

AMKEAG

AMKENLYFQGEAG

A99

KEAGIN

1-KEAENLYFQGMKEAGIN

2-KEAENLYFQGDAMKEAGIN

PSIII

V142-G150

P140

NPPKVEGKDDVTGE

NPPENLYFQGTGE

PSIV

T191-P201

A186

KEAEAGNTK

KEAENLYFQGNTK

 GpsA

PSI

P55-V57

C49

DRCNAAFLPDVPFPD

DRCENLYFQGFPD

P60

PFPDTL

PFPENLYFQGVPFPDTL

PSII

P97-D102

M99

PLMRPD

PLMPTTENLYFQGCLGRPD

PSIII

L128-Q131

I132

DQIPLA

DQIPTTENLYFQGCLGPLA

PSV

D272-V273

Q269

LGQGMD

LGQPTTENLYFQGGTVGMD

 TpiA

PSIII

E53-I59

E55

EAEGSH

1-EAEGGSGENLYFQ G SGGSGSH

2-EAEGCLGESENLYFQGDERKNKGSH

3-EAEGCLPTTENLYFQSGTVKNKGSH

PSIV

D67-N69

N69

DLNLSG

DLNENLYFQGLSG

PSVI

E133-A156

T130

GETEAENEAGKTE

1-GETENLYFQGGSGKTE

2-GETGGSENLYFQGGSGKTE

T153

LKTQGA

LKTDYDIPTTENLYFQSGTVDAGADQGA

Chromosomal insertions

 Adk

PSI

I72-R78

D76

QEDCRN

3-QEDENLYFQGESLFKCRN

 TpiA

PSIV

D67-N69

L70

LNLSGA

LNLPPKNENLYFQGESLFKGPSGA

 AtpA

PSIII

H123-F126

H123

LDHDGE

LDHENLYFQGDGE

 AtpD

PSIII

E101-E105

E101

KGEIGE

KGEENLYFQGIGE

Extended TEV-tag

 Adk

PSI

I72-R78

D76

QEDCRN

2-QEDENLYFQGCRN

    

4-QEDENLYFQGESLFKGGCRN

 GpsA

PSI

P55-V57

D56

LPDVPS

1-LPD ENLYFQG VPS

    

2-LPD PPKNENLYFQGESLFKGPVPS

aResidues deleted during the insertion process are shown in boldface. If no residues were deleted, a 6-residue stretch of the protein sequence is shown, and the insert was placed in the middle

bMinimal TEV-tag and extended derivatives used for insertion are shown in boldface

Functionality of all protein variants was assessed in vivo by measuring the ability of a tagged protein variant to sustain wild-type-like growth rates on different carbon sources, a strategy frequently employed to examine functionality of protein variants [14, 36]. Adk is an essential protein required for the biosynthesis of purine ribonucleotides [37] and plays a key role in controlling the rate of cell growth by tuning the availabilities of nucleotide species via inter-conversion [38].

GpsA is also an essential protein, catalysing the first step in the biosynthesis of phospholipids starting from the glycolytic intermediate dihydroxyacetone phosphate (DHAP) [39]. As a central enzymatic activity in glycolysis and gluconeogenesis [40], TpiA is conditionally essential: it is non-essential for growth on rich media (Lysogeny broth (LB) medium), but essential if glucose or glycerol are the only carbon sources (M9 medium). Therefore, differences in the specific growth rates on a glycolytic carbon source (glucose) and a gluconeogenic carbon source (glycerol) should identify impairment of catalytic activity for the various Adk, GpsA, and TpiA variants once the wild-type gene on the chromosome is inactivated.

Protein variants were expressed from their natural promoters on low copy plasmids in strains lacking the corresponding wild-type gene in their chromosome. A wild-type version of each protein expressed from the same genetic context was used as a reference for growth rate comparisons. Note that due to the essential nature of adk and gpsA, the final strains had to be constructed by genetic replacement rather than standard transformation [41]. As gpsA is encoded within an operon, we included the upstream gene secB on the test plasmid to maintain its genetic context and tested TEV-tag carrying variants in strain secB gpsA::kan. Adk variants were tested in strain adk::kan, and TpiA variants were tested in the previously constructed double knockout strain amn::FRT tpiA::FRT [26].

For TpiA and GpsA, all of the tagged protein variants sustained wild-type-like growth rates on minimal media with glucose and glycerol (Fig. 2a and b). For Adk, TEV-tag insertions into two of four PSs (PSI and PSIV, see Additional file 4: Figure S2 for numbering) resulted in protein variants that could sustain wild-type-like growth on glucose and glycerol (Fig. 2c). Insertion into PSIII resulted in a variant with wild-type-like growth on complex LB medium and on minimal M9 medium with glucose but showed a growth defect on minimal medium with glycerol (66 ± 18% of wild-type-specific growth rate). As the literature indicated that PSIII is located in a functionally relevant loop [42], we excluded PSIII from further analysis. The TEV-tag insertion into PSII (specifically after residue A93) resulted in a variant that sustained wild-type-like growth on complex LB medium but caused a growth defect on M9 minimal media with glucose and glycerol (54 ± 14% and 47 ± 18%; mean ± standard deviation (SD) of wild-type-specific growth rate, respectively). This indicated that insertions at site A93 were not fully permissive, despite the lack of reports of the PSII region’s functional relevance. To explore if changing the exact insertion position within PSII could restore wild-type-like growth, we created a small insertion library by polymerase chain reaction (PCR) (Additional file 5: Figure S3). After screening several library members, we identified one variant (tagged after residue A99, with a four-residue duplication of the original sequence) sustaining wild-type-like growth on all carbon sources. Other screened insertion sites (after residue K97) and insertion designs (after residue A99, with a six-residue duplication of the original sequence) resulted in proteins with compromised function (Fig. 2d).
Figure 2
Fig. 2

Functionality of TEV-tagged protein variants in vivo. Upper and middle panels (a-d): functionality of plasmid encoded TEV-tagged protein variants in vivo. Variants were expressed from their natural promoter on low copy plasmids. a TpiA, b GpsA, c Adk, d Adk variants isolated from an insertion library around PSII. Insertion positions and corresponding permissive stretches (PSs) are given for each variant. Functionality was evaluated as the ability of a certain variant to support growth of the corresponding knockout strain on different carbon sources at 37 °C. Experiments were done in biological duplicates ± SD. Lower panel (e and f): functionality of chromosomally encoded TEV-tagged proteins variants in vivo. e Indicated strains carrying a TEV-tag on the chromosome were grown in LB medium or M9 glucose with casamino acids at 32 °C, and growth rates were compared to the appropriate parent strain (Ec or Ec*) which was used for chromosomal integration; in case of TpiAL70 the strain has an additional STOP codon in amn (Ec*) resulting in a translational knockout. f Growth rates on M9 succinate of strains carrying a TEV-tag in the α- (AtpA) and β- (AtpD) subunits of ATP synthase. A functional ATP synthase is essential for growth on the non-fermentable carbon source succinate. Strains having AtpA or AtpD replaced by a kanamycin cassette fail to grow on succinate. Experiments were done in triplicate ± SD

In summary, we examined a total of 11 PSs and 19 insert designs within three proteins and showed that 15 out of 19 designs resulted in functional protein variants. This included one permissive site within TpiA (N69), which had not been discovered by transposon mutagenesis. This suggested that PSS allows for a substantial reduction in effort from screening/selecting from a transposon library of variants to the functional evaluation of a few variants.

Identified stretches are sufficiently permissive for chromosomal protein tagging

Having confirmed that the protein variants are functional in principle, we proceeded to explore the effect of tag insertion when the encoding genes are present only in monocopy, which would correspond to our ultimate objective: designing and implementing internally tagged proteins on the chromosome of E. coli with only minimal testing and re-engineering. We chromosomally inserted TEV-tags into two of our test proteins, Adk and TpiA, using co-selection multiplex automated genome engineering (CoS-MAGE) [43]. We chose to integrate the two cleavable variants AdkD76.3 and TpiAL70, (Fig. 3d and Additional file 6: Figure S4). The constructed strains were designated EcAdkD76.3 and Ec*TpiAL70 (Table 1 and Additional file 7: Table S3). Wild-type-like growth rates on complex medium and minimal medium supplemented with glucose verified full functionality of both variants in vivo (Fig. 2e).
Figure 3
Fig. 3

Cleavability of internally TEV-tagged protein variants. Cleavability of TEV-tagged Adk-6xHis variants was examined by incubating crude lysates derived from strains expressing the indicated Adk variant from a low copy vector under control of the natural Adk promoter in the presence or absence of TEV protease. Samples were separated by SDS PAGE and blotted. Cleavage products were detected with a 6xHis-tag-specific antibody. a AdkD76.1, b AdkA99, c AdkA186. The predicted secondary structure context of each residue is indicated in red. The loop length for variants AdkD76 and A99 was determined based on secondary structure prediction. d Cleavage of flanked TEV-tag variants inserted after residue D76. Purified AdkD76-Strep variants with differing TEV-tag flanking region lengths were incubated in the presence or absence of TEV protease and cleavage products were detected with a Strep-tag-specific antibody

After confirming the wild-type-like function of strains that contain a chromosomal gene for a TEV-tagged protein variant designed by the PSS, we tested the procedure on the α- and β-subunits of the ATP synthase (AtpA and AtpD). ATP synthase is a complex multi-subunit molecular machine and should therefore be a very stringent target for verifying the PSS approach. We had also identified ATP synthase as a major source for unspecific ATP depletion in cell-free extract (CFX) (Additional file 8: Figure S5). We predicted PSs for AtpA and AtpD and inserted the nucleotide sequence for TEV-tags into the chromosomal gene sequence for two identified PSs, specifically after residues H123 (AtpA) and E101 (AtpD) (Table 1 and Additional file 9: Figure S6). The corresponding strains EcAtpAH123 and EcAtpDE101 exhibited 92 ± 6% and 72 ± 12% of wild-type growth rate on the non-fermentable carbon source succinate, indicating the assembly of a functional ATP synthase [44] (Fig. 2f). The strains further exhibited wild-type-like growth on complex medium and minimal medium supplemented with glucose (Fig. 2e).

These results suggested that the PSS could be used for direct chromosomal engineering, as tagged proteins were functional when expressed from a monocopy gene and targets could be chromosomally tagged with minimal testing effort.

Ensuring accessibility of the inserted peptide tag: incorporating surface accessibility and secondary structure context shows that PSS-identified PSs are biased towards being surface accessible

Besides functionality, a further requirement for the usefulness of a permissive site is its accessibility to interaction partners, such as proteases, labelling enzymes, or antibodies. Depending on the partner, PS accessibility may not only be a function of simple surface accessibility but may also be dependent on secondary structure context, e.g. if the partner requires its recognition site to be located in a surface accessible unstructured loop. As we aimed to keep the entire workflow based on primary structure information, we used the freely available NetSurfP structure prediction tool [45] to assess relative surface accessibilities (RSAs) and secondary structures of each stretch.

We first verified that predicted results for RSA and secondary structure of a given protein correlated well with the structural data which were available for five of our test proteins, AtpA, AtpD, TpiA, Bla, and Adk (correlations ranging from 0.69 to 0.8 between sequence- and structure-based predicted RSA and secondary structure; Additional file 10: Table S4). Given this result and the good accuracy reported for NetSurfP’s predictions [45], we considered this tool sufficiently accurate for evaluating protein structure for PS selection in the absence of crystal structures.

Interestingly, when predicting the RSA for the PSs in our test proteins supported by manually mapping the positions of PSs onto available crystal structures, we realised that almost all of them were at least partly surface accessible (Fig. 1, Additional file 10: Table S4, and Additional file 11: Figure S7). Encouraged by this result, we were interested to evaluate if PSS-predicted stretches tended to be surface accessible in general. This would significantly support our goal to identify surface accessible PSs in a variety of proteins to eventually engineer protein knockdowns in a proteome-wide approach.

We therefore automated the PSS and predicted PSs across the functionally annotated part of the E. coli K-12 proteome. The most accessible insertion site (as defined in the Methods section) in predicted stretches displayed significantly higher RSA (0.389, 95% confidence interval (CI) = [0.0661, 0.712]) when compared to the most accessible insertion site in randomised stretches with the same length distribution (0.298, 95% CI = [0.0224, 0.574]) (see Methods). To evaluate the significance of this difference, we generated 1000 bootstrap samples of PS position shuffles to determine how these differences are distributed. We observed very little variability among the samples, with a mean average surface accessibility of 0.298, 95% CI = [0.297, 0.300]. This indicates that the higher average RSA of observed insertion sites is indeed statistically significant (p < 10−3, Additional file 12: Figure S8). In the sites located in our test proteins, we also found this significant enrichment with respect to RSA in PSs versus randomly chosen PSs: only 6 of 34 PSs are below, 14 of 34 PSs are within one SD, and 14 of 34 PSs within two SD above the mean RSA of a randomly placed site (Additional file 10: Table S4 and Additional file 11: Figure S7). These results supported that PSs identified by PSS exhibited higher than average RSAs, making it likely that using PSS in order to identify solvent exposed PSs could be generalisable to the proteome level.

Design of conditional protein knockdowns: testing and improving cleavability of the minimal TEV-tag by adding flanking residues

Our specific interest in chromosomal protein tagging is the engineering of conditional protein knockdowns to enable easy elimination of undesired catalytic activities from a cell-free platform. This requires efficient hydrolysis of the primary peptide backbone by TEV protease and loss of enzymatic activity after cleavage.

When examining the cleavability of all functional Adk TEV-tagged variants by western blotting, we realised that the inserted minimal TEV-tag (i.e. the canonical and widely used sequence ENLYFQG) was generally poorly hydrolysed, and cleavage efficiency seemed to be dependent on specific sequence and secondary structure context. Specifically, the TEV-tag was partly hydrolysed (to different extents) when placed into a loop, as seen for AdkD76.1 and AdkA99.1 (Fig. 3a and b), but not at all when placed into an α-helix, as seen for AdkA186 (Fig. 3c). This finding was confirmed when examining the cleavability of all TEV-tagged variants of TpiA and GpsA (Additional file 13: Figure S9). Therefore, despite ensuring at least theoretically good access to tags, TEV hydrolysis seemed to require the consideration of additional criteria, and we developed a “flanked TEV-tag” that could be efficiently hydrolysed even when placed into a presumably structurally more rigid internal position.

Our flanked TEV-tag extends the minimal cleavage site by residues derived from one of the variable hydrolysis sites in the natural TEV polyprotein (UniProtKB: P04517). We tested variants of different length for improved cleavage and found that a TEV-tag minimal sequence extended by five residues at its C-terminus (ENLYFQ↓G ESLFK) substantially enhanced the cleavage efficiency of variant AdkD76.3 (Fig. 3d). The same strategy was successfully employed to engineer cleavable variants of GpsA and TpiA (Additional file 6: Figure S4).

Stabilising the nucleotide pool in a cell-free platform using conditional protein knockdowns

As the first step, it was necessary to identify the major ADP and ATP sinks present in CFX. A database search for potentially abundant ADP consumers with no specific additional substrate or co-factor requirements yielded Adk as a strong candidate. As for ATP consumers, a database search for ATPases resulted in a set of potential candidates with uncertainty about abundance and activity under cell-free platform operation conditions. Therefore, ATP sinks were identified in a reverse approach by separating cell-free extracts on native PAGE, followed by activity detection and mass-spectrometric identification of corresponding proteins (Additional file 8: Figure S5). One identified major ATP sink was the soluble F1 portion of the membrane-spanning ATP synthase. Although membranes and membrane-bound proteins are removed during CFX preparation, the soluble F1 portion of ATP synthase is known to separate from the membrane-associated F0 part and remains present in CFX preparations. Without coupling to the proton-motive force, F1 hydrolyses ATP unspecifically [46, 47]. We therefore investigated tagging of Adk and two of the subunits of ATP synthase as possible measures for stabilizing nucleotides in CFX.

We first tested for ADP stabilisation using the cleavable Adk variant (AdkD76.3; see above). We prepared a CFX from strain EcAdkD76.3 and used high-performance liquid chromatography (HPLC) to determine the stability of externally added ADP (and its metabolites) with or without pre-treatment by TEV protease (Fig. 4a). Fitting the data to an exponential decay model revealed that the half-life of ADP in CFX obtained from strain EcAdkD76.3 increased from ~10 min to greater than 2 h when Adk was inactivated by proteolysis (Fig. 4d). The interruption of inter-conversion of ADP to ATP and adenosine monophosphate (AMP) after proteolysis was verified by monitoring ATP and AMP concentrations (Fig. 4a).
Figure 4
Fig. 4

Stabilisation of nucleotide pool in CFX using conditional protein knockouts. Time course of nucleotide inter-conversion in CFX with or without pre-treatment by TEV protease a Adk: ADP was added to a CFX prepared from strain EcAdk76.3. b AtpA and c AtpD: ATP was added to a CFX prepared from strains EcAtpA and EcAtpD. Nucleotide concentrations were quantified at indicated time points by HPLC in triplicate ± SD. 95% confidence intervals indicated the accuracy of the fits. d Specific half-life times (min) with 95% confidence bounds (min) of ATP or ADP before and after knocking out enzymatic activity by TEV protease cleavage

Subsequently, to test for ATP stabilisation, we assessed the stability of ATP in a CFX prepared from strains EcAtpAH123 and EcAtpDE101 with or without pre-treatment by TEV protease. In the presence of TEV protease, the half-life of ATP increased twofold and threefold for inactivated AtpA or AtpD, respectively (Fig. 4b–d). ATP half-life in a CFX prepared from the wild-type strain Ec was not affected by protease treatment (Additional file 14: Figure S10). We verified by western blotting that AtpAH123 and AtpDE101 were indeed cleaved by TEV protease (Additional file 15: Figure S11).

Discussion

We developed and tested an approach for predicting PSs in proteins to enable the rational design of internally tagged proteins based on primary structure information alone. We validated our approach by harnessing existing literature data on permissive sites in bacterial proteins as well as by experimentally verifying predicted PSs in various E. coli proteins.

Our approach can minimise the number of design, test, and re-engineering cycles, enabling efficient internal protein tagging directly into the genome. We exemplify this by functionally tagging AtpA and AtpD directly on the E. coli chromosome in a single engineering cycle. Both our literature-derived and our own experimental data suggest that PSs are permissive to insertions of various lengths at different positions within a stretch. Additionally, in a proteome-wide analysis, we show that identified PSs are enriched in surface accessible regions, making them suitable candidates for tag interaction.

However, during the engineering of conditional protein knockdowns, we observed that surface accessibility is not the only requirement for efficient hydrolysis of a TEV-tag by the TEV protease. Inefficient hydrolysis of internal TEV-tags has already compromised other engineering efforts [17], and therefore, we wanted to exclusively insert TEV-tags that could be efficiently hydrolysed. Hydrolysis seems strongly dependent on the structural context of the chosen stretch. In fact, cleavage of our tested proteins was only achieved when the TEV-tag was inserted into a structurally flexible loop region and could be sufficiently improved by extending the minimal TEV-tag by additional residues derived from one of the TEV polyprotein cleavage sites. Proteases frequently recognise their substrates in an extended β-strand conformation [48], and we assume that adding flanking residues to the TEV-tag gives the substrate more flexibility to adopt the correct conformation.

To demonstrate the applicability of our approach, we designed conditional protein knockdowns to engineer a cell-free platform with enhanced ADP and ATP stability after TEV cleavage. The single-protein knockdown of Adk could almost completely halt drainage of ADP over a time span relevant for cell-free protein production [49] or biotransformations [26]. In addition, ATP half-life could be stabilised two- to threefold by employing single-protein knockdowns. These results were very encouraging given that our activity mapping showed ATP degradation in CFXs to be complex and several further potential sinks were identified, suggesting that proteolytic elimination of additional enzymes could further enhance stability. Still, it remains to be tested if the herein achieved enhancement of ATP stability in CFX leads to improved performance of a cell-free system, e.g. for small molecule or protein production.

Our protein knockdowns can be chromosomally implemented by well-established and cheap oligo-recombineering, and inactivation depends on a single component (TEV protease), which does not require co-factors. Other imaginable knockdown strategies like induced mRNA decay [50] or terminal degradation tags [51] would require extensive recoding of genes (in case of mRNA decay) for implementation and would rely on cellular machinery and co-factors (ATP) to achieve a protein’s knockdown.

Conclusions

Based on our analysis of existing data, experimental verifications, and proteome-wide predictions, we suggest that this method is of general utility. Correspondingly, we have developed design guidelines consisting of four steps (identification of PSs by searching for gaps in functionally conserved homologous proteins; determination of PS accessibility; determination of loop flexibility; genomic integration) for successfully engineering internally tagged proteins and inserting them into the chromosome (see Design guidelines). Although established in E. coli, we believe that the basic concept of PSS is widely applicable to proteins from different species across kingdoms. We are aware that, in higher organisms, post-translational modifications, splicing, or protein-protein interactions play a more important role and will need to be considered. PSS leaps beyond state-of-the-art methods for permissive site identification and allows for the rapid and parallel design and implementation of engineered proteins. This is essential for systematic protein engineering efforts like the herein presented cell-free platform engineering, where we envision protein knockdown multiplexing on a whole-proteome scale. We also emphasise the simplicity of our approach: harnessing the vast repository of protein sequences contained in sequence databases as the sole input results in straightforward design of internally tagged proteins.

Design guidelines

Based on our analysis, we provide general design guidelines for successful engineering of internally tagged proteins and their chromosomal insertion.

Step 1: Identification. Search for gaps in functionally conserved homologous proteins

For MSA construction and gap identification, we recommend aligning at least four to six functionally conserved protein sequences from different species. The percent identity of the chosen sequences should be sufficiently high with respect to the selected search algorithm to prevent misidentification of homologs — which could potentially introduce incorrect gaps into the MSA [34] — while ensuring that the chosen sequences are dissimilar enough to reduce sampling bias. We observed that permissive sites scattered around identified PSs, which indicates that there is some freedom in insert site selection around PSs. To avoid false positives, we recommend considering, when available, literature information on important functional features of a protein or a protein family, to avoid disrupting components known to be important for function.

Step 2: Accessibility. Find an accessible PS

Although we found predicted PSs to be enriched for surface accessibility, we recommend verifying the accessibility of the location of a stretch. Predicted surface accessibility [44] is a good proxy if structural data are unavailable. For partially buried stretches, a user can simply choose an exposed position within the stretch.

Step 3: Flexibility. Find a stretch within a flexible loop region

Some applications require more stringent criteria than surface accessibility. As shown herein for the TEV-tag, but also known for other tags [3], the structural context of a PS might be relevant to the function of a peptide tag. Thus, we suggest evaluating the secondary structure context by examining available three-dimensional structural data or using secondary structure prediction.

Step 4: Genomic integration

We demonstrate that inserts can be integrated into the chromosome, allowing for genomic tagging of proteins. Although we specifically used MAGE for genomic insertion of TEV-tags, we emphasise that any precision genome-editing tool can be used, since the design and implementation of tagged proteins is de-coupled. Therefore, we recommend carefully choosing the most suitable genome-editing tool for the relevant host.

Methods

Chemicals and enzymes

Restriction enzymes, T4 ligase, and the Gibson assembly kit were obtained from New England Biolabs (Ipswich, MA, USA) and used according to the manufacturer’s instructions. Chemicals were purchased in the highest purity available from Sigma-Aldrich (St. Louis, MO, USA), Fluka (Buchs, Switzerland), or Roth (Lauterbourg, France).

Trypton, yeast extract, Bacto™ casamino acids, Low salt Difco™ LB Base, Miller (LB Miller), and Difco™ MacConkey agar base were obtained from BD Bioscience (Basel, Switzerland). Low salt LB-Miller medium was used to grow cells for the MAGE experiments, and chloramphenicol at 20 μg mL−1, kanamycin at 50 μg mL−1, or carbenicillin at 50 μg mL−1 was supplied for antibiotic selection. Isopropyl β-D-1-thiogalactopyranoside (IPTG) was added to 0.1 mM and 5-bromo-4-chloro-indolyl-β-D-galactopyranoside (X-gal) to 40 μg mL−1 to LB agar plates for blue/white selection for LacZ functionality. MalK functionality was tested on MacConkey agar (40 g L−1) supplemented with 10 g L−1 d-(+)-maltose monohydrate. Complex LB medium contained 10 g L−1 trypton, 5 g L−1 yeast extract, and 10 g L−1 NaCl. M9 minimal medium contained 1× M9 salts [52] and was supplemented with 10 mg L−1 thiamine, 2 mg L−1 biotin, and the carbon source as mentioned in the text.

For affinity purification of proteins, Strep-Tactin® purification resins (iba, Göttingen, Germany) or Ni-NTA agarose (Thermo Fisher, Reinach, Switzerland) was used using the recommended buffers of the corresponding supplier.

Desalted oligonucleotides and Sanger sequencing services were obtained from Microsynth (Balgach, Switzerland) and Sigma-Aldrich (St. Louis, MO, USA). MAGE oligos were purchased with 4-phosphorothiolated bases at the 5’ end.

Strains, plasmids, and primers

For lists of strains, plasmids, and primers, see Additional file 10: Table S4, Additional file 16: Table S5, and Additional file 17: Table S6.

Growth rate determination

Determination of initial growth rates of strains carrying plasmids with TEV-tagged protein variants was performed in 5 mL LB or M9 minimal medium supplemented with 0.5% glucose or 1% glycerol and 0.2% casamino acids at 37 °C. Samples were taken every 60 min, transferred to a 96-well plate, and OD600 was determined in a Viktor3 96-well reader from Perkin Elmer (Schwerzenbach, Switzerland). Chromosomally tagged variants were tested at 32 °C (all EcNR1-based strains contain a heat inducible lambda red system and therefore cannot be cultivated at 37 °C) in one of the following media (as specified in the main text): LB medium, M9 minimal medium supplemented with either 20 mM succinate and 0.05% yeast extract or 0.5% glucose and 0.2% casamino acids. Growth rates were measured optically with a BioLector in a 48-well FlowerPlate from m2p-labs GmbH (Baesweiler, Germany).

TEV protease and protease cleavage in CFX

For cleavage, we used two different TEV protease versions. For initial small-scale cleavability determination, we used an N-terminally Strep-tagged TEVopt version that was affinity purified using Strep-tactin resin as indicated by the manufacturer (iba, Göttingen, Germany). TEVopt was codon optimised for E. coli, and carries the following substitutions, which were reported previously to enhance solubility or functionality [51, 53]: N68D, S219V, last six residues deleted. An aliquot of 50 μg of TEVopt was used per milligram of CFX.

For optimisation of cleavage and proteomic switching, we used a TEV protease variant which is expressed as a cleavable C-terminal fusion with maltose binding protein (MBP) and is equipped with a 6xHis-tag at the N-terminus [54]. It was purified by Ni2+-NTA affinity purification followed by dialysis against TEV buffer (10 mM sodium phosphate buffer pH 7.5 with 1 mM dithiothreitol (DTT) and 1 mM ethylenediaminetetraacetic acid (EDTA)). An aliquot of 5 μg of TEV per milligram of CFX was used for proteomic switching experiments and 1:5 (wt/wt) TEV:protein for purified protein samples in TEV buffer. Typically, the first one third of the total amount of TEV protease was added followed by incubation at 30 °C for 2 to 3 h, after which time another third of the TEV was added, again followed by incubation for 3 to 4 h at 30 °C. Afterwards the mix was centrifuged at 20,817 × g and 4 °C for 30 min, and the last third of TEV was added to the supernatant and incubated overnight at 4 °C.

Cleavage analysis of purified Adk variants

Adk variants with a Strep-tag were cloned by Gibson assembly into vector pKTS [55]. For overexpression, E. coli BL21 cells with the corresponding plasmid were grown in LB medium at 37 °C until an OD600 of around 0.6 and induced by 100 ng mL−1 anhydrotetracycline (aTc). Six hours after induction, the cells were harvested and the Adk protein purified on a Strep-Tactin spin column. The purified Adk was TEV treated as mentioned above.

Protein detection

Protein cleavage was analysed by western blot; the cleaved proteins were separated on an SDS gel of appropriate concentration and blotted on a nitrocellulose membrane of pore size 0.4 μm (GE Healthcare, time and voltage depending on protein size). Proteins were detected using a monoclonal mouse 6xHis-tag antibody (Qiagen, Hilden, Germany) for Adk, GpsA, and AtpD variants, a mouse monoclonal Strep-tag antibody (Qiagen, Hilden, Germany) for AtpA variants, or a rabbit polyclonal E-tag antibody (Abcam, Cambridge, UK) for TpiA variants (LabForce AG, Nunningen, Switzerland). A goat anti-mouse IgG-alkaline phosphatase conjugate (Sigma-Aldrich, Buchs, Switzerland) or a goat anti-rabbit IgG-alkaline phosphatase conjugate (Sigma-Aldrich, Buchs, Switzerland) in combination with a chromogenic alkaline phosphatase reagent kit (Thermo Fisher, Reinach, Switzerland) was used for detection. Purified Adk proteins were probed by a mouse anti-Strep monoclonal antibody (2-1507-00; dilution 1:10,000; iba, Goettingen, Germany) followed by detection by IRDye® 800CW goat anti-mouse (925-32210; dilution 1:10,000; LI-COR, Bad Homburg, Germany).

Chromosomal integration

The DNA sequences for inserted peptides were delivered to chromosomal genes by CoS-MAGE. General procedures were carried out as described earlier [43]. More specifically, the oligonucleotides were designed to insert the sequence for the 21-bp TEV-tag or the flanking regions at the gene site corresponding to the permissive site of the protein of interest (see Additional file 16: Table S6 for specific primers). The oligonucleotides targeted the lagging strand of the chromosome and were optimised to have a ∆G of > –12.5 kcal/mol (determined with Mfold [56]). Additionally, oligonucleotides for translational knockouts for tpiA and amn were designed with the online platform MODEST [57]. For CoS-MAGE, 3 mL of E. coli (Ec) were grown in LB medium at 32 °C until an OD600 of around 0.6 was obtained. Cells were then heat-shocked for 15 min at 42 °C to induce the lambda red genes. An aliquot of 2 mL of induced cells was made electrocompetent by washing three times with ice-cold water. Then, 2 μM of each TEV-tag oligo and 0.2 μM of the co-selection oligo were added to the cells and the cells were electroporated (1-mm gap cuvettes, 1.8 kV). Cells were recovered by the addition of 3 mL of fresh LB Miller medium for further MAGE cycles. After two cycles of MAGE, the cells were recovered overnight at 32 °C and spread on selective medium plates, the specific composition of which depended on the selection marker (LB agar plate with the selective antibiotic for bla (ampicillin) or rpsL (streptomycin); McConkey agar plate with maltose for malK). Clones were analysed by Colony PCR (Multiplex PCR Qiagen). For PCR primers see Additional file 16: Table S6. The PCR program was performed as follows. Step 1: 15 min at 95 °C; step 2: 30 s at 95 °C; step 3: 30 s at 50 °C; step 4: 60 s at 72 °C; repeat steps 2 to 4 30 times; step 5: 10 min at 72 °C and storage at 8 °C. The insertions were verified by Sanger sequencing of PCRs of the genomic regions.

Preparation of CFX

Cultures of the appropriate strain were cultivated in LB medium and harvested at an OD600 of around 2.8 by centrifugation. Cell pellets were re-suspended 1:1 (cell wet weight to buffer volume) in 10 mM sodium phosphate buffer (pH 7.5 and disrupted by homogenisation with EmulsiFlex-C3 (Avestin Europe GmbH, Mannheim, Germany) at a pressure of 1500 bar. Cell debris was pelleted by centrifugation at 40,000 × g and 4 °C for 30 min, and the supernatant was used as a CFX or stored at –80 °C. The protein concentration in the CFX was determined by a standard Bradford assay [58]. Protein concentrations in CFX were around 15–20 mg mL–1 for the different samples (compared to bovine serum albumin as standard). Prior to use, the CFX was centrifuged again (21,130 × g and 4 °C for 30 min to remove denatured proteins)

ATP and ADP stability assay

In order to determine the stability of ADP and ATP in CFX, 15 mM ATP or 15 mM ADP was incubated in 10 mg mL–1 CFX in 10 mM sodium phosphate buffer (pH 7.5) with additional 1 mM MgCl2, 10 mM KCl, and 1 mM DTT, and three samples were withdrawn at each sampling time (30 μL). Proteins were immediately precipitated by the addition of 30 μL ice-cold isopropanol and subsequent centrifugation at 21,130 × g and 4 °C for 30 min. Samples were 1:1 diluted with double-distilled water (ddH2O) and 2-μL aliquots were analysed by HPLC in an Agilent Series 1200 device equipped with an auto-injector, an Accucore aQ (2.6-μm particle diameter, 150 × 4.6 mm2 column dimensions, Thermo Fisher Scientific, Reinach, Switzerland) column, and a UV monitor set to 254 nm. An isocratic elution was performed with 50 mM potassium phosphate pH 6 at a flow rate of 0.7 mL min–1. Peaks for ATP, ADP, and AMP were identified and quantified by retention times and comparison with authentic standards.

In order to compute the half-life, the concentration data were fitted to an exponential decay model using the Levenberg-Marquardt non-linear least squares algorithm from the MATLAB R2016b (Mathworks, Natick, MA, USA) curve fitting toolbox. We used the decay model
$$ \frac{\left[ ATP\right]}{\mathrm{mM}}=2+b{e}^{t{c}^{-1}} $$
to fit the initial concentration (2 + b) mM and ln(2) c for the ATP half-life (where t, c, and the half-life are in minutes, and b is unitless). The asymptote was set to 2 mM, as the ATP concentration did not drop below this level within 4 h in previous experiments (data not shown). Since no asymptote could be determined experimentally for ADP decay, we used the model
$$ \frac{\left[ ADP\right]}{\mathrm{mM}}=d+b{e}^{t{c}^{-1}} $$

instead and fit d as well, where (d + b) mM is the initial concentration. Finally, 95% parameter CIs were plotted for both models.

Manual PSS

Relevant sequences were retrieved from the functionally annotated database Universal Protein Resource Knowledgebase (UniProtKB) [59] using domain enhanced lookup time accelerated BLAST (DELTA-BLAST) [60]. Alignments were performed using the online version of Clustal Omega with default parameters [61]. Input sequences and UniProtKB accession numbers for all MSAs are summarised in Additional file 2: Table S2.

The span of each PS was defined by the two flanking residues of the gap in the underlying alignment, and its surface accessibility assessed by the maximum geometric mean RSA of adjacent residue pairs within the PS.

Correlation of crystal structures and predictions for secondary structure and RSA

Crystal structures were obtained in Protein Data Bank (PDB) format from the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank [62]. Secondary structure and absolute surface areas (ASAs) for each residue were obtained from the DSSP database [63], while relative surface areas, a measure of RSA, were computed from ASAs using the Bio.PDB.DSSP Python module [64]. RSA and secondary structure predictions were performed using the NetSurfP tool [45]. Given predictions, the secondary structure context assigned to each residue was defined as the annotation (helix, strand, or coil) with the highest probability.

Correlation of predicted and crystal structure surface accessibilities was assessed by computing the distance correlation [65] between the per-residue RSA values. To assess the correlation of secondary structures, annotations were mapped onto the set of points
$$ SS=\left\{\left(\begin{array}{c}0\\ {}0\end{array}\right),\left(\begin{array}{c}1\\ {}0\end{array}\right),\left(\begin{array}{c}1/2\\ {}\sqrt{3}/2\end{array}\right)\right\} $$

to ensure that all annotation types are equidistant. Distance correlation was then calculated in this space.

Automated proteome querying and homolog retrieval

As a sample set to test if PSs are enriched in surface accessible regions, 4434 proteins in the UniProtKB/Swiss-Prot database [59] were chosen from the E. coli K-12 proteome [66]. RSA and secondary structure were predicted for each residue using the NetSurfP tool [45] and, when available, obtained from crystal structures from the DSSP database [63]. Homologs were retrieved using the DELTA-BLAST tool [60], searching against the version of the UniProtKB/Swiss-Prot database from March 23, 2016 and the conserved domain database from May 27, 2015. With composition-based statistics disabled, all search hits with an E-value below 0.001 were retrieved. The hits were subsequently filtered to discard those whose high-scoring pairs (HSPs) covered less than 80% of the query length or had no inserts. Since the sensitivity of DELTA-BLAST drops significantly below a sequence identity of 30% [60], all hits below this cut-off were also discarded. Of the 4434 proteins tested, 4097 had at least one homolog in the Swiss-Prot database, and 2613 had at least one search hit satisfying the filtration criteria.

Detecting PSs from HSPs

Given a query protein and its respective homologs, a PS in the query is defined as a gap in their MSA plus its immediate flanking residues. To avoid the expense of computing an MSA for each query protein for this analysis, indels in the HSPs for the query proteins were used to construct PSs. We justify this approximation by noting that the exact positions and lengths of stretches are dependent on choice of alignment algorithm and gap scoring model. We define a PS of a query protein precisely using its HSPs in the following manner.

Let p denote a query protein in the proteome P and let HSP(p, i) be its ith HSP. We define the gap set of HSP(p, i) as g(p, i) = {[s, e]|indel in HSP(p, i) in (s, e)}, where s and e denote the left and right flanks of the indel, respectively. Using this, the set of PSs of p, PS(p), is defined as:
$$ PS(p)=\left\{\tilde{g}\left|\tilde{g}\ \mathrm{connected}\ \mathrm{component}\ \mathrm{of}\ \underset{i}{\cup }g\left(p,i\right)\right.\right\} $$

Testing PSs for enrichment in accessible regions

After detecting PSs in the sample set of proteins, the stretches were tested for significant enrichment in surface accessible regions, i.e. \( {H}_a:{E}_{{\left\{ PS\left({p}_i\right)\right\}}_i}\left[ RSA\right]>E\left[ RSA\right] \), by generating 1000 bootstrap samples of PSs site shuffles. Using the definition of the RSA of the ith residue of the query, RSA(i), the RSA of a PS is defined as \( RSA\left(\left[s,e\right]\right)= ma{x}_{i\in \left\{s,\dots, e-1\right\}}\sqrt{RSA(i) RSA\left(i+1\right)} \).

The RSAs of the sites preceding the N-terminus and proceeding C-terminus are defined to be the RSAs of the first and last residues, respectively.

To generate a sample, each stretch [s m , e m ] was assigned to a random protein p k with a probability proportional to p k 's length. Then for each protein p k , its assigned stretches were distributed uniform-randomly in the protein. This process was repeated to generate each bootstrap sample \( {\widehat{PS}}_j(P) \). The statistic \( {\mu}_{{\widehat{PS}}_j(P)}(RSA) \) was then computed for all bootstrap samples to generate the null distribution of mean RSAs. The test statistic μ PS(P)(RSA) was evaluated for significance against this distribution.

Abbreviations

Adk: 

Adenylate kinase

ADP: 

Adenosine diphosphate

AMP: 

Adenosine monophosphate

ASA: 

Absolute surface area

aTc: 

Anhydrotetracycline

ATP: 

Adenosine triphosphate

Bla: 

β-lactamase

CFX: 

Cell-free extract

CoS-MAGE: 

Co-selection Multiplex automated genome engineering

DHAP: 

Dihydroxyacetone phosphate

DNA: 

Deoxyribonucleic acid

DTT: 

Dithiothreitol

GpsA: 

Glycerol-3-phosphate dehydrogenase

HPLC: 

High-performance liquid chromatography

HSP: 

High-scoring pair

IPTG: 

Isopropyl β-D-1-thiogalactopyranoside

LB: 

Lysogeny broth

MAGE: 

Multiplex automated genome engineering

MBP: 

Maltose binding protein

MSA: 

Multiple sequence alignment

NMP: 

Nucleoside monophosphate

NMR: 

Nuclear magnetic resonance

NTP: 

Nucleoside triphosphate

PCR: 

Polymerase chain reaction

PDB: 

Protein Data Bank

PS: 

Permissive stretch

PSS: 

Permissive stretch search

RSA: 

Relative surface accessibility

SD: 

Standard deviation

Ser: 

Serine

SS: 

Secondary structure

TEV: 

Tobacco etch virus

TEV-tag: 

TEV protease recognition site

Thr: 

Threonine

TpiA: 

Triosephosphate isomerase

WT: 

Wild type

X-gal: 

5-Bromo-4-chloro-indolyl-β-D-galactopyranoside

Declarations

Acknowledgements

We thank George Church for strain EcNR1 (Addgene # 26930), Luzius Pestalozzi for plasmid pEXP3-TEVsol, Belen Calles and Victor de Lorenzo for information on permissive sites within TpiA, Tom Lampart and Christian L. Müller for initial automation of the PSS, and Irene Wüthrich and Gaspar Morgado for helpful comments on the manuscript. LAW and HM would like to thank Jörg Stelling for his support.

Funding

This work was partially supported by a Simons Foundation Junior Fellow award to SB, the EU projects ST-FLOW (#289326) and EuroBioSyn (#12749), and the Swiss National Science Foundation (project PROTSWITCH #310030_143645).

Availability of data and materials

All data generated or analysed during this study are included in this published article and its additional files. The E. coli strains and plasmids encoding for cleavable protein variants described herein were deposited with Addgene.

Authors’ contributions

SB and SP conceptualised the approach for PSS and the conditional protein knockdowns. SB, SOE, TMR, and SP designed the experiments. SB and SOE performed the experimental work. SB and SP supervised the work. Bioinformatics and statistical models and analyses were performed by LAW, HM, and SOE. All authors wrote the manuscript. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Authors’ Affiliations

(1)
Department of Biosystems Science and Engineering, ETH Zürich, Basel, Switzerland
(2)
Swiss Institute of Bioinformatics, Basel, Switzerland
(3)
Life Science Zürich Graduate School in Systems Biology, Zürich, Switzerland
(4)
Department of Computer Science, ETH Zürich, Zürich, Switzerland
(5)
Present address: Chemistry Department, Columbia University, New York, USA

References

  1. Adams SR, Campbell RE, Gross LA, Martin BR, Walkup GK, Yao Y, Llopis J, Tsien RY. New biarsenical ligands and tetracysteine motifs for protein labeling in vitro and in vivo: synthesis and biological applications. J Am Chem Soc. 2002;124(21):6063–76.View ArticlePubMedGoogle Scholar
  2. Cao H, Xiong Y, Wang T, Chen B, Squier TC, Mayer MU. A red cy3-based biarsenical fluorescent probe targeted to a complementary binding peptide. J Am Chem Soc. 2007;129(28):8672–3.View ArticlePubMedGoogle Scholar
  3. Barthelmes K, Reynolds AM, Peisach E, Jonker HRA, DeNunzio NJ, Allen KN, Imperiali B, Schwalbe H. Engineering encodable lanthanide-binding tags into loop regions of proteins. J Am Chem Soc. 2011;133(4):808–19.View ArticlePubMedPubMed CentralGoogle Scholar
  4. Tozser J, Tropea JE, Cherry S, Bagossi P, Copeland TD, Wlodawer A, Waugh DS. Comparison of the substrate specificity of two potyvirus proteases. FEBS J. 2005;272(2):514–23.View ArticlePubMedGoogle Scholar
  5. Chen I, Howarth M, Lin W, Ting AY. Site-specific labeling of cell surface proteins with biophysical probes using biotin ligase. Nat Methods. 2005;2(2):99–104.View ArticlePubMedGoogle Scholar
  6. Liu DS, Nivon LG, Richter F, Goldman PJ, Deerinck TJ, Yao JZ, Richardson D, Phipps WS, Ye AZ, Ellisman MH, et al. Computational design of a red fluorophore ligase for site-specific protein labeling in living cells. Proc Natl Acad Sci U S A. 2014;111(43):E4551–4559.View ArticlePubMedPubMed CentralGoogle Scholar
  7. Veggiani G, Nakamura T, Brenner MD, Gayet RV, Yan J, Robinson CV, Howarth M. Programmable polyproteams built using twin peptide superglues. Proc Natl Acad Sci U S A. 2016;113(5):1202–7.View ArticlePubMedPubMed CentralGoogle Scholar
  8. Scott JK, Loganathan D, Easley RB, Gong X, Goldstein IJ. A family of concanavalin A-binding peptides from a hexapeptide epitope library. Proc Natl Acad Sci U S A. 1992;89(12):5398–402.View ArticlePubMedPubMed CentralGoogle Scholar
  9. Matsubara T. Potential of peptides as inhibitors and mimotopes: selection of carbohydrate-mimetic peptides from phage display libraries. J Nucleic Acids. 2012;2012:740982.View ArticlePubMedPubMed CentralGoogle Scholar
  10. Slovin SF, Keding SJ, Ragupathi G. Carbohydrate vaccines as immunotherapy for cancer. Immunol Cell Biol. 2005;83(4):418–28.View ArticlePubMedGoogle Scholar
  11. Zasloff M. Antimicrobial peptides of multicellular organisms. Nature. 2002;415(6870):389–95.View ArticlePubMedGoogle Scholar
  12. Lotze J, Reinhardt U, Seitz O, Beck-Sickinger AG. Peptide-tags for site-specific protein labelling in vitro and in vivo. Mol BioSyst. 2016;12(6):1731–45.View ArticlePubMedGoogle Scholar
  13. Billerbeck S, Calles B, Muller CL, de Lorenzo V, Panke S. Towards functional orthogonalisation of protein complexes: individualisation of GroEL monomers leads to distinct quasihomogeneous single rings. Chembiochem. 2013;14(17):2310–21.View ArticlePubMedGoogle Scholar
  14. Zordan RE, Beliveau BJ, Trow JA, Craig NL, Cormack BP. Avoiding the ends: internal epitope tagging of proteins using transposon Tn7. Genetics. 2015;200(1):47–58.View ArticlePubMedPubMed CentralGoogle Scholar
  15. Backstrom M, Lebens M, Schodel F, Holmgren J. Insertion of a HIV-1-neutralizing epitope in a surface-exposed internal region of the cholera toxin B-subunit. Gene. 1994;149(2):211–7.View ArticlePubMedGoogle Scholar
  16. Sturgill TW, Cohen A, Diefenbacher M, Trautwein M, Martin DE, Hall MN. TOR1 and TOR2 have distinct locations in live cells. Eukaryot Cell. 2008;7(10):1819–30.View ArticlePubMedPubMed CentralGoogle Scholar
  17. Copeland MF, Politz MC, Johnson CB, Markley AL, Pfleger BF. A transcription activator-like effector (TALE) induction system mediated by proteolysis. Nat Chem Biol. 2016.Google Scholar
  18. Calles B, de Lorenzo V. Expanding the boolean logic of the prokaryotic transcription factor XylR by functionalization of permissive sites with a protease-target sequence. ACS Synth Biol. 2013;2(10):594–603.View ArticlePubMedGoogle Scholar
  19. Reznikoff WS. Tn5 transposition: a molecular tool for studying protein structure-function. Biochem Soc Trans. 2006;34(Pt 2):320–3.View ArticlePubMedGoogle Scholar
  20. Jiang WY, Bikard D, Cox D, Zhang F, Marraffini LA. RNA-guided editing of bacterial genomes using CRISPR-Cas systems. Nat Biotechnol. 2013;31(3):233–9.View ArticlePubMedPubMed CentralGoogle Scholar
  21. Wang HH, Isaacs FJ, Carr PA, Sun ZZ, Xu G, Forest CR, Church GM. Programming cells by multiplex genome engineering and accelerated evolution. Nature. 2009;460(7257):894–8.View ArticlePubMedPubMed CentralGoogle Scholar
  22. Vandemoortele G, Staes A, Gonnelli G, Samyn N, De Sutter D, Vandermarliere E, Timmerman E, Gevaert K, Martens L, Eyckerman S. An extra dimension in protein tagging by quantifying universal proteotypic peptides using targeted proteomics. Sci Rep. 2016;6:27220.View ArticlePubMedPubMed CentralGoogle Scholar
  23. Schlehuber LD, Rose JK. Prediction and identification of a permissive epitope insertion site in the vesicular stomatitis virus glycoprotein. J Virol. 2004;78(10):5079–87.View ArticlePubMedPubMed CentralGoogle Scholar
  24. Burg L, Zhang K, Bonawitz T, Grajevskaja V, Bellipanni G, Waring R, Balciunas D. Internal epitope tagging informed by relative lack of sequence conservation. Sci Rep. 2016;6:36986.View ArticlePubMedPubMed CentralGoogle Scholar
  25. Hodgman CE, Jewett MC. Cell-free synthetic biology: thinking outside the cell. Metab Eng. 2012;14(3):261–9.View ArticlePubMedGoogle Scholar
  26. Bujara M, Schumperli M, Billerbeck S, Heinemann M, Panke S. Exploiting cell-free systems: implementation and debugging of a system of biotransformations. Biotechnol Bioeng. 2010;106(3):376–89.PubMedGoogle Scholar
  27. Wang Y, Huang W, Sathitsuksanoh N, Zhu Z, Zhang YH. Biohydrogenation from biomass sugar mediated by in vitro synthetic enzymatic pathways. Chem Biol. 2011;18:372–80.View ArticlePubMedGoogle Scholar
  28. Pardee K, Slomovic S, Nguyen PQ, Lee JW, Donghia N, Burrill D, Ferrante T, McSorley FR, Furuta Y, Vernet A, et al. Portable, on-demand biomolecular manufacturing. Cell. 2016;167(1):248–59. e212.View ArticlePubMedGoogle Scholar
  29. Pardee K, Green AA, Ferrante T, Cameron DE, DaleyKeyser A, Yin P, Collins JJ. Paper-based synthetic gene networks. Cell. 2014;159(4):940–54.View ArticlePubMedPubMed CentralGoogle Scholar
  30. Kim HC, Kim TW, Kim DM. Prolonged production of proteins in a cell-free protein synthesis system using polymeric carbohydrates as an energy source. Process Biochem. 2011;46(6):1366–9.View ArticleGoogle Scholar
  31. Calhoun KA, Swartz JR. An economical method for cell-free protein synthesis using glucose and nucleoside monophosphates. Biotechnol Progr. 2005;21(4):1146–53.View ArticleGoogle Scholar
  32. Hallet B, Sherratt DJ, Hayes F. Pentapeptide scanning mutagenesis: random insertion of a variable five amino acid cassette in a target protein. Nucleic Acids Res. 1997;25(9):1866–7.View ArticlePubMedPubMed CentralGoogle Scholar
  33. Rost B. Twilight zone of protein sequence alignments. Protein Eng. 1999;12(2):85–94.View ArticlePubMedGoogle Scholar
  34. Saurabh K, Holland BR, Gibb GC, Penny D. Gaps: an elusive source of phylogenetic information. Syst Biol. 2012;61(6):1075–82.View ArticlePubMedGoogle Scholar
  35. Hyrum Carroll PR, Mark Clement, Quinn Snell. Effects of gap open and gap extension penalties. Biotechnology and Bioinformatics Symposium (BIOT) Provo (UT): Brigham Young University; 2006. p.19.Google Scholar
  36. RossMacdonald P, Sheehan A, Roeder GS, Snyder M. A multipurpose transposon system for analyzing protein production, localization, and function in Saccharomyces cerevisiae. Proc Natl Acad Sci U S A. 1997;94(1):190–5.View ArticleGoogle Scholar
  37. Glaser M, Nulty W, Vagelos PR. Role of adenylate kinase in the regulation of macromolecular biosynthesis in a putative mutant of Escherichia coli defective in membrane phospholipid biosynthesis. J Bacteriol. 1975;123(1):128–36.PubMedPubMed CentralGoogle Scholar
  38. Esmon BE, Kensil CR, Cheng CH, Glaser M. Genetic analysis of Escherichia coli mutants defective in adenylate kinase and sn-glycerol 3-phosphate acyltransferase. J Bacteriol. 1980;141(1):405–8.PubMedPubMed CentralGoogle Scholar
  39. Hsu CC, Fox CF. Induction of the lactose transport system in a lipid-synthesis-defective mutant of Escherichia coli. J Bacteriol. 1970;103(2):410–6.PubMedPubMed CentralGoogle Scholar
  40. Anderson A, Cooper RA. Gluconeogenesis in Escherichia coli: the role of triose phosphate isomerase. FEBS Lett. 1969;4(1):19–20.View ArticlePubMedGoogle Scholar
  41. Billerbeck S, Panke S. A genetic replacement system for selection-based engineering of essential proteins. Microb Cell Fact. 2012;11(1):110.View ArticlePubMedPubMed CentralGoogle Scholar
  42. Muller CW, Schlauderer GJ, Reinstein J, Schulz GE. Adenylate kinase motions during catalysis: an energetic counterweight balancing substrate binding. Structure. 1996;4(2):147–56.View ArticlePubMedGoogle Scholar
  43. Wang HH, Kim H, Cong L, Jeong J, Bang D, Church GM. Genome-scale promoter engineering by coselection MAGE. Nat Methods. 2012;9(6):591.View ArticlePubMedPubMed CentralGoogle Scholar
  44. Hawthorne CA, Brusilow WS. Complementation of mutants in the Escherichia coli proton-translocating ATPase by cloned DNA from Bacillus megaterium. J Biol Chem. 1986;261(12):5245–8.PubMedGoogle Scholar
  45. Petersen B, Petersen TN, Andersen P, Nielsen M, Lundegaard C. A generic method for assignment of reliability scores applied to solvent accessibility predictions. BMC Struct Biol. 2009;9:51.View ArticlePubMedPubMed CentralGoogle Scholar
  46. Dunn SD, Heppel LA. Properties and functions of the subunits of the Escherichia coli coupling factor ATPase. Arch Biochem Biophys. 1981;210(2):421–36.View ArticlePubMedGoogle Scholar
  47. Koebmann BJ, Westerhoff HV, Snoep JL, Nilsson D, Jensen PR. The glycolytic flux in Escherichia coli is controlled by the demand for ATP. J Bacteriol. 2002;184(14):3909–16.View ArticlePubMedPubMed CentralGoogle Scholar
  48. Tyndall JD, Nall T, Fairlie DP. Proteases universally recognize beta strands in their active sites. Chem Rev. 2005;105(3):973–99.View ArticlePubMedGoogle Scholar
  49. Kim TW, Kim DM, Choi CY. Rapid production of milligram quantities of proteins in a batch cell-free protein synthesis system. J Biotechnol. 2006;124(2):373–80.View ArticlePubMedGoogle Scholar
  50. Venturelli OS, Tei M, Bauer S, Chan LJG, Petzold CJ, Arkin AP. Programming mRNA decay to modulate synthetic circuit resource allocation. Nat Commun. 2017;8:15128.View ArticlePubMedPubMed CentralGoogle Scholar
  51. Taxis C, Stier G, Spadaccini R, Knop M. Efficient protein depletion by genetically controlled deprotection of a dormant N-degron. Mol Syst Biol. 2009;5:267.View ArticlePubMedPubMed CentralGoogle Scholar
  52. Shubeita HE, Sambrook JF, Mccormick AM. Molecular cloning and analysis of functional cDNA and genomic clones encoding bovine cellular retinoic acid-binding protein. Proc Natl Acad Sci U S A. 1987;84(16):5645–9.View ArticlePubMedPubMed CentralGoogle Scholar
  53. van den Berg S, Lofdahl PA, Hard T, Berglund H. Improved solubility of TEV protease by directed evolution. J Biotechnol. 2006;121(3):291–8.View ArticlePubMedGoogle Scholar
  54. Blommel PG, Fox BG. A combined approach to improving large-scale production of tobacco etch virus protease. Protein Expres Purif. 2007;55(1):53–68.View ArticleGoogle Scholar
  55. Neuenschwander M, Butz M, Heintz C, Kast P, Hilvert D. A simple selection strategy for evolving highly efficient enzymes. Nat Biotechnol. 2007;25(10):1145–7.View ArticlePubMedGoogle Scholar
  56. Zuker M. Mfold web server for nucleic acid folding and hybridization prediction. Nucleic Acids Res. 2003;31(13):3406–15.View ArticlePubMedPubMed CentralGoogle Scholar
  57. Bonde MT, Klausen MS, Anderson MV, Wallin AIN, Wang HH, Sommer MOA. MODEST: a web-based design tool for oligonucleotide-mediated genome engineering and recombineering. Nucleic Acids Res. 2014;42(W1):W408–15.View ArticlePubMedPubMed CentralGoogle Scholar
  58. Bradford MM. A rapid and sensitive method for the quantitation of microgram quantities of protein utilizing the principle of protein-dye binding. Anal Biochem. 1976;72:248–54.View ArticlePubMedGoogle Scholar
  59. Bateman A, Martin MJ, O'Donovan C, Magrane M, Apweiler R, Alpi E, Antunes R, Ar-Ganiska J, Bely B, Bingley M, et al. UniProt: a hub for protein information. Nucleic Acids Res. 2015;43(D1):D204–12.View ArticleGoogle Scholar
  60. Boratyn GM, Schaffer AA, Agarwala R, Altschul SF, Lipman DJ, Madden TL. Domain enhanced lookup time accelerated BLAST. Biol Direct. 2012;7:12.View ArticlePubMedPubMed CentralGoogle Scholar
  61. Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li WZ, Lopez R, McWilliam H, Remmert M, Soding J, et al. Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Mol Syst Biol. 2011;7:539.View ArticlePubMedPubMed CentralGoogle Scholar
  62. Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–42.View ArticlePubMedPubMed CentralGoogle Scholar
  63. Touw WG, Baakman C, Black J, te Beek TAH, Krieger E, Joosten RP, Vriend G. A series of PDB-related databanks for everyday needs. Nucleic Acids Res. 2015;43(D1):D364–8.View ArticlePubMedGoogle Scholar
  64. Hamelryck T, Manderick B. PDB file parser and structure class implemented in Python. Bioinformatics. 2003;19(17):2308–10.View ArticlePubMedGoogle Scholar
  65. Szekely GJ, Rizzo ML, Bakirov NK. Measuring and testing dependence by correlation of distances. Ann Stat. 2007;35(6):2769–94.View ArticleGoogle Scholar
  66. Blattner FR, Plunkett G, Bloch CA, Perna NT, Burland V, Riley M, ColladoVides J, Glasner JD, Rode CK, Mayhew GF, et al. The complete genome sequence of Escherichia coli K-12. Science. 1997;277(5331):1453.View ArticlePubMedGoogle Scholar
  67. Queirozclaret C, Meunier JC. Staining technique for phosphatases in polyacrylamide gels. Anal Biochem. 1993;209(2):228–31.View ArticleGoogle Scholar

Copyright

© Panke et al. 2017

Advertisement