Large-scale in silico mutagenesis experiments reveal optimization of genetic code and codon usage for protein mutational robustness

Background How, and the extent to which, evolution acts on DNA and protein sequences to ensure mutational robustness and evolvability is a long-standing open question in the field of molecular evolution. We addressed this issue through the first structurome-scale computational investigation, in which we estimated the change in folding free energy upon all possible single-site mutations introduced in more than 20,000 protein structures, as well as through available experimental stability and fitness data. Results At the amino acid level, we found the protein surface to be more robust against random mutations than the core, this difference being stronger for small proteins. The destabilizing and neutral mutations are more numerous in the core and on the surface, respectively, whereas the stabilizing mutations are about 4% in both regions. At the genetic code level, we observed smallest destabilization for mutations that are due to substitutions of base III in the codon, followed by base I, bases I+III, base II, and other multiple base substitutions. This ranking highly anticorrelates with the codon-anticodon mispairing frequency in the translation process. This suggests that the standard genetic code is optimized to limit the impact of random mutations, but even more so to limit translation errors. At the codon level, both the codon usage and the usage bias appear to optimize mutational robustness and translation accuracy, especially for surface residues. Conclusion Our results highlight the non-universality of mutational robustness and its multiscale dependence on protein features, the structure of the genetic code, and the codon usage. Our analyses and approach are strongly supported by available experimental mutagenesis data.


Table of Content
• Table S1. Amino acid mutations in M P oP due to single nucleobase substitutions (µSBS) and due to multiple nucleobase substitutions that cannot be obtained through single base substitutions (µMBS).
• Table S2. Amino acid mutations in M Exp due to single nucleobase substitutions (µSBS) and due to multiple nucleobase substitutions that cannot be obtained through single base substitutions (µMBS).
• Table S3. Amino acid mutations in M P oP due to substitutions of a single nucleobase Gua, Cyt, Ade or Thy.
• Table S4. Amino acid mutations in M Exp due to substitutions of a single nucleobase Gua, Cyt, Ade or Thy.
• Table S5. Amino acid mutations in M P oP due to single nucleobase substitutions (µSBS) that correspond to transitions or transversions.
• Table S6. Amino acid mutations in M Exp due to single nucleobase substitutions (µSBS) that correspond to transitions or transversions.
• Table S7. Amino acid mutations in M P oP due to single nucleobase substitutions (µSBS) and due to multiple nucleobase substitutions that cannot be obtained through single base substitutions (µMBS).
• Table S8. Amino acid mutations in M Exp due to single nucleobase substitutions (µSBS) and due to multiple nucleobase substitutions that cannot be obtained through single base substitutions (µMBS).
• Table S9. Difference between ∆∆G for µSBSs in M P oP reached from synonymous codons (syn) or from the wild-type codon (used), according to whether the positiondependent frequency of translation errors is taken into account (translation) or not (random).
• Figure S2. Influence of the protein length on the mutational robustness for core residues (RSA < 20%).
• Figure S3. Hydrophobic residue content (Val, Ile, Leu, Phe) in the protein core (RSA≤20%) as a function of the protein length.
• Figure S4. Difference between the mean of the experimental ∆∆G values per RSA bin of long proteins (L >200 residues) and short proteins (L ≤200 residues) as a function of RSA.
• Figure S5. ∆∆G (in kcal/mol) distribution for different RSA ranges, and for different types of single and multiple base substitutions (µSBS and µMBS).
• Figure S6. Ratio of stabilizing, destabilizing and neutral µSBSs considering random mutations (that occur with equal frequency at each codon position) • Figure  • Figure S8. Average ∆∆G (in kcal/mol) of µSBSs per residue as a function of the position in the sequence of wheat agglutinin isolectin 3 (PDB code 2X52, chain A).
• Figure S9. Schematic picture of the computational pipeline used in this paper.
• Table S10. List of eukaryote organisms in the dataset D, with their number of proteins and average ∆∆G upon all possible point mutations.
• Table S11. List of bacterial organisms in the dataset D, with their number of proteins and average ∆∆G upon all possible point mutations.
• Table S12. List of archaea organisms in the dataset D, with their number of proteins and average ∆∆G upon all possible point mutations.
• Table S13. List of virus organisms in the dataset D, with their number of proteins and average ∆∆G upon all possible point mutations. Table S1. Amino acid mutations in M P oP due to single nucleobase substitutions (µSBS) and due to multiple nucleobase substitutions that cannot be obtained through single base substitutions (µMBS). In ∆∆G (d,s) , the synonymous mutations (with ∆∆G = 0) are included in the mean and the degeneracy (the number of different base substitutions yielding the same amino acid mutation) is taken into account. The percentages of mutations refer to ∆∆G , without degeneracy and synonymous mutations.  Table S2. Amino acid mutations in M Exp due to single nucleobase substitutions (µSBS) and due to multiple nucleobase substitutions that cannot be obtained through single base substitutions (µMBS). When the number of occurrences is too low to yield reliable statistics (< 100), the RSA intervals are combined, and if the number is still too low, no results are given. We did not indicate the percentage of stabilizing, neutral and destabilizing mutations as the mutations are here non-random.    Table S7. Amino acid mutations in M P oP due to single nucleobase substitutions (µSBS) and due to multiple nucleobase substitutions that cannot be obtained through single base substitutions (µMBS). In ∆∆G (d) , the degeneracy (the number of different base substitutions yielding the same amino acid mutation) is taken into account. ∆∆G used and ∆∆G syn refer to mean ∆∆G values of mutations that are reached from the wild-type codon or a synonymous codon, respectively. σ is the standard deviation of the ∆∆G distribution: σ 2 (∆∆G) = σ 2 (∆∆G used ) + σ 2 (∆∆G syn ). The percentages of mutations refer to those starting from the wild-type codon.  Table S8. Amino acid mutations in M Exp due to single nucleobase substitutions (µSBS) and due to multiple nucleobase substitutions that cannot be obtained through single base substitutions (µMBS). In ∆∆G (d) , the degeneracy (the number of different base substitutions yielding the same amino acid mutation) is taken into account. ∆∆G used and ∆∆G syn refer to mean ∆∆G values of mutations that are reached from the wild-type codon or a synonymous codon, respectively. σ is the standard deviation of the ∆∆G distribution: σ 2 (∆∆G) = σ 2 (∆∆G used ) + σ 2 (∆∆G syn ). "Number" refers to the number of mutations starting from the wild-type codon; when it is too low to yield reliable statistics (< 100), the RSA intervals are combined.  Table S9. Difference between ∆∆G for µSBSs in M P oP reached from synonymous codons (syn) or from the wild-type codon (used), according to whether the position-dependent frequency of translation errors is taken into account (translation) or not (random). σ is the standard deviation of the ∆∆G distribution. Here the biased and unbiased codons, differently from Table 8 of the main text, are not defined in terms of the deviation from the codon equiprobability but in terms of the deviation with respect the expected frequency under the observed base frequencies.