Skip to main content

Table 1 New scripts used in our pipelinea

From: The taming of an impossible child: a standardized all-in approach to the phylogeny of Hymenoptera using public database sequences

Step

Number

Script

Download from GenBank

[I]

proseqco

Standardize headers

[a.I], [b.I]

header_standardizer

Split sequences to single genes

[b.II]

multiple_sequence_splitter

Check strand polarity and sequence similarity

[b.III]

checking_seq

Choose longest sequence per species and gene

[a.IV], [b.IV]

choose_longest_seq

Translate coding mitochondrial sequences from nucleotides to amino acids

[b.V]

dna2aa

Delete groups of orthologs with three or fewer species

[II], [III], [XIII]

small_groups_deleter

Delete species with only one sequence

[III], [XIII]

taxon_deleter

Backtranslate coding mitochondrial sequences from amino acids to nucleotides

[VI]

aa2dna

Mask gappy regions in alignment

[VII]

gap_killer

Select maximum clique of overlapping sequences

[IX], [X]

minimum_sequence_overlap

Ban compositional heterogeneity

[XI], [XII]

nucleotide_chi

Prune genera to best represented species

[XIV]

prune_genera

Select largest group of species that overlap in at least one group of orthologs

[XV]

reduce2leading_gene

Concatenate alignments

[XVI]

concatenator

  1. aAvailable at http://software.zfmk.de/ and in Additional file 1. All scripts were written in Ruby, except for checking_seq, which was written in Perl. Numerals (column "Number") correspond to those in Figure 1.