Skip to main content

Table 1 New scripts used in our pipelinea

From: The taming of an impossible child: a standardized all-in approach to the phylogeny of Hymenoptera using public database sequences

Step Number Script
Download from GenBank [I] proseqco
Standardize headers [a.I], [b.I] header_standardizer
Split sequences to single genes [b.II] multiple_sequence_splitter
Check strand polarity and sequence similarity [b.III] checking_seq
Choose longest sequence per species and gene [a.IV], [b.IV] choose_longest_seq
Translate coding mitochondrial sequences from nucleotides to amino acids [b.V] dna2aa
Delete groups of orthologs with three or fewer species [II], [III], [XIII] small_groups_deleter
Delete species with only one sequence [III], [XIII] taxon_deleter
Backtranslate coding mitochondrial sequences from amino acids to nucleotides [VI] aa2dna
Mask gappy regions in alignment [VII] gap_killer
Select maximum clique of overlapping sequences [IX], [X] minimum_sequence_overlap
Ban compositional heterogeneity [XI], [XII] nucleotide_chi
Prune genera to best represented species [XIV] prune_genera
Select largest group of species that overlap in at least one group of orthologs [XV] reduce2leading_gene
Concatenate alignments [XVI] concatenator
  1. aAvailable at and in Additional file 1. All scripts were written in Ruby, except for checking_seq, which was written in Perl. Numerals (column "Number") correspond to those in Figure 1.