Software availability
FlashFry is written in Scala and bundled as a single stand-alone Jar file, easily run on any system with an installed Java virtual machine (JVM). The tool is freely licensed under version 3 of the GPL, and code, documentation, and tutorials are available on the GitHub page: https://aaronmck.github.io/FlashFry/
Database generation times
We timed FlashFry’s database creation on an Amazon’s Elastic Compute Cloud r4.large instance type using an SSD filesystem. FlashFry was limited to one CPU with the ‘taskset -c 0’ command and 8GB of memory using the java “-Xmx8g” argument.
Off-target tool comparisons
Runtime and memory comparisons were run on Amazon’s Elastic Compute Cloud using the r4.large instance type with an Intel Xeon CPU at 2.30 GHz. Each machine instance was setup using a Docker configuration file, which is provided within our GitHub codebase. Each tool was limited to a single processor with the “taskset” Linux command, and compute and memory usage were recorded with the “time -v” command. The full pipeline is available in the GitHub repository, along with the timing results of individual runs. Each guide-count and permitted mismatch (3,4, and 5) value was replicated five times using the human HG38 reference. Guide counts of 1, 10, 100, and 1000 were run for all tools. These random guides were generated using FlashFry’s “random” analysis module with the “—onlyUnidirectional” flag set to ensure a single candidate per sequence. Additionally 10,000 and 100,000 guide iterations were run for FlashFry and BWA, but were not run with Cas-OFFinder and CRISPRseek for practical reasons. Individual tool configurations are detailed below.
FlashFry
FlashFry version 1.8.1 was run in discovery mode with java option “-Xmx8g” for 1–1000 target runs, and “-Xmx15g” for 10,000 and 100,000 target searches. Mismatches were set with the “--maxMismatch” command, and defaults were used for all remaining parameters.
BWA
BWA runtime includes the initial alignment step (aln) and mapping to genomic coordinates (samse). BWA aln was run with parameters taken from Haeussler et al. [8]: aln -o 0 -m 20000000 -n <mismatch_count> -k <mismatch_count> -N -l 20 <humanRef>, using BWA commit tag e624290ad42f6c1deea87332337b08302faece48 from the following repository: https://github.com/lh3/bwa.
Cas-OFFinder
A custom script was used to convert the random target FASTA file into the Cas-OFFinder input with the appropriate mismatch setting, using the 20(N)NGG search string. This conversion time was not towards Cas-OFFinder’s runtime or memory usage. Cas-OFFinder requires a custom Linux kernel driver supporting OpenCL to be installed on the machine, and our Docker instance pre-configures Intel’s OpenCL version 2017_7.0.0.2568_x64. Cas-OFFinder was then run using the CPU “C” option against the input file.
CRISPRseek
We ran CRISPRseek using a custom R script and the “Rscript” command-line tool (the associated code is available in our GitHub repository). Timing data includes loading the relevant libraries and resources, executing the off-target search, and as it was impossible to separate discovery of off-targets and scoring, the scoring of guides.
Cancer gene census calculations
All processing was done on a standard Amazon AWS r4.xlarge compute instance with an Intel Xeon CPU at 2.30 GHz. The cancer gene census (CGC) dataset, version 83 was downloaded from the CGC portal [20]. Intervals were generated using custom Scala scripts capturing the RefSeq exonic sequence of each gene using the model with the largest number of exons, adding 10 bases up and downstream. The corresponding genomic sequences were extracted from the human HG19 genome using Picard’s ExtractSequences (https://broadinstitute.github.io/picard/). Sites were then discovered and scored using FlashFry with a maximum of four mismatches to off-targets, and a maximum of 2000 off-target sequences per candidate. Off-target scoring was run with 30 GB of memory, taking 15,024.70 s. The Hsu et al. off-target scoring scheme [19] and the Moreno-Mateos and Vejnar et al. [10] on-target metric were run against the 426,560 sites, and an aggregate ranking was produced by supplying FlashFry the “rank” scoring option. The rank option will produce a rank-ordered assignment for each target based on the median rank of individual scores and will additionally use the Schulze method to rank the top 1000 targets [21]. Lastly, the best and second-best hits per individual exon and gene were calculated with Python and R scripts, available in the GitHub repository.