Animals, sampling, and tissue collections
Animals and breeds
Well-characterized breeds were chosen in order to obtain well-documented samples. Holstein is the most widely used breed for dairy cattle. For goats, the Alpine breed is one of the two most commonly used dairy breeds, and for pigs, the Large white breed is widely used as a dam line. For chickens, the White Leghorn breed was chosen as it provides the genetic basis for numerous experimental lines and is widely used for egg production.
Four animals were sampled for each species, two males and two females. They all had a known pedigree. Animals were sampled at an adult stage, so that they were sexually mature and had performance records, obtained in known environmental conditions. Females were either lactating or laying eggs.
All animals were fasted at least 12 h before slaughter. No chemicals were injected before slaughtering, animals were stunned and bled in a licensed slaughter facility at the INRA research center in Nouzilly.
Samples
Liver samples of 0.5 cm3 were taken from the edge of the organ, avoiding proximity with the gallbladder and avoiding blood vessels. Time from slaughter to sampling varied from 5 min for chickens to 30 min for goats and pigs and 45 min for cattle. For the purpose of RNA-seq, samples were immediately snapfrozen in liquid nitrogen, stored in 2-ml cryotubes and temporarily kept in dry ice until final storage at −80 ∘C.
For mammals, whole blood was sampled into EDTA tubes before slaughter; at least one sampling took place well before slaughter (at least 1 month) and another just before slaughter, in order to obtain at least 50 ml of whole blood for separation of lymphocytes (PBMC). PBMC were re-suspended in a medium containing 10% FCS, counted, conditioned with 10% DMSO and stored in liquid nitrogen prior to the sorting of specific cell types: CD3+CD4+ (“CD4”) and CD3+CD8+ (“CD8”).
For chicken, spleen was sampled after exsanguination. Spleen leucocytes were purified by density-gradient separation to remove nucleated erythrocytes contamination and stored in liquid nitrogen prior to CD4+ and CD8+ T cell sorting.
All protocols for liver sampling, PBMC separation, splenocyte purification, and T cell sorting can be found at http://ftp.faang.ebi.ac.uk/ftp/protocols/samples/
Experimental assays and protocols
All assays were performed according to FAANG guidelines and recommendations, available at http://www.faang.org. All detailed protocols used for RNA extraction and libraries production for RNA-seq, ATAC-seq, and Hi-C are available at http://ftp.faang.ebi.ac.uk/ftp/protocols/assays/.
RNA extraction
Cells and tissues were homogenized in TRIzol reagent (Thermo) using an ULTRA-TURRAX (IKA-Werke) and total RNAs were extracted from the aqueous phase. They were then treated with TURBO DNase (Ambion) to remove remaining genomic DNA and then processed to separate long and small RNAs using the mirVana miRNA Isolation kit. Small and long RNA quality was assessed using an Agilent 2100 Bioanalyzer and RNA 6000 nano kits (Agilent) and quantified on a Nanodrop spectrophotometer.
RNA-seq
Stranded mRNA libraries were prepared using the TruSeq Stranded mRNA Sample Prep Kit -V2 (Illumina) on 200 ng to 1 μg of total long RNA with a RNA Integrity Number (RIN) over 8 following the manufacturer’s instructions. Libraries were PCR amplified for 11 cycles and library quality was assessed using the High Sensitivity NGS Fragment Analysis Kit DNF-474 and the Fragment Analyser system (AATI). Libraries were loaded onto a High-seq 3000 (Illumina) to reach a minimum read numbers of 100M paired reads for each library.
Hi-C
In situ Hi-C libraries were made according to [47] with a few modifications. For all species, fresh liver biopsies were dissociated using Accutase, and each resulting cell suspension was filtered using a 70 μm cell strainer. Cells were then fixed with 1% formaldehyde for 10 min at 37 ∘C and fixation was stopped by adding Glycine to a final concentration of 0.125M. After two washes with PBS, cells were pelleted and kept at −80 ∘C for long term storage. Subsequently, cells were thawed on ice and 5 million cells were processed for each Hi-C library. Cell membranes were disrupted using a potter-Elvehjem PTFE pestle and nuclei were then permeabilized using 0.5% SDS with digestion overnight with HindIII endonuclease. HindIII-cut restriction sites were then end-filled in the presence of biotin-dCTP using the Klenow large fragment and were religated overnight at 4 ∘C. Nucleus integrity was checked using DAPI labelling and fluorescence microscopy. Nuclei were then lysed and DNA was precipitated and purified using Agencourt AMPure XP beads (Beckman Coulter) and quantified using the Qubit fluorimetric quantification system (Thermo). Hi-C efficiency was controlled by PCR using specific primers for each species and, if this step was successful, DNA was used for library production. DNA was first treated with T4 DNA polymerase to remove unligated biotinylated ends and sheared by sonication using a M220 Covaris ultra-sonicator with the DNA 550 pb SnapCap microtube program (Program length: 45 s; Picpower 50; DutyF 20; Cycle 200; Temperature 20 ∘C).
Sheared DNA was then size-selected using magnetic beads, and biotinylated fragments were purified using M280 Streptavidin Dynabeads (Thermo) and reagents from the Nextera_Mate_Pair Sample preparation kit (Illumina). Purified biotinylated DNA was then processed using the TrueSeq nano DNA kit (Illumina) following the manufacturer’s instructions. Libraries were amplified for 10 cycles and then purified using Agencourt AMPure XP beads. Library quality was assessed on a Fragment Analyser (AATI) and by endonuclease digestion using NheI endonuclease. Once validated, each library was sequenced on an Illumina Hi-Seq 3000 to reach a minimum number of 150M paired reads per library. Libraries from the cattle samples failed the Quality Control steps (proportion of mapped reads, number of valid interactions) and were not included in the analysis.
ATAC-seq
ATAC-seq libraries were prepared according to [20] with a few modifications. For liver, cells were dissociated from the fresh tissue to obtain a single cell suspension. Cells were counted and 50,000 cells were processed for each assay. Transposition reactions were performed using the Tn5 Transposase and TD reaction buffer from the Nextera DNA library preparation kit (Illumina) for 30 min at 37 ∘C. DNA was then purified using the Qiagen MinElute PCR purification kit. Libraries were first amplified for 5 cycles using custom-synthesized index primers and then a second amplification was performed. The appropriate number of additional PCR cycles was determined using real-time PCR, permitting the cessation of amplification prior to saturation. The additional number of cycles needed was determined by plotting the Rn versus Cycle and then selecting the cycle number corresponding to one-third of the maximum fluorescent intensity. After PCR amplification, libraries were purified using a Qiagen MinElute PCR purification kit followed by an additional clean-up and sizing step using AMPure XP beads (160 μl of bead stock solution was added to 100 μl of DNA in EB buffer) following the manufacturer’s instructions. Library quality was assessed on a BioAnalyser (Agilent) using Agilent High Sensitivity DNA kit (Agilent), and libraries were quantified using a Qubit Fluorometer (Thermo). Considering that the Hi-C protocol was not successful on the liver samples from cattle, ATAC-seq was not attempted on these samples either.
Bioinformatics and data analysis
All software used in this project along with the corresponding versions are listed in Additional file 1: Table S3. The reference gene annotation was obtained from the Ensembl v90 release (pig: Sscrofa11.1, chicken: GalGal5, cattle: UMD3.1, goat: ARS1). Since Capra hircus was not part of the Ensembl release, we used the NCBI CHIR_ARS1 annotation (see Additional file 1: Table S2).
RNA-seq
RNA-seq pipeline
Prior to any processing, all RNA-seq reads were trimmed using cutadapt version 1.8.3. Reads were then mapped twice using STAR v2.5.1.b [22, 23]: first on the genome indexed with the reference gene annotation to quantify expression of reference transcripts, and secondly on the same genome indexed with the newly generated gene annotation (FR-AgENCODE transcripts) (see below and Additional file 1: Figure S1) [64]. The STAR –quantMode TranscriptomeSAM option was used in both cases in order to additionally generate a transcriptome alignment (bam) file. After read mapping and CIGAR-based softclip removal, each sample alignment file (bam file) was processed with Cufflinks 2.2.1 [65, 66] with the max-intron-length (100000) and overlap-radius (5) options, guided by the reference gene annotation (–GTF-guide option) ([64], Additional file 1: Figure S1). All cufflinks models were then merged into a single gene annotation using Cuffmerge 2.2.1 [65, 66] with the –ref-gtf option. The transcript and gene expressions on both the reference and the newly generated gene annotation were quantified as TPM (transcripts per million) using RSEM 1.3.0 [24] on the corresponding transcriptome alignment files ([64], Additional file 1: Figure S1). The newly generated transcripts were then processed with FEELnc version 0.1.0 [36] in order to classify them into “lncRNA”, “mRNA” and “otherRNA” (Additional file 1: Figure S1, Tables S10–11, Figure S9). The newly generated transcripts with a TPM value of at least 0.1 in at least 2 samples were called FR-AgENCODE transcripts and kept as part of the new annotation. The 0.1 threshold was chosen knowing that the expression values of polyadenylated transcripts usually go from 0.01 to 10,000 [35] and that we wanted to simultaneously capture long non coding RNAs that are generally lowly expressed and reduce the risk of calling artefactual transcripts.
PCA based on gene expression
Principal Component Analysis (PCA) was performed using the mixOmicsR package [67] on the RNA-seq sample quantifications of each species. This was done using the expression (TPM) of two different sets of genes: reference genes with TPM 0.1 in at least two samples (Additional file 1: Figure S3) and FR-AgENCODE genes with TPM 0.1 in at least two samples (Additional file 1: Figure S11).
Annotated gene orthologs
We used Ensembl Biomart [68] to define the set of orthologous genes across cattle, chicken and pig. Only “1 to 1” similarity relationships were kept (11,001 genes). Since goat was not part of the Ensembl annotation, goat gene IDs were added to this list using gene name as a correspondence term. The resulting 4-species orthologous set contained 9461 genes (Additional file 3).
RNA-seq sample hierarchical clustering
Based on the expression of the 9461 orthologous genes in the 39 RNA-seq samples from the four species, the sample-by-sample correlation matrix was computed using the Pearson correlation of the log10 gene TPM values (after adding a pseudocount of 10−3). We then represented this sample by sample correlation matrix as a heatmap where the samples were also clustered using a complete linkage hierarchical clustering (Fig. 2).
RNA-seq normalization and differential analysis
To perform the differential analysis of gene expression, we used the expected read counts provided by RSEM [24]. RNA-seq library size normalization factors were calculated using the weighted Trimmed Mean of M-values (TMM) approach of [69] as implemented in the R/Bioconductor package edgeR [29]. The same package was used to fit three different per-gene negative binomial (NB) generalized log-linear models [70].
-
In Model 1, the expression of each gene was explained by a tissue effect; because all three tissues (liver, CD4, CD8) were collected from each animal, an animal effect was also included to account for these repeated measures:
$$\frac{\log \mu_{gi}}{s_{i}} = \beta_{g,\text{tissue}(i)} + \gamma_{g,\text{animal}(i)}, $$
where μgi represents the mean expression of gene g in sample i, si the TMM normalization factor for sample i, tissue(i)∈{liver, CD4, CD8} and animal(i)∈{1,2,3,4} the tissue and animal corresponding to sample i, and βg,tissue(i) and γg,animal(i) the fixed tissue and animal effects, respectively, of gene g in sample i. Hypothesis tests were performed to identify significantly differentially expressed genes among each pair of tissues, e.g.,
$$\mathcal{H}_{0g}:\ \beta_{g,\text{liver}} = \beta_{g,\text{CD4}}. $$
-
Model 2 is identical to the previous model, where gene expression was modeled using both a tissue and an animal effect, with the exception that the CD4 and CD8 tissues were collapsed into a single group. In this model, the only hypothesis of interest is thus between the liver and global CD cell group:
$$\mathcal{H}_{0g}:\ \beta_{g,\text{liver}} = \beta_{g,\text{CD}}. $$
All hypothesis tests were performed using likelihood-ratio tests and were corrected for multiple testing with the Benjamini-Hochberg (FDR, [71]) procedure. Genes with an FDR smaller than 5% and an absolute log-fold change larger than 2 were declared differentially expressed.
GO analysis of differentially expressed genes
For each species, GO term enrichment analysis was performed on the genes found to be over- or under-expressed in liver versus T cells. This analysis was done separately for each species (Additional file 1: Figure S5-S7) and subsequently for genes identified for all species (Additional file 1: Fig. S8), using the three following ontologies: biological process (BP), molecular function (MF) and cell compartment (CC), and using the GOstatR/Bioconductor package [72]) for only those genes with a human ortholog.
FR-AgENCODE transcript positional classification
The FR-AgENCODE transcript models were first classified according to their position with respect to reference transcripts: known the FR-AgENCODE transcript is strictly identical to a reference transcript (same intron chain and same initial and terminal exons) extension the FR-AgENCODE transcript extends a reference transcript (same intron chain but one of its two most extreme exons extends the reference transcript by at least one base pair) alternative the FR-AgENCODE transcript corresponds to a new isoform (or variant) of a reference gene, i.e., the FR-AgENCODE transcript shares at least one intron with a reference transcript but does not belong to the above categories novel the FR-AgENCODE transcript is in none of the above classes
FR-AgENCODE transcript coding classification
The FR-AgENCODE transcript models were also classified according to their coding potential. For this, the FEELnc program (release v0.1.0) was used to discriminate long non-coding RNAs from protein-coding RNAs. FEELnc includes three consecutive modules: FEELnc filter, FEELnc codpot and FEELnc classifier. The first module, FEELnc filter, filters out non-lncRNA transcripts from the assembled models, such as transcripts smaller than 200 nucleotides or those with exons strandedly overlapping exons from the reference annotation. This module was used with default parameters except -b transcript_biotype=protein_coding, pseudogene to remove novel transcripts overlapping protein-coding and pseudogene exons from the reference. The FEELnc codpot module then calculates a coding potential score (CPS) for the remaining transcripts based on several predictors (such as multi k-mer frequencies and ORF coverage) incorporated into a random forest algorithm [73]. In order to increase the robustness of the final set of novel lncRNAs and mRNAs, the options –mode=shuffle and –spethres=0.98,0.98 were set. Finally, the FEELnc classifier classifies the resulting lncRNAs according to their positions and transcriptional orientations with respect to the closest annotated reference transcripts (sense or antisense, genic or intergenic) in a 1Mb window (–maxwindow=1000000).
It is worth noting that between 83 and 2718 lncRNA transcripts were not classified because of their localization on the numerous unassembled contigs in livestock species with no annotated genes.
FR-AgENCODE gene conservation between species
In order to obtain gene orthology relationships, we projected FR-AgENCODE transcripts from the four livestock species to the human GRCh38 genome using the UCSC pslMap program (https://github.com/ENCODE-DCC/kentUtils/tree/master/src/hg/utils/pslMap, v302). More precisely, we used the UCSC hg38To[species.assembly].over.chain.gz chain file for each species (created in-house for goat following UCSC instructions) and retained only the best hit for each transcript (according to the pslMap score). We further required each FR-AgENCODE gene to project to a single human gene that did not strandedly overlap any other projected FR-AgENCODE gene.
Syntenic conservation of lncRNAs
Briefly, a lncRNA was considered as “syntenically” conserved between two species if (1) the lncRNA was located between two orthologous protein-coding genes, (2) the lncRNA was the only one in each species between the two protein-coding genes, and (3) the relative gene order and orientation of the resulting triplet was identical between species. Using these criteria, we found six triplets shared between the four species, 73 triplets shared between cattle, goat, and pig, and 19 triplets shared between cattle, chicken, and pig.
ATAC-seq
ATAC-seq data analysis pipeline
ATAC-seq reads were trimmed with trimgalore 0.4.0 using the –stringency 3, -q 20, –paired and –nextera options (Additional file 1: Table S3). The trimmed reads were then mapped to the genome using bowtie 2 2.3.3.1 with the -X 2000 and -S options [74]. The resulting sam file was then converted to a bam file with samtools 1.3.1, and this bam file was sorted and indexed with samtools 1.3.1 [75]. The reads for which the mate was also mapped and with a MAPQ ≥10 were retained using samtools 1.3.1 (-F 12 and -q 10 options, [75]), and finally only the fragments where both reads had a MAPQ ≥10 and which were on the same chromosome were retained.
Mitochondrial reads were then filtered out, as well as duplicate reads (with picard tools, MarkDuplicates subtool). The highest proportion of filtering was due to the MAPQ 10 and PCR duplicate filters (Additional file 1: Figure S14). The peaks were called using MACS2 2.1.1.20160309 [76] in each tissue separately using all the mapped reads from the given tissue (-t option) and with the –nomodel, -f BAMPE and –keep-dup all options. To get a single set of peaks per species, the tissue peaks were merged using mergeBed version 2.26.0 [77]. These peaks were also quantified in each sample by simply counting the number of mapped reads overlapping the peak.
ATAC-seq peaks were also classified with respect to the reference gene annotation using these eight genomic domains and allowing a peak to be in several genomic domains: exonic the ATAC-seq peak overlaps an annotated exon by at least one bp intronic the ATAC-seq peak is totally included in an annotated intron tss the ATAC-seq peak includes an annotated TSS tss1Kb the ATAC-seq peak overlaps an annotated TSS extended 1 kb both 5’ and 3’ tss5Kb the ATAC-seq peak overlaps an annotated TSS extended 5 kb both 5’ and 3’ tts the ATAC-seq peak includes an annotated TTS tts1Kb the ATAC-seq peak overlaps an annotated TTS extended 1 kb both 5’ and 3’ tts5Kb the ATAC-seq peak overlaps an annotated TTS extended 5 kb both 5’ and 3’ intergenic the ATAC-seq peak does not overlap any gene extended by 5 kb on each side
ATAC-seq differential analysis: normalization and model
Contrary to RNA-seq counts, ATAC-seq counts exhibited trended biases visible in log ratio-mean average (MA) plots between pairwise samples after normalization using the TMM approach, suggesting that an alternative normalization strategy was needed. In particular, trended biases are problematic as they can potentially inflate variance estimates or log fold-changes for some peaks. To address this issue, a fast loess approach [78] implemented in the normOffsets function of the R/Bioconductor package csaw [79] was used to correct differences in log-counts vs log-average counts observed between pairs of samples.
As for RNA-seq, we used two different differential models: Model 1 for tissue pair comparisons, Model 2 for T cell versus liver comparisons (see corresponding “RNA-seq” section for more details).
ATAC-seq peak TFBS density
In order to identify Transcription Factor Binding Sites (TFBS) genome-wide, we used the FIMO [80] software (Additional file 1: Table S3) to look for genomic occurrences of the 519 TFs catalogued and defined in the Vertebrate 2016 JASPAR database [81]. We then intersected these occurrences with the ATAC-seq peaks of each species and computed the TFBS density in differential vs non differential ATAC-seq peaks. Among the predicted TFBSs, those obtained from the CTCF motif were used to profile the resulting density with respect to Topological Associating Domains from Hi-C data (Fig. 7, Additional file 1: Figure S25).
Comparison between ATAC-seq peaks and ChIP-seq histone mark peaks
Pig liver H3K4me3 and H3K27ac ChIP-seq peaks from the Villar et al. study [42] were downloaded from ArrayExpress (experiment number E-MTAB-2633). As these peaks were provided on the 10.2 pig genome assembly, they were first lifted over to the 11.1 pig genome assembly using the UCSC liftover program (https://genome.sph.umich.edu/wiki/LiftOver). About 86.7% (9632 out of 11,114) of the H3K4me3 peaks and 91.8% (31,161 out of 33,930) of the H3K27ac peaks could be lifted over to the 11.1 genome assembly. The median peak size was 1944 bp for H3K4me3 and 2786 bp for H3K27ac, and the peak size distribution was very similar for the initial 10.2 and the lifted over 11.1 peaks. As for genome coverage, the H3K4me3 and H3K27ac peaks covered 0.9% and 4.7% of the 11.1 pig genome, respectively. In comparison, there were 25,885 pig liver ATAC-seq peaks with a median size of 360 bp and covering 0.5% of the pig genome. Consistent with what was expected from the two histone marks, the vast majority (94.9%) of the H3K4me3 peaks (known to be associated to promoter regions) overlapped (bedtools intersect program) with the H3K27ac peaks (known to be associated to both promoter and enhancer regions), and about 30% of the H3K27ac peaks overlapped with the H3K4me3 peaks. Comparing our pig liver ATAC-seq peaks to the histone mark peaks, we found that 27.1% (7012 out of 25,885) and 36.4% (9410 out of 25,885) of our pig liver ATAC-seq peaks overlapped with the H3K4me3 and H3K27ac peaks, respectively. Reciprocally, 70.3% (6773 out of 9632) and 28.3% (8821 out of 31,161) of the H3K4me3 and H3K27ac peaks respectively overlapped with our pig liver ATAC-seq peaks.
To assess if these numbers were higher than expected by chance, we shuffled (bedtools shuffle program) the 25,885 pig liver ATAC-seq peaks 1000 times on the pig genome and recomputed their intersection with the two sets of histone mark peaks (H3K4me3 and H3K27ac). After doing so, we never obtained percentages of H3K4me3 and H3K27ac peaks, respectively, overlapping the shuffled ATAC-seq peaks that were equal or higher than the ones obtained with the real ATAC-seq peaks. This means that indeed, 70.3% and 28.3% of the histone mark peaks overlapping our ATAC-seq peaks are percentages that are significantly higher than expected by chance (p value <10−3).
We also compared the ATAC-seq, H3K4me3 and H3K27ac peak scores (fold enrichment against random Poisson distribution with local lambda for ATAC-seq peaks and fold-enrichment over background for ChIP-seq peaks) of the common peaks versus the other peaks. In doing so, we found that common peaks had significantly higher scores than non common peaks (median 94 versus 32, p value <2.2×10−16 for ATAC-seq peaks, median 57 versus 22, p value <2.2×10−16 for H3K4me3 peaks and median 32 versus 12, p value <2.2×10−16 for H3K27ac peaks, Wilcoxon tests), highlighting a common signal between the two techniques.
Chromatin accessibility conservation across species
In order to investigate the conservation of chromatin accessibility across our 4 livestock species, we used the human GRCh38 genome as a reference. After indexing the softmasked GRCh38 genome (main chromosomes) using lastdb (last version 956, -uMAM4 and -cR11 options, http://last.cbrc.jp/), we used the lastal program followed by the last-split program (-m1 and –no-split options) (last version 956, http://last.cbrc.jp/) to project the 104,985 cattle, 74,805 goat, 119,894 chicken, and 149,333 pig ATAC-seq peaks onto the human genome. In doing so and consistent with the phylogenetic distance between our species and human, we were able to project 72.6% (76,253) cattle, 73.7% (55,113) goat, 12.3% (14,792) chicken, and 80.1% (119,680) pig peaks to the human genome. The percentage of bases of the initial peaks that could be aligned was around 40% for mammals and 14% for chicken. Then, for each peak that could be projected onto the human genome, we retained its best hit (as provided by lastal) and then merged all these best hits (i.e., from the 4 species) on the human genome (using bedtools merge). A total of 215,620 human regions were obtained, from which we kept the 212,021 that came from a maximum of 1 peak from each species. Those 212,021 regions were called human hits.
Based on the 1083 four-species orthologous peaks in the 38 ATAC-seq samples, the sample-by-sample correlation matrix was computed using the Pearson correlation of the log10 normalized ATAC-seq values (after adding a pseudo-count of 10−3 to the values). We then represented this sample-by-sample correlation matrix as a heatmap where the samples were also clustered using a complete linkage hierarchical clustering (Additional file 1: Figure S21). Chicken ATAC-seq samples clustered completely separately from mammal ATAC-seq samples. T cell samples from cattle and goat were also closer to each other than to liver samples.
To shuffle the 104,985 cattle, 74,805 goat, 119,894 chicken, and 149,333 pig ATAC-seq peaks, we used the bedtools shuffle program on their respective genomes and projected these shuffled peaks to the human genome as was done for the real peaks.
We also compared the human hits to the combined set of 519,616 ENCODE human DNAse I peaks from two CD4+, two CD8+ and one “right lobe of liver” samples (experiment accessions ENCSR683QJJ, ENCSR167JFX, ENCSR020LUD, ENCSR316UDN, and ENCSR555QAY from the encode portal https://www.encodeproject.org/, by merging the peaks from the 5 samples into a single set of peaks using bedtools merge). We found that 23.1% (48,893 out of 212,021) of the human hits obtained from the real ATAC-seq peaks overlapped human DNAse I peaks, whereas only 8.5% (21,159 out of 249,943) of the human hits obtained from shuffled ATAC-seq peaks overlapped human DNAse I peaks. This further supports the biological signal present in these data.
Finally we used the phastcons measure of vertebrate sequence conservation obtained from the multiple alignment of 100 vertebrate species genomes including human (hg38.phastCons100way.bw bigwig file from the UCSC web site https://genome.ucsc.edu/). For each human hit, we computed its phastcons score using the bigWigAverageOverBed utility from UCSC (https://github.com/ENCODE-DCC/kentUtils).
Hi-C
Hi-C data analysis pipeline
Our Hi-C analysis pipeline includes HiC-Pro v2.9.0 [82] (Additional file 1: Table S3) for the read cleaning, trimming, mapping (this part is internally delegated to Bowtie 2 v2.3.3.1), matrix construction, and matrix balancing ICE normalization [83]. HiC-Pro parameters: BOWTIE2_GLOBAL_OPTIONS= –very-sensitive -L 30–score-minL,-1,-0.1–end-to-end–reorder, BOWTIE2_LOCAL_OPTIONS =–very-sensitive-L 20–score-minL,-0.6,-0.2–end-to-end–reorder, LIGATION_SITE= AAGCTAGCTT, MIN_INSERT_SIZE = 20, MAX_INSERT_SIZE= 1000, GET_ALL_INTERACTION_CLASSES= 1, GET_PROCESS_SAM = 1, RM_SINGLETON = 1, RM_MULTI = 0, RM_DUP = 1, MAX_ITER = 100,FILTER_LOW_COUNT_PERC = 0.02, FILTER_HIGH_COUNT_PERC = 0, EPS =0.1. TAD finding was performed using two methods: arrowhead from the Juicer tool V1.5.3 [49] at the 10 kb resolution using the matrix balancing normalization (arrowhead-r 10000 -k KR option), and with Armatus V2.1 [55] with default parameters and gamma=0.5 (Additional file 1: Table S3). Graphical visualization of the matrices was produced with the HiTCR/Bioconductor package v1.18.1 [48] (Additional file 1: Table S3). Export to JuiceBox [84] was done through Juicer Tools V1.5.3 (Additional file 1: Table S3). These tools were called through a pipeline implemented in Python. Because of the high number of unassembled scaffolds (e.g., for goat) and/or micro-chromosomes (e.g., for chicken) in our reference genomes, only the longest 25 chromosomes were considered for TAD and A/B compartment calling. For these processes, each chromosome was considered separately.
The Directionality Index (DI) was computed using the original definition introduced by [50] to indicate the upstream vs. downstream interaction bias of each genomic region. Interaction matrices of each chromosome were merged across replicates and the score was computed for each bin of 40 kb. CTCF sites were predicted along the genomes by running FIMO with the JASPAR TFBS catalogue (see “ATAC-seq” section).
A/B compartments were obtained using the method described in [21] as illustrated in Additional file 1: Figure S24: first, ICE-normalized counts, Kij, were corrected for a distance effect with:
$$\widehat{K}_{ij} = \frac{K_{ij} - \overline{K}^{d}}{\sigma^{d}}, $$
in which \(\widehat {K}_{ij}\) is the distance-corrected count for the bins i and j, \(\overline {K}^{d}\) is the average count over all pairs of bins at distance d=d(i,j) and σd is the standard deviation of the counts over all pairs of bins at distance d. Within-chromosome Pearson correlation matrices were then computed for all pairs of bins based on their distance-corrected counts and a PCA was performed on this matrix. The overall process was performed similarly to the method implemented in the R/Bioconductor package HiTC [48]. Boundaries between A and B compartments were identified according to the sign of the first PC (eigenvector). Since PCAs had to be performed on each chromosome separately, the average counts on the diagonal of the normalized matrix were used to identify which PC sign (+/ −) should be assigned to A and B compartments for each chromosome. This allowed a homonegenous assignment across chromosomes to be obtained, without relying on the reference annotation. In line with what was originally observed in humans, where the first PC was the best criterion for separating A from B compartments (apart from a few exceptions like chromosome 14 for instance [21]), we also observed a good agreement between the plaid patterns of the normalized correlation matrices and the sign of the first PC (Additional file 1: Figure S24).
To estimate the robustness of A/B compartment calling, the method was tested on each replicate separately (four animals). Since the HiTC filtering method can discard a few bins in some matrices, resulting in missing A/B labels, the proportion of bins with no conflicting labels across replicates was computed among the bins that had at least two informative replicates (Additional file 1: Figure S26).
Chromatin structure conservation across species
To get insight into chromatin structure conservation across species, similar to what was done with chromatin accessibility data (see above), we projected the 11,711 goat, 6866 chicken, and 14,130 pig 40 kb TAD boundaries to the human GRCh38 genome using lastal followed by last-split (-m1 and –no-split options, last version 956, http://last.cbrc.jp/, using the same indexed GRCh38 softmasked genome as was used for ATAC-seq, see above). For this analysis we considered TADs from Armatus because of the high number of boundaries that were identified by the method. As expected from their length, TAD boundary projections were highly fragmented (median 16, 2, and 19 blocks per projection representing 3%, 0.6%, and 3% of the initial segment, for the best hit of goat, chicken, and pig, respectively). In order to recover conserved segments, we chained the alignments using a python in-house script (program available on demand, used with stranded mode, coverage=0.4, score=3000, and length_cutoff=5000). Doing so, we managed to project 90% of the mammalian and 5% of the chicken TAD boundaries onto the human genome. Similar to what was done for ATAC-seq, for each projected TAD boundary, its best hit (according to the chaining score) was retained. The median length of those best hits represented 95% and 78% of the initial query size for mammals and chicken respectively. Merging these best hits on the human genome (using bedtools merge), we obtained 16,870 human regions with a median length of 44.6 kb (similar to the initial TAD boundary size of 40 kb). Out of those, 16,468 were considered non ambiguous (i.e., not coming from several TAD boundaries from the same species) and were retained for further analyses. As was found for the ATAC-seq peaks, the majority (65.6%) of the hits were single species (similarity level 1), a substantial percentage of them (34%) were 2 species hits (similarity level 2), and seventy one of them (0.4%) were 3 species hits (similarity level 3).
To estimate the structural impact of each TAD boundary, we used the local interaction score as used by [51] and [52] and sometimes referred to as “interaction ratio” or “insulation profile”. Within a sliding window of 500 kb along the genome (step=10 kb), the insulation score ratio is defined as the proportion of read pairs that span across the middle of the window. The score ratio is reported at the middle position of the window and represents the local density of the chromatin contacts around this point. This proportion is expected to be maximal in regions with many local interactions (typically TADs) and minimal over insulators (typically TAD boundaries). Intuitively, a TAD boundary with a low interaction score (which indicates strong insulation properties) has a good capacity to prevent interactions that cross it while a TAD with a relatively high interaction score has a “weak” insulation strength. Here, only valid interactions (“valid pairs”) in in cis (inter-chromosomal contacts were not considered in the ratio) were considered after applying all HiC-Pro QC and filters. Computing a ratio among all read pairs that have both reads within the sliding window reduces the impact of potential biases (read coverage, restriction site density, GC content, etc.). Consequently, the interaction profiles from the 4 replicates along the genome of each species were highly similar (not shown), allowing to merge them in order to assign each TAD boundary a single score per species. For orthologous TAD boundaries, the scores from different species could be used to compute pairwise correlations. Human data were obtained from http://aidenlab.org/data.html [47] for the GM12878 cell line (ENCODE batch 1, HIC048.hic file from https://bcm.app.box.com/v/aidenlab/file/95512487145). The.hic file was parsed by the Juicer tool (“dump” mode with options “observed KR”) to compute the corresponding interaction score as described above. The LiftOver tool was used to convert the genomic positions of the human TAD boundaries (version hg19 vs. hg38) before comparing the interaction scores with livestock species.
The number and proportion of genes (all or only the orthologous ones) in each compartment type was computed using bedtools map (-distinct option on the gene ID field). Orthologous genes were taken from Ensembl as previously described. Under the independence assumption of compartment assignment between species, the expected proportion of orthologous genes with “triple A” (resp. with “triple B”) assignments between species is equal to the product of the observed frequencies for A (resp. for B) compartments in the three species. The observed frequencies of “triple A” and “triple B” assignments in orthologous genes was compared to this expected proportion using a χ2 goodness-of-fit test.
Multi-assay integration
ATAC-seq vs. RNA-seq correlation: intra- and inter-sample analysis
For each ATAC-seq peak that overlapped a promoter region (1 kb upstream of the TSS, as suggested in Fig. 4) its less-normalized read count value (see differential analysis) was associated with the TMM-normalized expression of the corresponding gene from the reference annotation. Intra- and inter-sample correlations were then investigated: within each sample, genes were ranked according to their expressions and the distribution of the corresponding ATAC-seq values was computed for each quartile (Additional file 1: Figure S18). Across samples, the Pearson correlation coefficient was computed for each gene using only the samples for which both the ATAC-seq and the RNA-seq normalized values were available (e.g., n=10 for pig, Additional file 1: Figure S19–20). Similar results were obtained with Spearman correlations (not shown).
Chromatin accessibility and gene expression in A/B compartments
To compute the general chromatin accessibility in A and B compartments, we first computed the average of the normalized read count values across all liver samples for each ATAC-seq peak. For each compartment, the mean value of all contained peaks was then reported and the resulting distributions for all A and B compartments were reported (Fig. 8).
The same approach was used to assess the general expression of genes in A and B compartments, using the average of the normalized expression values from the liver samples. Difference between A and B distributions was tested for statistical significance using a Wilcoxon test.