Multiomic atlas with functional stratification and developmental dynamics of zebrafish cis-regulatory elements – Nature.com
The DANIO-CODE DCC
We established a DCC protocol26, which we populated with zebrafish developmental genomic data, including standardized annotation of metadata of diverse, often inconsistently annotated, published datasets (Fig. 1a), by the DANIO-CODE consortium (https://www.birmingham.ac.uk/generic/danio-code/partners/index.aspx). The DCC is accessible from ZFIN and includes datasets, their underlying samples and sequencing protocols using ZFIN and ENCODE nomenclature (www.danio-code.zfin.org). To identify and analyze the developmental dynamics of genomic features, direct comparison across datasets produced by different laboratories and different protocols is required. To this end, we carried out consistent reprocessing starting from the raw sequencing data (Fig. 1a). Raw sequencing data were collected and reprocessed by standardized pipelines of ENCODE for ChIPseq and ATAC-seq27, FANTOM for CAGE-seq28 and producer pipelines for Hi-C, 4C-seq or other data (Methods). These pipelines are available on GitLab (https://gitlab.com/danio-code). The DCC data include 1,438 published datasets contributed by data producers directly or collected by DANIO-CODE data annotators, together with strategically selected datasets for developmental stages from the public domain. In addition, 366 datasets were generated by consortium members to fill gaps and to aid functional annotation and functional element characterization, including 15 CAGE-seq, 18 ChIPseq, 11 ATAC-seq, 2 Hi-C and 320 4C-seq datasets (Fig. 1b and Extended Data Fig. 1a,b). Breakdown of the datasets according to data types and stages of development is presented in Fig. 1b. The source of data collection is in Extended Data Figure 1c and Supplementary Table 1. Quality checks and data comparability analyses were carried out for datasets within a data type obtained from multiple laboratories, particularly affecting RNA-seq (Supplementary Fig. 1b), ChIPseq (Supplementary Fig. 1df), CAGE-seq (Supplementary Fig. 2) and ATAC-seq (Supplementary Fig. 1c) data. The DCC continues to be periodically updated (Extended Data Fig. 1e) and is openly accessible to the community for downloading data and uploading new datasets (Supplementary Videos 1 and 2).
a, Collection and manual annotation processes of datasets with the DANIO-CODE DCC with highlights of key findings. b, Extent of the open repository for developmental multiomic data for zebrafish with assay type (y axis) and developmental stage (x axis). Data first reported in this study are highlighted with black circles. c, Visualization of temporal dynamics of selected transcriptomic and epigenomic features during development at a developmentally active locus. Coloring of tracks represents developmental series from maternal (blue) to zygotically active stages of embryogenesis (red). Symbols and track colors indicate representative stages (Extended Data Fig. 1d). CNS, central nervous system.
The resulting data and reprocessed multiomic datasets represent a comprehensive annotation of the zebrafish genome during normal embryonic development and are available as a public track hub in the UCSC browser and uploadable to the Ensembl genome browser. Figure 1c provides an example developmentally regulated locus covering selected stages visualized by the Washington University Epigenome browser29. The tracks further include annotation of approximately 140,000 predicted ATAC-seq-supported developmental regulatory elements (PADRE) annotated by ChromHMM categories. The bulk sample-based predictions for regulatory elements are complemented with annotations of cell-type specificity of candidate regulatory elements provided by single-cell ATAC-seq30 (Supplementary Videos 35).
As genome-wide transcriptome analyses3,31,32,33 fail to annotate 5 untranslated regions (UTRs) precisely, we used DANIO-CODE expression data to improve current Ensembl models of developmentally active genes. We utilized 139 developmental RNA-seq samples to identify 31,458 genes comprising 55,596 transcripts (Fig. 2a and Supplementary Table 2), among them 167 novel transcripts of uncertain coding potential (TUCP) and 726 long noncoding (lnc) RNA genes not previously annotated by Ensembl and supported by CAGE signals (Extended Data Fig. 2 and Supplementary Table 3). We mapped 5 transcription start sites (TSSs) from 34 CAGE samples in 16 developmental stages (Fig. 2a). We applied promoter-calling criteria to CAGE data (Methods and Supplementary Fig. 2ac), resulting in 22,500 active promoters per CAGE sample on average, corresponding to 16,303 genes (Supplementary Table 4), and adding 4,070 novel promoters to 18,461 previously annotated Ensembl TSSs (GRCz10). To supplement the promoterome with cis-regulatory sites, we curated 581 regulatory motifs representing 814 zebrafish transcription factor (TFs), and predicted binding sites for these motifs across all promoters (Methods).
a, DANIO-CODE transcript 5 ends supported by CAGE TSS during stages of development. b, Distribution of absolute distance of Ensembl TSSs to CAGE-dominant TSSs in the Prim-5 stage. c, Relationship between guide distance to TSSs and ddCt. Inset: number of dCas guides for all 26 tested genes. d, CAGE-defined TSSs increase the accuracy of promoter identification and support dCas inhibition guide reagent designs. Distance between Ensembl TSSs and CAGE-dominant TSSs (top). Genome view with CRISPR guide position and efficacy, Ensembl and RefSeq transcripts, CAGE and RNA-seq expression (bottom).
a, Heat map shows the dynamics of expression levels of reference and alternative promoters across 16 developmental stages represented as images. Expression levels are scaled in the range of 01 for each row. Reference and alternative transcripts using the same and different coding sequence (CDS) starts are denoted. Transcript pairs without full CDS annotation are denoted as ambiguous. b, Distribution of correlation coefficient of expression levels of promoters across 16 developmental stages. c, Enrichment of KEGG pathways on multipromoter genes. The adjusted P value cut-off is 0.05, denoted by a vertical dashed line. The number of genes in KEGG pathways and those overlapping with multipromoter genes is shown inside the bars, d, MARA motif activity plots of three TF motifs across development. Posterior means and standard deviations (depicted as error bars) are based on analysis of the expression levels of all n=27,781 promoters for each sample. Motif logos are depicted as insets. e, Genome browser view of the actin alpha 1a promoter. From the top: ATAC signal, CAGE signal, a single TSR (black), two Ensembl transcripts (dark red) and TFBSs predicted to regulate this TSR (red) are shown. Color intensities of the TFBSs reflect MARA scores of predicted regulatory role of TFs.
Our above definition of promoters at single-nucleotide resolution may offer important guidance for promoter-targeted gene manipulation. For instance, gene promoter targeting for transcription block may be useful in reverse genetic experiments to avoid mutant RNA-mediated genetic compensation, which may mask mutant phenotypes and hinder dissection of gene function34. We compared Ensembls RNA-seq-based TSS with our CAGE-seq-based TSS and found a substantial discrepancy in position (Fig. 2b and Extended Data Fig. 3a), potentially impacting guide RNA design for CRISPRCas targeting. Multiple dCas guide positions were designed and their impact on expression reduction with increased distance between the guide target and dominant CAGE-defined TSS was tested. Efficiency of dCas inhibition was higher when CAGE dominant, compared to Ensembl, start sites were used (Fig. 2c,d and Supplementary Table 5), demonstrating the importance of accurate TSS detection and the improved accuracy of CAGE over the current Ensembl pipeline in promoter detection.
Using these data we identified 1,293 multipromoter genes (Supplementary Table 6), where 1,176 genes had one reference and one alternative promoter and 117 genes had two or more alternative promoters. Correlation of expression levels of reference and alternative promoter pairs indicated both convergent (cyan in Fig. 3a,b) and divergent (brown) dynamics during development. The expression of reference promoters was on average higher than those of alternative promoters (Extended Data Fig. 3b). Among 978 transcript pairs with full-length coding sequence annotation, 373 (38%) of the alternative promoters affected only the 5 UTR (for example, dag1; Extended Data Fig. 3c), whereas the remaining 605 altered the N-terminal protein sequence (for example, bmp6; Extended Data Fig. 3d). We analyzed mouse CAGE-seq28 data from comparable embryonic stages and annotated 1,779 multipromoter genes (Extended Data Fig. 3e and Supplementary Table 7). About one-third (294; 30%) of identified mouse orthologs of zebrafish multipromoter genes (974; 75%) utilized alternative promoters. Orthologs of multipromoter genes were likely (P=2.7105; Fishers exact test) to be expressed in similar stages and highly likely (P=3.241058; Fishers exact test) to have multiple promoters in mouse. Multipromoter genes were enriched in KEGG signaling pathways in zebrafish (Fig. 3c) and mouse (Supplementary Table 8), suggesting vertebrate conservation of alternative promoters in signal transduction-associated genes.
Precision promoter annotation and expression dynamics allow exploitation of this resource to predict TF activity regulating the promoters. We implemented Motif Activity Response Analysis (MARA)35,36 for zebrafish. MARA models promoter expression dynamics in terms of the annotated TF binding sites, to infer which TFs most substantially drive expression changes during development. Figure 3d shows the inferred activity profiles of three TFs with strong effects on genome-wide expression patterns. While Tead3 targets are upregulated from gastrulation onwards, Tgif1 targets are transiently downregulated and NF-Y targets are downregulated from the sphere stage onwards, consistent with the known activities of these TFs37,38,39,40,41 (Extended Data Fig. 4 and Supplementary Table 9). MARA predicts substantially changing regulatory activities for regulatory motifs and assigns candidate regulator TFs to promoters (Fig. 3e). We have integrated our zebrafish annotations into the ISMARA webserver (ismara.unibas.ch) to allow this activity analysis on any RNA-seq data.
Next, we aimed to generate a comprehensive atlas of zebrafish developmental regulatory elements. We defined reproducible ATAC-seq42 peaks as PADREs in four pre-zygotic genome activation (ZGA) and seven post-ZGA stages, which we further classified on the basis of the presence of four histone marks using ChromHMM43,44 in five post-ZGA stages (Fig. 4a, Supplementary Fig. 3 and Extended Data Fig. 5a).
a, Genome browser screenshot showing ChromHMM classification of PADREs, and respective histone post-translational modification signals used to define them. b, UMAP plot of PADREs at the Prim-5 stage. Each point represents one open chromatin region, colored by functional assignment. c, Occurrence probabilities of chromatin marks for ChromHMM states. The states function was manually assigned using The Roadmap Epigenomic annotations as reference. 1_TssA1, 2_TssA2: active TSS; 3_TssFlank1, 4_TssFlank2, TSS flanking region; 5_EnhA1, active enhancer; 6_EnhFlank, enhancer flanking region; 7_EnhWk1, primed enhancer; 8_Pois, poised elements; 9_PcRep, Polycomb-repressed regions; 10_Quies, quiescent state. dg, UMAP plot showing PADREs overlapping with CAGE promoters (d), CTCF motif (e), eRNA enhancers (f) and transgenically validated enhancers (g). The transgenically validated enhancers are predominantly associated with enhancer-associated chromatin states (Supplementary Table 11). h, UMAP plot showing the mean phastCons score for each PADRE (top right) and overlap with human CNEs (top left). The bottom subpanel shows the distribution of phastCons scores of active enhancers throughout development (left, bars represents interquartile range), as well as the distribution of the phastCons score for PADREs separated by function at the Prim-5 stage. Two-sided Wilcoxon rank sum test was used to calculate P values between promoters and enhancers (P=2.21016) and enhancers and Polycomb-associated elements (P=2.21016). Exons and intergenic regions were added as reference (right) i, Position of cell-type-specific elements on the UMAP plot (top). ATAC, H3K27ac and H3K4me1 signals around the peak summit of cell-type-specific PADREs (bottom).
a, Openness profile of selected SOM classes (4: early; 6: post-ZGA constitutive; 14: late class), and their position density on the UMAP plots of different developmental stages (top). Heat map of signal intensity of ATAC, H3K27ac and H3K4me1 at the Dome and the Prim-5 stages, along with their respective profiles (bottom). b, Position of COPEs, DOPEs and DOPEs marked with H3K27ac in adult tissues on the UMAP plot (left). Profiles of ATAC, H3K27ac and H3K4me1 of COPEs, DOPEs, DOPEs marked in adult tissues, and other constitutive elements throughout development (right).
To examine the developmental dynamics of PADREs, we developed a UMAP-based method (Methods and Extended Data Fig. 6ac) that can identify known functional classes and potentially novel subclasses during development. The UMAP plot of PADREs (Fig. 4b and Extended Data Fig. 6d) separated most ChromHMM-derived functional classes, including promoters from enhancers (Fig. 4c). Near-symmetry around the y axis reflects strand directionality and was most prominent among promoters (Fig. 4d). Two prominent clusters, which stretched upward and downward from the right apex and bear no chromatin marks, are enriched for the CTCF motif with well-positioned flanking nucleosomes45 (Fig. 4e and Supplementary Fig. 4). Enhancer predictions were validated with two independent sets: (1) enhancers with bidirectional enhancer RNA signals46 called from nuclear CAGE; and (2) a manually curated catalog of published enhancers functionally validated in transgenic reporter assays (Supplementary Table 10). Both colocalized with enhancer-classified PADREs on the UMAP (Fig. 4f,g and Extended Data Fig. 5c,d), demonstrating the utility of the method. DNA methylation analysis revealed CG-rich, promoter-associated PADREs persistently hypomethylated across stages, and less CG-dense enhancer-associated PADREs gradually hypermethylated during development before becoming hypomethylated in adult somatic tissue. Dynamically methylated PADREs varied in the onset and degree of hyper/hypomethylation: for example, conserved phylotypic enhancers11 commenced hypomethylation at the Prim-5 stage (Extended Data Fig. 5f).
Next, we assessed the evolutionary conservation of PADREs by overlapping with human conserved noncoding elements (CNEs) and calculating the phastCons score for each PADRE (Fig. 4h, top, and Extended Data Fig. 5b). Early-acting enhancers appear less conserved than those activated later (Fig. 4h, bottom left, and Extended Data Fig. 5e). phastCons scores of enhancers were higher on average than those of promoters (Fig. 4h, bottom right). Poised elements were the most conserved, suggesting that Polycomb-bound enhancers are a specific class critical for differentiation and organogenesis17,47, and contributing to the hourglass model of development48.
To assign cell-type specificity to PADREs we integrated them with Prim-5 single-cell ATAC-seq30 data (Extended Data Fig. 7a). The majority of anatomical annotation overlapped with transgenically confirmed enhancers and PADRE functional annotation (Extended Data Fig. 7b and Supplementary Table 11). UMAP (Fig. 4i, right) revealed remarkable differences between cell types, both within the same tissue and across tissues. PADREs active in neural precursors of the developing central nervous system showed a threefold increase of H3K27ac compared to those active in differentiating neurons, confirming previous observations about heterogeneity of cell-type population and chromatin dynamics in the developing central nervous system49,50. In contrast, PADREs active in muscle cells carried levels of H3K27ac and H3K4me1 comparable to neural precursors, but distinct accessibility profiles (Fig. 4i, bottom).
To understand the temporal dynamics of PADREs, we created a set of consensus PADREs (cPADREs), containing ~140,000 regions open in at least two neighboring stages (Supplementary Fig. 3a). We clustered nonpromoter cPADREs by chromatin accessibility into self-organizing maps (SOMs) (Extended Data Fig. 7c). Figure 5a (top) shows UMAP locations of 3 out of 16 SOM clusters, which demonstrate remarkable developmental chromatin changes, containing cPADREs active early and subsequently decommissioned (class 4), active from ZGA onwards (class 6) and late elements (class 14). Their chromatin profiles around ATAC-seq peaks were different, with only the early elements depleted of H3K27ac at their peak (Fig. 5a, bottom). With distinct chromatin and conservation profiles, early and late elements represent two separate classes of enhancers.
Finally, we explored the dynamics of PADREs without observable chromatin marks at any stage of development. 2,109 such regions were constitutively open throughout development (Supplementary Fig. 5a), which we termed constitutive orphan predicted elements (COPEs). They colocalized with constitutive SOM class 6 and 40% of them contained a CTCF motif (Figs. 5b, top, and 4e). In contrast, another nonmarked open chromatin set (11,044; termed dynamic orphan predicted elements; DOPES) was open only in specific developmental stages (Fig. 5b and Supplementary Fig. 5b). They were depleted of promoters, with only 65 (0.6%) overlapping CAGE promoters (Supplementary Table 12). Using data from ref. 24, we found that 2,513 DOPEs contained active chromatin marks later in adult tissues, but were open to the same extent as active enhancers already in the embryo (Supplementary Fig. 5c). As we are unaware of epigenetically orphaned accessible elements in the development, whose chromatin opening precedes or is uncoupled from enhancer-associated histone mark deposition, this may represent a discovery of a previously unknown subtype of primed enhancers.
To reveal any developmental promoter regulatory principles, we exploited the PADRE chromatin features to functionally classify CAGE-seq-defined active RNA polymerase II (Pol II) promoters. First, we characterized these promoters at Dome and Prim-5 stages on the basis of their chromatin accessibility at nucleosome resolution, revealing eight clusters (Fig. 6a, Supplementary Fig. 6a and Supplementary Table 13). We detected similar clusters in human embryonic stem cells (Supplementary Fig. 6b) indicating conservation of promoter chromatin architecture classes. The classes differed mostly in their upstream configuration, including the width of the nucleosome-free region (NFR), the signal strength of the central NFR and the presence of upstream open regions (Fig. 6a), which followed GC content (Supplementary Fig. 6c). The NFRs only differed in their amplitude between medium constitutive and weak open (Supplementary Fig. 6a), with the latter either reflecting reduced promoter activity or promoters active only in a subset of cells. The NFR variations were characterized by histone marker presence and patterns of upstream opposite strand transcription (for example, upstream offset) with distinct distances between the main TSS and flanking nucleosomes (for example, wide and strong open) and TSS profiles (Supplementary Fig. 6d). These classes showed notable differences in histone modification patterns (Fig. 6b), confirmed by the differing UMAP positions of promoter PADREs (Fig. 6c). Apart from weak open, each class produces antisense transcription (PROMPTs)51,52,53, including double NFR, wide and upstream offset classes, which showed CAGE expression from both the main NFR and another upstream region, with sense transcription being stronger than antisense (Fig. 6a). Notably, the architecture classes remain stable over developmental time (Fig. 6d and Supplementary Fig. 6e), suggesting they represent distinct regulation mechanisms acting on the genes rather than stage-dependent promoter activity states. Wide and strong open classes contained the most conserved promoters (Fig. 6e and Supplementary Fig. 6f), and were enriched in transcription regulator genes (Fig. 6g and Supplementary Fig. 6g). However, promoter classes showed distinct dynamic temporal expression (Fig. 6f) with notable enrichment of the double NFR class for maternally expressed genes, in contrast to the predominantly early and late zygotic weak open and medium zygotic classes, respectively. The promoter classes also showed distinct gene ontology (GO) enrichment categories (Fig. 6g). Overall, our approach offers a promoter architecture classification for zebrafish and indicates functional specialization and vertebrate conservation of promoter classes.
a, Heat map of chromatin accessibility profiles aligned to dominant TSS per promoter at the Prim-5 stage. Nucleosome-free regions (red) are superimposed with nucleosome positioning (blue). Stack height reflects number of promoters. Above each heat map, combined histograms of CAGE expression are shown. Black, forward TSSs; red, reverse orientation TSSs (the scale is amplified 10 in relation to forward transcription). Nucleosome positioning is symbolized above alignments and black arrows indicate transcription direction; size indicates relative strength. Promoter configuration classes are color-coded consistently in all panels (including Supplementary Fig. 6) b, Aggregated H3K4me1, H3K4me3 (MNase-digested), H3K27ac ChIPseq signals for classes as in a are aligned to dominant TSS. c, UMAP profiles of promoter classes at the Prim-5 stage. UMAPs are cropped to highlight promoter PADREs. d, Flow diagram indicates the relationship between promoter configuration class at the Dome stage (left edge, Supplementary Fig. 6) and the Prim-5 stage (right edge). Band width represents the number of promoters. e, Violin plot of phastCons vertebrate conservation distribution of promoters. Each class is aligned to a. f, Classification of promoter expression during development with SOMs. On the top right, 55 diagrams contain violin plots with stage-by-stage expression levels. Blue to red spectrum indicates maternal to zygotic expression dynamics of promoter clusters. Surface areas of gray circles indicate the number of promoters per cluster. Stages of development are symbolized below the SOM array. On the left, mustard: positive and green: negative color spectrum in SOMs indicates the enrichment in promoter overlap between promoter expression classes (SOMs) in each chromatin architecture class a. g, Enriched GO categories for each promoter architecture class.
Key genes regulating development are controlled by numerous long-range enhancers, which often overlap with highly conserved noncoding elements (HCNEs) within genomic regulatory blocks (GRBs)15, which also often contain other bystander genes that do not respond to those enhancers. The extent of GRBs coincides with those of topologically associating domains (TADs) around developmental genes54 (Fig. 7a). We exploited DANIO-CODE annotations to characterize chromatin opening and interaction topology in those poorly understood loci, and their regulatory role in TADs.
a, Schematic representation of GRBs. Basic components of a GRB. GRB enhancers (green) regulating the target genes span the entire length of the GRB (middle). Typical density pattern of conserved noncoding elements in a GRB, most of which overlap enhancers (top). Hi-C contact matrix within a GRB (bottom). b, Chromatin opening profiles through developmental stages along TADs. c, Genome browser view of a GRB TAD showing H3K27ac signals in the Dome and the Prim-5 stages, H3K27ac ensembles (black bar), CAGE promoters (black blocs) and nonpromoter PADREs (blue active in the Dome stage, red active in the Prim-5 stage and purple PADREs active in both stages). A zoomed-in genome browser view of an H3K27ac ensemble (top, left). d, Aggregate contact enrichment centered on ensembles at stages as indicated. e, TAD compartment score distribution. Positive scores represent A compartments, while negative ones represent B compartments. The comparison was done using two-sided two-sample unpaired Wilcoxon test. g, Heat maps of H3K27ac signal across GRB TADs containing ensembles through developmental stages. TADs are ordered by their width in descending order and fixed on the TAD center. h, CAGE expression patterns of selected gene classes separated by SOM, with the highest and lowest ratios in ensemble-associated genes. Bar plot on the right shows the proportion of ensemble-associated genes in each class. BGT and GST classes are marked on the heat map i, Gene expression pattern of GRB target and bystander genes. The left side bar shows an ensemble association for each gene. The right side bar shows the target or bystander assignment for each gene. Genes in TADs with and without ensembles are separated by a green line. BST and GST classes are indicated on the side. j, Graph showing mean expression and standard error of GRB target genes associated and not associated with early H3K27ac ensembles. k, A model describing the influence of an H3K27ac ensemble on expression of GRB target genes. If the H3K27ac ensemble is in contact with the target gene, it can be expressed early on.
We distinguished GRB TADs, characterized by a high density of extreme noncoding conservation, from non-GRB TADs. In the regions corresponding to late (Long-pec) embryo TADs, chromatin started opening at the boundaries as early as the Dome stage and remains open thereafter (Fig. 7b and Extended Data Fig. 8a). GRB TADs showed a strong increase in accessibility across the entire TAD, whereas in non-GRB TADs the increase was mild and occurred later (Fig. 7b). TADs started to form early but formed fully only at later developmental stages55,56 (Extended Data Fig. 8b). We found more promoter-proximal enhancers in early stages and more distal enhancers in late stages, (Extended Data Fig. 8c), in line with similar findings by contact analysis55.
When we estimated the activity of enhancer candidates by H3K27ac in TADs, we observed that such elements in late stages are numerous, short and distributed throughout the entire TAD length. In contrast, many fewer PADREs were active early at Dome stage, and they often occurred in clusters with uninterrupted H3K27ac signal connecting them (Extended Data Fig. 8d,e and Fig. 7c). We detected ~1,600 such clusters57, of which ~1,300 fell in TADs and were enriched in GRB TADs (Extended Data Fig. 8f). These clusters were reminiscent of super-enhancers57,58, although more numerous than 231 reported in mouse57 and 411 in zebrafish56. Given their unusual scale and early appearance before lineage determination (when previously reported super-enhancers appear), we distinguished them from super-enhancers and called them H3K27ac ensembles. We hypothesized that they might be associated with the lack of fully formed TADs in the early stages, when enhancers are used proximally to early active promoters. To test this, we investigated the relationship between the chromatin interactions and activity of H3K27ac ensemble-associated genes during early versus late embryogenesis.
We found that promoters were enriched at the boundaries of H3K27ac ensembles (Extended Data Fig. 8g) and that the ensembles contain most candidate enhancer PADREs detected in early stages (Extended Data Fig. 8h). In contrast, the PADREs active only later in development represented long-range enhancers, distributed across the entire TAD (Extended Data Fig. 8d), and not enriched in ensembles (Extended Data Fig. 8h). Moreover, H3K27ac present along the entire length of the ensemble became restricted to individual peaks associated with PADREs by Prim-5 (Fig. 7c, zoomed-in panel).
Consistent with an H3K27ac ensemble role in early gene regulation, we observed increased Hi-C contacts within them at Dome in both GRB and non-GRB TADs. By Prim-5, strong contacts spread throughout the entire TAD (Fig. 7d and Extended Data Fig. 9a). TADs with H3K27ac ensembles present at Dome belonged to the active A compartment at Prim-5 (Fig. 7e), arguing for a role for H3K27ac ensembles in the timely opening of chromatin in their host TADs. Indeed, in GRB TADs, the H3K27ac mark propagated from H3K27ac ensembles to fill the entire TAD in later stages (Fig. 7f).
To examine how H3K27ac ensembles influence gene expression, we classified promoters within TADs by expression dynamics using SOM (Extended Data Fig. 9b). H3K27ac ensemble-associated promoters mostly sequestered into clusters, with the highest expression in early post-ZGA stages (Fig. 7g). We termed the top two H3K27ac ensemble-associated classes as blastula-gastrula transition (BGT) and gastrula-segmentation transition (GST) on the basis of the peak expression time. The two major gene classes in GRB TADs were ubiquitously expressed (GRB bystanders) and late zygotic expressed (likely GRB target genes). However, in GRB TADs with an ensemble, we observed a BGT gene class, not present in GRB TADs without an ensemble, as well as more genes in the GST class. Both classes were enriched in ensemble-associated genes (Fig. 7h). Moreover, there was a clear trend of earlier expression in H3K27ac ensemble-associated GRB target genes, compared to other GRB target genes (Fig. 7i), suggesting that ensembles participated in the activation of early-acting developmental genes, including those later dependent on long-range regulation. Moreover, if the target gene is not in contact with the H3K27ac ensemble, it can only become expressed once long-range interactions are present (Fig. 7j).
Next, we investigated whether our annotation of noncoding elements could be exploited to predict functionally conserved cis-regulatory elements (CREs) among vertebrates. Existing comparative methods rely on direct alignments between species of interest59,60. However, the large evolutionary distance between fish and mammals limits the power of comparison, due to loss of noncoding sequence similarity. We developed a method to predict functional conservation across large evolutionary distances and genomic scales independent of direct sequence alignment, exploiting the fact that functional elements often maintain collinear syntenic positions, while their spacing scales with genome size, particularly in GRB TADs15,54,61,62. We selected 13 high-quality bridging species reference genomes and using stepped pairwise sequence alignment (Extended Data Fig. 10 and Methods), which allowed us to map coordinates between genomes of varying sizes, identified reference points (multispecies anchors; Fig. 8a) between genomes and enabling identification of syntenic regions through interpolation of relative syntenic positions between anchor points.
a, Cross-species comparison of the irx3/5(a) TAD between zebrafish and mouse and a zoom-in on the locus around irx3(a). Connecting lines represent projections of bin centers from zebrafish to mouse. b, Distribution of distances from the bin centers (n=528,830) to their closest anchors in zebrafish (blue), and from their projections to their closest anchors in mouse (red), using the direct and the multispecies projection approach. c, Epigenetic comparison of the irx3/5(a) TAD. H3K27me3 overlap in mapped regions is indicated as colored bars (yellow, mutually enriched; blue, zebrafish specific; red, mouse specific; Methods). Opacity reflects signal amplitude and is proportional to the maximum H3K27me3 signal in both species. d, H3K27me3 overlap profiles for four selected GRB TADs. TAD boundaries are indicated with square brackets. e, H3K27me3 overlap profiles of all GRB TADs. TADs are ordered by their relative amount of shared signal. Bins are ordered by the amount of shared signal: bins with shared signal appear in the middle, bins with zebrafish- and mouse-specific signals are left and right, respectively. A view of the TADs with their genomic bin order is given in Extended Data Fig. 10d,f. Classification of zebrafish ATAC-seq peaks in the irx3a locus into DC, IC and NC on the basis of overlaps with direct anchors, multispecies anchors and mouse DNase-seq peak projections (Methods). g, Distribution of DNase-seq signal in the mouse genome around the projected regions of the zebrafish ATAC-seq peaks (n=140,633). Asterisks above the bars indicate the effect size category based on Cohens d: very small (not indicated), small (*), medium (**), large (***), very large (****). h, Cross-species comparison of ChromHMM functional states. i, Cumulative distribution of shared motifs in mouse DNase-seq peaks overlapping zebrafish ATAC-seq peaks. j, H3K27ac enrichment (signal 80th percentile) within (n=11,083) and outside (n=93,020) of enhancer ensembles (P<2.21016, Fishers exact test). k, Cross-species comparison of H3K27ac profile around an H3K27ac ensemble neighboring the zebrafish aktip gene.
We then compared zebrafish and mouse GRB TADs, which differ in size approximately twofold (Extended Data Fig. 10a). We defined GRB TADs as the 1,000 TADs with the highest CNE density, split them into 1-kilobase (kb) bins, and mapped the bin centers from zebrafish to mouse. Using our multispecies approach over direct alignment reduced distances from the bin centers to their closest anchor by a factor of 16 in zebrafish and 29 in mouse (Fig. 8b).
We asked whether this method could discover conserved epigenomic subdomains by comparing epigenomic feature distribution across genomes. We used H3K27me3 ChIPseq data from phylotypic stages in zebrafish (Prim-5) and mouse (E10.5; Methods). H3K27me3 coordinates from zebrafish were projected onto the mouse genome, recovering mouse H3K27me3 features in the corresponding region. An example at the irx3a locus (Fig. 8c) shows H3K27me3 enrichment correlates between zebrafish and mouse, even in the absence of direct sequence conservation. On a genome-wide level, H3K27me3 enrichment is substantially more likely to be shared between zebrafish and mouse for both directly alignable and nonalignable genomic regions (Extended Data Fig. 10e), suggesting epigenomic subdomains and functional elements can be conserved in location and span. We see more GRB TADs showing regions of strong similarity in H3K27me3 extent, while others, such as TADs containing her9 or celf5a, show more zebrafish- or mouse-specific signal enrichment, and still others show little enrichment (Fig. 8d,e).
We next looked at conservation of functional elements marked by open chromatin. We classified zebrafish ATAC-seq peaks in the GRB TADs as directly conserved (DC) if they fall in a region of direct sequence alignment with mouse (16,188 elements, 11.5 %), indirectly conserved (IC) if they do not directly align (6,137 elements, 4.4%) but were alignable through bridging species and nonconserved (NC) for all other peaks (for example, irx3a in Fig. 8f). Notably, DC and IC elements shared regulatory features with their matched counterparts in mouse, including DNase hypersensitivity and ChromHMM feature classification, compared to NC elements (Fig. 8g,h). DC and IC regions were also more likely to share TF binding site (TFBS) motifs compared to nonoverlapping, randomly sampled mouse DNase-seq peaks within and across TAD boundaries (cis and trans in Fig. 8i and Supplementary Table 14). These results suggest a similar level of functional conservation of DC and IC elements, even though IC elements lack direct alignability. Next, we tested whether the early developmental H3K27ac ensembles detected in zebrafish embryos (Fig. 7) are conserved in mouse using our anchoring-based approach. As shown in Fig. 8j,k, H3K27ac signal in mouse was substantially enriched in zebrafish ensembles, suggesting these ensembles are evolutionarily conserved epigenetic subdomains in vertebrates. Genes associated with these conserved ensembles are listed in Supplementary Table 15. Our comparative epigenomic approach has maximized the identification of putative functional elements and epigenetic subdomains conserved between zebrafish and mouse, and highlights the utility of the DANIO-CODE annotations for discovery of vertebrate-conserved mechanisms.
Read more from the original source:
Multiomic atlas with functional stratification and developmental dynamics of zebrafish cis-regulatory elements - Nature.com