RENGE infers gene regulatory networks using time-series single-cell RNA-seq data with CRISPR perturbations … – Nature.com
Cell culture
Human iPSC line, OILG-3, was obtained from the Wellcome Sanger Institute and cultured in StemFlex medium (Thermo Fisher) on Vitronectin (Thermo Fisher)-coated culture dishes. Cells were detached using TrypLE (Thermo Fisher) and re-seeded at 4104 cells per well into 6-well plates for routine maintenance. For the first 24h after passaging, cells were treated with 10M Y-27632 (Wako). SpCas9-expressing OILG cells were generated as previously described36.
Selected gRNAs (Supplementary Table1) were cloned into pKLV2-U6gRNA5(BbsI)-PGKpuroBFP-W. Lentivirus was produced individually by transfecting 293FT cells together with lentiviral packaging plasmids, psPAX2 and pMD2.G using LipofectamineLTX37. The resulting viral supernatants were then pooled in an equal volume ratio. OILG-Cas9 (1.56105) cells were transduced with the pooled lentivirus at 89% transduction efficiency and maintained until harvesting without passaging. On days 2, 3, 4, and 5 after transduction, 8104 BFP+ cells were collected using an MA900 cell sorter (Sony), then resuspended at 1106 cells/mL in 0.05% BSA in PBS. These cells were then subjected to 5 scRNA-seq library preparation using a Chromium Next GEM Single Cell 5 Library & Gel Bead Kit following the manufacturers protocol with minor modifications to simultaneously capture guide RNA molecules. Briefly, a spike-in oligo (5-AAGCAGTGGTATCAACGCAGAGTACCAAGTTGATAACGGACTAGCC-3) was added to the reverse transcription reaction. The small DNA fraction isolated after cDNA clean-up was then used to generate a gRNA sequencing library with the primers listed in Supplementary Table2. PCR was performed using 2KAPA Hi-Fi Master Mix with the following program: 95C for 3 min, 12 cycles of 98C for 15 sec and 65C for 10 sec, followed by 72C for 1 min. The resulting gene expression libraries and gRNA libraries were pooled at a molecular ratio of 7:1 and sequenced using NovaSeq with 26 cycles for read 1, 91 cycles for read 2, and 8 cycles for the sample index.
A digital expression matrix with gRNA assignment was obtained using the CRISPR Guide Capture Analysis pipeline of Cell Ranger 5.0.0 (10x Genomics). The generated expression matrix was processed using Seurat (version 4.0.3)38. Single cells were filtered to leave cells with>200 and<10000 expressed genes and<20% reads from mitochondrial genes. The expressions were normalized using the sctransform method of Seurat. Only cells bearing a single gRNA were used for downstream analysis.
We investigated GRNs whose nodes were TFs only. Below, we adopt a 1-origin indexing system for all vectors and matrices. Consider a model that represents the propagation of the KO effect from the KO gene g on the GRN. Let G denote the number of genes included in the GRN. The G-dimensional gene expression vector ({{{{{{{{bf{E}}}}}}}}}_{g,{K}^{{prime} }}^{{prime} }) of a cell including the up to ({K}^{{prime} })-th order regulatory effect from the KO gene g is modeled as follows:
$$begin{array}{r}{{{{{{{{bf{E}}}}}}}}}_{g,{K}^{{prime} }}^{{prime} }=mathop{sum }limits_{{k}^{{prime} }=1}^{{K}^{{prime} }}{left({{{{{{{{bf{M}}}}}}}}}_{g}odot {{{{{{{bf{A}}}}}}}}right)}^{{k}^{{prime} }}{{{{{{{{bf{X}}}}}}}}}_{g}+{{{{{{{{bf{b}}}}}}}}}_{{K}^{{prime} }},end{array}$$
(3)
where Xg is a G-dimensional vector of which gth component is the expression change of gene g due to its KO, and the other components are zero. When the cell is the wild type, i.e. no gene is knocked out (g=0), X0 is a zero vector. ({{{{{{{{bf{b}}}}}}}}}_{{K}^{{prime} }}) is the G-dimensional expression vector corresponding to the wild type. A is a GG matrix and Ai,j(ij) represents the strength of regulation from gene j to i; that is, the change in gene i expression due to a unit amount change in gene j expression. Ai,j(i=j) represents effects such as degradation and self-regulation (Supplementary Note1).denotes an element-wise product. Eq. (3) is an extension of Eq. (1) with a mask matrix Mg representing that the KO gene g is no longer regulated by other genes:
$${{{{{{{{{{bf{M}}}}}}}}}_{g}}}_{i,j}=left{begin{array}{ll}0quad &(i=g)\ 1quad &(i,ne ,g).end{array}right.$$
(4)
Thus, (mathop{sum }nolimits_{{k}^{{prime} } = 1}^{{K}^{{prime} }}{({{{{{{{{bf{M}}}}}}}}}_{g}odot {{{{{{{bf{A}}}}}}}})}^{{k}^{{prime} }}{{{{{{{{bf{X}}}}}}}}}_{g}) represents the expression change from the wild type due to gene KO.
From the scCRISPR analysis, we obtained the G-dimensional gene expression vector Ec,t in cell c sampled at time t and G-dimensional vector Xc,t representing the decrease in expression of the KO gene in the cell (t=1,,T,c=1,,Ct). Here, T is the number of time points, and Ct is the number of cells sampled at time t. Note that here, in contrast to Eq. (2) in the Results section, the subscript of E have been changed from g,t to c,t. The KO gene in cell c sampled at time t is identified by the presence of gRNA and denoted by gc,t. The calculation of Xc,t from gc,t will be explained in a later section.
Suppose we have the gene expression data ({{{{{{{{bf{E}}}}}}}}}_{g,{K}^{{prime} }}^{{prime} }; ({K}^{{prime} }=1,cdots ,,max_{K}^{{prime} })), in which the effects of different maximum orders of ({K}^{{prime} }) regulation appear, we can infer the GRN A by fitting Eq. (3) to the data. However, it is impossible to synchronize the sampling time t of the cells and the time at which the effects appear for up to the ({K}^{{prime} })-th order of regulation from the KO gene. Hence, the maximum order of regulation from the KO gene in the cells at sampling time t is unknown. Thus, RENGE estimates the value from the data. By introducing a term w(t,k,gc,t) representing the strength of the effect of the k-th order of regulation at time t when the gene gc,t is knocked out, we can express Eq. (3) as follows:
$${{{{{{{{bf{E}}}}}}}}}_{c,t}=mathop{sum }limits_{k=1}^{K}w(t,k,{g}_{c,t}){({{{{{{{{bf{M}}}}}}}}}_{c,t}odot {{{{{{{bf{A}}}}}}}})}^{k}{{{{{{{{bf{X}}}}}}}}}_{c,t}+{{{{{{{{bf{b}}}}}}}}}_{t}$$
(5)
$$w(t,k,{g}_{c,t})=frac{1}{1+{exp }^{-({alpha }_{{g}_{c,t}}+beta t-gamma k)}},$$
(6)
where w(t,k,gc,t) is assumed to be monotonically increasing with respect to t and monotonically decreasing with respect to k, thus, as time progresses, the effects of higher-order regulation become more apparent. ({alpha }_{{g}_{c,t}},,beta ,,gamma) are the parameters to be estimated, and 0,0. The parameter ({alpha }_{{g}_{c,t}}) represents the time required for the effect of the KO of gene gc,t to appear and is assumed to differ with each KO gene. is related to a rate constant at which the regulation step progresses with respect to time t, and is a parameter representing the degree of decrease in the effect of higher-order regulation. Mc,t is obtained by replacing the subscripts of the mask matrix in Eq. (4) with the relation g=gc,t. The parameters to estimate are ({{{{{{{bf{A}}}}}}}},,{{{{{{{{bf{b}}}}}}}}}_{t}; (t=1,cdots ,,T),,{alpha }_{{g}_{c,t}} ({g}_{c,t}=1,cdots ,,{G}_{ko}),,beta ,,gamma), where Gko is the number of KO genes.
The parameters are estimated by minimizing the following objective function:
$$L = mathop{sum}limits_{t=1}^T mathop{sum}limits_{c=1}^{C_t} left| {{{{{mathbf{m}}}}}}_{c,t} odot left[{{{{{mathbf{E}}}}}}_{c,t}{-}left{mathop{sum}limits_{k=1}^K w(t,k,,g_{c,t}) ({{{{{mathbf{M}}}}}}_{c,t} odot {{{{{mathbf{A}}}}}} )^k {{{{{mathbf{X}}}}}}_{c,t} + {{{{{mathbf{b}}}}}}_t right}right]right|_{2}^{2} \ + lambda_1 mathop{sum}limits_{i,j=1}^G left|left{{{{{{mathbf{A}}}}}}right}_{i, j}right| + lambda_2 mathop{sum}limits_{k=1}^K mathop{sum}limits_{i, j=1}^G left{{{{{{mathbf{A}}}}}}^kright}_{i,j}^2,$$
(7)
where {A}i,j denotes the i,j element of the matrix A,denotes the element-wise product, and mc,t is the mask vector for cell c at time t:
$${{{{{{{{{{bf{m}}}}}}}}}_{c,t}}}_{i}=left{begin{array}{ll}0quad &(i={g}_{c,t})\ 1quad &(i,ne ,{g}_{c,t})end{array}right..$$
(8)
The first term in Eq. (7) is the squared error between the predictions of the model and the data. mc,t is used to ignore the squared error of KO gene gc,t expression in cell c at time t because mRNA of KO gene gc,t may still be expressed even when the functional protein is lost when using the CRISPR system. The last two terms in Eq. (7) are the L1 and L2 regularization terms of the parameter A, respectively. To suppress the magnitude of each element of not only A but also Ak(k2), an L2 regularization term was added for Ak(k=1,K). Note that the L1 regularization term was only added for A and not for Ak(k2) because A represents a GRN and thus is expected to be sparse, but Ak(k2) is not necessarily sparse. The objective function is minimized using the L-BFGS-B method implemented in scipy.minimize. K,1,2 are hyperparameters that are set to values that minimize cross-validation loss using Bayesian optimization with Optuna39.
One of the RENGE inputs, Xc,t, is a G-dimensional vector representing the decrease in expression of the target gene due to its KO in cell c at time t. Here, we assumed that when the target gene is entirely knocked out, the gene expression is decreased to zero. That is, the decrease in expression equals the average expression in control cells. However, in scCRISPR analysis, the target gene is not necessarily knocked out even in cells where the corresponding gRNA is detected. It is therefore necessary to distinguish between cells in which the transcriptome is affected by the KO and cells in which the KO fails and thus the transcriptome is not affected. RENGE uses the concept of perturbation probability, defined as the probability that gRNA detected in a cell has an effect on the transcriptome. RENGE calculates the perturbation probability pc(c=1,,C) for each cell c in the same way as MIMOSCA13, where C is the total number of cells.
Xc,t is defined as the decreased expression of the KO gene gc,t multiplied by pc:
$${{{{{{{{bf{X}}}}}}}}}_{c,t,i}=left{begin{array}{ll}-{p}_{c}cdot frac{1}{{C}_{t}^{ctrl}}mathop{sum }limits_{j = 1}^{{C}_{t}^{ctrl}}{{{{{{{{bf{E}}}}}}}}}_{j,t,i}^{ctrl}quad &(i={g}_{c,t})\ 0quad &(i,ne ,{g}_{c,t}),end{array}right.$$
(9)
where ({C}_{t}^{ctrl}) is the number of control cells at time t and ({{{{{{{{bf{E}}}}}}}}}_{j,t,i}^{ctrl}) is the expression of gene i in control cell j at time t.
RENGE calculates the p-value for each element of the matrix A, which indicates the strength of regulation, using the bootstrap method as follows. Let the data set be denoted by ({{{{{{{bf{D}}}}}}}}=mathop{bigcup }nolimits_{t = 1}^{4}({{{{{{{{bf{X}}}}}}}}}_{t},{{{{{{{{bf{E}}}}}}}}}_{t})). The bootstrap data set D1,,DN is created by sampling cells with replacement, keeping the number of cells for each KO gene at each time point (N=30 by default). For each Dl(l=1,,N), apply RENGE and estimate Al. Given Al(l=1,,N), calculate the sample variance Var({A}i,j)(i,j=1,,G) of {A}i,j. Assuming the null distribution of {A}i,j is ({{{mathcal{N}}}}(0, Var({{{{{{{{{bf{A}}}}}}}}}}_{i,j}))), RENGE calculates the p-value pi,j of {A}i,j as follows:
$${p}_{i,j}=left{begin{array}{ll}2left(1-{Phi }^{-1}right.({{{{{{{{{bf{A}}}}}}}}}}_{i,j}/Var({{{{{{{{{bf{A}}}}}}}}}}_{i,j}))quad &({{{{{{{{{bf{A}}}}}}}}}}_{i,j},ge, 0)\ 2left({Phi }^{-1}right.({{{{{{{{{bf{A}}}}}}}}}}_{i,j}/Var({{{{{{{{{bf{A}}}}}}}}}}_{i,j}))quad &({{{{{{{{{bf{A}}}}}}}}}}_{i,j}, < ,0),end{array}right.$$
(10)
where is the cumulative distribution function of the standard normal distribution. The q-value is then calculated using the Benjamini-Hochberg procedure to control for multiple hypothesis testing. Since RENGE cannot infer self-regulation, all downstream analyses, including method comparison and network analysis, were performed by excluding self-regulation.
The following existing methods were compared with RENGE: GENIE39, dynGENIE340, BINGO32, MIMOSCA13, and scMAGeCK16. GENIE3 predicts the expression of a gene from that of other genes using a tree-based ensemble. The importance of one gene for the prediction of another indicates the strength of the interaction between the genes. Although it exhibited superior performance in the benchmark of GRN inference from scRNA-seq data11, GENIE3 cannot handle information on KO genes or time series data. In this study, one cell was treated as one sample, and time information was ignored. In each cell, the expression of the target KO gene was set to 0 regardless of its measured mRNA expression.
dynGENIE3 is a modified version of GENIE3 that is appropriate for time-series data; however, it cannot handle KO gene information. In this study, at each time point, the expression of each cell for each KO gene was averaged to produce a time series data set of (number of KO genes +1). In each time-series data set, the expression of the KO gene was set to 0.
BINGO is a method used to infer GRNs from time-series expression data by modeling gene expression dynamics with stochastic differential equations involving nonlinear gene-gene interactions. It can also handle KO information. BINGO takes two types of input data, time-series expression data (as data.ts) and KO gene data (as data.ko). The time-series data was constructed in the same way as for dynGENIE3, and KO gene data was constructed based on gRNA assignment.
MIMOSCA was developed for scCRISPR-screening data, and performs a linear regression of expression data using the gRNA detected in each cell and other information as covariates. This method can handle the index of the time point from which each cell is derived as a covariate, but not the time-series information. In this study, we used MIMOSCA by setting gRNA and the index of timepoint as covariates.
scMAGeCK includes scMAGeCK-LR and scMAGeCK-RRA, both GRN inference methods for the scCRISPR-screening data. scMAGeCK-LR performs linear regression similar to MIMOSCA. scMAGeCK-RRA uses Robust Rank Aggregation (RRA) to detect genes with expression changes in each KO. However, it cannot handle time information, so we applied scMAGeCK by ignoring the time information of each cell.
Recently, SCEPTRE41 and Normalisr42 were shown to improve the inference of associations between perturbations and gene expression in scCRISPR analysis. However, since these methods were developed for the high multiplicity-of-infection (MOI) scCRISPR analysis data, they were not examined in this study, which used low MOI data.
To benchmark the methods, simulated data were generated using dyngen, a GRN-based simulator of scRNA-seq data. A total of 750 GRNs, consisting of 100 genes, were generated by setting num_tfs=100. In detail, 250 GRNs were generated for each of the three backbones (linear, converging, and bifurcating conversing) defined in dyngen. We used the backbones with only one steady state because they are cases similar to the real data of hiPS cells we obtained in this study.
The ground-truth GRNs were used for the simulation by dyngen. Initially, the simulation was run without KO for simtime_from_backbone(backbone) time to obtain a steady state for each backbone. Subsequently, a gene was knocked out, and the simulation was run for 100 steps from the steady state. After the KO, a total of 100 cells were sampled at four time points in regular intervals. The parameter values used in dyngen are presented in Supplementary Table6.
We ran the simulation knocking out each of the 100 genes in each GRN and obtained expression data of 100 genes sampled from 100 cells under 100 KOs. Note that here we performed a single-gene KO multiple times. For each GRN, the expression data subset was constructed by extracting the cells corresponding to the KO genes included in the randomly selected set M of genes. For each backbone, the 250 GRNs were divided into 5 sets, each of which included 50 GRNs. GRNs in each set have a different size M(M=20,40,60,80,100). The ratio of KO genes for each data set is (frac{| M| }{100}). We found that in some GRNs of bifurcating converging backbone, single-gene KO does not cause substantial expression variation, possibly due to the GRN structure (Supplementary Fig.3). The amount of expression variation caused by single-gene KO (MIMOSCA score) was calculated using the GGko matrix calculated by MIMOSCA as follows:
$${{{{{rm{MIMOSCA}}}}}}_{{{{{rm{score}}}}}},=frac{{sum }_{i,j}| {{{{{{{{{boldsymbol{beta }}}}}}}}}}_{i,j}| }{{G}_{ko}}.$$
(11)
Since RENGE assumes that single-gene KO causes a substantial amount of expression variation, we excluded GRNs with MIMOSCA_score<2. Consequently, we used 248 GRNs for linear backbones, 233 GRNs for converging backbones, and 133 GRNs for bifurcating converging backbones, resulting in a total of 614 GRNs. To normalize the count data generated by dyngen and stabilize variance, we applied sctransform of Seurat38. The resulting data were used to infer GRNs by each method. The results for all the 750 GRNs are shown in Supplementary Fig.2.
To evaluate the agreement between the inferred GRN and the ground-truth GRN, we first calculated the agreement of the presence and absence of regulation using the AUPRC ratio, while ignoring the sign of the regulation. AUPRC is a common metric that measures the agreement between the inferred and ground-truth GRNs. The AUPRC ratio is the AUPRC divided by that of a random predictor, and it was averaged for all GRNs and M KO gene sets for each KO gene ratio. The AUPRC ratio for each of the positive and negative regulations was then calculated as follows: for positive regulations the confidence level of regulation was set to 0 if it was negative, and only positive regulations were considered; negative regulation was similarly calculated.
We selected the genes to be included in the GRN of hiPSCs as follows. Let d2 be the coefficient matrix obtained by applying MIMOSCA to the day 2 cell population. ({{{{{{{{{{boldsymbol{beta }}}}}}}}}_{d2}}}_{i,j}) represents the expression variation of gene i when gene j is knocked out. The expression variation score vi of gene i was defined as ({v}_{i}={sum }_{j}| {{{{{{{{{{boldsymbol{beta }}}}}}}}}_{d2}}}_{i,j}|), and the top 80 non-KO genes with large vi were selected. A total of 103 genes with 80 non-KO genes and 23 KO genes constituted the node set for the focal system in this study.
The ChIP-Atlas, a database for ChIP-seq data, was used to validate the GRN inferred from the hiPSC data. ChIP-seq data for 19 genes from human pluripotent stem cells was obtained. We used cell types included in the cell-type class Pluripotent stem cell defined in the ChIP-Atlas that did not contain derived in the cell type name. Note that the data labeled as ChIP-seq data for RUNX1T1 in ChIP-Atlas was excluded because it was actually ChIP-seq data for RUNX1-ETO. The 19 genes with ChIP-seq data consisted of 9 KO genes and 10 non-KO genes (Supplementary Table3). The confidence level for the binding of a TF to DNA is expressed as (-10 times {log }_{10},({{mbox{MACS2}}}; q{{mbox{-value}}})). If the confidence level of the binding of gene j to gene i in the region of TSS10kb was higher than the predetermined ChIP threshold, we assumed that regulation occurred from gene j to gene i. This means that the ground-truth network depends on the ChIP threshold; the higher the ChIP threshold, the more reliable the regulations in the ground-truth network. We calculated the AUPRC ratio for the ground-truth GRNs of various confidence levels changing the ChIP threshold from 0 to the maximum confidence value in the data.
The rank correlation coefficient between the confidence level of each regulation was calculated using each method and the confidence level of the ChIP-seq data ((-10 times {log }_{10},({{mbox{MACS2}}}; q{{mbox{-value}}}))). For RENGE, MIMOSCA, and scMAGeCK, we used (-{log }_{10}(q,{{mbox{-value}}},)) as the confidence level, and for GENIE3, dynGENIE3, and BINGO, we used the output value of each tool itself (confidence values or weights).
We examined the details of the inferred regulations for each method by comparing it with the ground-truth network with the ChIP threshold=300. There were 237 regulations, the same number that was observed in the ground-truth network, that were extracted for the GRNs inferred by each method, in order of confidence score of the regulation. These regulations were classified as follows. Suppose the regulation from gene j to gene i was inferred. If the length k of the shortest path from gene j to gene i in the ground-truth network was 1, it was classified as direct; while if k>1, it was classified as indirect. If there was no path from gene j to gene i, it was classified as no path.
Having inferred the GRN of 103 genes by RENGE, we focused on regulation with FDR<0.01 and calculated the out-degree for each gene which is shown in Fig.5b. Using this GRN, we validated our hypothesis that gene pairs with a similar set of target genes are likely to form a proteincomplex. Using the regulatory coefficient matrix A estimated by RENGE, the regulatory correlation coefficients were calculated for all gene pairs in the network as follows:
$$R={co{r}_{sp}({{{{{{{{bf{A}}}}}}}}}_{:,i},{{{{{{{{bf{A}}}}}}}}}_{:,j})| 1le i,,jle G},$$
(12)
where A:,i denotes the i-th column of A and corsp(x,y) denotes the Spearmansrank correlation coefficient between x and y. If corsp(A:,i,A:,j) is close to 1, gene i and gene j regulate the same genes in the same direction, and if close to -1, they regulate the same genes in the opposite direction.
We compared the regulatory correlation with the protein complex data from the three databases. First, curated complexes were obtained from the CORUM3.0 database. We used all complexes in which at least 66% of their component genes were included in the 103 genes in the GRN15. When a gene pair was included in the same complex, the gene pair was assigned to be in the CORUM complex. Second, protein-protein interaction scores were obtained from the v11.5 of STRING (9606.protein.physical.links.v11.5.txt.gz). The protein-protein interaction scores for gene i and gene j are denoted as PPIi,j. Among the gene pairs in R, those with PPIi,j=0 were assigned STRING score low, and those with the top 10% of PPIi,j among gene pairs with PPIi,j>0 were assigned STRING score high. Third, colocalization scores for the DNA binding of TFs were obtained from the ChIP-Atlas, using data for the cell type class of pluripotent stem cells.
Let ({D}_{S}=mathop{bigcup }nolimits_{t = 1}^{4}({{{{{{{{bf{X}}}}}}}}}_{S,t},{{{{{{{{bf{E}}}}}}}}}_{S,t})) be a data set containing control cells and cells in which genes in the gene set S are knocked out, and O={1,,23} be the indices of the genes knocked out in the hiPSC data. We trained the RENGE model using the dataset DO{j} excluding cells in which the gene j(j=1,,23) was knocked out. The trained RENGE model was then used to predict the expression changes of the other genes when gene j was knocked out. We calculated the Pearson correlation coefficient between the predicted and measured expression changes for the gene j KO using D{j}.
All the underlying statistical details were provided earlier in the Methods section.
Further information on research design is available in theNature Portfolio Reporting Summary linked to this article.