# How to convert enrichment/depletion to frequency for comparing deep sequencing to sequence profile?

We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

I have two datasets, from different sources, that I need to compare.

The first set is deep sequencing results of a directed evolution experiment, where I have the naive library and selected library counts, and have calculated enrichment/depletion (positive and negative values with no upper or lower bound).

The second set is a set of protein sequences for which I calculate amino acid frequencies (positive values from 0-1).

The goal is to calculate a similarity between the two datasets. Typically I have two of the second type of set (protein sequences) and I calculate similarity based on the amino acid frequencies… What's the best way to convert enrichment/depletion to frequency so I can compare?

Example deep sequencing data, for position 77 of the protein:

$$ext{enrichment} = log_2left(frac{F_S}{F_N} ight)$$

Where $$F_S$$ is selected frequency and $$F_N$$ is naïve frequency.

I came up with a possible solution for frequency equivalent from enrichment ($$F_E$$) but am open to thoughts if it's good or not:

$$F_E = frac{displaystylefrac{F_S}{F_N}}{displaystylesum_ ext{amino acid}frac{F_S}{F_N}}$$

Although the question is kind of confusing at some places, what I understood is that you are trying to compare the relative amino acid enrichments in the two datasets.

As far as I know, you could construct the protein sequences from directed evolution experiment (presumably a time series data. Please clarify that.) and make a multiple sequence alignment (MSA) of that. In order to construct the sequences, there would be some technical procedures that would depend on the type of deep sequencing data you have. Factors such as the read length, protein length and coverage would need to be taken into consideration.

You could similarly make MSA for second datasets too.

Then using tools such as Rate4site (https://www.ncbi.nlm.nih.gov/m/pubmed/12169533/) you would be able to get evoutionary rates per site from MSAs. Then compare the evolutionary rates per sites for two datasets by correlating them.

If the correlation is high, the enrichments in both datasets are similar, otherwise not.

## Efficient depletion of ribosomal RNA for RNA sequencing in planarians

The astounding regenerative abilities of planarian flatworms prompt steadily growing interest in examining their molecular foundation. Planarian regeneration was found to require hundreds of genes and is hence a complex process. Thus, RNA interference followed by transcriptome-wide gene expression analysis by RNA-seq is a popular technique to study the impact of any particular planarian gene on regeneration. Typically, the removal of ribosomal RNA (rRNA) is the first step of all RNA-seq library preparation protocols. To date, rRNA removal in planarians was primarily achieved by the enrichment of polyadenylated (poly(A)) transcripts. However, to better reflect transcriptome dynamics and to cover also non-poly(A) transcripts, a procedure for the targeted removal of rRNA in planarians is needed.

### Results

In this study, we describe a workflow for the efficient depletion of rRNA in the planarian model species S. mediterranea. Our protocol is based on subtractive hybridization using organism-specific probes. Importantly, the designed probes also deplete rRNA of other freshwater triclad families, a fact that considerably broadens the applicability of our protocol. We tested our approach on total RNA isolated from stem cells (termed neoblasts) of S. mediterranea and compared ribodepleted libraries with publicly available poly(A)-enriched ones. Overall, mRNA levels after ribodepletion were consistent with poly(A) libraries. However, ribodepleted libraries revealed higher transcript levels for transposable elements and histone mRNAs that remained underrepresented in poly(A) libraries. As neoblasts experience high transposon activity this suggests that ribodepleted libraries better reflect the transcriptional dynamics of planarian stem cells. Furthermore, the presented ribodepletion procedure was successfully expanded to the removal of ribosomal RNA from the gram-negative bacterium Salmonella typhimurium.

### Conclusions

The ribodepletion protocol presented here ensures the efficient rRNA removal from low input total planarian RNA, which can be further processed for RNA-seq applications. Resulting libraries contain less than 2% rRNA. Moreover, for a cost-effective and efficient removal of rRNA prior to sequencing applications our procedure might be adapted to any prokaryotic or eukaryotic species of choice.

## Introduction

DNA double-strand breaks (DSBs) are major DNA lesions that form in a variety of physiological conditions—such as transcription 1,2 , meiosis 3 and VDJ recombination 4 —as well as a consequence of exposure to DNA-damaging agents and replication stress 5 . DSBs can also be induced in a controlled manner at specific sites in the genome using programmable nucleases, such as the CRISPR (clustered regularly interspaced short palindromic repeats)-associated RNA-guided endonucleases, Cas9 and Cpf1, which have greatly advanced genome editing. However, the potentially mutagenic off-target DNA cleavage activity of these nucleases represents an issue of major concern that needs to be thoroughly assessed before these enzymes can be safely used in the clinical setting 6 . Thus, developing methods that can accurately map the genome-wide location of endogenous as well as exogenous DSBs in different systems and conditions is not only essential to advance our understanding of DSB biology, but is also critical for successful translation of programmable nucleases from research tools into clinical applications.

In the past few years, several methods based on next-generation sequencing (NGS) have been developed to assess DSBs at genomic scale, including chromatin immunoprecipitation sequencing 7,8 , direct in situ breaks labeling, enrichment on streptavidin and next-generation sequencing (BLESS) 9,10,11 , genome-wide, unbiased identification of DSBs enabled by sequencing (GUIDE-seq) 12 , in vitro Cas9-digested whole-genome sequencing (Digenome-seq) 13 , integrase-defective lentiviral vector (IDLV)-mediated DNA break capture 14 , high-throughput, genome-wide, translocation sequencing 15 and more recently End-Seq 16 and DSBCapture 17 . Although all of these methods represent important complementary tools to detect DSBs genome wide (Supplementary Table 1), they also have important drawbacks. For example, chromatin immunoprecipitation sequencing of DSB-sensing or repair proteins such as p53-binding protein 1 or the phosphorylated variant histone H2A.X (γH2A.X) does not label DSBs directly and is unable to identify DNA breakpoints with single-nucleotide resolution. GUIDEseq, IDLV-mediated DNA break capture and high-throughput, genome-wide, translocation sequencing detect DSBs by quantifying the products of non-homologous end-joining repair, potentially missing DSBs that are repaired through other pathways. Furthermore, in vivo delivery of exogenous oligonucleotides in GUIDEseq or viral cassettes in IDLV-mediated DNA break capture for evaluating DSBs in primary cells and intact tissues may be challenging. DSBs induced by programmable nucleases, such as CRISPR-associated RNA-guided Cas9 and Cpf1, can be evaluated in vitro using Digenome-seq, but this approach may not be representative of relevant nuclease concentrations and of cellular properties, such as chromatin environment and nuclear architecture, which might influence the frequency of DNA breaking and repair. Lastly, BLESS and the related methods End-Seq 16 and DSBCapture 17 require substantial amounts of input material (typically, in the order of millions of cells), are labour-intensive and are semi-quantitative due to lack of appropriate controls for PCR amplification biases, limiting their applications and scalability. Here we describe a method for breaks labeling in situ and sequencing (BLISS) that compared with other DSB mapping methods is more versatile, sensitive and quantitative. We demonstrate the broad applicability of BLISS for genome-wide detection of both endogenous and exogenous DSBs in low-input samples of cells and tissues, as well as for genome-wide profiling of on- and off-target DSBs introduced by Cas9 and Cpf1 nucleases.

## Results

### CRISPR-mediated knockout of wildtype p53 increases cell proliferation in a subset of cancer cell lines

In order to identify an ideal cell-based model system to profile p53 function we took advantage of publicly available data generated through Project Achilles [32]. Briefly, Project Achilles utilizes genome-scale CRISPR knockout screens to identify genetic dependencies across a large compendium of cancer cell lines. The effect of knocking out each individual gene during a CRISPR screen is reported as a gene-level ‘Enrichment Score’. These scores are calculated based on changes in the relative abundance of cells harboring sgRNAs targeting each respective gene over the course of a screen. Therefore, these ‘Enrichment Scores’ serve as a proxy for the impact of gene knockout on cell proliferation. We profiled p53 ‘Enrichment Scores’ across more than 350 cancer cell lines and found that p53 knockout had no effect on cell proliferation for many of the cell lines screened in Project Achilles. However, we were able to identify a subset of cell lines in which p53 knockout conferred a proliferative advantage (Fig. 1a).

p53 knockout increases cell proliferation. a Distribution of p53 enrichment scores from pooled CRISPR knockout screens in 350 cancer cell lines. b p53 enrichment scores in a selected subset of cancer cell lines containing wildtype p53. c Western blot analysis of Cas9 expression in 769P cells. d Comparison of log2 fold changes (relative to pDNA) for all sgRNAs in CRISPR library between replicates. e Visualization of enrichment/depletion for sgRNAs targeting a selected subset of genes (red) compared to all sgRNAs in CRISPR library (black)

To identify molecular features associated with cell lines in which p53 knockout resulted in a proliferative advantage we intersected p53 ‘Enrichment Scores’ with data from the IARC (International Agency for Research on Cancer) TP53 database [33]. The IARC TP53 database is a curated resource for the mutation status of p53, along with several other known tumor suppressors and oncogenes, in human cell lines. Consistent with known p53 biology, we found that the proliferative advantage of p53 knockout was specific to cell lines harboring wildtype p53 (Fig. 1a, Additional file 1: Figure S1, Additional file 6: Table S1). In contrast, p53 knockout in cell lines containing mutations in the p53 gene, loss of p53 expression mutations, or p53 deletions had no significant impact on cell proliferation (Fig. 1a, Additional file 1: Figure S1, Additional file 6: Table S1). Collectively, these results indicate that cell proliferation can be used as a phenotype to screen p53 function in cell lines harboring wildtype copies of the gene.

To select a cell line for screening p53 function we first narrowed the list of cancer cell lines screened through Project Achilles down to those harboring wildtype p53. We then used data from the IARC TP53 database to further restrict this list to cell lines with no documented mutations in other known tumor suppressors or oncogenes (e.g. PTEN, KRAS, BRAF) (Additional file 6: Table S1). In total, we identified 8 cell lines that met our stringent criteria (Fig. 1b). The human renal adenocarcinoma cell line 769P displayed the highest p53 ‘Enrichment Score’ in the Project Achilles data and was selected as a model cell line for all subsequent experiments (Fig. 1b).

### Pooled CRISPR screen identifies p53-regulated genes that influence cell proliferation

To determine if a pooled CRISPR screen would be able to identify downstream targets of p53 that influence cell proliferation we designed a proliferation-based CRISPR screen. We generated a list of 330 genes that have p53 binding sites within 10 kb of their transcription start site and have been predicted to be directly regulated by p53 in a previous study [29]. We constructed a CRISPR library containing 4 sgRNAs targeting each gene in this list as well as 4 sgRNAs targeting p53 (Additional file 7: Table S2). As controls this CRISPR library included 70 sgRNAs targeting intergenic regions of the human genome and 70 sgRNAs with no genomic targets (Additional file 7: Table S2). We refer to this library throughout this report as our gene-targeting CRISPR library.

In order to perform CRISPR knockout (CRISPRko) screens we next generated a 769P-derived cell line expressing Cas9. We stably integrated Cas9 into a population of 769P cells using lentivirus and confirmed Cas9 expression by western blot (Fig. 1c). We then infected the Cas9-expressing 769P cells with our gene-targeting library at a multiplicity of infection (MOI) of

0.5 and a representation of 1000 cells per sgRNA. Library-infected cells were cultured for 21 days, genomic DNA was isolated, and targeted sequencing was performed to evaluate changes in sgRNA abundance relative to the CRISPR library pDNA (Additional file 8: Table S3).

To calculate changes in sgRNA abundance over the course of the screen we utilized MAGeCK, a computational tool for model-based analysis of pooled CRISPR screens [34]. Analysis with MAGeCK revealed a significant correlation in sgRNA enrichment/depletion across biological replicates indicating that our screening results are highly reproducible (Fig. 1d, Additional file 9: Table S4). Moreover, sgRNAs targeting p53 were among the most enriched in our screen, confirming the validity of our approach (Fig. 1e, Additional file 10: Table S5). In addition to p53 we identified several p53-regulated genes in which knockout resulted in a significant proliferative advantage (Fig. 1e, Additional file 10: Table S5). Interestingly, we also uncovered a subset of p53-regulated genes where knockout lead to a proliferative disadvantage (Fig. 1e, Additional file 10: Table S5). These data demonstrate that proliferation-based CRISPR screens can be used to functionally profile downstream events in the p53 pathway.

### Pooled CRISPR screen identifies p53-bound regulatory elements that influence cell proliferation

Having established that CRISPR screens can be used to profile downstream events in the p53 pathway we next designed a screening approach to identify regulatory elements bound by p53 that mediate its influence on cell proliferation. More specifically, we designed a CRISPR library to target and inhibit the function of p53-bound regulatory elements. We used previously reported p53 ChIP-Seq data to identify p53 binding sites throughout the human genome [29]. We then searched for p53 consensus motifs (CWWG [N]2-12CWWG) located within each p53 ChIP-Seq peak (Fig. 2a). Once found, we designed sgRNAs targeting all PAM-containing sequences located within 16 bp upstream or downstream of the consensus motif. In total, we designed 11,434 sgRNAs targeting 4930 motifs located within 2036 p53 ChIP-Seq peaks (Fig. 2b, c, d, Additional file 11: Table S6). While many p53 motifs could only be targeted by a single sgRNA, the majority of the motifs we identified were targeted by multiple sgRNAs in our CRISPR library (Fig. 2d). Likewise, 83% (1703/2036) of the ChIP-Seq peaks represented in our CRISPR library were targeted by multiple sgRNAs (Fig. 2c). As controls we also included 500 sgRNAs targeting intergenic regions of the human genome and 500 sgRNAs with no genomic targets (Additional file 11: Table S6). We refer to this library throughout this report as our peak-targeting CRISPR library.

p53-bound regulatory elements influence cell proliferation. a p53 binding sites as determined by ChIP-Seq (black) and p53 consensus motifs (grey). b Distribution of distances to nearest annotated transcription start site for all sgRNAs in CRISPR library. c Distribution of number of sgRNA designs per p53 ChIP-Seq peak. d Distribution of number of sgRNA designs per p53 consensus motif. e Western blot analysis of dCas9-KRAB expression in 769P cells. f Comparison of log2 fold changes (relative to pDNA) for all sgRNAs in CRISPR library between replicates. g Volcano plot comparing significance of sgRNA enrichment/depletion and log2 fold change (relative to pDNA) for all sgRNAs in CRISPR library. h Visualization of enrichment/depletion for sgRNAs targeting a selected subset of peaks (red) compared to all sgRNAs in CRISPR library (black). i Comparison of log2 fold change (relative to pDNA) and distance from nearest annotated TSS for all sgRNAs in CRISPR library

In order to perform CRISPR interference (CRISPRi) screens we next generated a 769P-derived cell line expressing a nuclease-dead version of Cas9 fused to the KRAB repressive domain (dCas9-KRAB). We stably integrated dCas9-KRAB into a population of 769P cells using lentivirus and confirmed dCas9-KRAB expression by western blot (Fig. 2e). We then infected the dCas9-KRAB-expressing 769P cells with our peak-targeting library at an MOI of

0.5 and a representation of 1000 cells per sgRNA. Library-infected cells were cultured for 21 days, genomic DNA was isolated, and targeted sequencing was performed to evaluate changes in sgRNA abundance relative to the CRISPR library pDNA (Additional file 12: Table S7).

We again used MAGeCK to calculate changes in sgRNA abundance during the screen and observed a moderate correlation in sgRNA enrichment/depletion across biological replicates (Fig. 2f, Additional file 13: Table S8). Among the most enriched sgRNAs in the screen were those targeting a ChIP-Seq peak (Peak 974) located upstream of CDKN1A, a gene that was significantly enriched in screens performed with the gene-targeting CRISPR library (Fig. 2g, Additional file 14: Table S9). Surprisingly, we identified many p53 binding sites in which CRISPRi-mediated repression resulted in a significant proliferative disadvantage (Fig. 2g, h). While some of these p53 binding sites were located proximal to an annotated transcription start site (TSS), most were located more than 10 kb away from the nearest TSS (Fig. 2i). Collectively, these data demonstrate that proliferation-based CRISPRi screens can be used to functionally profile regulatory elements that are bound by p53.

To evaluate the ability of CRISPRko technology to identify functional regulatory elements we performed screens using our peak-targeting CRISPR library in cells expressing Cas9 as opposed to dCas9-KRAB. We infected Cas9-expressing 769P cells with our peak-targeting CRISPR library at an MOI of

0.5 and a representation of 1000 cells per sgRNA, cultured the infected cells for 21 days, isolated genomic DNA, and performed targeted sequencing to evaluate changes in sgRNA abundance relative to the CRISPR library pDNA (Additional file 15: Table S10). Analysis with MAGeCK revealed a moderate correlation in sgRNA enrichment/depletion across biological replicates indicating that our screening results are reproducible (Additional file 2: Figure S2A, Additional file 16: Table S11). Similar to our findings in dCas9-KRAB-expressing 769P cells we identified many p53 binding sites in which CRISPR-mediated knockout resulted in a significant proliferative disadvantage (Additional file 2: Figure S2B, Additional file 2: Figure S2C, Additional file 17: Table S12). Once again, most of these p53 binding sites were located more than 10 kb away from the nearest TSS (Additional file 2: Figure S2D). Interestingly, we observed minimal overlap in the sgRNAs that were significantly enriched/depleted across the CRISPRko and CRISPRi screens. Moreover, the overall concordance of enrichment/depletion for all sgRNAs in the peak-targeting CRISPR library was strikingly low (Additional file 2: Figure S2E). In contrast to our CRISPRi screen results we were unable to associate any p53 binding sites identified in the CRISPRko screen with genes that were significantly enriched/depleted in our gene-targeting CRISPR screen. Based on these data we focused our validation efforts on p53 binding sites identified in our CRISPRi screen.

### Repression of p53-bound regulatory elements impacts cell proliferation

Among the sgRNAs that were most depleted in our peak-targeting CRISPRi screen were those targeting Peak 2319 (Fig. 2h). Peak 2319 is located within the first intron of RAD51C, a gene determined to be essential for cell proliferation in our gene-targeting CRISPRko screen (Fig. 3a, Fig. 1e). Peak 2319 contains three p53 motifs, two of which were targeted by sgRNAs in our peak-targeting CRISPR library (Fig. 3a). We found that sgRNAs targeting both motifs were significantly depleted in our peak-targeting CRISPRi screen (Fig. 3b). We reasoned that the p53 binding sites located within Peak 2319 are components of a downstream regulatory element that modulate RAD51C expression and selected sgRNAs targeting Peak 2319 and RAD51C for experimental validation.

Functional characterization of p53-bound regulatory elements that influence cell proliferation. a Schematic of p53 motifs and sgRNA targets located in Peak 2319. (ChromHMM track legend: red = active promoter orange = strong enhancer) (b) Log2 fold changes (relative to pDNA) in CRISPR screen for sgRNAs targeting Peak 2319. FDR values were calculated using the Benjamini-Hochberg method. c Schematic of p53 motifs and sgRNA targets located in Peak 384. (ChromHMM track legend: yellow = weak/poised enhancer) (d) Log2 fold changes (relative to pDNA) in CRISPR screen for sgRNAs targeting Peak 384. FDR values were calculated using the Benjamini-Hochberg method. e Comparison of cellular growth rates following inhibition of Peak 2319 or Peak 384. P-values were calculated using the two-tailed unpaired Student’s t-test with equal variances. **P < 0.01, *P < 0.05

Also among the most depleted sgRNAs in our peak-targeting CRISPRi screen were those targeting Peak 384 (Fig. 2h). In contrast to the close proximity between Peak 2319 and RAD51C, Peak 384 is located more than 200 kb away from the nearest annotated protein-coding gene (Fig. 3c). Peak 384 contains three p53 motifs, two of which were targeted by sgRNAs in our peak-targeting CRISPR library (Fig. 3c). We identified multiple sgRNAs targeting the first of those motifs that were significantly depleted in our peak-targeting CRISPRi screen (Fig. 3d). We hypothesized that the p53 binding sites within Peak 384 are components of a regulatory element located deep within an intergenic region of the genome and selected sgRNAs targeting this peak for experimental validation.

To experimentally validate that selected p53 binding sites represent functional regulatory elements we evaluated the impact of repressing each individual binding site on cell proliferation. We used lentivirus to stably transduce individual sgRNAs targeting the p53 binding sites of interest into dCas9-KRAB-expressing 769P cells. In addition, we generated stable dCas9-KRAB-expressing cell lines harboring an sgRNA targeting RAD51C, an sgRNA targeting an intergenic region of the genome, or an sgRNA with no genomic target. The resulting 7 cell lines were cultured in parallel for 18 days and population doublings were evaluated at each passage. Cell lines harboring sgRNAs targeting RAD51C, Peak 2319, and Peak 384 underwent significantly fewer population doublings as compared to cell lines containing negative control sgRNAs (Fig. 3e). Furthermore, we observed a significant difference in population doublings between cells harboring the sgRNA targeting the RAD51C TSS (RAD51C) and cells containing an sgRNA targeting the p53 binding site within the first intron of RAD51C (2319.1–1) (Fig. 3e). This observation suggests that sgRNAs targeting the RAD51C TSS and the RAD51C intron influence cell proliferation through distinct mechanisms (direct transcriptional interference of RAD51C and inhibition of regulatory element activity, respectively). We detected a similar impact on cell proliferation for two different sgRNAs targeting Peak 2319 in our validation experiments despite their differing degrees of depletion in our CRISPRi screen (Fig. 3b, e). This observation suggests that many of the modest proliferation phenotypes generated by sgRNAs in our CRISPRi screen may translate to more potent impacts on cell proliferation in focused validation experiments. Altogether, our results confirm that pooled CRISPR screens can be used to identify functional regulatory elements that influence cell proliferation.

In addition to the sgRNAs that were significantly depleted in our CRISPRi screen we identified several sgRNAs that were significantly enriched. For example, multiple sgRNAs targeting Peak 1267 resulted in a significant proliferative advantage in our CRISPRi screen (Fig. 2h). Peak 1267 contains five p53 motifs, two of which were targeted by sgRNAs in our peak-targeting CRISPR library (Additional file 3: Figure S3A). Although Peak 1267 is located within the first intron of TNFRSF10A, knockout of TNFRSF10A had no impact on cell proliferation in our gene-targeting CRISPRko screen (Additional file 3: Figure S3A, Figure S3B). In contrast, we identified multiple sgRNAs targeting the second p53 consensus motif in Peak 1267 that were significantly enriched in our peak-targeting CRISPRi screen (Additional file 3: Figure S3C). Importantly, these results demonstrate that regulatory elements can be functionally dissociated from proximal protein-coding genes.

### Pooled CRISPR screen identifies p53-bound regulatory elements that influence the DNA damage response

To evaluate the ability of a pooled CRISPR screen to identify regulatory elements that influence additional biological processes we next investigated the p53-mediated response to DNA damage. First, we utilized our gene-targeting CRISPR library to ensure that a CRISPR screen would be able to identify protein-coding genes that are required for cell cycle arrest in response to DNA damage. We infected Cas9-expressing 769P cells with our gene-targeting library at an MOI of

0.5 and a representation of 1000 cells per sgRNA. Library-infected cells were cultured in the presence of the DNA damage-inducing agent doxorubicin for 21 days, genomic DNA was isolated, and targeted sequencing was performed to evaluate changes in sgRNA abundance relative to the CRISPR library pDNA (Additional file 8: Table S3). Analysis with MAGeCK revealed a strong correlation in sgRNA enrichment/depletion across biological replicates indicating that our screening results are highly reproducible (Fig. 4a, Additional file 18: Table S13). We identified several sgRNAs that prevented cell cycle arrest in response to DNA damage. (Fig. 4a, Additional file 18: Table S13). Among the most enriched sgRNAs were those targeting p53, CDKN1A, and SLC30A1 (Fig. 4b, Fig. 4c, Additional file 19: Table S14). These data demonstrate that a CRISPR screen can be used to identify genes that are required for cell cycle arrest in response to DNA damage.

p53-bound regulatory elements influence cellular response to DNA damage. a Comparison of log2 fold changes (relative to pDNA) for all sgRNAs in gene-targeting CRISPR library between replicates. b Log2 fold changes (relative to pDNA) in CRISPR screen for sgRNAs targeting selected subset of genes. FDR values were calculated using the Benjamini-Hochberg method. c Visualization of enrichment/depletion for sgRNAs targeting a selected subset of genes (red) compared to all sgRNAs in CRISPR library (black). d Comparison of log2 fold changes (relative to pDNA) for all sgRNAs in peak-targeting CRISPR library between replicates. e Volcano plot comparing significance of sgRNA enrichment/depletion and log2 fold change (relative to pDNA) for all sgRNAs in CRISPR library. f Visualization of enrichment/depletion for sgRNAs targeting a selected subset of peaks (red) compared to all sgRNAs in CRISPR library (black). g Comparison of log2 fold change (relative to pDNA) and distance from nearest annotated TSS for all sgRNAs in CRISPR library

We next used our peak-targeting CRISPRi library to search for regulatory elements involved in the p53-mediated response to DNA damage. We infected dCas9-KRAB-expressing 769P cells with our peak-targeting library at an MOI of

0.5 and a representation of 1000 cells per sgRNA. Library-infected cells were cultured in the presence of doxorubicin for 21 days, genomic DNA was isolated, and targeted sequencing was performed to evaluate changes in sgRNA abundance relative to the CRISPR library pDNA (Additional file 12: Table S7). Analysis with MAGeCK revealed a relatively weak correlation in sgRNA enrichment/depletion across biological replicates (Fig. 4d). This weak correlation likely results from the combination of reduced proliferation in cells treated with doxorubicin and the less potent enrichments/depletions observed in screens performed with our peak-targeting CRISPR library. Despite weak overall correlation in sgRNA enrichment/depletion across replicates we were able to identify several sgRNAs that were significantly enriched in our screen (Fig. 4e, Additional file 20: Table S15). Interestingly, the three peaks that had the most significant impact on cycle arrest in response to DNA damage (Peak 974, Peak 975, and Peak 976) are located within a 15 kb window surrounding the CDKN1A transcription start site. Aside from p53, CDKN1A was the most enriched gene in our DNA damage screen performed with the gene-targeting CRISPR library (Fig. 4b, c, f, Additional file 21: Table S16). Although most of the p53 binding sites identified in our screen were located within 10 kb of an annotated TSS, at least one was located more than 250 kb away from the nearest TSS (Fig. 4g). Altogether, these data provide an additional example of a pooled CRISPR screen being used to successfully identify functional regulatory elements.

We again tested the ability of CRISPRko technology to identify functional regulatory elements by performing a DNA damage response screen in cells expressing Cas9 as opposed to dCas9-KRAB. We infected Cas9-expressing 769P cells with our peak-targeting library at an MOI of

0.5 and a representation of 1000 cells per sgRNA. Library-infected cells were cultured in the presence of doxorubicin for 21 days, genomic DNA was isolated, and targeted sequencing was performed to evaluate changes in sgRNA abundance relative to the CRISPR library pDNA (Additional file 15: Table S10). Analysis with MAGeCK revealed a moderate correlation in sgRNA enrichment/depletion across biological replicates (Additional file 4: Figure S4A, Additional file 22: Table S17). While we did identify p53 binding sites in which CRISPR-mediated knockout prevented cell cycle arrest in response to DNA damage, the magnitude of sgRNA enrichment was less significant as compared to the CRISPRi screen (Additional file 4: Figure S4B, Additional file 23: Table S18). Moreover, the sgRNA enrichments were far less pronounced than we observed in the CRISPRi screen (Additional file 4: Figure S4C, Figure S4D). Once again, we observed minimal overlap in the sgRNAs that were significantly enriched/depleted across the CRISPRko and CRISPRi screens of the DNA damage response (Additional file 4: Figure S4E). Furthermore, none of the p53 binding sites that appeared to impact the DNA damage response in the CRISPRko were located near genes that were significantly enriched/depleted in our gene-targeting CRISPR screen. Based on these data we focused our validation efforts on p53 binding sites identified in our CRISPRi screen.

### Repression of p53-bound regulatory elements prevents cell cycle arrest in response to DNA damage

Among the sgRNAs that were most enriched in our peak-targeting CRISPRi screen of the DNA damage response were those targeting ChIP-Seq peaks nearest CDKN1A (Fig. 4f). More specifically, Peak 975 overlaps the CDKN1A TSS, Peak 974 is located 10 kb upstream of the CDKN1A TSS, and Peak 976 is located 5 kb downstream of the CDKN1A TSS (Fig. 5a). Peak 975 contains three p53 consensus motifs and multiple sgRNAs targeting the first of those motifs were significantly enriched in our CRISPRi screen (Fig. 5b). Peak 976 contains eight p53 consensus motifs and we identified sgRNAs targeting several of those motifs that were significantly enriched in our CRISPRi screen (Fig. 5c). Lastly, Peak 974 contains four p53 consensus motifs and sgRNAs targeting each of those motifs were significantly enriched in our CRISPRi screen, although the magnitude of enrichment was not as pronounced as with sgRNAs targeting Peak 975 and Peak 976 (Fig. 5d). We hypothesized that the p53 binding sites located within these ChIP-Seq peaks are components of regulatory elements that modulate CDKN1A expression and selected an sgRNA targeting Peak 975 for experimental validation.

Functional characterization of p53-bound regulatory elements that influence cellular response to DNA damage. a Schematic of p53 motifs and sgRNA targets located in Peaks 974, 975, and 976. (ChromHMM track legend: red = active promoter orange = strong enhancer yellow = weak/poised enhancer dark green = transcriptional transition/elongation light green = weak transcribed) (b-d) Log2 fold changes (relative to pDNA) in CRISPR screen for sgRNAs targeting b Peak 975, c Peak 976, and d Peak 974. FDR values were calculated using the Benjamini-Hochberg method. e Schematic of p53 motifs and sgRNA targets located in Peak 685. f Log2 fold changes (relative to pDNA) in CRISPR screen for sgRNAs targeting Peak 685. FDR values were calculated using the Benjamini-Hochberg method. g Cell cycle analysis of DNA damage response following inhibition of Peak 975 or Peak 685. P-values were calculated using the two-tailed unpaired Student’s t-test with equal variances. **P < 0.01

Also among the most enriched sgRNAs in our peak-targeting CRISPRi screen of the DNA damage response was one targeting Peak 685 (Fig. 4e). Peak 685 is located more than 250 kb away from the nearest annotated protein-coding gene and contains two p53 motifs, both of which were targeted by sgRNAs in our peak-targeting CRISPR library (Fig. 5e). We identified one sgRNA targeting the second of those motifs that was significantly enriched in our peak-targeting CRISPRi screen (Fig. 5f). We hypothesized that this p53 binding site is a component of a regulatory element located deep within an intergenic region of the genome and selected an sgRNA targeting Peak 685 for experimental validation.

To experimentally validate that the selected p53 binding sites represent functional regulatory elements we evaluated the impact of repressing individual binding sites on cell cycle arrest in response to DNA damage. We used lentivirus to stably transduce individual sgRNAs targeting p53 binding sites of interest into dCas9-KRAB-expressing 769P cells. In addition, we generated stable dCas9-KRAB-expressing cell lines harboring an sgRNA targeting p53, an sgRNA targeting an intergenic region of the genome, or an sgRNA with no genomic target. The resulting 5 cell lines were cultured in the presence or absence of doxorubicin for 16 h followed by cell cycle analysis (Additional file 5: Figure S5). All of the stable cell lines we generated displayed similar cell cycle profiles in standard culture conditions, with 12–15% of total cells in S-phase (Fig. 5g). In response to doxorubicin treatment cell lines harboring negative control sgRNAs dropped to 0.5% of total cells in S-phase (Fig. 5g). In contrast, 10% of cells harboring sgRNAs targeting p53 remained in S-phase after treatment with doxorubicin (Fig. 5g). Likewise, cells containing sgRNAs targeting Peak 975 and Peak 685 exhibited significantly lower levels of cell cycle arrest with 4.27 and 3.72% of total cells in S-phase, respectively (Fig. 5g). Altogether, these results confirm that pooled CRISPR screens can be used to identify functional regulatory elements that influence the DNA damage response. Moreover, these data further demonstrate that pooled CRISPR screening can be used as a general approach to identify functional regulatory elements that influence diverse biological processes.

### Affiliations

Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA

Timothy Gilpatrick, Isac Lee & Winston Timp

Oxford Nanopore Technologies, Oxford, UK

James E. Graham, Etienne Raimondeau, Rebecca Bowen & Andrew Heron

Department of Oncology, Johns Hopkins School of Medicine, Baltimore, MD, USA

Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, USA

Department of Molecular Biology and Genetics, Department of Medicine, Division of Infectious Disease, Johns Hopkins School of Medicine, Baltimore, MD, USA

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

You can also search for this author in PubMed Google Scholar

### Contributions

T.G. and W.T. constructed the study. T.G. performed the experiments. T.G., I.L. and F.S. analyzed the data. T.G., J.G., E.R., R.B. and A.H. developed the method. S.S. and B.D. provided primary breast tissue and generated the mouse xenografts. T.G. and W.T. wrote the paper.

## Access options

Get full journal access for 1 year

All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.

All prices are NET prices.

## Specific computational tools for analyzing DNA methylation sequencing data

The following section describes several tools that have been developed to analyze DNA methylation sequencing data generated using the different experimental protocols presented above. For each experimental technique, we indicate which tool we believed to be the optimal choice for a scientist with limited knowledge in computational data analysis. For the selection and recommendations, we used criteria of performance (from the raw reads processing to differential analysis), graphic output options, and availability of a detailed manual (see Table 1). In addition, we took into account more practical criteria, such as how easy it was to download, install, and execute the particular tool, based on personal experience. The tools are recommended for each experimental protocol, according to the number of criteria that could be fulfilled. The recommendations are discussed in more detail under the “Discussion” section.

### Selected tools for bisulfite sequencing data analysis

Just a handful of tools can perform all or most of the necessary steps in the data analysis. For example, BS Seeker, Bismark, and BSMAP are suitable for bisulfite sequencing read alignment only [37, 40, 42], while GBSA and BSmooth are for specific downstream analyses [51, 60]. BS Seeker performs alignment and methylation calling, but does not calculate methylation ratio or beta scores [42]. On the other hand, Bicycle is able to perform all necessary steps and is relatively universal to different platforms [54], while SMAP is a great example of a convenient pipeline, but suitable only for RRBS data [55].

BSmooth is a tool for WGBS data analysis that performs alignment of the reads, measures methylation levels, and detects DMRs when biological replicates are available [51]. BSmooth takes into account biological variability (not only sample) while searching for DMRs. The algorithm detects regions consisting of several CpGs thus, biologically significant differentially methylated single CpGs will be missed in the results, which can be a disadvantage in a research setting [51]. Working with the BSmooth algorithm can be challenging to many users, since data must be pre-processed and adapted for the analysis in an R environment. Considering the level of difficulty and the limited capabilities of the tool, it is therefore not recommended for most users (Table 1).

MOABS (Model-based Analysis of Bisulfite Sequencing data) is one of the most powerful command line-based pipelines that are suitable for WGBS, RRBS, and 5hmC data analysis [52]. It is able to perform alignments, methylation calling, identification of DMPs and DMRs, and differential methylation analysis (Table 1). It reports a unique value that combines biological and statistical significance for differential methylation—credible methylation difference (CDIF) [52]. Since the pipeline does not report beta score for methylation, it can be difficult to compare results from MOABS with results from other research projects. The MOABS pipeline offers powerful algorithms for data analysis. However, setup of the analysis is complicated and probably too complicated for users that are inexperienced with respect to command-line use. It seems to be complicated to organize the input and output files, and the user must be very familiar with writing definitions and paths. MOABS can be executed by writing a master/configuration script or by using command lines. Using a configuration script is more convenient, but the whole analysis is performed at once, which can be demanding regarding computational power and CPU time.

MethPipe is a pipeline similar to MAOBS and integrates various tools for methylation data analysis, including alignment, methylation calling, analysis of hypo- and hypermethylated regions, and differential methylation analysis (Table 1). It is also applicable for DNA hydroxymethylation analysis [53, 61]. However, MethPipe is considerably more difficult to use, compared with Bicycle, SMAP, or even MOABS, since it requires even more commands to be written and executed. On the other hand, writing and executing individual commands in the pipeline allows a maximum amount of control on the process: it can be run in small steps, with output files named and ordered according to user’s preferences. Furthermore, MethPipe has an extensive documentation with thorough instructions, which is useful to read even without intending to use the pipeline itself, since it describes the basic principles of DNA methylation data analysis [61]. MethPipe developers have also created and curate a reference methylome database MethBase, which can be useful for biological comparisons [53]. For example, by adding tracks of methylomes from different human tissues and cell lines to the UCSC browser and comparing them to own data. Data from MethBase can be downloaded using UCSC Table Browser or from the MethBase website for individual methylomes, where files contain methylation levels and coverage information for each CpG.

MOABS and MethPipe could be the pipelines of choice for more experienced users. However, because of its high functionality and user-friendly command line, Bicycle is the main pipeline we are suggesting for use by scientists with different backgrounds.

#### Bicycle (recommended for WGBS, targeted BS-seq, and TAB-seq )

Bicycle is a pipeline for computational analysis of bisulfite sequencing data that is more powerful or at least as powerful as MOABS or MethPipe, but undeniably easier to use, which is a great benefit for scientists without advanced computational skills [54]. The pipeline is able to perform all necessary steps—from conversion and indexing of the reference genome to the differential methylation analysis (Table 1). The tool is suitable for both paired-end and single-end reads. Bicycle has several advantages over other pipelines and includes more options than any other bisulfite sequencing analysis pipeline [54]. It can analyze the efficiency of the bisulfite conversion, which is important for correct estimation of methylation levels. Furthermore, it identifies and removes ambiguous reads, which is not included in other pipelines. Removal of clonal reads is also a Bicycle feature that is not often covered in methylation pipelines. No other pipeline has a non-CG to CG context correction option, while Bicycle performs it automatically during methylation analysis.

Methylation analysis of raw sequencing reads, and subsequent differential methylation analysis can be performed with just 4 commands, and 2 additional commands are required only when a reference genome is used for the first time [54].

The 6 steps in the pipeline (Additional file 1):

Creating a project. All output files are held in one folder.

Creating two in silico bisulfited reference genomes. C-to-T conversion for Watson strand reads and G-to-A conversion for Crick strand reads.

Indexing the reference genomes. Steps 2 and 3 are needed to be executed only for the reference when used for the first time.

Methylation analysis and methylcytosine calling.

Determination of DMPs and DMRs. Differentially methylated positions are always determined, but when regions of interest are determined, only relevant positions alongside with differentially methylated regions are reported.

Bicycle creates two in silico bisulfited versions of reference genomes: C-to-T conversion is made to accommodate reads from the Watson strand and G-to-A conversion for the reads from the Crick strand [54]. Two versions of references are then indexed. Reads are processed concurrently and mapped to the references executing two separate threads. The mapping command outputs SAM files, which are then automatically converted to BAM files and indexed with SAMtools [54].

Each cytosine is visited and assigned to a methylation context (CG, CHG, or CHH). Methylation level calculation and methylation calling are performed [54]. Various corrections, which can be controlled by options, are performed automatically. For example, if a cytosine is initially assigned to CHG or CHH due to single-nucleotide polymorphism (SNP), it is re-assigned correctly to a CG. In this step, filters can be applied: disregard ambiguous reads, discard clonal reads and keeping the highest quality one, filter out incorrectly converted reads [54]. During methylation calling, at each position, bicycle estimates the error rate in bisulfite conversion by calculating the error as the percentage of unconverted Cs from an unmethylated control genome (when it is included in the experiment), by calculating the error as the percentage of unconverted barcodes (when barcodes with unmethylated Cs were attached to the reads before bisulfite conversion) or by using a specified fixed error rate [54].

The significant advantage of the Bicycle pipeline is that it also can perform a differential methylation analysis. Both DMPs and DMRs are computed by comparing to groups of samples (control and condition). The statistical analysis is based on MethylSig algorithm [62].

Bicycle can be adapted for the analysis of 5hmC, identified using the TAB-seq approach. 5hmC would be reported as methylated cytosine during the analysis with the pipeline. Analysis should be available for the oxBS-seq data as well, but then positions that overlap between oxBS-seq and BS-seq of the same sample should be discarded in order to identify 5hmC but leave 5mC modifications behind.

#### SMAP (recommended for RRBS)

SMAP is another example of a bisulfite sequencing data analysis pipeline [55]. It focuses on RRBS data analysis from reference preparation to detection of DMPs, DMRs, SNPs, and allele-specific methylation (ASM). In step 1 of the pipeline, the reference genome is prepared by converting all Cs into Ts for both strands and indexing those strands. Reference is cut into target regions, based on the enzyme that was used in the RRBS protocol. In step 2, reads are trimmed and aligned in step 3 (Additional file 1). Two alignment algorithms can be chosen: Bowtie2 or bsmap and their options selected. In step 4, methylation levels are calculated for target regions. DMPs and DMRs are detected in step 5 using Fisher’s exact test when seed number is smaller than 5. Otherwise, t test or chi-square tests are chosen automatically. SNPs and ASM are analyzed in step 6 using Bis-SNP or Bcftools. Heterozygous SNPs are then filtered for ASM event detection. In a final step, results are summarized into a report [55].

### Web-based alternatives to command-line tools

There are several online pipelines for methylation analysis, where own data can be uploaded and analyzed using a visual interface rather than a command-line. However, often online platforms require frequent maintenance, and lack of this leads to poor website performance, annoying errors, and crashes. Another important concern is data protection for sensitive human genetic data in servers or clouds used by the particular platform, since data has to be uploaded to perform the analysis, and such data handling and storage is still a topic of discussion [63,64,65].

Genestack.com is an online platform that offers pipelines for the analysis of various data types, including WGBS (and RRBS) (https://genestack.com) [56]. A 30-day free trial is available in order to try the tools. However, since September 2019 access to the Genestack platform has been restricted and after the free trial period a paid subscription is required. The platform is visually pleasing and visualizes the results from all necessary steps of the methylation data analysis pipeline, which is a big advantage, compared with the command line-based tools. Unfortunately, big data upload is not efficient enough and is highly time-consuming. The advantage is that uploads can be resumed after some time even if computer is turned-off or internet connection interrupted. Furthermore, some of the available public data is already accessible and does not need to be uploaded. However, the access of the tools and their application to the data can be confusing, since they are not well listed in the menu. To make it easier, there is a task manager available to track the activity and access the results. In addition, the Genestack website has several thorough tutorials, created especially for the WGBS, RNA-Seq, and other omics data.

Mapping to the reference genome is performed using the BSMAP algorithm, and various options such as number of mismatches or the BS data generation protocol can be chosen. Unfortunately, differential methylation is not available in Genestack, which is a significant disadvantage of the platform. Overall, considering the disadvantages of the platform and controversies regarding the treatment of sensitive data, this platform would not be our first choice for data analysis (Table 1).

### MeDIP-seq data processing

The earliest tools developed specifically for MeDIP-seq data analysis were Batman and MEDIPS (which is possibly the most frequently used tool for MeDIP data analysis), but these tools do not perform quality control or mapping of the reads [57]. Therefore, additional tools are required to prepare the data for analysis, which is time-consuming and can be challenging computationally. As a solution, there are several pipelines that combine various tools, including MEDIPS. The most frequently described and recommended pipelines in various publications are MeQA and MeDUSA.

Huang et al. created the MeQA pipeline for “pre-processing, data quality assessment and distribution of sequences reads, and estimation of DNA methylation levels of MeDIP-seq datasets” [57]. To run the pipeline, a configuration file must be prepared, which is then called by a command line. The pipeline consists of two main parts. Part A performs a quality control (summarized in a pdf report with graphs), and an alignment that results in BAM files and alignment quality control. Reference genome and indexes are downloaded automatically from UCSC, which is a great advantage of MeQA. DNA methylation levels are estimated in part B and mapped regions are extracted in BED format. The regions or parts of regions that correspond to promoters, bidirectional promoters, genes, or downstream of genes are identified and CpG enrichment is estimated. Summary of the results is generated.

Unfortunately, MeQA does not perform differential methylation analysis (DMR analysis) [57]. In addition, currently the pipeline seems to be unavailable, which prevents us from recommending it.

#### MeDUSA (recommended for MeDIP-seq data analysis)

MeDUSA (Methylated DNA Utility for Sequence Analysis) is a pipeline for MeDIP-seq data analysis that focuses on accurate DMRs detection [43, 58]. It contains several packages to perform a complete analysis of MeDIP-seq data: sequence alignment, quality control, and DMR identification (Table 1) [58]. BWA is used for the alignment, SAMtools for subsequent filtering, and FastQC for quality control metrics. MeDUSA integrates and uses MEDIPS as a tool for methylation analysis. The pipeline is executed by writing a configuration file, which runs the scripts of the pipeline. Template and example configuration files are available to download.

The pipeline consists of four parts. In part 1, the alignment of reads and filtering is performed, using BWA and SAMtools. Some of the alignment parameters are set up in the configuration file, while more can be added by modifying the part 1 script. The part 2 script runs MEDIPS and its quality control and generates WIG tracks for individual strands and both strands combined. The tracks are converted to bigWig format. DMRs are called in part 3 using MEDIPS. In part 4, these DMRs are annotated (Additional file 1). In this step, annotation files are required, and they must be written in GFF file format and organized in the correct directory structure. Annotation files are available together with MeDUSA v2.0, while the newest version 2.1 does not include these files. However, they can easily be copied from one version to another.

### MRE-seq data processing

MRE-seq is not the most popular approach to study DNA methylation, although some datasets are publicly available and have potential to be used. Therefore, developing specific tools and pipelines for this type of data is not common. However, R Bioconductor has a package just for methylation-sensitive restriction enzyme sequencing data, msgbsR [44].

#### MsgbsR (recommended for MRE-seq data analysis)

The methylation sensitive genotyping by sequencing R package (msgbsR) contains a collection of functions for MRE-seq data analysis [44, 59]. However, the input must be indexed BAM files, which means that the user must do data pre-processing before using msgbsR. This can be done with Bowtie2 or BWA aligners. msgbsR then identifies and quantifies read counts at methylated sites. Enzyme cut sites are also verified and DNA methylation is assessed based on read coverage [44]. One of the advantages of this package is the differential methylation analysis.

In the pipeline, the input BAM files are read. Then cut sites are extracted and checked. Incorrect cuts are filtered out and a preliminary read count table is generated. msgbsR can plot the results using plotCounts.

The user should keep in mind that this package requires pre-processing of raw data and knowledge of the R programming language and analyzing MRE-seq data means that both the R programming language and command-line tools will have to be used. However, an example script is provided on the website together with a manual [59].

## References

Goll MG, Bestor TH. Eukaryotic cytosine methyltransferases. Annu Rev Biochem. 200574:481–514.

Lister R, Pelizzola M, Dowen RH, Hawkins RD, Hon G, Tonti-Filippini J, et al. Human DNA methylomes at base resolution show widespread epigenomic differences. Nature. 2009462:315–22.

Laurent L, Wong E, Li G, Huynh T, Tsirigos A, Ong CT, et al. Dynamic changes in the human methylome during differentiation. Genome Res. 201020:320–31.

Bergman Y, Cedar H. DNA methylation dynamics in health and disease. Nat Struct Mol Biol. 201320:274–81.

Busche S, Ge B, Vidal R, Spinella JF, Saillour V, Richer C, et al. Integration of high-resolution methylome and transcriptome analyses to dissect epigenomic changes in childhood acute lymphoblastic leukemia. Cancer Res. 201373:4323–36.

Kleinman CL, Gerges N, Papillon-Cavanagh S, Sin-Chan P, Pramatarova A, Quang DA, et al. Fusion of TTYH1 with the C19MC microRNA cluster drives expression of a brain-specific DNMT3B isoform in the embryonal brain tumor ETMR. Nat Genet. 201446:39–44.

Huynh JL, Garg P, Thin TH, Yoo S, Dutta R, Trapp BD, et al. Epigenome-wide differences in pathology-free regions of multiple sclerosis-affected brains. Nat Neurosci. 201417:121–30.

De Jager PL, Srivastava G, Lunnon K, Burgess J, Schalkwyk LC, Yu L, et al. Alzheimer's disease: early alterations in brain DNA methylation at ANK1, BIN1, RHBDF2 and other loci. Nat Neurosci. 201417:1156–63.

Liu Y, Aryee MJ, Padyukov L, Fallin MD, Hesselberg E, Runarsson A, et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nat Biotechnol. 201331:142–7.

Liang L, Willis-Owen SA, Laprise C, Wong KC, Davies GA, Hudson TJ, et al. An epigenome-wide association study of total serum immunoglobulin E concentration. Nature. 2015520(7549):670–4.

Petersen AK, Zeilinger S, Kastenmuller G, Romisch-Margl W, Brugger M, Peters A, et al. Epigenetics meets metabolomics: an epigenome-wide association study with blood serum metabolic traits. Hum Mol Genet. 201423:534–45.

Dick KJ, Nelson CP, Tsaprouni L, Sandling JK, Aissi D, Wahl S, et al. DNA methylation and body-mass index: a genome-wide analysis. Lancet. 2014383:1990–8.

Irvin MR, Zhi D, Joehanes R, Mendelson M, Aslibekyan S, Claas SA, et al. Epigenome-wide association study of fasting blood lipids in the Genetics of Lipid Lowering Drugs and Diet Network Study. Circulation. 2014130(7):565–72.

Grundberg E, Meduri E, Sandling JK, Hedman AK, Keildson S, Buil A, et al. Global analysis of DNA methylation variation in adipose tissue from twins reveals links to disease-associated variants in distal regulatory elements. Am J Hum Genet. 201393:876–90.

Ziller MJ, Gu H, Muller F, Donaghey J, Tsai LT, Kohlbacher O, et al. Charting a dynamic DNA methylation landscape of the human genome. Nature. 2013500:477–81.

Hon GC, Rajagopal N, Shen Y, McCleary DF, Yue F, Dang MD, et al. Epigenetic memory at embryonic enhancers identified in DNA methylation maps from adult mouse tissues. Nat Genet. 201345:1198–206.

Stadler MB, Murr R, Burger L, Ivanek R, Lienert F, Scholer A, et al. DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature. 2011480:490–5.

Xie W, Barr CL, Kim A, Yue F, Lee AY, Eubanks J, et al. Base-resolution analyses of sequence and parent-of-origin dependent DNA methylation in the mouse genome. Cell. 2012148:816–31.

Shirane K, Toh H, Kobayashi H, Miura F, Chiba H, Ito T, et al. Mouse oocyte methylomes at base resolution reveal genome-wide accumulation of non-CpG methylation and role of DNA methyltransferases. PLoS Genet. 20139:e1003439.

Patil V, Ward RL, Hesson LB. The evidence for functional non-CpG methylation in mammalian cells. Epigenetics. 20149(6):823–8.

Ziller MJ, Hansen KD, Meissner A, Aryee MJ. Coverage recommendations for methylation analysis by whole-genome bisulfite sequencing. Nat Methods. 201512(3):230–2.

Grundberg E, Small KS, Hedman AK, Nica AC, Buil A, Keildson S, et al. Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nat Genet. 201244:1084–9.

Spector TD, Williams FM. The UK Adult Twin Registry (TwinsUK). Twin Res Hum Genet. 20069:899–906.

Dunham I, Kundaje A, Aldred SF, Collins PJ, Davis CA, Doyle F, et al. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012489:57–74.

Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 200129:308–11.

Burger L, Gaidatzis D, Schubeler D, Stadler MB. Identification of active regulatory regions from DNA methylation data. Nucleic Acids Res. 201341:e155.

Allum F, Shao X, Guénard F, Simon MM, Busche S, Caron M, et al. Characterization of functional methylomes by next-generation capture sequencing identifies novel disease associated variants. Nat Commun. 20156:7211.

Bernstein BE, Stamatoyannopoulos JA, Costello JF, Ren B, Milosavljevic A, Meissner A, et al. The NIH Roadmap Epigenomics Mapping Consortium. Nat Biotechnol. 201028:1045–8.

Rada-Iglesias A, Ameur A, Kapranov P, Enroth S, Komorowski J, Gingeras TR, et al. Whole-genome maps of USF1 and USF2 binding and histone H3 acetylation reveal new aspects of promoter structure and candidate genes for common human disorders. Genome Res. 200818:380–92.

Heintzman ND, Hon GC, Hawkins RD, Kheradpour P, Stark A, Harp LF, et al. Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature. 2009459:108–12.

Dyke SO, Cheung WA, Joly Y, Ammerpohl O, Lutsik P, Rothstein MA, et al. Epigenome data release: a participant-centered approach to privacy protection. Genome Biol. 201516:142.

Kasowski M, Kyriazopoulou-Panagiotopoulou S, Grubert F, Zaugg JB, Kundaje A, Liu Y, et al. Extensive variation in chromatin states across humans. Science. 2013342:750–2.

Heinz S, Benner C, Spann N, Bertolino E, Lin YC, Laslo P, et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Mol Cell. 201038:576–89.

Deblois G, Giguere V. Functional and physiological genomics of estrogen-related receptors (ERRs) in health and disease. Biochim Biophys Acta. 20111812:1032–40.

Luo J, Sladek R, Carrier J, Bader JA, Richard D, Giguere V. Reduced fat mass in mice lacking orphan nuclear receptor estrogen-related receptor alpha. Mol Cell Biol. 200323:7947–56.

Tontonoz P, Graves RA, Budavari AI, Erdjument-Bromage H, Lui M, Hu E, et al. Adipocyte-specific transcription factor ARF6 is a heterodimeric complex of two nuclear hormone receptors, PPAR gamma and RXR alpha. Nucleic Acids Res. 199422:5628–34.

Singh MV, Ntambi JM. Nuclear factor 1 is essential for the expression of stearoyl-CoA desaturase 1 gene during preadipocyte differentiation. Biochim Biophys Acta. 19981398:148–56.

Yang VW, Christy RJ, Cook JS, Kelly TJ, Lane MD. Mechanism of regulation of the 422(aP2) gene by cAMP during preadipocyte differentiation. Proc Natl Acad Sci U S A. 198986:3629–33.

Breitling LP, Yang R, Korn B, Burwinkel B, Brenner H. Tobacco-smoking-related differential DNA methylation: 27 K discovery and replication. Am J Hum Genet. 201188:450–7.

Shenker NS, Polidoro S, van Veldhoven K, Sacerdote C, Ricceri F, Birrell MA, et al. Epigenome-wide association study in the European Prospective Investigation into Cancer and Nutrition (EPIC-Turin) identifies novel genetic loci associated with smoking. Hum Mol Genet. 201322:843–51.

Zeilinger S, Kuhnel B, Klopp N, Baurecht H, Kleinschmidt A, Gieger C, et al. Tobacco smoking leads to extensive genome-wide changes in DNA methylation. PLoS One. 20138:e63812.

Tsaprouni LG, Yang TP, Bell J, Dick KJ, Kanoni S, Nisbet J, et al. Cigarette smoking reduces DNA methylation levels at multiple genomic loci but the effect is partially reversible upon cessation. Epigenetics. 20149:1382–96.

Bukvic BK, Blekic M, Simpson A, Marinho S, Curtin JA, Hankinson J, et al. Asthma severity, polymorphisms in 20p13 and their interaction with tobacco smoke exposure. Pediatr Allergy Immunol. 201324:10–8.

Gieger C, Radhakrishnan A, Cvejic A, Tang W, Porcu E, Pistis G, et al. New gene functions in megakaryopoiesis and platelet formation. Nature. 2011480:201–8.

Ziller MJ, Muller F, Liao J, Zhang Y, Gu H, Bock C, et al. Genomic distribution and inter-sample variation of non-CpG methylation across human cell types. PLoS Genet. 20117:e1002389.

Guo JU, Su Y, Shin JH, Shin J, Li H, Xie B, et al. Distribution, recognition and regulation of non-CpG methylation in the adult mammalian brain. Nat Neurosci. 201417:215–22.

Step SE, Lim HW, Marinis JM, Prokesch A, Steger DJ, You SH, et al. Anti-diabetic rosiglitazone remodels the adipocyte transcriptome by redistributing transcription to PPARgamma-driven enhancers. Genes Dev. 201428:1018–28.

Nica AC, Parts L, Glass D, Nisbet J, Barrett A, Sekowska M, et al. The architecture of gene regulatory variation across multiple human tissues: the MuTHER study. PLoS Genet. 20117:e1002003.

Johnson MD, Mueller M, Game L, Aitman TJ. Single nucleotide analysis of cytosine methylation by whole-genome shotgun bisulfite sequencing. Curr Protoc Mol Biol. 2012Chapter 21:Unit21.23.

Liu Y, Siegmund KD, Laird PW, Berman BP. Bis-SNP: Combined DNA methylation and SNP calling for Bisulfite-seq data. Genome Biol. 201213:R61.

Guenard F, Houde A, Bouchard L, Tchernof A, Deshaies Y, Biron S, et al. Association of LIPA gene polymorphisms with obesity-related metabolic complications among severely obese patients. Obesity (Silver Spring). 201220:2075–82.

Turcot V, Tchernof A, Deshaies Y, Perusse L, Belisle A, Marceau S, et al. LINE-1 methylation in visceral adipose tissue of severely obese individuals is associated with metabolic syndrome status and related phenotypes. Clin Epigenetics. 20124:10.

Lohse M, Bolger A, Nagel A, Fernie AR, Lunn JE, Stitt M, et al. RobiNA: a user-friendly, integrated software solution for RNA-Seq-based transcriptomics. Nucleic Acids Res. 201240(Web Server issue):W622–7.

Trapnell C, Pachter L, Salzberg SL. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics. 200925:1105–11.

Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 200910:R25.

Anders S, Huber W. Differential expression analysis for sequence count data. Genome Biol. 201011:R106.

Quinlan AR, Hall IM. BEDTools: a flexible suite of utilities for comparing genomic features. Bioinformatics. 201026:841–2.

R Development Core Team. R: A language and environment for statistical computing. Vienna, Austria: the R Foundation for Statistical Computing 2011.

Mahony S, Auron PE, Benos PV. DNA familial binding profiles made easy: comparison of various motif alignment and clustering strategies. PLoS Comput Biol. 20073:e61.

## Introduction

Understanding the relationships between protein sequences and their functions is a fundamental objective of protein science. Our ability to map these relationships has improved with advances in technology. Until recently, the ability to decode information from experiments that characterize protein function was limited by the need to clone and/or individually sequence every gene of interest at relatively low throughput. Next-generation sequencing has changed this, and a number of important publications describe techniques that combine phenotypic screening and deep sequencing to investigate how protein sequence influences structure, folding, binding or organism growth/fitness [1�]. Araya and Fowler have written a good review of recent advances [11]. Generally, the experimental approach involves constructing a library of many different mutant variants of a protein of interest. The library is then screened/selected for some property or function. The retained library pool is sequenced, and features of sequences that are observed with high frequency are implicated as important for the relevant property. In this introduction, we discuss applications of this approach to the problem of determining protein interactions with a target.

Interaction systems that have been subjected to a screening-plus-sequencing approach include PDZ domain peptide ligands [4, 5], Pin WW domain peptide ligands [6], influenza haemaglutinnin inhibitors [7], LYN kinase interaction partners [8], computationally designed digoxigenin binders [9] and Bcl-2 type receptor/BH3 complexes [10]. Experiments varied in library size (

600,000 members) as well as in the type of screening used to detect binding (phage display, yeast display, ribosome display, bacterial two-hybrid). These studies are exciting milestones that dramatically expand the amount of data available to describe protein interactions. Yet, it is important to consider what information the data from various interaction screens contain and how it can be used. A standard approach has been to quantify the enrichment of each sequence or point mutation among library members classified as binders, relative to the unselected library, and to use this as a proxy for affinity. This may be problematic, as it relies on adequate deep sequencing of the starting library and bias-free amplification of sequences throughout screening and sample preparation. In fact, Derda et al. found that the relative abundance of phage displayed peptides could be significantly skewed if phages were amplified after a selection step [12]. McLaughlin et al. have reported data that support an impressive correlation of enrichment scores with binding affinities [5], but the appropriateness and resolution of new methods for affinity determination is not well established.

Recently, Kinney et al. pioneered a detailed approach to the screen-and sequence scheme and applied it to measure protein-DNA interactions [13]. Adopting the expression level of GFP as an indicator of transcription factor binding strength, they employed fluorescence activated cell sorting (FACS) to sort a bacterial library of

20000 mutant lacZ-promoters with different activities into pools and decoded these by deep sequencing. A maximum-likelihood computational routine transformed the sequencing data into a position specific scoring matrix that described the DNA-binding affinity of the transcription factor. In a similar approach, Sharon et al. monitored the affinity of transcription factors for hundreds of mutant yeast promoters that were coupled to YFP and derived a ranking of transcription factor activities [14].

Sharon et al. and Kinney et al. employed multi-bin sorts that increased the resolution of their experiments (i.e. the ability to distinguish between two different dissociation constants or equivalent measures of affinity) and permitted the analysis of frequency distributions rather than the more difficult to interpret enrichment values. However, issues remain to be addressed. First, only the expression of fluorescent protein was monitored in the protein-DNA binding studies, without accounting for variations in transcription factor levels that impact reporter gene expression. Prior work supports the importance of a correction. Liang et al. developed a two-color FACS screen for RNA gene regulatory devices [15]. One fluorescence signal reported the device activity, the other was a measure of basic transcription levels. This setup dramatically increased the resolution of the sorting scheme in comparison to a one-color strategy. Similarly, Dutta et al. gauged the stability of protein mutants by fragment reconstitution and yeast display [16]. They observed the expression and display of a mutant fragment with one fluorescence signal and the binding of a complementary fragment with another signal. Their findings suggested a correlation between the stabilities of the protein mutants and the ratio of the two fluorescence signals. Chao et al. showed qualitatively that a mixture of two yeast-displayed antibodies with very similar affinities for a target can be enriched for the stronger binder by FACS when expression levels are taken into account. Second, Kinney et al. [13] and Sharon et al. [14] considered averages of their detailed experimental information during computational analyses. They calculated position specific scoring matrices and mean expression values, respectively. Cooperative effects and signal variance may limit the accuracy of models derived with such assumptions.

High-throughput characterization of protein interactions will be most useful if it can deliver accurate estimates of affinity or affinity rankings. For example, such estimates could enable the construction of more accurate predictive models or could guide the refinement of protein designs [7]. We present a protocol that uses a rigorous sorting strategy in combination with downstream computational processing that returns a precise affinity ranking of individual sequences. Taking advantage of yeast-surface display, in which a signal resulting from a peptide binding to a protein can be normalized by the expression level of that peptide, we developed a theoretical framework to derive the expected signals for binders of different affinities. Experimental sorting using FACS, plus library sequencing, yielded coarse-grained signal distributions for

1000 peptide-displaying clones in a single experiment. Computational processing generated a global ranking of peptide affinities, and our theoretical model allowed a detailed statistical analysis of sources of error in the final results. Because existing methods are already capable of discerning strong from weak and non-binders, we have focused on discriminating tight binders within a 500-fold range of affinities (0.1 nM-60 nM). Accurate data in this regime may aid in the design of very strong binders that can be important therapeutic and diagnostic agents [17�]. We conducted our study using a small library of about 1000 yeast displayed BH3 peptides that bind to Bcl-xL, a key regulator of apoptosis. High-affinity binders of Bcl-xL are of great interest due to their potential for diagnosing or surmounting apoptotic blockades in numerous cancers [20�].

## Contributor Information

NHLBI Trans-Omics for Precision Medicine (TOPMed) Consortium:

### Namiko Abe

11 New York Genome Center, New York, NY USA

### Seth Ament

158 University of Maryland, Baltimore, MD USA

### Peter Anderson

159 University of Washington, Seattle, WA USA

### Pramod Anugu

160 University of Mississippi, Jackson, MS USA

### Deborah Applebaum-Bowden

161 National Institutes of Health, Bethesda, MD USA

### Tim Assimes

162 Stanford University, Stanford, CA USA

### Dimitrios Avramopoulos

27 Johns Hopkins University, Baltimore, MD USA

### Emily Barron-Casella

27 Johns Hopkins University, Baltimore, MD USA

### Terri Beaty

27 Johns Hopkins University, Baltimore, MD USA

### Gerald Beck

23 Cleveland Clinic, Cleveland, OH USA

### Diane Becker

27 Johns Hopkins University, Baltimore, MD USA

### Amber Beitelshees

158 University of Maryland, Baltimore, MD USA

### Takis Benos

163 University of Pittsburgh, Pittsburgh, PA USA

### Marcos Bezerra

164 Fundação de Hematologia e Hemoterapia de Pernambuco–Hemope, Recife, Brazil

### Joshua Bis

159 University of Washington, Seattle, WA USA

### Russell Bowler

165 National Jewish Health, Denver, CO USA

### Ulrich Broeckel

166 Medical College of Wisconsin, Milwaukee, WI USA

### Jai Broome

159 University of Washington, Seattle, WA USA

### Karen Bunting

11 New York Genome Center, New York, NY USA

### Carlos Bustamante

162 Stanford University, Stanford, CA USA

### Erin Buth

159 University of Washington, Seattle, WA USA

### Jonathan Cardwell

125 University of Colorado at Denver, Denver, CO USA

### Vincent Carey

95 Brigham and Women’s Hospital, Boston, MA USA

### Cara Carty

167 Washington State University, Seattle, WA USA

### Richard Casaburi

168 University of California, Los Angeles, Los Angeles, CA USA

### Peter Castaldi

95 Brigham and Women’s Hospital, Boston, MA USA

### Mark Chaffin

169 Broad Institute, Cambridge, MA USA

### Christy Chang

158 University of Maryland, Baltimore, MD USA

### Yi-Cheng Chang

170 National Taiwan University, Taipei, Taiwan

### Sameer Chavan

125 University of Colorado at Denver, Denver, CO USA

### Bo-Juen Chen

11 New York Genome Center, New York, NY USA

### Wei-Min Chen

171 University of Virginia, Charlottesville, VA USA

### Lee-Ming Chuang

170 National Taiwan University, Taipei, Taiwan

### Ren-Hua Chung

172 National Health Research Institute Taiwan, Zhunan Township, Taiwan

### Suzy Comhair

23 Cleveland Clinic, Cleveland, OH USA

### Elaine Cornell

173 University of Vermont, Burlington, VT USA

### Carolyn Crandall

168 University of California, Los Angeles, Los Angeles, CA USA

### James Crapo

165 National Jewish Health, Denver, CO USA

### Jeffrey Curtis

174 University of Michigan, Ann Arbor, MI USA

### Coleen Damcott

158 University of Maryland, Baltimore, MD USA

### Sean David

175 University of Chicago, Chicago, IL USA

### Colleen Davis

159 University of Washington, Seattle, WA USA

### Lisa de las Fuentes

176 Washington University in St Louis, St Louis, MO USA

### Michael DeBaun

177 Vanderbilt University, Nashville, TN USA

### Ranjan Deka

178 University of Cincinnati, Cincinnati, OH USA

### Scott Devine

158 University of Maryland, Baltimore, MD USA

### Qing Duan

179 University of North Carolina, Chapel Hill, NC USA

### Ravi Duggirala

180 University of Texas Rio Grande Valley School of Medicine, Edinburg, TX USA

### Jon Peter Durda

173 University of Vermont, Burlington, VT USA

### Charles Eaton

181 Brown University, Providence, RI USA

### Lynette Ekunwe

160 University of Mississippi, Jackson, MS USA

182 Harvard University, Boston, MA USA

### Serpil Erzurum

23 Cleveland Clinic, Cleveland, OH USA

### Charles Farber

171 University of Virginia, Charlottesville, VA USA

### Matthew Flickinger

174 University of Michigan, Ann Arbor, MI USA

### Myriam Fornage

183 University of Texas Health at Houston, Houston, TX USA

### Chris Frazar

159 University of Washington, Seattle, WA USA

### Mao Fu

158 University of Maryland, Baltimore, MD USA

### Lucinda Fulton

176 Washington University in St Louis, St Louis, MO USA

### Shanshan Gao

125 University of Colorado at Denver, Denver, CO USA

### Yan Gao

160 University of Mississippi, Jackson, MS USA

### Margery Gass

184 Fred Hutchinson Cancer Research Center, Seattle, WA USA

### Bruce Gelb

16 Icahn School of Medicine at Mount Sinai, New York, NY USA

### Xiaoqi Priscilla Geng

174 University of Michigan, Ann Arbor, MI USA

### Mark Geraci

185 Indiana University, Indianapolis, IN USA

### Auyon Ghosh

95 Brigham and Women’s Hospital, Boston, MA USA

### Chris Gignoux

162 Stanford University, Stanford, CA USA

### David Glahn

186 Yale University, New Haven, CT USA

### Da-Wei Gong

158 University of Maryland, Baltimore, MD USA

### Harald Goring

187 University of Texas Rio Grande Valley School of Medicine, San Antonio, TX USA

### Sharon Graw

188 University of Colorado Anschutz Medical Campus, Aurora, CO USA

### Daniel Grine

125 University of Colorado at Denver, Denver, CO USA

### C. Charles Gu

176 Washington University in St Louis, St Louis, MO USA

### Yue Guan

158 University of Maryland, Baltimore, MD USA

### Namrata Gupta

169 Broad Institute, Cambridge, MA USA

### Jeff Haessler

184 Fred Hutchinson Cancer Research Center, Seattle, WA USA

### Nicola L. Hawley

186 Yale University, New Haven, CT USA

### Ben Heavner

159 University of Washington, Seattle, WA USA

### David Herrington

189 Wake Forest Baptist Health, Winston-Salem, NC USA

### Craig Hersh

95 Brigham and Women’s Hospital, Boston, MA USA

### Bertha Hidalgo

21 University of Alabama, Birmingham, AL USA

### James Hixson

183 University of Texas Health at Houston, Houston, TX USA

### Brian Hobbs

95 Brigham and Women’s Hospital, Boston, MA USA

### John Hokanson

125 University of Colorado at Denver, Denver, CO USA

### Elliott Hong

158 University of Maryland, Baltimore, MD USA

### Karin Hoth

190 University of Iowa, Iowa City, IA USA

### Chao Agnes Hsiung

172 National Health Research Institute Taiwan, Zhunan Township, Taiwan

### Yi-Jen Hung

191 Tri-Service General Hospital National Defense Medical Center, Taipei, Taiwan

### Haley Huston

192 Blood Works Northwest, Seattle, WA USA

### Chii Min Hwu

132 Taichung Veterans General Hospital Taiwan, Taichung City, Taiwan

### Rebecca Jackson

193 Ohio State University Wexner Medical Center, Columbus, OH USA

### Deepti Jain

159 University of Washington, Seattle, WA USA

### Min A. Jhun

174 University of Michigan, Ann Arbor, MI USA

### Craig Johnson

159 University of Washington, Seattle, WA USA

### Rich Johnston

194 Emory University, Atlanta, GA USA

### Kimberly Jones

27 Johns Hopkins University, Baltimore, MD USA

### Sekar Kathiresan

169 Broad Institute, Cambridge, MA USA

### Alyna Khan

159 University of Washington, Seattle, WA USA

### Wonji Kim

182 Harvard University, Boston, MA USA

### Greg Kinney

125 University of Colorado at Denver, Denver, CO USA

### Holly Kramer

195 Loyola University, Maywood, IL USA

### Christoph Lange

196 Harvard School of Public Health, Boston, MA USA

### Ethan Lange

125 University of Colorado at Denver, Denver, CO USA

### Leslie Lange

125 University of Colorado at Denver, Denver, CO USA

### Cecelia Laurie

159 University of Washington, Seattle, WA USA

### Meryl LeBoff

95 Brigham and Women’s Hospital, Boston, MA USA

### Jiwon Lee

95 Brigham and Women’s Hospital, Boston, MA USA

### Seunggeun Shawn Lee

174 University of Michigan, Ann Arbor, MI USA

### Wen-Jane Lee

132 Taichung Veterans General Hospital Taiwan, Taichung City, Taiwan

### David Levine

159 University of Washington, Seattle, WA USA

### Joshua Lewis

158 University of Maryland, Baltimore, MD USA

### Xiaohui Li

197 Lundquist Institute, Torrance, CA USA

### Yun Li

179 University of North Carolina, Chapel Hill, NC USA

### Henry Lin

197 Lundquist Institute, Torrance, CA USA

### Honghuang Lin

198 Boston University, Boston, MA USA

### Keng Han Lin

174 University of Michigan, Ann Arbor, MI USA

### Simin Liu

181 Brown University, Providence, RI USA

### Yongmei Liu

136 Duke University, Durham, NC USA

### Yu Liu

199 Stanford University, Palo Alto, CA USA

### James Luo

28 National Heart, Lung, and Blood Institute, National Institutes of Health, Bethesda, MD USA

### Michael Mahaney

200 University of Texas Rio Grande Valley School of Medicine, Brownsville, TX USA

### Barry Make

27 Johns Hopkins University, Baltimore, MD USA

### JoAnn Manson

95 Brigham and Women’s Hospital, Boston, MA USA

### Lauren Margolin

169 Broad Institute, Cambridge, MA USA

### Lisa Martin

201 George Washington University, Washington, DC USA

### Susan Mathai

125 University of Colorado at Denver, Denver, CO USA

### Susanne May

159 University of Washington, Seattle, WA USA

### Patrick McArdle

158 University of Maryland, Baltimore, MD USA

### Merry-Lynn McDonald

21 University of Alabama, Birmingham, AL USA

### Sean McFarland

202 Harvard University, Cambridge, MA USA

### Daniel McGoldrick

159 University of Washington, Seattle, WA USA

### Caitlin McHugh

159 University of Washington, Seattle, WA USA

### Hao Mei

160 University of Mississippi, Jackson, MS USA

### Luisa Mestroni

188 University of Colorado Anschutz Medical Campus, Aurora, CO USA

### Nancy Min

160 University of Mississippi, Jackson, MS USA

### Ryan L. Minster

163 University of Pittsburgh, Pittsburgh, PA USA

### Matt Moll

95 Brigham and Women’s Hospital, Boston, MA USA

### Arden Moscati

16 Icahn School of Medicine at Mount Sinai, New York, NY USA

### Solomon Musani

160 University of Mississippi, Jackson, MS USA

### Stanford Mwasongwe

160 University of Mississippi, Jackson, MS USA

### Josyf C. Mychaleckyj

171 University of Virginia, Charlottesville, VA USA

16 Icahn School of Medicine at Mount Sinai, New York, NY USA

### Rakhi Naik

27 Johns Hopkins University, Baltimore, MD USA

### Take Naseri

203 Ministry of Health, Government of Samoa, Apia, Samoa

### Sergei Nekhai

204 Howard University, Washington, DC USA

### Bonnie Neltner

125 University of Colorado at Denver, Denver, CO USA

### Heather Ochs-Balcom

205 University at Buffalo, Buffalo, NY USA

### David Paik

162 Stanford University, Stanford, CA USA

### James Pankow

206 University of Minnesota, Minneapolis, MN USA

### Afshin Parsa

158 University of Maryland, Baltimore, MD USA

### Juan Manuel Peralta

180 University of Texas Rio Grande Valley School of Medicine, Edinburg, TX USA

### Marco Perez

162 Stanford University, Stanford, CA USA

### James Perry

158 University of Maryland, Baltimore, MD USA

### Ulrike Peters

184 Fred Hutchinson Cancer Research Center, Seattle, WA USA

### Lawrence S. Phillips

194 Emory University, Atlanta, GA USA

### Toni Pollin

158 University of Maryland, Baltimore, MD USA

### Julia Powers Becker

125 University of Colorado at Denver, Denver, CO USA

### Meher Preethi Boorgula

125 University of Colorado at Denver, Denver, CO USA

### Michael Preuss

16 Icahn School of Medicine at Mount Sinai, New York, NY USA

### Dandi Qiao

95 Brigham and Women’s Hospital, Boston, MA USA

### Zhaohui Qin

194 Emory University, Atlanta, GA USA

### Nicholas Rafaels

125 University of Colorado at Denver, Denver, CO USA

### Laura Raffield

179 University of North Carolina, Chapel Hill, NC USA

### Laura Rasmussen-Torvik

207 Northwestern University, Chicago, IL USA

### Aakrosh Ratan

171 University of Virginia, Charlottesville, VA USA

### Robert Reed

158 University of Maryland, Baltimore, MD USA

### Elizabeth Regan

165 National Jewish Health, Denver, CO USA

### Muagututi𠆊 Sefuiva Reupena

208 Lutia I Puava Ae Mapu I Fagalele, Apia, Samoa

### Carolina Roselli

169 Broad Institute, Cambridge, MA USA

### Pamela Russell

125 University of Colorado at Denver, Denver, CO USA

### Sarah Ruuska

192 Blood Works Northwest, Seattle, WA USA

### Kathleen Ryan

158 University of Maryland, Baltimore, MD USA

### Ester Cerdeira Sabino

209 Universidade de Sao Paulo, Sao Paulo, Brazil

### Danish Saleheen

210 Columbia University, New York, NY USA

### Shabnam Salimi

158 University of Maryland, Baltimore, MD USA

### Steven Salzberg

27 Johns Hopkins University, Baltimore, MD USA

### Kevin Sandow

197 Lundquist Institute, Torrance, CA USA

### Vijay G. Sankaran

211 Broad Institute, Harvard University, Boston, MA USA

### Christopher Scheller

174 University of Michigan, Ann Arbor, MI USA

### Ellen Schmidt

174 University of Michigan, Ann Arbor, MI USA

### Karen Schwander

176 Washington University in St Louis, St Louis, MO USA

### Frank Sciurba

163 University of Pittsburgh, Pittsburgh, PA USA

### Christine Seidman

46 Harvard Medical School, Boston, MA USA

### Jonathan Seidman

46 Harvard Medical School, Boston, MA USA

### Stephanie L. Sherman

194 Emory University, Atlanta, GA USA

### Aniket Shetty

125 University of Colorado at Denver, Denver, CO USA

### Wayne Hui-Heng Sheu

132 Taichung Veterans General Hospital Taiwan, Taichung City, Taiwan

### Brian Silver

212 UMass Memorial Medical Center, Worcester, MA USA

### Josh Smith

159 University of Washington, Seattle, WA USA

### Tanja Smith

11 New York Genome Center, New York, NY USA

### Sylvia Smoller

85 Albert Einstein College of Medicine, New York, NY USA

### Beverly Snively

189 Wake Forest Baptist Health, Winston-Salem, NC USA

### Michael Snyder

162 Stanford University, Stanford, CA USA

### Tamar Sofer

95 Brigham and Women’s Hospital, Boston, MA USA

### Garrett Storm

125 University of Colorado at Denver, Denver, CO USA

### Elizabeth Streeten

158 University of Maryland, Baltimore, MD USA

### Yun Ju Sung

176 Washington University in St Louis, St Louis, MO USA

### Jody Sylvia

95 Brigham and Women’s Hospital, Boston, MA USA

159 University of Washington, Seattle, WA USA

### Carole Sztalryd

158 University of Maryland, Baltimore, MD USA

### Hua Tang

162 Stanford University, Stanford, CA USA

### Margaret Taub

27 Johns Hopkins University, Baltimore, MD USA

### Matthew Taylor

125 University of Colorado at Denver, Denver, CO USA

### Simeon Taylor

158 University of Maryland, Baltimore, MD USA

### Machiko Threlkeld

159 University of Washington, Seattle, WA USA

### Lesley Tinker

184 Fred Hutchinson Cancer Research Center, Seattle, WA USA

### David Tirschwell

159 University of Washington, Seattle, WA USA

### Sarah Tishkoff

213 University of Pennsylvania, Philadelphia, PA USA

### Hemant Tiwari

21 University of Alabama, Birmingham, AL USA

### Catherine Tong

159 University of Washington, Seattle, WA USA

### Michael Tsai

206 University of Minnesota, Minneapolis, MN USA

### Dhananjay Vaidya

27 Johns Hopkins University, Baltimore, MD USA

### Peter VandeHaar

174 University of Michigan, Ann Arbor, MI USA

### Tarik Walker

125 University of Colorado at Denver, Denver, CO USA

### Robert Wallace

190 University of Iowa, Iowa City, IA USA

### Avram Walts

125 University of Colorado at Denver, Denver, CO USA

### Fei Fei Wang

159 University of Washington, Seattle, WA USA

### Heming Wang

95 Brigham and Women’s Hospital, Boston, MA USA

### Karol Watson

168 University of California, Los Angeles, Los Angeles, CA USA

### Jennifer Wessel

185 Indiana University, Indianapolis, IN USA

### Kayleen Williams

159 University of Washington, Seattle, WA USA

### L. Keoki Williams

214 Henry Ford Health System, Detroit, MI USA

### Carla Wilson

95 Brigham and Women’s Hospital, Boston, MA USA

### Joseph Wu

162 Stanford University, Stanford, CA USA

### Huichun Xu

158 University of Maryland, Baltimore, MD USA

### Lisa Yanek

27 Johns Hopkins University, Baltimore, MD USA

### Ivana Yang

125 University of Colorado at Denver, Denver, CO USA

### Rongze Yang

158 University of Maryland, Baltimore, MD USA

### Norann Zaghloul

158 University of Maryland, Baltimore, MD USA

### Maryam Zekavat

169 Broad Institute, Cambridge, MA USA

### Snow Xueyan Zhao

165 National Jewish Health, Denver, CO USA

### Wei Zhao

174 University of Michigan, Ann Arbor, MI USA

### Degui Zhi

183 University of Texas Health at Houston, Houston, TX USA

### Xiang Zhou

174 University of Michigan, Ann Arbor, MI USA

### Xiaofeng Zhu

215 Case Western Reserve University, Cleveland, OH USA