Information

17.5: De novo motif discovery - Biology

17.5: De novo motif discovery - Biology


We are searching data for your request:

Forums and discussions:
Manuals and reference books:
Data from registers:
Wait the end of the search in all databases.
Upon completion, a link will appear to access the found materials.

As discussed in beginning of this chapter, the core problem for motif finding is to define the criteria for what is a valid motif and where they are located. Alternatively, one could use ChIP-seq to search for motifs, but this method relies on not only having a known Transcription Factor of interest, but also requires developing antibodies to recognize said Transcription Factor, which can be costly and time consuming.

Ideally one would be able to discover motifs de novo, or without relying on an already known gene set or Transcription Factor. While this seems like a difficult problem, it can in fact be accomplished by taking advantage of genome-wide conservation. Because biological functions are usually conserved across species and have distinct evolutionary signatures, one can align sequences from close species and search specifically in conserved regions (also known as Island of Conservation) in order to increase the rate of finding functional motifs.

Motif discovery using genome-wide conservation

Conservation islands often overlap known motifs, so doing genome-wide scans through evolutionary conserved regions can help us discover motifs, de novo. However, not all conserved regions will be motifs; for instance, nucleotides surrounding motifs may also be conserved even though they are not themselves part of a motif. Distinguishing motifs from background conserved regions can be done by looking for enrichments which will select more specifically for kmers involved in regulatory motifs. For instance, one can find regulatory motifs by searching for conserved sequences enriched in intergenic regions upstream of genes as compared to control regions such as coding sequences, since one would expect motifs to be enriched in or around promoters of genes. One can also expand this model to find degenerate motifs: we can look for conservation of smaller, non-degenerate motifs separated by a gap of variable length, as shown in the figure below. We can also extend this motif through a greedy search in order to get closer to find the local maximum likelihood motif. Finally, evolution of motifs can also reveal which motifs are degenerate; since a particular motif is more likely to be degenerate if it is often replaced by another motif throughout evolution, motif clustering can reveal which kmers are likely to correspond to the same motif.

In fact, the strategy has its biological relevance. In 2003, Professor Kellis argued that there must be some selective pressure to cause a particular sequence to be occur on specific places. His PhD. thesis on the topic can be found at the following location:

Validation of discovered motifs with functional datasets

These predicted motifs can then be validated with functional datasets. Predicted motifs with at least one of the following features are more likely to be real motifs: -enrichment in co-regulated genes. One can extend this further to larger gene groups; for instance, motifs have been found to be enriched in genes expressed in specific tissues -overlap with TF binding experiments -enrichment in genes from the same complex -positional biases with respect to the transcription start site (TSS): motifs are enriched in gene TSS’s -upstream vs. downstream of genes, inter- vs. intra-genic positonal biases: motifs are generally depleted in coding sequences -similarity to known transcription factor motifs: some, but not all, discovered motifs may match known motifs (however, not all motifs are conserved and known motifs may not be exactly correct)


HOMER

HOMER contains a novel motif discovery algorithm that was designed for regulatory element analysis in genomics applications (DNA only, no protein). It is a differential motif discovery algorithm, which means that it takes two sets of sequences and tries to identify the regulatory elements that are specifically enriched in on set relative to the other. It uses ZOOPS scoring (zero or one occurrence per sequence) coupled with the hypergeometric enrichment calculations (or binomial) to determine motif enrichment. HOMER also tries its best to account for sequenced bias in the dataset. It was designed with ChIP-Seq and promoter analysis in mind, but can be applied to pretty much any nucleic acids motif finding problem.

There are several ways to perform motif analysis with HOMER. The links below introduce the various workflows for running motif analysis. In a nutshell, HOMER contains two tools, findMotifs.pl and findMotifsGenome.pl , that manage all the steps for discovering motifs in promoter and genomic regions, respectively. These scripts attempt to make it easy for the user to analyze a list of genes or genomic positions for enriched motifs. However, if you already have the sequence files that you want to analyze (i.e. FASTA files), findMotifs.pl (and homer2 ) can process these directly.

Regardless of how you invoke HOMER, the same basic steps are executed to discover regulatory elements:

Preprocessing:

1. Extraction of Sequences (findMotifs.pl/findMotifsGenome.pl)

2. Background Selection (findMotifs.pl/findMotifsGenome.pl)

3. GC Normalization (findMotifs.pl/findMotifsGenome.pl)

Sequences in the target and background sets are then binned based on their GC-content (5% intervals). Background sequences are weighted to resemble the same GC-content distribution observed in the target sequences. This helps avoid HOMER avoid simply finding motifs that are GC-rich when analyzing sequences from CpG Islands. To perform CpG% normalization instead of GC%(G+C) normalization, use " -cpg ". An example of the GC%-distribution of regions from a ChIP-Seq experiment:


4. Autonormalization (New with v3.0, homer2/findMotifs.pl/findMotifsGenome.pl)

Often the target sequences have an imbalance in the sequence content other than GC%. This can be caused by biological phenomenon, such as codon-bias in exons, or experimental bias caused by preferrential sequencing of A-rich stretches etc. If these sources of bias are strong enough, HOMER will lock on to them as features that significanly differentiate the target and background sequences. HOMER now offers autonormalization as a technique to remove (or partially remove) imblances in short oligo sequences (i.e. AA) by assigning weights to background sequences. The proceedure attempts to minimize the difference in short oligo frequency (summed over all oligos) between target and background data sets. It calculates the desired weights for each background sequence to help minimize the error. Due to the complexity of the problem, HOMER uses a simple hill-climbing approach by making small adjustment in background weight at a time. It also penalizes large changes in background weight to avoid trivial solutions that a increase or decrease the weights of outlier sequences to extreme values. The length of short oligos is controlled by the " -nlen <#> " option.


Discovering Motifs de novo (homer2)

By default, HOMER uses the new homer2 version of the program for motif finding. If you wish to use the old version when running any of the HOMER family of programs, add " -homer1 " to the command line.

5. Parsing input sequences into an Oligo Table

6. Oligo Autonormalization (optional)

200 bp), you can also apply the autonormalization concept to the Oligo Table. The idea is still to equalize the smaller oligos (i.e. 1,2,3 bp) within the larger motif lengthed oligos (i.e. 10,12,14 bp etc.). This is a little more dangerous since the total number of motif lengthed oligos can be very large (i.e. 500k for 10 bp, much more for longer motifs), meaning there are a lot of weights to "adjust". However, this can help if there is an extreme sequence bias that you might be having trouble scrubbing out of the data set (the " -olen <#> " option).

7. Global Search phase

After creating (and possibly normalizing) the Oligo Table, HOMER coducts a global search for enriched "oligos". The basic idea is that if a "Motif" is going to be enriched, then the oligos considered part of the motif should also be enriched. First, HOMER screens each possible oligo for enrichment. To increase sensitivity, HOMER then allows mismatches in the oligo when searching for enrichment. To speed up this process, which can be very resource consuming for longer oligos with a large number of possible mismatches, HOMER will skip oligos when allowing multiple mismatches if they were not promising, for example if they had more background instances than target instances, or if allowing more mismatches results in a lower enrichment value. The " -mis <#> " controls how many mismatches will be allowed.

Calculating Motif Enrichment:

Motif enrichment is calculated using either the cumulative hypergeometric or cumulative binomial distributions. These two statistics assume that the classification of input sequences (i.e. target vs. background) is independent of the occurence of motifs within them. The statistics consider the total number of target sequences, background sequences and how many of each type contains the motif that is being checked for enrichment. From these numbers we can calculate the probability of observing the given number (or more) of target sequences with the motif by chance if we assume there is no relationship between the target sequences and the motif. The hypergeometric and binomial distributions are similar, except that the hypergeometric assumes sampling without replacement, while the binomial assumes sampling with replacement. The motif enrichment problem is more accurately described by the hypergeometric, however, the binomial has advantages. The difference between them is usually minor if there are a large number of sequences and the background sequences >> target sequences. In these cases, the binomial is preferred since it is faster to calculate. As a result it is the default statistic for findMotifsGenome.pl where the number of sequences is typically higher. However, if you use your own background that has a limited number of sequences, it might be a good idea to switch to the hypergeometric (use " -h " to force use of the hypergeometric). findMotifs.pl exects smaller number for promoter analysis and uses the hypergeometric by default.

One important note: Since HOMER uses an Oligo Table for much of the internal calculations of motif enrichment, where it does not explicitly know how many of the original sequences contain the motif, it approximates this number using the total number of observed motif occurrences in background and target sequences. It assumes the occurrences were equally distributed among the target or background sequences with replacement, were some of the sequences are likely to have more than one occurence. It uses the expected number sequences to calculate the enrichment statistic (the final output reflects the actual enrichment based on the original sequences).

8. Matrix Optimization

9. Mask and Repeat

After the first "promising oligo" is optimized into a motif, the sequences bound by the motif to are removed from the analysis and the next promising oligo is optimized for the 2nd motif, and so on. This is repeated until the desired number of motifs are found (" -S <#> ", default: 25). This is where the there is an important difference between the old (homer) and new (homer2) versions. The old version of homer would simply mask the oligos bound by the motif from the Oligo Table. For example if the motif was GAGGAW then GAGGAA and GAGGAT would be removed from the Oligo Table to avoid having the next motif find the same sequences. However, if GAGGAW was enriched in the data, there is a good chance that any 6-mer oligo like nGAGGA or AGGAWn would also be somewhat enriched in the data. This would cause homer to find multiple versions of the same motif and provide a little bit of confusion in the results.

To avoid this problem in the new version of HOMER (homer2), once a motif is optimized, HOMER revisits the original sequences and masks out the oligos making up the instance of the motif as well as well as oligos immediately adjacent to the site that overlap with at least one nucleotide. This helps provide much cleaner results, and allows greater sensitivity when co-enriched motifs. To make revert back to the old way of motif masking with homer2, specify " -quickMask " at the command line. You can also run the old version with " -homer1 ".

Screening for Enrichment of Known Motifs (homer2):

10. Load Motif Library

11. Screen Each Motif

Motif Analysis Output:

12. Motif Files (homer2, findMotifs.pl, findMotifsGenome.pl)

The true output of HOMER are "*.motif" files which contain the information necessary to identify future instance of motifs. They are reported in the output directories from findMotifs.pl and findMotifsGenome.pl. A typical motif file will look something like:

>ASTTCCTCTT 1-ASTTCCTCTT 8.059752 -23791.535714 0 T:17311.0(44 .
0.726 0.002 0.170 0.103
0.002 0.494 0.354 0.151
0.016 0.017 0.014 0.954
0.005 0.006 0.027 0.963
0.002 0.995 0.002 0.002
0.002 0.989 0.008 0.002
0.004 0.311 0.148 0.538
0.002 0.757 0.233 0.009
0.276 0.153 0.030 0.542
0.189 0.214 0.055 0.543

The first row starts with a ">" followed by various information, and the other rows are the positions specific probabilities for each nucleotide (A/C/G/T). The header row is actually TAB delimited, and contains the following information:

  1. ">" + Consensus sequence (not actually used for anything, can be blank) example: >ASTTCCTCTT
  2. Motif name (should be unique if several motifs are in the same file) example: 1-ASTTCCTCTT or NFkB
  3. Log odds detection threshold, used to determine bound vs. unbound sites ( mandatory ) example: 8.059752
  4. log P-value of enrichment, example: -23791.535714
  5. 0 (A place holder for backward compatibility, used to describe "gapped" motifs in old version, turns out it wasn't very useful :)
  6. Occurence Information separated by commas, example: T:17311.0(44.36%),B:2181.5(5.80%),P:1e-10317
    1. T:#(%) - number of target sequences with motif, % of total of total targets
    2. B:#(%) - number of background sequences with motif, % of total background
    3. P:# - final enrichment p-value
    1. Tpos: average position of motif in target sequences (0 = start of sequences)
    2. Tstd: standard deviation of position in target sequences
    3. Bpos: average position of motif in background sequences (0 = start of sequences)
    4. Bstd: standard deviation of position in background sequences
    5. StrandBias: log ratio of + strand occurrences to - strand occurrences.
    6. Multiplicity: The averge number of occurrences per sequence in sequences with 1 or more binding site.

    13. De novo motif output (findMotifs.pl/findMotifsGenome.pl/compareMotifs.pl)

    HOMER takes the motifs identified from de novo motif discovery step and tries to process and present them in a useful manner. An HTML page is created in the output directory named homerResults.html along with a directory named "homerResults/" that contains all of the image and other support files to create the page. These pages are explicitly created by running a subprogram called " compareMotifs.pl ".

    Comparison of Motif Matrices:

    Motifs are first checked for redundancy to avoid presenting the same motifs over and over again. This is done by aligning each pair of motifs at each position (and their reverse opposites) and scoring their similarity to determine their best alignment. Starting with HOMER v3.3, matrices are compared using Pearson's correlation coefficient by converting each matrix into a vector of values. Neutral frequencies (0.25) are used in where the motif matrices do not overlap.

    The old comparison was done by comparing the probability matrices using the formula below which manages the expectations of the calulations by scrambling the nuclotide identities as a control. (freq1 and freq2 are the matrices for motif1 and motif2)


    Motifs are next compared against a library of known motifs. For this step, all motifs in JASPAR and the "known" motifs are used for comparison. You can specify a custom motif library using " -mcheck <motif library file> " when using findMotifs[Genome].pl or " -known <motif library file> " when calling compareMotifs.pl directly.

    By default, it looks for the file "/path-to-homer/data/knownTFs/all.motifs" to find the motif to compare with the de novo motifs. If "-rna" is specified, it will load the file "/path-to-homer/data/knownTFs/all.rna.motifs".

    An example of the output HTML is show below:


    Depending on how the findMotifs[Genome].pl program that was executed, the "Known Motif Enrichment Results" and "Gene Ontology Enrichment Results" may or may not link to anything. Motifs are sorted based on p-value, and basic statistics about the motif (present in the motif files) is displayed.

    The final column contains a link to the "motif file", which is important if you want to search for the motif in other sequences.

    In the Best Match/Details column, HOMER will display the known motif which most closely matched with the de novo motif. It is very important that you TAKE THIS ASSIGNMENT WITH A GRAIN OF SALT. Unfortunately, sometimes the best match still isn't any good. Also, it is common that the "known" motif isn't any good to begin with. To investigate the assignment further, click on the "More Information" link which provides a page that looks like this:

    Basic Information: The section contains basic information, including links to the motif file (normal and reverse opposite) and the pdf version of the motif logo.


    Followed by matches to known motifs. This section shows the alignments between the de novo motif and known motifs. It's important to check and see if these alignments look reasonable:


    Clicking on the "similar motifs" will show the other de novo motifs found during motif finding that resemble the motif but had a lower enrichment value. It contains a similar "header" as the "More Information" link, but below it shows the motifs that were considered similar. It is usually a good idea to check this list over - sometimes a distinct motif will be grouped incorrectly in the list because it shares a couple residues.


    Background

    Discovering and characterizing DNA and protein sequence motifs are fundamental problems in computational biology. Here, we use the term 'motif' to refer to a position-specific probability matrix that describes a short sequence of amino acids or nucleotides that is important to the functioning of the cell. For example, the regulation of transcription requires sequence-specific binding of transcription factors to certain cis-acting motifs, which typically are located upstream of transcriptional start sites [1]. On the other hand, protein sequence motifs might correspond to active sites in enzymes or to binding sites in receptors [2].

    A wide variety of statistical methods have been developed to identify sequence motifs in an unsupervised manner from collections of functionally related sequences [3]. In addition, databases such as JASPAR [4], TRANSFAC [5], and BLOCKS [6] can be used to scan a sequence of interest for known DNA or protein motifs. In this work we develop a statistical method for comparing two DNA or protein motifs with one another. This type of comparison is valuable within the context of motif discovery. For example, imagine that you are given a collection of promoter regions from genes that share similar mRNA expression profiles, and that a motif discovery algorithm identifies a motif within those promoters. Often, the first question you would ask is whether this new motif resembles some previously identified transcription factor binding site motif. To address this question, you need a computer program that will scan a motif database for matches to your new (query) motif. The program must consider all possible relative offsets between the two motifs, and for DNA motifs it must consider reverse complement matches as well. An example alignment between two similar motifs is shown in Figure 1. An alternate use for a motif comparison program would be to identify and then eliminate or merge highly redundant motifs within an existing motif database.

    An aligned pair of similar motifs. The query and target motifs are both derived from JASPAR motif NF-Y, following the simulation protocol described in the text. Tomtom assigns an E value of 3.81 × e -10 to this particular match. The figure was created using a version of seqlogo [26], modified to display aligned pairs of Logos.

    We are not the first to describe a method for quantifying the similarities between pairs of motifs. Pietrokovski [7] compared protein motifs using a straightforward algorithm based on the Pearson correlation coefficient (PCC). Subsequently, Hughes and coworkers [8] applied a similar method to DNA motifs. Wang and Stormo [9] introduced an alternate motif column comparison function, termed the average log-likelihood ratio (ALLR). More recently, Schones and coworkers [10] introduced two motif similarity functions, one based on the Pearson χ 2 test and the other on the Fisher-Irwin exact test (FIET). They showed that these two new functions have better discriminative power than the PCC and ALLR similarity functions. In addition, multiple research groups have used Kullback-Leibler divergence (KLD) to compare motifs [11–13], and Choi and coworkers [14] used euclidean distance (ED) to compare protein profiles. Finally, Sandelin and Wasserman [15] used their own column comparison function (SW) within the context of a dynamic programming alignment approach to compare DNA motifs. This method differs significantly from all other DNA-motif based approaches in the sense that it allows gaps in the motif-motif alignments.

    In this report we focus on ungapped alignments of motifs. We describe a general method for accurately modeling the empirical null distribution of scores from an arbitrary, additive column comparison function. We estimate the null distribution of scores for each column in a 'query' motif using the observed scores of aligning it with each motif column in a database of 'target' motifs. Using a dynamic programming algorithm inspired by earlier work on searching a sequence database with a motif [16–18], we estimate the null distribution of the sum of scores for any range of contiguous columns in the query motif. This makes it possible for the user to determine whether the motif comparison score between the query motif and a particular target motif is statistically significant. Previous methods begin by defining a score between two motif columns, and then they combine these scores either by summing (as we do) [7–9, 14] or by taking the mean [11–13] or geometric mean [10] of the column scores. Our scoring method differs in that it computes the P values of the match scores for the columns of the query motif aligned with a given target motif in all possible ways (without gaps). These 'offset' P values are computed using the cumulative density functions estimated from the target database, as described above. The minimum P value among these offset P values is used to compute the overall P value of the match between the query motif and the target motif, assuming independence of the offset P values. This is called the 'motif' P value. Finally, we apply a Bonferroni correction to the motif P values to derive an E value.

    This algorithm is implemented in a software tool called Tomtom, which is publicly available as part of the MEME Suite of motif analysis tools [19–21]. Tomtom can compute E values based on any one of seven column comparison functions: PCC, ALLR, PCS, FIET, KLD, ED, or SW. In this work, we demonstrate the accuracy of Tomtom's statistical estimates. We also validate Tomtom'smotif retrieval accuracy via a simulation experiment. The results show that, in addition to providing formal semantics for motif similarity scores, Tomtom's P value estimation yields improved rankings relative to ad hoc normalization schemes.


    Results

    RADAR overcomes challenges in modeling MeRIP-seq data and accommodates complex study designs

    Using BAM files as the input, RADAR first divides transcripts (concatenated exons) into 50-bp consecutive bins and quantifies pre-IP and post-IP read counts for each bin (Fig. 1a). Unlike current differential methylation analysis methods [8,9,10,11] that scale to library sizes as a way of normalization, which can be strongly skewed by highly expressed genes [16] (Additional file 1: Figure S1), RADAR uses the median-of-ratio method [17] implemented in DEseq2 to normalize the INPUT library for the sake of robustness. For the IP library, RADAR normalizes the fold enrichment computed from the IP counts divided by the INPUT counts, which takes both IP efficiency and IP library size variation into account.

    Unique features of m 6 A-seq (MeRIP-seq) data. RADAR divides concatenated exons of a gene into consecutive bins and models the immunoprecipitation (IP)-enriched read counts in such bins. a depicts a pair of read counts in the INPUT and the IP library in the ith bin as ci and ti. In the RADAR workflow, the gene-level read count of the input library ( _>_m> ) substitutes the bin-level read count ci as the representation of the pre-IP RNA levels of the ith bin. b compares the relative variation of gene-level and bin-level (local) read counts of different bin sizes in four m 6 A-seq datasets, suggesting that unwanted variation can be reduced using gene-level counts as the estimates of pre-IP RNA levels. Panel c compares the cross-sample mean and variance of regular RNA-seq (pre-IP counts) and m 6 A-seq (post-IP read counts adjusted for pre-IP RNA level variation) data in four m 6 A-seq datasets. The fitted curvature of m 6 A-seq can differ from that of RNA-seq, indicating that m 6 A-seq may have a different mean-variance relationship from RNA-seq. Biological and experimental confounding factors are often encountered in patient samples. d shows the first two principal components (PCs) of m 6 A enrichment in each dataset, where the samples are colored by covariates that need to be accounted for. m 6 A enrichment was represented by IP sample read counts adjusted for pre-IP (INPUT) RNA-level variation. e shows the first two PCs after regressing out known covariates—age in the ovarian cancer dataset and batch in the T2D dataset. After regressing out the covariate, samples are separated by disease conditions on the PCA plot

    After proper normalization across all samples, RADAR then calculates the methylation level for each bin conditioned on its pre-IP RNA expression level for each sample. In contrast to previous methods [8,9,10,11] that use peak-level read counts in the INPUT library as its measurement of pre-IP RNA expression level, we use gene-level read counts as a more robust representation, which is defined as the total number of reads across all bins that span the same gene (Fig. 1a). This choice is motivated by the observation that the median read coverage within each peak is very low—18 reads per peak (7 reads in a 50-bp bin) (Additional file 1: Figure S2) in a typical MeRIP-seq input sample of 20 million (mappable) reads (Additional file 1: Figure S3). Over-dispersion of low counts due to random sampling in the sequencing process can introduce substantial unwanted variation to the estimation of pre-IP RNA level. This can be further exacerbated by the uneven distribution of reads caused by local sequence characteristics such as GC content and mappability. Using gene-level counts as the estimate of pre-IP RNA expression level can mitigate the dispersion by increasing the number of reads (272 reads on average) and simultaneously diminishing the effects of sequence characteristics within a gene (Fig. 1a). By comparing the variance of read counts across replicates at the gene level with that at the bin level, we show that the cross-sample variance is much less at the gene level than at the bin level in all three datasets (Fig. 1b).

    RADAR models the read count distribution using a Poisson random effect model instead of a negative binomial distribution, which is commonly used in RNA-seq analysis [13, 15, 17] as well as in DRME and QNB for MeRIP-seq analysis [9, 10]. Negative binomial distribution-based models assume a quadratic relationship between mean read counts and their variance across all genes. We observe in real m 6 A-seq datasets that the mean-variance relationship of post-IP counts across genes significantly differs from that of regular RNA-seq counts (i.e., pre-IP counts). The former does not always follow a similar quadratic curvature and can exhibit very different patterns of variability (Fig. 1c, Additional file 1: Figure S4). To overcome these limitations, RADAR applies a more flexible generalized linear model framework (see the “Material and methods” section) that captures variability through random effects.

    Another important advancement of RADAR, compared to existing MeRIP-seq data analysis tools [8,9,10,11], is the flexibility to incorporate covariates and permit more complex study design. Phenotypic covariates such as age and gender as well as experimental covariates such as batch information are often encountered in epitranscriptomic profiling studies with heterogenous patient samples. Covariates such as litter and age are common in experimental animal studies. For example, in the ovarian cancer dataset, the age of the tissue donors is partially confounded with predictor variable—disease status. In the T2D islets dataset, the variance of the first two principal components is confounded with the sequencing batch (Fig. 1d). After regressing out the batch effect, the remaining variance can be better explained by disease status (Fig. 1e). This indicates the importance of controlling for potential confounding factors when performing differential methylation tests. The generalized linear model framework in RADAR allows the inclusion of covariates and offers support for complex study designs.

    Comparative benchmarks of different methods using simulated datasets

    To evaluate the performance of RADAR in comparison to current methods, we applied RADAR and other methods for MeRIP-seq differential analysis including exomePeak, Fisher’s exact test, MeTDiff, and QNB on simulated datasets. We considered four scenarios: the proposed random effect model with/without covariates and the quad-negative binomial (QNB) model adopted from QNB [9, 10] with/without covariates. For each scenario, we evaluated the sensitivity and false discovery rate (FDR) of different methods using ten simulated copies. We first simulated a dataset of eight samples using the random effect model (“Materials and method” section Eq. (1), denoted as the simple case). The INPUT library was directly drawn from the T2D dataset. We simulated IP read count adjusted for pre-IP expression level of each bin according to Eq. (1) where μ is equal to mean log read count in the “control” group of T2D dataset. The final IP read counts were obtained by rescaling simulated data by the average IP/INPUT ratio observed in the T2D data. In total, we simulated three datasets of 26,324 sites in which 20% of sites are true positives with effect sizes of 0.5, 0.75, or 1, respectively.

    For DM loci with an effect size of 0.5, RADAR achieved 29.1% sensitivity and 12.0% FDR at an FDR cutoff of 10%. At the same cutoff, exomePeak and Fisher’s test achieved 72.8% sensitivity/52.5% FDR and 72.2% sensitivity/50.5% FDR, respectively. MeTDiff achieved 10.5% sensitivity and 16.2% FDR. QNB, on the contrary, did not own any power for the small effect size. When the effect size increased, RADAR achieved much higher sensitivity, 77.8% for an effect size of 0.75 and 95.7% for an effect size of 1, while FDR were well calibrated at 10.4% and 10.1%, respectively. exomePeak and Fisher’s test both achieved 89% and 96% sensitivity for effect sizes of 0.75 and 1, respectively, but at the cost of unsatisfactory FDRs, which were greater than 46%. MeTPeak exhibited well-calibrated FDR (12.3% and 11.4%) and moderate sensitivity of 50.4% and 81.5% for effect sizes of 0.75 and 1, respectively. QNB only had low power for an effect size of 1 (beta = 1, 13.9% sensitivity and 0.5% FDR). Overall, for the simple case without covariates, RADAR achieved high sensitivity while maintained low FDR at varying true effect sizes (Fig. 2a). We then applied the above analysis at varying FDR cutoff and found RADAR achieved the highest sensitivity at a fixed level of empirical FDR (Additional file 1: Figure S5A). We note exomePeak and Fisher’s test achieved high sensitivity at all effect sizes as combining read counts across replicates of the same group helped to gain power. As a tradeoff, failing to account for within-group variability resulted in high FDR. On the contrary, RADAR and MeTDiff exhibited well-calibrated FDR while achieved high sensitivity at same levels as exomePeak for large effect sizes. QNB was overconservative and possessed little power.

    Benchmarking RADAR on two simulation models. We benchmarked RADAR and other alternative methods on simulated data. Using two simulation models—a random effect (RADAR) model and a quad-negative-binomial (QNB) model, we simulated dataset of eight replicates of varying true effect sizes (0.5, 0.75, and 1) with and without covariates. We tested different methods on simulated dataset and compared the results at an FDR cutoff of 0.1 with simulated true sites. We show the sensitivity (fraction of true sites detected by the method at an FDR cutoff of 0.1) and false discovery rate (fraction of detected differential sites that are not true sites) of each method applied on data simulated by the random effect model without covariates (a) and with covariates (b) and the quad-negative-binomial model without covariates (c) and with covariates (d), respectively. The FDR cutoff used to select DM sites is labeled by a dashed line

    We next applied the aforementioned methods to the proposed model with a covariate (effect size equal to 2, denoted as the difficult case) (Fig. 2b). As a result, at an FDR cutoff of 10%, RADAR achieved 38.4%, 79.7%, and 95.7% sensitivity with empirical FDRs slightly higher than those in the simple case (18.2%, 14.4%, and 13.7% for effect sizes of 0.5, 0.75, and 1, respectively). MeTDiff, with similar performance as RADAR in the simple case, lost power in the difficult case due to incapability of accounting for confounding factors. exomePeak, Fisher’s test, and QNB behaved similarly as in the simple case. The advantage of RADAR over other methods is robust to the choice of FDR cutoff as shown in Additional file 1: Figure S5B. In summary, RADAR outperformed existing alternatives in both cases.

    Taking the covariate model with a DM effect size of 0.75 as an example, we also checked the distributions of effect size estimates and p values obtained from each method. In all methods, effect sizes were overall correctly estimated with estimates for “true” sites centered at 0.75 (Additional file 1: Figure S6A) and that for null sites centered at zero (Additional file 1: Figure S6B). However, we note the distribution of beta estimates is narrower for RADAR, especially in the difficult case, suggesting a more confident estimation. p values of exomePeak and Fisher’s test at null sites were enriched near zero, indicating over-detection of false-positive signals (Additional file 1: Figure S6C). We also observed many large p values obtained by QNB for “true” sites in both cases and MeTDiff in the difficult case, which suggested a high false-negative rate (Additional file 1: Figure S6D).

    We then repeated simulation studies using the QNB model. Instead of setting the variances of INPUT and IP libraries equal as presented in the QNB paper, we let the variance of IP read count be larger than that of INPUT. This setting better reflects our observation in the real data as extra noise can be introduced during immunoprecipitation process for IP reads generation (Additional file 1: Figure S4). In the simple case without covariates, RADAR exhibited the lowest empirical FDR (18.9% and 18.5%) despite slightly lower sensitivity comparing to other methods (73.5% and 82.3%) when the effect sizes were relatively large (for effect sizes of 0.75 and 1). QNB performed better when the effect size was small with 58.6% sensitivity and 15.6% FDR for an effect size of 0.5 (Fig. 2c). The results were consistent when we evaluated their performance with different FDR cutoffs. Overall, QNB performed slightly better than RADAR with an effect size of 0.5. RADAR achieved similar sensitivity but better calibrated FDR when effect sizes equal to 0.75 and 1 (Additional file 1: Figure S5C). In the model with covariates, RADAR exhibited the lowest empirical FDR, with 25.8%, 23.0%, and 22.5% at effect sizes of 0.5, 0.75, and 1, respectively, while other methods either failed to detect the signal or had a higher empirical FDR. Specifically, MeTDiff had sensitivity below 0.5% at varying effect sizes and QNB reached FDRs of 64.1%, 55.8%, and 50.5% for effect sizes of 0.5, 0.75, and 1, respectively, at an FDR cutoff of 10% (Fig. 2d). The advantage of RADAR over alternative methods hold in the difficult case at varying cutoffs (Additional file 1: Figure S5D). In summary, RADAR outperformed other existing methods in most scenarios, particularly when covariates were present.

    Comparative benchmarks of different methods using four real m 6 A-seq datasets

    Next, we compared the performance of different methods using four real m 6 A-seq datasets: ovarian cancer (GSE119168), T2D (GSE120024), mouse liver (GSE119490), and mouse brain (GSE113781). To evaluate the sensitivity of different methods, we first checked the distributions of p values obtained from corresponding DM tests (Fig. 3). In the ovarian cancer, T2D, and mouse liver data, Fisher’s test and exomePeak detected the most signals as the p values are most dense near zero. In these three datasets, RADAR also returned a desirable shape for the p value histogram in which p values were enriched near zero while uniformly distributed elsewhere. MeTDiff returned a desired shape only in the ovarian cancer and mouse liver datasets. QNB were overconservative in the ovarian cancer and T2D dataset. All methods failed to return enriched p values near zero for the mouse brain dataset, suggesting there was no or little signal in this dataset. This is consistent with the original publication that very few differential peaks were detected in this study [7].

    Sensitivity of benchmarked methods on real m 6 A-seq data. We benchmarked RADAR and other alternative methods on four m 6 A-seq data with different characteristics. Each panel shows the histogram of p-values obtained from DM tests using RADAR, MeTDiff, QNB, Fisher’s exact test and exomePeak on each dataset, respectively

    To ensure that well-performed methods achieved high sensitivity while maintaining a low FDR, we further performed permutation analyses to obtain the null distribution of p values for each dataset. Specifically, we shuffled the phenotype labels of samples such that the new labels were not associated with the true ones or any other important confounding factors. We expected the p values from a permutation test to follow a uniform distribution and the enriched p values near zero would be considered as false discoveries. For each dataset, we combined test statistics from 15 permuted copies and compared their distribution with the original tests (Fig. 4). p values from Fisher’s test and exomePeak were strongly enriched near zero and only slightly lower than those from the original tests. This suggests the strong signals detected by these two methods are likely to be false discoveries, consistent with the conclusion from simulation analysis. On the contrary, the histograms of p values from RADAR were close to flat in all datasets, indicating that strong signals detected by RADAR were more likely to be true. MeTDiff exhibited well-calibrated p values in the ovarian cancer and T2D data but enriched for small p values in the mouse liver data with an indicated high FDR. QNB test returned conservative p value estimates in all datasets. Taking together these analyses, we demonstrated that RADAR outperforms the alternatives by achieving high sensitivity and specificity simultaneously in real datasets.

    Benchmarking false-positive signals using permutation analysis on real m 6 A-seq data. To assess empirical FDR of the test, we permuted the phenotype labels of samples so that the new labels were not associated with true ones. Each panel shows the histograms of p values obtained from DM tests on 15 permuted copies (blue) and those from the tests on the original dataset (red)

    To better demonstrate that RADAR detects DM sites with better sensitivity and specificity in real data, we show examples of DM site that is only detected by RADAR as well as likely false discovery sites identified by exomePeak and Fisher’s test but not by RADAR in the T2D dataset. We plot sequence coverage of individual samples for the DM sites in the RNF213 gene (Additional file 1: Figure S7A) and show despite large variability in control samples, m 6 A enrichment of T2D samples is consistently lower on this locus. Conversely, in the bogus DM sites detected by alternative methods (Additional file 1: Figure S7B, C), enrichment differences are mainly driven by one or two outlier samples in one group.

    To further demonstrate the advantage of using gene-level read counts over local read counts to account for RNA expression level, we repeated the above analysis using post-IP counts adjusted by the local read counts of INPUT. We showed that in the T2D dataset, gene-level adjustment not only enabled stronger signal detection, but also lowered FDR as we observed that the permutation analysis using local count adjustment resulted in undesired stronger signals around zero in the p value histogram (Additional file 1: Figure S8). In the ovarian cancer and the mouse liver datasets, local count adjustment achieved higher signal detection but at the cost of a higher FDR. This analysis suggested that using gene-level read counts as the estimates of pre-IP RNA expression levels could effectively reduce FDR and lead to more accurate DM locus detections.

    Attributed to the robust representation of pre-IP RNA expression level using gene-level read counts, RADAR’s performance is more robust to the sequencing depth of INPUT samples. To demonstrate this, we applied RADAR on data created by sub-sampling the read counts of INPUT samples in the T2D dataset so that the sequencing depth is half of the full dataset (average 17.5 million reads). We compared the DM sites detected in the reduced dataset with the results obtained from the full dataset (Additional file 1: Figure S9A). Using a 10% FDR cutoff, RADAR-detected DM sites in the reduced dataset showed the highest overlap with that in the full dataset. MeTDiff and QNB only had a few overlapping DM sites between the sub-sampled and full dataset. Fisher’s test and exomePeak had slightly fewer overlaps comparing to RADAR but had more false discoveries. We further compared the log fold change (logFC) estimates from reduced and full datasets to check their consistency. As a result, we found reduced sequencing depth had the least impact on the logFC estimated by RADAR while the estimates by others are much less reproducible with a shallower sequencing depth (Additional file 1: Figure S9A).

    Unlike earlier pipelines that perform DM tests only on peaks identified from peak calling, RADAR directly tests on all filtered bins and reports DM sites. To check if the DM sites reported by RADAR are consistent with known characteristics of m 6 A, we performed de novo motif search on these sites and found DM sites detected in ovarian cancer, mouse liver, and T2D datasets are enriched for known m 6 A consensus motif (Additional file 1: Figure S10A) [18], suggesting DM sites reported by RADAR are mostly true. We also examined the topological distribution of these DM sites by metagene analysis (Additional file 1: Figure S10B). The distributions in ovarian cancer and mouse liver datasets are consistent with the topological distribution of common m 6 A sites, indicating methylation changes that occurred in these two datasets were not spatially biased. Interestingly, DM sites detected in T2D dataset are strongly enriched at 5′UTR, suggesting T2D-related m 6 A alteration are more likely to occur at 5′UTR.

    RADAR analyses of m 6 A-seq data connect phenotype with m 6 A-modulated molecular mechanisms

    Finally, we investigated whether DM test results obtained from RADAR would lead to better downstream interpretation. In the ovarian cancer dataset, we performed KEGG pathway enrichment analysis on the differential methylated genes (DMGs) detected by RADAR (Fig. 5a). We found the detected DMGs were enriched with molecular markers related to ovarian cancer dissemination [19, 20]. For instance, we identified key regulators of the PI3K (enrichment p value 7.8 × 10 −5 ) and MAPK pathways (enrichment p value 1.1 × 10 −4 ), including hypo-methylated PTEN and hyper-methylated BCL2 (Additional file 1: Figure S11). Other notable DMGs include key markers of ovarian cancer such as MUC16 (CA-125) and PAX8, as well as genes that play key roles in ovarian cancer biology such as CCNE1 and MTHFR. Conversely, DMGs detected by MeTDiff were only enriched in three KEGG pathways (Fig. 5b), most likely due to its inadequate power. We showed through permutation analysis that exomePeak and Fisher’s test results included a significant portion of false positives and could lead to biased downstream interpretations.

    Pathways enriched in differential methylated genes identified in ovarian cancer and T2D datasets. We performed KEGG pathway enrichment analysis using ClusterProfiler [37] on DMGs identified in the ovarian cancer dataset by RADAR (a) and MeTDiff (b), respectively. The enrichment maps represent identified pathways as a network with edges weighted by the ratio of overlapping gene sets

    In the T2D dataset, DMGs identified by RADAR were enriched in related pathways including insulin signaling pathways, type II diabetes mellitus, mTOR pathways, and AKT pathways (Additional file 1: Table S1), indicating a role that m 6 A might play in T2D. We further analyzed these DMGs in related pathways and found the methylome of insulin/IGF1-AKT-PDX1 signaling pathway been mostly hypo-methylated in T2D islets (Additional file 1: Figure S12). Impairment of this pathway resulting in downregulation of PDX1 has been recognized as a mechanism associated with T2D where PDX1 is a critical gene regulating β cell identity and cell cycle and promoting insulin secretion [21,22,23,24]. Indeed, follow-up experiment on a cell line model validated the role of m 6 A in tuning cell cycle and insulin secretion in β cells and animal model lacking methyltransferase Mettl14 in β cells recapitulated key T2D phenotypes (results presented in a separate manuscript, [25]). To summarize, RADAR-identified DMGs enabled us to pursue an in-depth analysis of the role that m 6 A methylation plays in T2D. On the contrary, due to the incapability to take sample acquisition batches as covariates, the alternative methods were underpowered to detect DM sites in T2D dataset and could not lead to any in-depth discovery of m 6 A biology in T2D islets. These examples suggest that MeRIP-seq followed by RADAR analysis could further advance functional studies of RNA modifications.

    Validation of RADAR-detected DM sites by the SELECT method

    Recently, Xiao et al. developed an elongation and ligation-based qPCR amplification method (termed SELECT) for single nucleotide-specific detection of m 6 A [26]. This method relies on mechanism different from antibody pull-down-based MeRIP-seq to detect m 6 A, making it a suitable method for validating DM sites discovered by RADAR analysis. We selected six DM sites (Additional file 1: Table S2) including two sites only detected by RADAR and four sites in genes important in β cell for experimental validation using the SELECT method. Among six validated sites, the β cells regulator PDX1 and RADAR-specific DM sites showed significant m 6 A level alteration with p values 0.009 and 0.017, respectively (Fig. 6). Three other sites, IGF1R in the insulin/IGF1-AKT-PDX1 signaling pathway, MAFA—another important regulator of β cell function, and RADAR-specific DM site in CPEB2, showed m 6 A changes consistent with RADAR result despite not reaching statistical significance. The sites in the TRIB3 gene are similarly methylated in control and T2D samples as measured by SELECT. Overall, five out of six experimentally validated sites were supported by orthogonal evidence by SELECT, confirming the reliability of RADAR-detected differential methylation sites.

    Experimental validation of RADAR-detected DM sites using the SELECT method. We applied antibody independent method SELECT on T2D samples (N = 4). Shown are SELECT results of six putative DM sites for validation. SELECT measures the relative abundance of non-methylated RNA molecules of target locus as represented by the elongation and ligation “read through” of oligo probes. Thus, SELECT results—“relative read through”—are inversely correlated with m 6 A level


    3 BENCHMARK RESULTS

    We performed a benchmark study of GimmeMotifs on 18 TF ChIP-seq datasets. The ROC AUC and MNCP of the best performing motif were calculated and compared with the best motif of two other ensemble methods: SCOPE (Carlson et al., 2007) and W-ChipMotifs (Jin et al., 2009) (Supplementary Tables S1 and S2) . The results show that GimmeMotifs consistently produces accurate results (median ROC AUC 0.830). The method also significantly improves on the results of SCOPE (ROC AUC 0.613). The recently developed W-ChIPmotifs shows comparable results to GimmeMotifs (ROC AUC 0.824), although this tool does not cluster similar redundant motifs. In addition, the focus of GimmeMotifs is different. While the web interface of W-ChipMotifs is very useful for casual use, the command-line tools of GimmeMotifs can be integrated in more sophisticated analysis pipelines.


    Acknowledgements

    The authors acknowledge Jacqueline E. Boyle for genotyping mice staff at Monash ARL for animal husbandry Jelena Kezic of Monash Histology Platform for processing and Haemotoxylin and Eosin staining of embryos and yolk sacs and Geza Paukovics, Phil Donaldson and Eva Orlowski from AMREP flow cytometry facility for their assistance in flow cytometry. The authors would also like to thank Bertie Gottgens, University of Cambridge, for reading the manuscript and providing insightful feedback.


    17.5: De novo motif discovery - Biology

    Understanding gene regulatory networks has become one of the central research problems in bioinformatics. More than thirty algorithms have been proposed to identify DNA regulatory sites during the past thirty years. However, the prediction accuracy of these algorithms is still quite low. Ensemble algorithms have emerged as an effective strategy in bioinformatics for improving the prediction accuracy by exploiting the synergetic prediction capability of multiple algorithms.

    Results

    We proposed a novel clustering-based ensemble algorithm named EMD for de novo motif discovery by combining multiple predictions from multiple runs of one or more base component algorithms. The ensemble approach is applied to the motif discovery problem for the first time. The algorithm is tested on a benchmark dataset generated from E. coli RegulonDB. The EMD algorithm has achieved 22.4% improvement in terms of the nucleotide level prediction accuracy over the best stand-alone component algorithm. The advantage of the EMD algorithm is more significant for shorter input sequences, but most importantly, it always outperforms or at least stays at the same performance level of the stand-alone component algorithms even for longer sequences.

    Conclusion

    We proposed an ensemble approach for the motif discovery problem by taking advantage of the availability of a large number of motif discovery programs. We have shown that the ensemble approach is an effective strategy for improving both sensitivity and specificity, thus the accuracy of the prediction. The advantage of the EMD algorithm is its flexibility in the sense that a new powerful algorithm can be easily added to the system.

    Publication Info

    Published in BMC Bioinformatics, Volume 7, Issue 342, 2006.

    © BMC Bioinformatics 2006, BioMed Central

    Hu, J., Yang, Y. D., & Kihara, D. (2006). EMD: An ensemble algorithm for discovering regulatory motifs in DNA sequences. BMC Bioinformatics, 7(342).


    Computational Biology: Toward Deciphering Gene Regulatory Information in Mammalian Genomes

    Department of Statistics, Harvard University, 1 Oxford Street, Cambridge, Massachusetts 02138, U.S.A.

    Departments of Statistics and Health Research and Policy, Stanford University, 390 Serra Mall, Stanford, California 94305, U.S.A.

    Department of Statistics, Harvard University, 1 Oxford Street, Cambridge, Massachusetts 02138, U.S.A.

    Departments of Statistics and Health Research and Policy, Stanford University, 390 Serra Mall, Stanford, California 94305, U.S.A.

    Abstract

    Summary Computational biology is a rapidly evolving area where methodologies from computer science, mathematics, and statistics are applied to address fundamental problems in biology. The study of gene regulatory information is a central problem in current computational biology. This article reviews recent development of statistical methods related to this field. Starting from microarray gene selection, we examine methods for finding transcription factor binding motifs and cis-regulatory modules in coregulated genes, and methods for utilizing information from cross-species comparisons and ChIP-chip experiments. The ultimate understanding of cis-regulatory logic in mammalian genomes may require the integration of information collected from all these steps.


    Ectopic DNMT3L triggers assembly of a repressive complex for retroviral silencing in somatic cells

    Mammalian genomes are replete with retrotransposable elements, including endogenous retroviruses. DNA methyltransferase 3-like (DNMT3L) is an epigenetic regulator expressed in prospermatogonia, growing oocytes, and embryonic stem (ES) cells. Here, we demonstrate that DNMT3L enhances the interaction of repressive epigenetic modifiers, including histone deacetylase 1 (HDAC1), SET domain, bifurcated 1 (SETDB1), DNA methyltransferase 3A (DNMT3A), and tripartite motif-containing protein 28 (TRIM28 also known as TIF1β and KAP1) in ES cells and orchestrates retroviral silencing activity with TRIM28 through mechanisms including, but not limited to, de novo DNA methylation. Ectopic expression of DNMT3L in somatic cells causes methylation-independent retroviral silencing activity by recruitment of the TRIM28/HDAC1/SETDB1/DNMT3A/DNMT3L complex to newly integrated Moloney murine leukemia virus (Mo-MuLV) proviral DNA. Concurrent with this recruitment, we also observed the accumulation of histone H3 lysine 9 trimethylation (H3K9me3) and heterochromatin protein 1 gamma (HP1γ), as well as reduced H3K9 and H3K27 acetylation at Mo-MuLV proviral sequences. Ectopic expression of DNMT3L in late-passage mouse embryonic fibroblasts (MEFs) recruited cytoplasmically localized HDAC1 to the nucleus. The formation of this epigenetic modifying complex requires interaction of DNMT3L with DNMT3A as well as with histone H3. In fetal testes at embryonic day 17.5, endogenous DNMT3L also enhanced the binding among TRIM28, DNMT3A, SETDB1, and HDAC1. We propose that DNMT3L may be involved in initiating a cascade of repressive epigenetic modifications by assisting in the preparation of a chromatin context that further attracts DNMT3A-DNMT3L binding and installs longer-term DNA methylation marks at newly integrated retroviruses.

    Importance: Almost half of the mammalian genome is composed of endogenous retroviruses and other retrotransposable elements that threaten genomic integrity. These elements are usually subject to epigenetic silencing. We discovered that two epigenetic regulators that lack enzymatic activity, DNA methyltransferase 3-like (DNMT3L) and tripartite motif-containing protein 28 (TRIM28), collaborate with each other to impose retroviral silencing. In addition to modulating de novo DNA methylation, we found that by interacting with TRIM28, DNMT3L can attract various enzymes to form a DNMT3L-induced repressive complex to remove active marks and add repressive marks to histone proteins. Collectively, these results reveal a novel and pivotal function of DNMT3L in shaping the chromatin modifications necessary for retroviral and retrotransposon silencing.

    Copyright © 2014, American Society for Microbiology. All Rights Reserved.

    Figures

    DNMT3L and the ZFP809-TRIM28 pathway…

    DNMT3L and the ZFP809-TRIM28 pathway are both required for epigenetic silencing of Mo-MuLV…

    DNMT3L- and ZFP809-TRIM28-mediated Mo-MuLV silencing…

    DNMT3L- and ZFP809-TRIM28-mediated Mo-MuLV silencing in C57BL/6 background ES cells. (A) Wild-type and…

    DNMT3L facilitated the formation of…

    DNMT3L facilitated the formation of the DNMT3A/SETDB1/HDAC1 protein complex in ES cells 2…

    DNMT3L-induced retroviral silencing activity depends…

    DNMT3L-induced retroviral silencing activity depends on PBSpro sequence and functional DNMT3L harboring proper…

    DNMT3L induces retroviral silencing activity…

    DNMT3L induces retroviral silencing activity in 3T3 cells. (A) Relative mRNA expression level…

    Mo-MuLV LUC and Mo-MuLV LUC/PBSQ…

    Mo-MuLV LUC and Mo-MuLV LUC/PBSQ have the same infection titers. (A) RAT2 cells…

    DNMT3L can recruit epigenetic modifiers…

    DNMT3L can recruit epigenetic modifiers to induce repressive histone modifications on Mo-MuLV LTR…

    Ectopic DNMT3L induces the formation…

    Ectopic DNMT3L induces the formation of a repressive chromatin modifier complex in DNMT3L-expressing…

    DNMT3L induces HDAC1 translocation to…

    DNMT3L induces HDAC1 translocation to the nucleus in later-passage MEFs. The subcellular localization…

    DNMT3L facilitates the formation of…

    DNMT3L facilitates the formation of the protein complex containing DNMT3A, SETDB1, and HDAC1…


    DNA motif discovery using chemical reaction optimization

    DNA motif discovery means to find short similar sequence elements within a set of nucleotide sequences. It has become a compulsory need in bioinformatics for its useful applications such as compression, summarization, and clustering algorithms. Motif discovery is an NP-hard problem and exact algorithms cannot solve it in polynomial time. Many optimization algorithms were proposed to solve this problem. However, none of them can show its supremacy by overcoming all the obstacles. Chemical Reaction Optimization (CRO) is a population based metaheuristic algorithm that can easily fit for the optimization problem. Here, we have proposed an algorithm based on Chemical Reaction Optimization technique to solve the DNA motif discovery problem. The four basic operators of CRO have been redesigned for this problem to search the solution space locally as well as globally. Two additional operators (repair functions) have been proposed to improve the quality of the solutions. They have been applied to the final solution after the iteration stage of CRO to get a better one. Using the flexible mechanism of elementary operators of CRO along with the additional operators (repair functions), it is possible to determine motif more precisely. Our proposed method is compared with other traditional algorithms such as Gibbs sampler, AlignACE (Aligns Nucleic Acid Conserved Elements), MEME (Multiple Expectation Maximization for Motif Elicitation), and ACRI (Ant-Colony-Regulatory-Identification) by testing real-world datasets. The experimental results show that the proposed algorithm can give better results than other traditional algorithms in quality and in less running time. Besides, statistical tests have been performed to show the superiority of the proposed algorithm over other state-of-the-arts in this area.

    This is a preview of subscription content, access via your institution.


    Watch the video: Ορκομωσία Μοριακής Βιολογίας 7-11-18 (May 2022).


Comments:

  1. Mukki

    It is a pity, that now I can not express - it is very occupied. I will be released - I will necessarily express the opinion.

  2. Shanos

    Wow, super, waited a long time. THX

  3. Beaman

    And what follows from this?

  4. Bahir

    I regret, that I can not participate in discussion now. I do not own the necessary information. But with pleasure I will watch this theme.

  5. Shakagrel

    Thanks for your help in this matter. You have a wonderful forum.

  6. Travon

    I apologize, but in my opinion you are wrong. I can defend my position. Write to me in PM, we will discuss.

  7. Jozsi

    None of your business!



Write a message