Abstract


Motivation: Identification of genes that are regulated by a given transcription factor (TF) is of major importance in biology. Most popular in silico methods focusing on the prediction of potential TF target genes are based on the sequence-specific binding TFs affinity. However approaches relying only on the DNA sequence lead to a high number of false positive binding sites that in most cases cannot be experimentally validated. To overcome this limitation, additional information such as the DNA accessibility (open chromatin) or the sequence conservation across different species are needed.

Results: We developed wimtrap, a user-friendly R package that allows organism- and context-specific predictions of TFs target genes, with special emphasis on plants species. The approach involves beforehand a modeling phase that exploits as a reference experimentally validated TFs binding sites. This step enables to obtain by machine learning model that predict TFs binding sites based on context-specific and context-unspecific features defined at the genome scale, complementing the information carried by the DNA-sequence. Models have been built from Arabidopsis thaliana, the plant-model by excellence, extensively studied, about which the necessary data are available. Assessment of the predictive performances of these models showed that they can significantly improve the TFs binding sites (TFBS) predictions, from which the potential targets genes can be inferred. Predictions can be readily obtained in Arabidopsis thaliana, with more than 600 Position Weight Matrices (PWMs) and context-specific data about 20 different organs or growth stages integrated in the package. Depending on the availability of the required data, the models can be transferred to other plants species, such as rice.

Description of the package


Wimtrap is an R package that allows the prediction of Transcription Factors Binding Sites (TFBS) and therefore identification of a set of genes that might be regulated by a transcription factor (TF) of interest. The package, comprising easy-to-use pre-built models for the plant model Arabidopsis thaliana, aims to offer tools to train and apply predictive models that reduce the false TFBS discovery rate obtained from popular pattern-matching methods. Additional genome-wide features, complementary to the information brought by the DNA sequence, are integrated in decision trees models. Such approach makes possible to take into account the dependence of the TFs binding on the chromatin environment (i.e. histones marks, DNA opening, nucleosomes occupancy), which vary across growth stages, organs, tissues and treatments.

The available functions are fast, flexible and easy to use. They are designed to give a framework that enables to get quickly TFBS models and then to not only apply them but also to analyse them, with plotting and summarizing functionalities that can help to study the TF binding, in the context of a given organism in general or in the one of TF families more specifically. The main task that is therefore requested to the user is therefore limited to data mining. It is indeed necessary to input, from source files in commonly used formats, data about, in the studied organism: on one hand, ChIP-seq data, that can be used to validate and invalidate TFBS identified by pattern-matching analysis; on the other hand, genome wide data, that can be exploited in the models and can be, for instance, the location of genomic structures (promoters, coding sequences,…), the phylogenetic sequence conservation and features related to the chromatin state..

Citation


Quentin Rivière, Massimiliano Corso, Matthieu Defrance and Nathalie Verbruggen. wimtrap: Package for integrative organism- and context-specific Transcription Factors binding sites modeling and prediction including pre-built models for Arabidopsis thaliana. R package 2018.

Installation instructions


Prerequisites for Linux OS

The libraries libcurl, libxml and libssl libraries are necessary to install the package.

On a Debian-based machine (Ubuntu,.), copy/paste the following commands into the terminal:

sudo apt-get install libcurl4-openssl-dev libxml2-dev libssl-dev

Installing a recent version of R (>= 3.5.0)

To install the wimtrap package, a 3.5.0 version of R or higher is required. You can download the latest version of R from the official CRAN website.

Installing R packages dependencies

The wimtrap package depends on various other packages that have to be prior installed, among which some belong to the Bioconductor project. These latter are installed with a specific function, different from the standard one.

install.packages(c("curl","RCurl", "XML", "ggplot2", "C50", "partykit", "DescTools", "httpuv"))
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install(c("SummarizedExperiment", "S4Vectors", "BiocParallel", "Biostrings", "GenomicRanges", "IRanges", "rtracklayer", "GenomeInfoDb", "biomartr", "biomaRt"))

Installing wimtrap package

Please download the package to the local folder of your computer “path/to/downloads” and type in the following command line in your shell, modifying “path/to/downloads/wimtrap_1.1.0.tar.gz” by the actual address of the package on your computer:

install.packages("path/to/downloads/wimtrap_1.1.0.tar.gz", type = "source", repos = NULL)

Overview of the available functions


The package aims at offering tools allowing to predict TFs targets by identifying potential TFs binding sites in a given context (i.e. organ, growth stage or treatment) .This requires beforehand a TFBS modeling phase. This step relies on the availability of ChIP-seq data for the studied organism, that allow to discriminate the true from the false TFBS. Candidates binding sites are at first identified by scanning the genome sequence with the Position Weight Matrices (PWMs) of TFs for which ChIP-seq data are available and are then characterized by a set of features:

By machine learning, it is then possible to get models that are trained to characterize the true TFBS compared to the false ones and can therefore refine the set of candidates binding sites output by the pattern-matching analysis.

Once TFBS models have been obtained, they can be analysed using summarizing and printing functionalities, in order to better understand TFs binding, and then be applied to new transcription factors. The source data considered by the models can be updated in such a way that the models can be exploited specifically in different contexts and can even be transferred to different organisms.

In total 8 different functions are available and are associated to the three main workflow steps, that can be summarized as following:

These functions are listed in the table below:

Function name Description
buildTFBSmodel Handily build organism- and context-specific (and possibly TFs family- or TF- specific) binding sites decision trres
buildTFBSmodel_custom Customizable version of the function buildTFBSmodel
checkTFsuitability Verify if the TFs considered to build the a TFBS model are actually suitable for modeling
print.TFBSmodel Summarize a TFBS model: output the contribution of the input features and the decision trees predictive performances
plot.TFBSmodel Display the ROC curve of a TFBS model, the distribution of the input features among the true and false binding sites as well as the binding sites classification decision trees
transposeTFBSmodel Adapt a TFBS model to an another one
TFBSpredict Predict TFBS and related TFs targets using a TFBS model
athalTFBSpredict Predict context-specifically TFBS and related TFs targets in the plant species Arabidopsis thaliana
filterTFBS Filter predicted TFBS and related TFs targets

Download the necessary data to run the examples of the user guide


Additional source data are necessary to run the examples of this user guide.

Please download the compressed directory "Data.zip" in which these data are stored and then unzip it into your working directory. You can replace in the following code "path/to/downloads/Data.zip" by the actual address of the archive on your computer.

Note that the user guide provides "real-world" examples that are aimed to illustrate in the best way the use of the different functions. For the convenience, we advice to test examples of the manual, that can be accessed within an R session by calling the help pages related to the functions exported by wimtrap (type ?wimtrap in the console to navigate through all the help pages). 

library(utils)
unzip(zipfile = "path/to/downloads/Data.zip", exdir = getwd())

Identification of transcription factors targets in A. thaliana using a pre-built model


Arabidopsis thaliana

Arabidopsis thaliana

Binding sites and potential target genes of Arabidopsis thaliana’s Transcription Factors can be context-specifically predicted by taking advantage of aprebuilt model which is integrated in the athalTFBSpredict function.

To be used, this function needs 3 informations:

The PFMs can be automatically retrieved on the basis of the TF names (AGI codes or gene names) thanks to the integration of the PlantTFDB database but they can as well be input from a source file in the jaspar format (See “What motif format JASPAR supports” on the FAQ of Jaspar website).

Example of jaspar file format is for 2 TFs:

>TF1
32   22   31   47   24   8   33
11   0   25   7   24   15   12
32   46   25   35   28   64   12
25   32   19   11   24   13   43
>TF2
0   100   0   0   0
90   0   99   0   2
10   0   1   100   0
0   0   0   0   98

Below “>TF name”, the matrix contains as information the number of times that each nucleotide has been counted along the motif in a set of binding sites (here they are totally 100). The columns are related to the position inside the motif and are separated by blanks while the lines are related to the nucleotides (A, C, G, T). The total number of counts at each position is constant (here, = 100).

20 different contexts are available for Arabidopsis thaliana, that can each be chosen by the user through an ID number. The context-specificity is made possible by considering DNAse-seq data obtained in various contexts that allow to identify the open regions of the DNA (so-called “DHS” for “DNAseI hypersensitive sites”). These data are originating from 3 different sources: PlantDHS, PlantRegMap and Pajoro et al. (2014).

ID code Organ Age Treatment
1 Flower Closed Control
2 Flower Open Control
3 Flower 0 day Control
4 Flower 2 days Control
5 Flower 4 days Control
6 Flower 8 days Control
7 Leaf 7 days Control
8 Root (whole) 7 days Control
9 Root hair 7 days Control
10 Root non hair 7 days Control
11 Root non hair 10 days Control
12 Seed coat 4 days past anthesis Control
13 Seed coat 7 days past anthesis Control
14 Seedling 14 days Control
15 Seedling 7 days Control
16 Seedling 7 days Dark
17 Seedling 7 days Dark 30min light
18 Seedling 7 days Dark 3hr light
19 Seedling 7 days Dark 24hr
20 Seedling 7 days Heat Shock

The pre-built models are each characterized by a degree of sensitivity, evaluated from the training dataset. In classification modeling, the balance between specificity and sensitivity is of main importance. These two measures of models predictive performances can not be optimized in the same time. Depending on the purposes of the research, it is more relevant to promote sensitivity or specificity.

In the present case, there are two classes of binding sites: the true ones (that can be validated by ChIP-seq data) and the false ones (that cannot be validated by ChIP-seq data). Sensitivity and specificity assess the ability of models to make correct predictions:

Before using athalTFBSpredict, you can draw the ROC (ie sensitivity-specificity curve) associated to the models specific to Arabidopsis thaliana:

library(wimtrap)
data("wimtrapModel_athal", package = "wimtrap")
plot(wimtrapModel_athal, type = "ROC")

Refer to the Modeling (training) curve. Around 0.65 of sensitivity, the balance between sensitivity and specificity seems interesting for general purposes.

As an example, to predict the potential targets of PIL5 (AT2G20180) and DOF1 (AT1G51700) in 14 days old  A. thaliana seedlings grown in control context (ID number = 14), the user must type:

PotentialTargets <- athalTFBSpredict(TFnames = c("AT2G20180", "AT1G51700"),
                                     context = 14,
                                     sensitivity = 0.65)

To use Position Frequency Matrices from an another source, encoded in a jaspar file on the computer, the user can type:

PotentialTargets <- athalTFBSpredict(context = 14,
                                     sensitivity = 0.65,
                                     pfm = "Data/PIF1_DOF1.jaspar")

PotentialTargets is a list of three elements:

It is possible to access to the predicted targets of PIF1 (AT2G20180), for example, as it follows:

print(PotentialTargets$PredictedTargets$AT2G20180)

Please refer to the dedicated section to further manipulate these candidates targets using filterTFBS.

Build TFBS models


Quick start

The objective of the modeling consists of refining the candidates binding-sites identified by pattern-matching analysis. Such approach is made possible by the availability of ChIP-seq data related to a given organism: these data allow to determine which candidates are true binding sites (i.e. validated by the ChIP-seq data) and which are false. If each candidate binding site is characterized by a set of relevant features, decision trees can therefore be trained to classify the candidates into the true and the false binding sites.

Logically, it is thus necessary to input:

  • ChIP-seq data related to a given organism;
  • The Pseudo Frequency Matrices (PFMs) of the motifs recognized by the TFs about which ChIP-seq data are considered;
  • The genomic sequence of the organism;
  • Genome-wide data allowing to annotate the candidates binding sites.

The required ChIP-seq data describe the location of ChIP-peaks and are encoded in a BED file in which the name field (the 4th) gives for each peak the name of the TF that binds it.

If, for a given organism, ChIP-seq data about 2 TFs, TF1 and TF2 for example, are considered, such file looks like as following:

1   74  105 TF1
1   1138    1171    TF1
1   1403    1434    TF2
2   90  122 TF2
2   1602    1635    TF2
2   2301    2334    TF1
3   125 176 TF2
3   578 597 TF1
3   2749    2782    TF1

In a jaspar file, the PFMs associated to these 2 TFs are encoded as following:

>TF1
32   22   31   47   24   8   33
11   0   25   7   24   15   12
32   46   25   35   28   64   12
25   32   19   11   24   13   43
>TF2
0   100   0   0   0
90   0   99   0   2
10   0   1   100   0
0   0   0   0   98

IMPORTANT you can notice that the names of the TFs, TF1 and TF2, are matching in the two previous files. This names matching is MANDATORY.

About the predictive features, they can be one of the tree categories:

  • numeric (example: the degree of opening of the DNA in the regions where are located the binding sites)
  • categorical (example: chromatin state in the regions where are located the binding sites)
  • overlapping (example: are the binding sites locating on a promoter?)

Annotations in terms of each feature can be obtained from source date encoded in BED format that describe the genome by regions. The type of feature is determined by the score field (the 5th):

  • the field is a numeric, the feature is numeric;
  • the field is a character, the feature is categorical;
  • the field is empty, the feature is overlapping.

The sources files (in BED format) will look like as following:

  • For a numeric feature:
1   74  192 . 10.5
1   1030    1119 . 9.2
1   2002    2071 . 7.1
2   143 215 . 11.3
2   1566    1644 . 13.4
2   1828    1907 . 8.0
3   111 186 . 10.1
3   433 577 . 6.3
3   2167    2300 . 11.8

Generally, such files give the locations of the peaks for a given signal (as histone methylation) and the score. The entire genome is thus not described. By default, the score will be smoothed on the length of the candidates binding sites and the value outside the peaks will be considered to be equal to 0. It is possible to change these parameters using the buildTFBSmodel_custom function

  • For a categorical feature:
1   1   1030 . A
1   1030    2002  . A
1   2002    2254  . B
2   1   1566 . C
2   1566    1828  . C
2   1828    2977  . B
3   1   186 . A
3   186 577 . C
3   577 2  . A

Usually, the whole genome is described in such files, but this is not mandatory.

  • For an overlapping feature:
1   74  192
1   1030    1119
1   2002    2071
2   143 215
2   1566    1644
2   1828    1907
3   111 186
3   433 577
3   2167    2300

Such files encode the location of regions with a special characteristic: they are promoters or regions of open DNA, for example.

Ideally, the models will consider the following features at the location of the candidates binding sites:

  • Genomic structural annotations (promoter, 5’UTR, coding sequence, intron, 3’UTR, downstream region);
  • Sequence phylogenetic conservation;
  • Condition-specific opening of the DNA (identified by DNAseI-seq or ATAC-seq);
  • Other context-specific features characterizing the chromatin state (histone marks or nucleosomes occupancy).

The buildTFBSmodel aims at easing the process of training the models, allowing to automatically download (from the binomial name of the studied species), according to the last genome release:

  • Location of genomic structures (promoter, 5’UTR, coding sequence, intron, 3’UTR, downstream region), from Biomart. Promoter and downstream regions length are respectively put at 2000bp (upstream the Transcription Start Site) and 1000bp (downstream the Transcript Termination Site);
  • Genome sequence, from Ensembl, RefSeq or GenBank;
  • Location of the Transcription Start Sites (TSS) of protein-coding genes, from biomart. It is essential to annotate the candidates binding sites with a potential target gene (i.e. the one whose the TSS is the closest) and to calculate their distance to the closest TSS.

A model for Arabidopsis thaliana can be built. The following data are integrated:

Optionally, the conserved elements and the open DNA regions can be scored or categorized (see the distinction between overlapping, numerical and categorical features).

Type the following command:

TFBSmodel_Athaliana_S14d <- buildTFBSmodel(organism = "Arabidopsis thaliana",
                                           ChIPpeaks = "Data/ChIPpeaks.bed",
                                           pfm = "Data/pfm_seedlings.jaspar",
                                           dhs = "Data/DHS_S14d.bed",
                                           conserved_elements = "Data/CNS.bed",
                                           proportionTF_Validation = 0.5,
                                           strandAsFeature = FALSE)
  • proportionTF_validation = 0.5 means that 50% of the TFs about which ChIP-seq data are considered (here = 4 TFs) won’t be considered to train the model and will be kept aside for cross-validation purposes of the decision trees. If, due to the scarcity of the data, it is not possible to keep TFs aside, a “training ROC”  can still be obtained. It will be assessed on the basis of the TFs considered for modeling, from candidates binding sites that were not part of the training dataset.
  • strandAsFeature = FALSE states to not consider to take into consideration the orientation of the candidates binding sites towards their closest transcript (i.e. their potential target). It has sense to take into account this feature only if one TF or TFs from a same family (i.e. supposed to exhibit similar properties) are considered.

Custom building

For customizing the model building, use the buildTFBSmodel_custom function.

Set the promoter and downstream region lengths

The promoter (upstream the Transcription Start Site) and the downstream region (downstream the Transcript Termination Site) lengths can be set when automatically downloading the promoters and downstream regions locations from biomart:

TFBSmodel_Athaliana <- buildTFBSmodel_custom(organism = "Arabidopsis thaliana",
                                             ChIPpeaks = "Data/ChIPpeaks.bed",
                                             pfm = "Data/pfm_seedlings.jaspar",
                                             dhs = "Data/DHS_S14d.bed",
                                             conserved_elements = "Data/CNS.bed",
                                             proportionTF_Validation = 0.5,
                                             strandAsFeature = FALSE,
                                             promoterLength = 1000,
                                             downstreamLength = 500)

Set the matching scores p-value threshold

By default, the considered candidates binding sites are the regions of the genome matching the TFs PFMs with a p-value of maximum 10^-3. This p-value threshold can be changed, in order to be more or less stringent in the candidates binding sites selection:

TFBSmodel_Athaliana <- buildTFBSmodel_custom(organism = "Arabidopsis thaliana",
                                             ChIPpeaks = "Data/ChIPpeaks.bed",
                                             pfm = "Data/pfm_seedlings.jaspar",
                                             dhs = "Data/DHS_S14d.bed",
                                             conserved_elements = "Data/CNS.bed",
                                             proportionTF_Validation = 0.5,
                                             strandAsFeature = FALSE,
                                             pvalThreshold = 0.0005)

Turn off automatic downloading

You might want to avoid automatic downloading of the genomic structures locations, genomic sequence and/or protein-coding TSS locations, for the following reasons:

  • To consider a previous assembly;
  • To ignore some genomic structural annotations;
  • To input data obtained from an another source than biomart. This will be necessary if the considered species is not described on this database.
  • To speed up the training, as the downloading step can be relatively long.

The data necessary for genomic structural annotations as well as the genomic sequence can be input from the following source files:

  • from BED3 file for the location of the genomic structures  (promoters, 3'UTR, introns, CDS, 5'UTR, downstream regions) : the 4th field of the BED file has to be empty;
  • from fasta file for the genomic sequence;
Regarding the protein-coding Transcription Start Sites, several informations need to be input: their locations (which corresponds to only 1 bp) as well as the name and the orientation of the related transcripts. It is then necessary to input the data in the following format:
  • in a BED file, where TSS are represented by 1 bp regions and are annotated in the name field (4th field) by the transcript names and in the strand field (5th) by the orientation of the transcripts on the DNA strand. Such a file will look like as following:
1   65  66  Gene1.1 .   +
1   1214    1215    Gene2.1 .   -
1   1627    1628    Gene2.2     +
2   250 251 Gene3.1 .   +
2   619 620 Gene4.1 .   -
2   2460    2461    Gene5.1 .   -
2   3195    3196    Gene5.1 .   -
3   90  91  Gene6.1 .   - 
3   356 357 Gene6.2 .   +
3   1448    1449    Gene6.1 .   +


Therefore, as an example, to ignore the location on introns of the candidates binding sites and to input from source files the genomic sequence and the promoters location, type:


TFBSmodel_Athaliana <- buildTFBSmodel_custom(organism = "Arabidopsis thaliana", ChIPpeaks = "Data/ChIPpeaks.bed", pfm = "Data/pfm_seedlings.jaspar", dhs = "Data/DHS_S14d.bed", conserved_elements = "Data/CNS.bed",
promoter = "Data/Promoter_athal.bed", intron = NULL, genome_sequence = "Data/genome_athal.fa",
 
proportionTF_Validation = 0.5, strandAsFeature = FALSE)

If after having trained a model a first time, you want to train a second time a model (see reasons why by refering to checkTFsuitability) avoiding the time-consuming step of downloading, the following tips are working:

StructuresLocations <- TFBSmodel_Athaliana$`Source data and annotations`$`Structure Ranges`
GenomicSequence <- TFBSmodel_Athaliana$`Source data and annotations`$genome

TFBSmodel_Athaliana <- 
  buildTFBSmodel_custom(organism = "Arabidopsis thaliana",
                         ChIPpeaks = "Data/ChIPpeaks.bed",
                         pfm = "Data/pfm_seedlings.jaspar",
                         dhs = "Data/DHS_S14d.bed",
                         conserved_elements = "Data/CNS.bed",
                         genome_sequence = GenomicSequence,
                         tss = StructuresLocations$closestTSS,
                         promoter = StructuresLocations$Promoter,
                         x5utr = StructuresLocations$X5UTR,
                         cds = StructuresLocations$CDS,
                         intron = StructuresLocations$Intron,
                         x3utr = StructuresLocations$X3UTR,
                         downstream = StructuresLocations$Downstream,
                         proportionTF_Validation = 0.5,
                         strandAsFeature = FALSE)
TF nameContext
Source ChIP-seq dataSource DNAseI-seq dataContext ID
PIFlowers 4 daysHeyndrickx et al., 2014Pajoro et al., 2014F4d
AGFlowers 4 daysJin et al., 2017Pajoro et al., 2014F4d
TOC1 Seedlings 14 daysHeyndrickx et al., 2014Zhang et al., 2012S14d
PRR5  Seedlings 14 daysHeyndrickx et al., 2014Zhang et al., 2012S14d
FLM Seedlings 14 daysHeyndrickx et al., 2014Zhang et al., 2012S14d
BES1 Seedlings 14 daysHeyndrickx et al., 2014Zhang et al., 2012S14d
SOC1 Flowers 0 dayHeyndrickx et al., 2014Pajoro et al., 2014F0d
AP3 Flowers 4 daysHeyndrickx et al., 2014Pajoro et al., 2014F4d
SEP3 Open flowersHeyndrickx et al., 2014Jin et al., 2017Fopen
SMZ Seedlings 7 daysHeyndrickx et al., 2014Jin et al., 2017S7d
PIF4Seedlings 14 daysHeyndrickx et al., 2014Zhang et al., 2012S14d
PIF5Seedlings 14 daysHeyndrickx et al., 2014Zhang et al., 2012S14d
PRR7 Seedlings 14 daysHeyndrickx et al., 2014Zhang et al., 2012S14d
LFY Seedlings 14 daysHeyndrickx et al., 2014Zhang et al., 2012S14d
CCA1 Seedlings 14 daysJin et al., 2017Zhang et al., 2012S14d


A new model for Arabidopsis thaliana can therefore be learned from all these TFs, by typing:

DHSfiles <- c(F4d = "Data/DHS_F4d.bed",
              Fopen = "Data/DHS_Fopen.bed",
              S7d = "Data/DHS_S7d.bed",
              S14d = "Data/DHS_S14d.bed",
              F0d = "Data/DHS_F0d.bed",
              F2d = "Data/DHS_F2d.bed")

GroupTFbyDHS <- data.frame(TFNames = c("PI", "AG",
"TOC1", "PRR5", "FLM", "BES1", "SOC1", "AP3", "SEP3", "SMZ", "PIF4", "PIF5", "PRR7", "LFY", "CCA1"), DHS = c("F4d", "F4d", rep("S14d", 4), "F0d", "F4d", "Fopen", "S7d", rep("S14d", 5)))

The name that is given to each file path in DHSfiles allows to define the dataframe GroupTFbyDHS, whose first column has to be named "TFNames" and the second ones "DHS". IMPORTANT: the names of the TFs input in the column TFNames corresponds to the ones used in the "Data/ChIPpeaks.bed" and "Data/pfm_athal.pfm" following files.

To build the models, type now:

TFBSmodel_Athaliana_FS <- buildTFBSmodel_custom(organism = "Arabidopsis thaliana",
                                                ChIPpeaks = "Data/ChIPpeaks.bed",
                                                pfm = "Data/pfm_athal.jaspar",
                                                dhs = DHSfiles,
groupTFbyDHS = GroupTFbyDHS,
conserved_elements = "Data/CNS.bed", proportionTF_Validation = 0.5, strandAsFeature = FALSE)

Remark: It can happen that a same TF has been studied in different contexts. There is no real support for this case. To consider all the ChIP-seq data associated to one same TF, the trick is to act as they were about different TFs and give nicknames to the TF for each context. These nicknames have to be encoded in the source files describing the ChIP-peaks locations (that have to be annotated with the different nickanames according to the context in which they have been obtained) but also in the jaspar file with the PFMs, which means that the PFM of the TF has to be repeated in this file as much as there are contexts and named according to the nicknames of the TF.

Fine tune integration of numeric features

When considering numeric features, two parameters can be set:

  • smoothing_window: gives the length of the windows centered on the candidates binding sites, along which the signal is averaged to allow the annotation of the candidates binding sites.
  • default_value_when_missing: assigns a default value to regions that are not defined in the source data.

Different genome-wide signals, coming especially from ChIP-seq experiments studying the histone marks or from DNAseI-seq or ATAC-seq studies, have few sense at the bp level and have to be considered in the context of larger regions. The two introduced parameters allow to deal with this aspect but it is not advised to consider BED files in which the signal of each bp is defined, that are heavy. The package is better designed to take as input location of related peaks together with their scores, that can be obtained by pre-processing the data with peak-calling, using a package such as chromstaR for example. In that case, the main effect of setting smoothing_window and default_value_when_missing will be to more or less lower the value associated to candidates binding sites located at the borders of the peaks compared to the ones located at the centers.

As an example, we will consider for A. thaliana, the location of peaks that score the genomic regions that are the most open, according to ATAC-seq experiment achieved in the context of 14 days old seedlings (Zefu et al., 2017). The ATAC-seq assess the sensitivity of genomic regions to the Transposase5 and therefore leads to the identification of Transposase5-hypersensitive regions (TnHS):

TFBSmodel_Athaliana_ATAC <- buildTFBSmodel_custom(organism = "Arabidopsis thaliana",
                                                  ChIPpeaks = "Data/ChIPpeaks.bed",
                                                  pfm = "Data/pfm_seedlings.jaspar",
                                                  dhs = "Data/DHS_S14d.bed",
otherFeatures = c(TnHS = "Data/TnHS_S14d.bed"),
smoothing_window = c(TnHS = 50),
default_value_when_missing = c(TnHS = 0),
 
conserved_elements = "Data/CNS.bed", proportionTF_Validation = 0.5, strandAsFeature = FALSE)

Consider results from an external pattern-matching analysis

The implemented function of pattern-matching is based on the function matchPWM du package Biostrings. The distribution of the scores output by matchPWM is evaluated by scanning sampled sequences representing the genome to infer the scores p-values. While this algorithm is fast, it is however simple, without any background correction (see the description). It could be valuable to exploit the output of more evolved tools, such as the well-established FIMO and RSAT or the R-packages motifcounter and motifmatchr.

The results of a pattern-matching analysis can be input through the matches argument as a list of GRanges objects (defined by the GenomicRanges package).

Each GRanges object of the list refers to the results of pattern-matching for one of the considered PFMs. They describe the location of the candidates binding sites and their matching scores (raw or converted to a p-value) in a metadata column.

As an example, the results of a pattern-matching analysis obtained using motifmatchr will be considred.

The following function can be defined to get the results of the analysis in the correct format, using as input source files encoding the genome sequence in FASTA and the PFMs in jaspar: This functions requires that the motifmatchr and TFBStools packages are installed.

#Install the necessary packages
if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install(c("motifmatchr", "TFBSTools"))

#Define the pattern-matching function
analyzeWithMotifmatchr <- function(pfm, genome_sequence, pvalThreshold){ ##Import the PFMs PFMs <- wimtrap:::readPSFM(pfm) #Format them as PFMatrixList PFMs <- lapply(seq_along(PFMs), function(matrix){ PFM <- PFMatrix(profileMatrix = PFMs[[matrix]], ID = names(PFMs)[matrix]) return(PFM)} ) PFMsList <- PFMatrixList() for (pfm in seq_along(PFMs)){ PFMsList[[pfm]] <- PFMs[[pfm]] names(PFMsList)[pfm] <- PFMs[[pfm]]@ID } ##Import the genome sequence GenomeSequence <- readDNAStringSet(genome_sequence) ##Define GRanges corresponding to each chromosome ChrRanges <- GRanges(seqnames = names(GenomeSequence), ranges = IRanges(start = rep(1, length(GenomeSequence)), end = width(GenomeSequence)), strand = rep("+", length(GenomeSequence))) PFMatches<- motifmatchr::matchMotifs(PFMsList, ChrRanges, GenomeSequence, out = "position", bg = "genome", p.cutoff = pvalThreshold) PFMatches <- as.list(PFMatches) }

To build a model for Arabidopsis thaliana, you can thus type:

#Load the necessary libraries
library(motifmatchr)
library(TFBSTools)
library(Biostrings)
library(IRanges)
library(GenomicRanges)

#Get the results of the pattern-matching PFMatches <- analyzeWithMotifmatchr(pfm =
"Data/pfm_seedlings.jaspar", genome_sequence = "Data/genome_athal.fa", 0.001)

#Print the head of the object containing the results head(PFMatches)

#Build a TFBS model
TFBSmodel_Athaliana <- buildTFBSmodel_custom(organism = "Arabidopsis thaliana", ChIPpeaks = "Data/ChIPpeaks.bed", matches = PFMatches, dhs = "Data/DHS_S14d.bed", conserved_elements = "Data/CNS.bed", proportionTF_Validation = 0.5, strandAsFeature = FALSE)

Analyze the models

Quick overview of the models

Printing a TFBSmodel object allows to get the predictive features (= attributes) considered by the models and a description of each model: their sensitivity and specificity, the number of trials that they carry on to make prediction (as the boost is activated), the number of binding sites (= the number of cases) that were considered to train them, the mean tree size of the decision trees associated to each trial and the attributes usage, that allows to assess which features are the most predictive.

data("wimtrapModel_athal")
wimtrapModel_athal
## TFBSmodel object
## 
##  *    10 attributes:
## 
##  matchLogPval, Promoter, X5UTR, CDS, Intron, X3UTR, Downstream, DistToClosestTSS, DHS, ConservedElement
## 
##  *    33 C5.0 predictive models:
## 
##   sensitivity  specificity iterations  cases  tree size          attributes usage
##  | --------- | ---------- | -------- | ---- | --------- | ------------------------------ |
## 
##  [ 1 ]      1.9 %       99.8 %         3       3484      3.3         100.00% matchLogPval
##                                  100.00% DHS 
##                                   21.61% Promoter 
##                                   20.58% ConservedElement 
## 
##  [ 2 ]      1.9 %       99.9 %         3       3484      2.0         100.00% matchLogPval
##                                  100.00% DHS 
##                                    1.29% ConservedElement 
## 
##  [ 3 ]      1.9 %       99.9 %         3       3484      2.0         100.00% matchLogPval
##                                  100.00% DHS 
##                                    1.49% ConservedElement 
## 
##  [ 4 ]      1.9 %       99.9 %         3       3484      2.0         100.00% matchLogPval
##                                  100.00% DHS 
##                                    1.09% ConservedElement 
## 
##  [ 5 ]      1.9 %       99.7 %         3       3484      4.7         100.00% DHS
##                                   81.54% matchLogPval 
##                                   18.46% CDS 
##                                   18.46% ConservedElement 
##                                    4.99% DistToClosestTSS 
##                                    2.10% X5UTR 
##                                    1.84% Promoter 
## 
##  [ 6 ]      3.8 %       99.6 %         3       3484      2.0         100.00% DHS
##                                   18.66% matchLogPval 
## 
##  [ 7 ]      3.8 %       99.5 %         3       3484      2.0         100.00% DHS
##                                   18.92% matchLogPval 
## 
##  [ 8 ]      3.8 %       99.5 %         3       3484      2.0         100.00% DHS
##                                   18.83% matchLogPval 
## 
##  [ 9 ]      3.8 %       99.6 %         3       3484      2.0         100.00% DHS
##                                   19.20% matchLogPval 
## 
##  [ 10 ]     3.8 %       99.6 %         3       3484      2.7         100.00% matchLogPval
##                                  100.00% DHS 
##                                    2.73% ConservedElement 
## 
##  [ 11 ]     3.8 %       99.5 %         3       3484      2.0         100.00% DHS
##                                   18.54% matchLogPval 
## 
##  [ 12 ]     5.7 %       99.6 %         3       3484      3.7         100.00% DHS
##                                   17.51% matchLogPval 
##                                   17.51% Intron 
##                                    0.75% ConservedElement 
## 
##  [ 13 ]     5.7 %       99.9 %         3       3484      2.0         100.00% DHS
##                                  100.00% ConservedElement 
##                                    9.07% matchLogPval 
## 
##  [ 14 ]     5.7 %       99.9 %         3       3484      2.3         100.00% DHS
##                                   18.11% ConservedElement 
##                                    4.76% matchLogPval 
## 
##  [ 15 ]    11.3 %       99.5 %         3       3484      2.7         100.00% matchLogPval
##                                  100.00% DHS 
##                                    4.16% ConservedElement 
## 
##  [ 16 ]    11.3 %       99.6 %         3       3484      4.0         100.00% DHS
##                                  100.00% ConservedElement 
##                                   17.39% matchLogPval 
##                                   17.39% CDS 
##                                   15.36% X3UTR 
## 
##  [ 17 ]    18.9 %       99.3 %         3       3484      2.3         100.00% DHS
##                                   18.57% ConservedElement 
##                                    5.11% matchLogPval 
## 
##  [ 18 ]    32.1 %       96.3 %         3       3484      2.7         100.00% DHS
##                                  100.00% ConservedElement 
##                                   11.22% matchLogPval 
## 
##  [ 19 ]    32.1 %       96.3 %         3       3484      2.7         100.00% matchLogPval
##                                  100.00% DHS 
##                                   18.54% ConservedElement 
##                                    1.26% X3UTR 
## 
##  [ 20 ]    32.1 %       96.3 %         3       3484      3.0         100.00% DHS
##                                   18.71% ConservedElement 
##                                    5.17% matchLogPval 
## 
##  [ 21 ]    34.0 %       95.9 %         3       3484      3.0         100.00% DHS
##                                   24.45% matchLogPval 
##                                   24.45% ConservedElement 
## 
##  [ 22 ]    34.0 %       96.0 %         3       3484      3.0         100.00% DHS
##                                   22.90% matchLogPval 
##                                   21.61% ConservedElement 
## 
##  [ 23 ]    34.0 %       96.1 %         3       3484      2.7         100.00% DHS
##                                   21.10% matchLogPval 
##                                   21.10% ConservedElement 
## 
##  [ 24 ]    34.0 %       96.0 %         3       3484      2.7         100.00% DHS
##                                   18.74% matchLogPval 
##                                   18.74% ConservedElement 
## 
##  [ 25 ]    50.9 %       94.0 %         3       3484      4.0         100.00% DHS
##                                   26.84% ConservedElement 
##                                   17.82% matchLogPval 
##                                    4.85% X5UTR 
## 
##  [ 26 ]    54.7 %       93.6 %         3       3484      4.3         100.00% DHS
##                                   25.72% matchLogPval 
##                                   25.72% ConservedElement 
##                                   16.76% Intron 
## 
##  [ 27 ]    60.4 %       92.1 %         3       3484      6.3         100.00% DHS
##                                   89.32% matchLogPval 
##                                   30.02% X5UTR 
##                                   30.02% ConservedElement 
##                                   13.17% CDS 
##                                    9.56% X3UTR 
##                                    4.08% Promoter 
##                                    0.89% DistToClosestTSS 
## 
##  [ 28 ]    79.2 %       84.4 %         3       3484      3.0         100.00% DHS
##                                  100.00% ConservedElement 
##                                   66.96% matchLogPval 
##                                    3.13% Promoter 
## 
##  [ 29 ]    79.2 %       84.6 %         3       3484      6.3         100.00% DHS
##                                  100.00% ConservedElement 
##                                   84.76% matchLogPval 
##                                   56.75% Promoter 
##                                   56.75% X5UTR 
##                                   38.29% CDS 
##                                    3.16% Intron 
## 
##  [ 30 ]    81.1 %       80.1 %         3       3484      6.7         100.00% DHS
##                                   56.54% DistToClosestTSS 
##                                   56.54% ConservedElement 
##                                   55.20% Promoter 
##                                   51.32% X5UTR 
##                                   51.21% Intron 
##                                   40.15% CDS 
##                                   33.30% matchLogPval 
## 
##  [ 31 ]    84.9 %       74.7 %         3       3484      7.0         100.00% DHS
##                                   51.92% matchLogPval 
##                                   51.92% ConservedElement 
##                                   51.03% Intron 
##                                   49.11% Promoter 
##                                   43.43% CDS 
##                                   42.59% X5UTR 
##                                    9.99% DistToClosestTSS 
## 
##  [ 32 ]    86.8 %       70.4 %         3       3484      9.7         100.00% DHS
##                                   45.52% matchLogPval 
##                                   45.52% DistToClosestTSS 
##                                   45.52% ConservedElement 
##                                   45.09% CDS 
##                                   27.53% Promoter 
##                                   27.53% X5UTR 
##                                   27.35% Intron 
##                                   16.96% X3UTR 
##                                    1.58% Downstream 
## 
##  [ 33 ]    96.2 %       41.3 %         3       3484      6.7         100.00% DistToClosestTSS
##                                  100.00% DHS 
##                                   39.61% matchLogPval 
##                                   38.46% Intron 
##                                   38.46% ConservedElement 
##                                   31.80% CDS 
##                                   31.75% Promoter

Check the suitability of TFs for modeling

The models aim to predict TFs targets based on their sequence-specific binding to the DNA. This approach implies that TFs that do not recognize a particular motif are not suitable: the pattern-matching can only identify a random sample of the regions bound by such TFs. After getting TFBS models, it is thus valuable to verify that all the considered TFs are adapted to the modeling (and/or if the data are trustable). The sequence-specificity binding of TFs can be assessed by evaluating the predictive performances of the pattern-matching alone, that has to perform better than a random guess. The ROC curve has to be associated to an area under the curve (AUC) superior to 0.5, ie has to be above the straight line of coefficient 1 and the ordinate at origin 1. Poor performances for a TF can occur either due to sequence-unspecific recruitment or due to corrupted ChIP-seq data or PFM.

This can be easily checked from a TFBSmodel using checkTFsuitability:

data("wimtrapModel_athal")
checkTFsuitability(wimtrapModel_athal)

Below, are presented illustrations of ROC curves of suitable and unsuitable TFs:

Assess graphically the models predictive performances

It is possible to compare the performances of the TFBS models to the ones of the pattern-matching to predict TFBS by plotting sensitivity-specificity curves (= ROC curves). Two ROC curves are drawn to assess the models performances: the first one is evaluated with the TFs considered to train the models while the second is obtained from the TFs kept aside for validation purposes.

In addition, the plot is accompanied by the area under the ROC curves (AUC). A better predictive method tend to get a higher AUC.

This plot is especially useful to choose the model to pick for making predictions. A model can be accessed through the sensitivity level obtained from the training TFs. The best sensitivity level depends not only on your purposes (do you need to retrieve as much as possible TRUE TFBS or rather to get a list of predicted TFBS in which you can be more confident?) but also on the performances of the method. Generally the most interesting sensitivity-specificity levels are situated in the bend of the ROC curve.

The the ROC curves associated to the models specific to Arabidopsis thaliana are taken as an example.

data("wimtrapModel_athal")
plot(wimtrapModel_athal, type = "ROC")

The models are clearly performing better than the pattern-matching. There is no drop of the performances between the TFs considered to train the models and the TFs used for validation. The models around 0.75 sensitivity are particularly interesting.

Note that optionally, the performances of the models to predict TFs genes targets (from the prediction of the TFBS) can be plot in comparison with the ones of the pattern-matching if the option targetROC of the function buildTFBSmodel was set to TRUE for you getting the TFBSmodel object. This evaluation is however time-consuming.

Characterize the TFBS according to the considered features

It is interesting to visualize how the candidates binding sites that are actually bound by the TFs, according to ChIP-seq data, are differing from the ones that are not. It allows to evaluate the relevance of the considered predictive features and possibly to better understand TFs binding: for example, with models specific to a family of TFs, it is possible to evaluate whether the TFBS are associating with an histone mark. In addition, such evaluation can also be used as quality control, to verify that the TFBS exhibits some expected characteristics.

data("wimtrapModel_athal")
plot(wimtrapModel_athal, type = "Features")

With the example of Arabidopsis thaliana, the following plots are output:

  • Pattern-matching: distribution (represented as box-plots) of the pattern-matching scores of the candidates binding sites (i.e. the regions corresponding to a TF motif) among the TRUE (i.e. ChIP-positive motifs) and the FALSE (i.e. ChIP-negative motifs) TFBS;
  • Location of the motifs on genomic structures: proportion of the TRUE (ChIP-positive motifs) and of the FALSE (ChIP-negative motifs) TFBS that are locating on each of the considered genomic structure. N.B.: a same candidate binding site can overlap several features;
  • Distance of the motifs to the closest TSS: distribution (represented as density curves) of the distance to the closest transcript start site of the candidates binding among the ChIP-positive and ChIP-negative motifs;
  • Location of the motifs on DNA features: proportion of the ChIP-positive and ChIP-negative motifs that are located on an open region of the DNA (DHS, DNAseI-hypersensitive sites) and on a phylogenetically sequence-conserved region (conserved element).

Note that as the prevalence of TRUE TFBS (ChIP-positive motifs) is low (the order of 1% of all the candidates), the ChIP-negative motifs can be considered as background.

These plots show that the TRUE TFBS are hopefully associated to expected characteristics: they tend to get higher pattern-matching scores, locate preferentially to promoters and 5’UTR and avoid coding sequences, are more represented in the immediate neighborhood of the TSS and are over-represented on DHS and conserved elements. The considered predictive features make sense.

Draw the decision trees

The decision trees that are associated to each model can be drawn. A model can be accessed through its sensitivity level evaluated from the training TFs (see how to plot ROC curves). 3 different trees can be represented for a given model. Indeed, as the modeling involves a boosting with 3 trials, the predictions will correspond to the most represented response output by 3 different tree models, that are plotted separately.

data("wimtrapModel_athal")
plot(wimtrapModel_athal, type = "Model", sensitivity = 0.60)

The candidates binding sites (=motifs) are iteratively split in smaller groups according to their values for the feature that is optimizing the separation of the ChIP-positive from the ChIP-negative motifs (by maximizing the total entropy). The number (n) of candidates binding sites ending at each final node is printed. In addition, the proportion of ChIP-positive and ChIP-negative candidates binding sites in each final node is represented as a stacked, cumulative, bar (in dark grey, the proportion of ChIP-negative motifs and in light grey, the proportion of ChIP-positive motifs). It is an another way to study the models. For example, the precision of the method to predict Arabidopsis thalian TFBS appears to be quite limited despite the great improvements: no more than 20% of candidates binding sites in a node is ChIP-positive.

Transpose TFBS models from a species to an another

The source data necessary to annotate the candidates binding sites can all be updated in a TFBSmodel object using the transposeTFBSmodel. This is especially useful to transfer the TFBS models from an organism to an another one, as the context-specific source data can be directly updated with the TFBSpredict function.

The usage of transposeTFBSmodel is similar to the one of buildTFBSmodel_custom to input the source data.

For example,  the TFBS models learned from Arabidopsis thaliana can be transferred to Oryza sativa subsp. japonica. Indeed, on PlantRegMap, data about open DNA regions in seedlings or callus and about conserved elements can be downloaded. Before inputting these data, the downloaded files have to be manipulated. The location of the conserved elements have to be encoded in a BED instead of GTF file and the score fields, both for the open DNA regions and conserved elements, have to be empty as these features are considered as overlapping in the models. This can be achieved easily, in Excel for example.

If the automatic downloading of the genomic sequence, of the location of structural genomic features and of the Transcription Start Sites (TSS) is activated, type the following, considering the open DNA regions in seedlings:

data("wimtrapModel_athal")
TFBSmodel_rice <- transposeTFBSmodel(wimtrapModel_athal,
                                    organism = "Oryza sativa",
                                    dhs= "Data/DHS_Osj_seedlings_14_days.bed",
                                    conserved_elements = "Data/CE_Osj.bed")

Apply TFBS models

It is easy to exploit TFBS models to predict TFBS and related genes targets from a TFBSmodel, that can be got using the buildTFBSmodel or buildTFBSmodel functions.

It is necessary to: