FIMO TFBSs

FIMO is a popular tool for predicting TFBS in regions. Here are scripts that make calling FIMO easier, as they take over some preprocessing step and also process the output into a matrix of regions x TFs and a bed-file which can be loaded into a genome browser. The preprocessing includes

  • writing the sequence of the regions

  • forcing the regions to be within the genome boundaries

  • adjusting the TF motif meme-file so that the base content of the bed-regions is used as background frequencies.

In the output matrix, overlapping TFBS of the same TF on the same strand are merged and counted as 1. Take care with palindromic motifs, they are still counted on both strands. Only the regions with at least one binding site are written into the matrix. The same collected binding sites are written into the bed file to be displayed in a genome browser. The first script starts with a bed-file, the second with a set of genes to get the TFBS in their promoter regions. Both are called via the command line. If the genome sequence file doesn’t have an index, samtools has to be installed and on the PATH for the function to create it.

CARE: Tested for meme5.4.1, later versions handle sequence names differently.

TFBS in regions

Start with a bed-file and get the matrix of regions x TFs with the TFBS.

usage: FIMO_TFBS_inRegions.py [-h] --bed_file BED_FILE --PWMs PWMS --fasta
                              FASTA --fimo_src FIMO_SRC --out_dir OUT_DIR
                              [--write_sequence WRITE_SEQUENCE]
                              [--thresh THRESH]
                              [--fit_background FIT_BACKGROUND]

Named Arguments

--bed_file

Bed-file with regions for which TFBS should be annotated.

--PWMs

TF motif file in meme format.

--fasta

Genome sequence file in fasta format.

--fimo_src

Path to the Fimo executable. If fimo is on path, just type “fimo”. CARE: Tested for meme5.4.1, later versions handle sequence names differently.

--out_dir

Path to which to write the output to.

--write_sequence

If Fimo should write the sequence matched to the motif in its output file [True, False].

Default: 'False'

--thresh

p-value cutoff for the FIMO output.

Default: '0.0001'

--fit_background

Whether to adjust the motif file to the base content of the bed-regions. Otherwise, the base content in the –PWMs file will be used.

Default: True

import subprocess
import pandas as pd

# We use subprocess here to run the bash command for easier documentation. It's also possible to run it directly
# in the terminal.
bed_file = 'ExampleData/chr21_MockPeaks.bed'
PWMs = 'ExampleData/Jaspar_Hocomoco_Kellis_human_meme.txt'
fasta = 'ExampleData/chr21_MiniMock.fa'  # For the example it's a random sequence from chr21.
fimo_src = "fimo"  # If it's on the PATH, otherwise full path to the executable.
out_dir = 'docs/gallery/FIMO_inRegions'

fimo_cmd = 'python3 src/FIMO_TFBS_inRegions.py --bed_file {} --PWMs {} --fasta {} --fimo_src {} --out_dir {} --write_sequence False'.format(bed_file, PWMs, fasta, fimo_src, out_dir)
print(fimo_cmd)
subprocess.call(fimo_cmd, shell=True)
python3 src/FIMO_TFBS_inRegions.py --bed_file ExampleData/chr21_MockPeaks.bed --PWMs ExampleData/Jaspar_Hocomoco_Kellis_human_meme.txt --fasta ExampleData/chr21_MiniMock.fa --fimo_src fimo --out_dir docs/gallery/FIMO_inRegions --write_sequence False
# The output matrix will be stored in the following path.
fimo_matrix_file = 'docs/gallery/FIMO_inRegions/Out_Fimo_TFBSMatrix.txt.gz'
fimo_matrix = pd.read_table(fimo_matrix_file, sep='\t', header=0, index_col=0)
# With the tiny example most TF have no binding site, so let's pick a few that have some.
print(fimo_matrix[['FOXH1', 'STAT1', 'STAT3']])
               FOXH1  STAT1  STAT3
region                            
chr21:1-100        0      1      1
chr21:200-500      1      1      1
chr21:550-590      0      0      0

TFBS in promoter

Start with a gtf-file and get the TFBS in their promoter regions as gene x TF matrix.

usage: FIMO_TFBS_inPromoter.py [-h] --gtf GTF [--extend EXTEND]
                               [--open_regions OPEN_REGIONS] --PWMs PWMS
                               --fasta FASTA --fimo_src FIMO_SRC --out_dir
                               OUT_DIR [--write_sequence WRITE_SEQUENCE]
                               [--thresh THRESH]
                               [--fit_background FIT_BACKGROUND]

Named Arguments

--gtf

GENCODE gtf-file to take the genes and locations from.

--extend

Number of bp to extend the promoter in each direction.

Default: 200

--open_regions

Optional bed-file to restrict the promoter regions to. That means first the promoter regions are constructed and then only those promoter parts that overlap with open_regions are kept.

Default: False

--PWMs

TF motif file in meme format.

--fasta

Genome sequence file in fasta format.

--fimo_src

Path to the Fimo executable. If fimo is on path, just type “fimo”. CARE: Tested for meme5.4.1, later versions handle sequence names differently.

--out_dir

Path to which to write the output to.

--write_sequence

If Fimo should write the sequence matched to the motif in its output file [True, False].

Default: 'False'

--thresh

p-value cutoff for the FIMO output.

Default: '0.0001'

--fit_background

Whether to adjust the motif file to the base content of the bed-regions. Otherwise, the base content in the –PWMs file will be used.

Default: True

import subprocess
import pandas as pd

# The run is similar to the previous, but we use a gtf-file instead of a bed-file.
gtf_file = 'ExampleData/gencode.v38.annotation_chr21Genes.gtf'  # With tow mock genes covered by the mock fasta.
PWMs = 'ExampleData/Jaspar_Hocomoco_Kellis_human_meme.txt'
fasta = 'ExampleData/chr21_MiniMock.fa'  # For the example it's a random sequence from chr21.
fimo_src = "fimo"  # If it's on the PATH, otherwise full path to the executable.
out_dir = 'docs/gallery/FIMO_inPromoter'

fimo_cmd = 'python3 src/FIMO_TFBS_inPromoter.py --gtf {} --PWMs {} --fasta {} --fimo_src {} --out_dir {} --write_sequence False'.format(gtf_file, PWMs, fasta, fimo_src, out_dir)
subprocess.call(fimo_cmd, shell=True)

fimo_matrix_file = 'docs/gallery/FIMO_inPromoter/Out_Fimo_TFBSMatrix.txt.gz'
fimo_matrix = pd.read_table(fimo_matrix_file, sep='\t', header=0, index_col=0)
# With the tiny example most TF have no binding site, so let's pick a few that have some.
print(fimo_matrix[['FOXH1', 'REST', 'PLAG1']])
               FOXH1  REST  PLAG1
region                           
ENSG0000MOCK1      1     1      0
ENSG0000MOCK2      0     0      1