MEME formatting

Functions for parsing TF motif files in meme-format (https://meme-suite.org/meme/doc/meme-format.html). Note, even if it’s a defined format, there might be variations when using files that are not used in the examples here.

# Block that has to be executed for all.
import MEME_Formatting
out_dir = 'docs/gallery/'
meme_file = 'ExampleData/Jaspar_Hocomoco_Kellis_human_meme.txt'
annotation = 'ExampleData/gencode.v38.annotation_chr21Genes.gtf'  # It's not the full one, so fewer hits expected.
MEME_Formatting.meme_id_map(meme_file, gtf_file, species='human')

Takes a motif meme-file and returns the list of TF gene ids belonging to that TF. Note that it assumes a certain syntax for the different motif versions. Species is required for looking up names missing in the gtf file with the MyGene.info API.

Parameters:
  • meme_file – Motif file in meme format.

  • gtf_file – gtf-file in GENCODE’s format, can be gzipped.

  • species – ‘human’ or ‘mouse’. Others are not tested.

Returns:

  • tf_ids: Dictionary of {motif: [Ensembl IDs]} with an ID of each constituent monomer.

  • all_tf_names: List of all monomers without any version.

  • misses: Motif names that could not be mapped.

Return type:

tuple

# Get the Ensembl ID for the TFs in our motif meme-file.
tf_ids, all_tf_names, misses = MEME_Formatting.meme_id_map(meme_file=meme_file, gtf_file=annotation, species='human')
print('TBXT', tf_ids['TBXT'])
print('MAX::MYC', tf_ids['MAX::MYC'])
TBXT    ['ENSG00000164458']
MAX::MYC        ['ENSG00000125952', 'ENSG00000136997']
MEME_Formatting.meme_monomer_map(meme_file)

Takes a motif file in meme format and creates a map of the motif name to a list of the constituent monomers, removing any motif versions in the list values. E.g. {‘FOS(MA1951.1)’: [‘FOS’], ‘FOXJ2::ELF1’: [‘FOXJ2’, ‘ELF1’]}.

Returns:

  • tf_monomer_map: A dictionary with {motif name: List of TF names}

  • all_monomer_names: List of all the monomer names, meaning a list of all unique values of tf_monomer_map.

Return type:

tuple

# Useful when in need of the individual monomers or removal of the motif versions.
tf_monomer_map, all_monomer_names = MEME_Formatting.meme_monomer_map(meme_file=meme_file)
print('BHLHA15(MA0607.2)', tf_monomer_map['BHLHA15(MA0607.2)'])
print('MAX::MYC', tf_monomer_map['MAX::MYC'])
print('all monomers', all_monomer_names[:4])
BHLHA15(MA0607.2)       ['ENSG00000180535']
MAX::MYC        ['ENSG00000125952', 'ENSG00000136997']
['NR1I2', 'RELA', 'HSF1', 'ZIM3']
MEME_Formatting.subset_meme(meme_file, motif_names, out_file, include_dimers=True, exact_match=False)

Takes a meme file and writes a new one containing only the ones present in motif_names. Headerlines are preserved.

Parameters:
  • meme_file – Motif file in meme format.

  • motif_names – List of motif names to keep for the new file.

  • out_file – Path where to write the new subsetted meme file.

  • include_dimers – Also adds dimers containing one of the motif_names.

  • exact_match – If motifs have to be exact matches to the motif_names. If False, allows version suffixes, e.g. BHLHA15(MA0607.2).

# Subset a meme-file, which is useful for example for excluding TFs that are not expressed.
MEME_Formatting.subset_meme(meme_file, motif_names=['MAX::MYC', 'TBXT'], out_file=out_dir+"Subset_meme.txt",
                            include_dimers=True, exact_match=False)
print(open(out_dir+"Subset_meme.txt").read())
MEME version 4

ALPHABET= ACGT

strands: + -

Background letter frequencies (from uniform background):
A 0.25000 C 0.25000 G 0.25000 T 0.25000 

MOTIF MAX::MYC 

letter-probability matrix: alength= 4 w= 11 nsites= 21 E= 0
  0.333333        0.047619        0.428571        0.190476      
  0.714286        0.047619        0.190476        0.047619      
  0.095238        0.428571        0.428571        0.047619      
  0.047619        0.952381        0.000000        0.000000      
  1.000000        0.000000        0.000000        0.000000      
  0.000000        0.952381        0.000000        0.047619      
  0.047619        0.000000        0.952381        0.000000      
  0.000000        0.047619        0.000000        0.952381      
  0.000000        0.000000        1.000000        0.000000      
  0.047619        0.047619        0.857143        0.047619      
  0.142857        0.238095        0.000000        0.619048      

MOTIF TBXT 

letter-probability matrix: alength= 4 w= 16 nsites= 7335 E= 0
  0.007307        0.021922        0.003092        0.967678      
  0.260763        0.555969        0.153712        0.029556      
  0.693440        0.009995        0.257132        0.039433      
  0.000000        0.999620        0.000000        0.000380      
  0.966808        0.000000        0.033192        0.000000      
  0.037374        0.875021        0.019030        0.068575      
  0.248283        0.481736        0.181769        0.088213      
  0.078576        0.076327        0.026733        0.818364      
  0.801734        0.023314        0.094605        0.080347      
  0.090262        0.169687        0.492718        0.247333      
  0.078512        0.015939        0.862338        0.043211      
  0.000000        0.027221        0.001862        0.970917      
  0.001571        0.000000        0.998429        0.000000      
  0.056629        0.264270        0.007753        0.671348      
  0.036126        0.119806        0.683584        0.160484      
  0.967525        0.002784        0.025748        0.003943