GTF processing

Functions based on gtf-files (tested for GENCODE’s format) and the resulting outputs. Centred on the usage of pybedtools, but the output can also be easily written to bed files. To run the example code for this module, we will always start with this block of code:

import GTF_Processing
annotation = 'ExampleData/gencode.v38.annotation_Mini.gtf'
gene_list = ['ENSG00000160294', 'ENSG00000279493', 'ENSG00000279720']
GTF_Processing.gene_window_bed(gtf_file, extend=200, gene_set={}, tss_type='5', dict_only=False, merge=False, open_regions=False)

Based on a gtf file fetches all or the most 5’ TSS for all genes, and returns a BedTool object with windows around the TSS, expanding by ‘extend’ in each direction, resulting in a total window size of 2*’extend’+1. Alternatively gives a dictionary with the TSS, also containing the number of transcripts, gene name and strand. The BedTools intervals will be 0-based, the TSS in the dictionary still 1-based like in the gtf-file. Care: removes the .-suffixes from all gene IDs.

Parameters:
  • gtf_file – gtf-file in GENCODE’s format, can be gzipped.

  • extend – Number of base pairs to extend the TSS in each direction. 200 means a window of size 401.

  • gene_set – Set of Ensembl IDs or gene names or mix of both to limit the output to. If empty, return for all genes in the annotation.

  • tss_type – “5” to get only the 5’ TSS or “all” to get all unique TSS of all transcripts in the gtf-file.

  • dict_only – Returns a dictionary instead of a BedTool’s object.

  • merge – If True, merges all overlapping promoters of the same gene into one row in the BedTool’s object.

  • open_regions – Optional bed file or BedTools’ object, only overlapping parts of promoters will be kept for the BedTool’s object. Can therefore be used to find genes whose promoter overlap a set of peaks, for example to find genes that are accessible.

promoter_regions = GTF_Processing.gene_window_bed(gtf_file=annotation, extend=200, gene_set=gene_list, tss_type='5')
print(promoter_regions)
chr21   5011598 5011999 ENSG00000279493 .       +
chr21   9082900 9083301 ENSG00000279720 .       -
chr21   46286096        46286497        ENSG00000160294 .       -
GTF_Processing.gene_feature_table(gtf_file, gene_set=(), extend=500000)

Based on a gtf-file fetch several features on gene-level into one DataFrame, relying heavily on other gtf-processing functions.

Parameters:
  • gtf_file – gtf-file in GENCODE’s format, can be gzipped.

  • gene_set – Set of Ensembl IDs or gene names or mix of both to limit the output to. If empty, return for all genes in the annotation.

  • extend – How much the gene windows should be extended to each side of the TSS to calculate gene density. 1000 means a window of size 2001.

Returns:

A DataFrame with the features, with the indices being either the gene_set or the gtf-file genes. Gene density refers to the number of other genes within a window of ±extend.

Return type:

pandas DataFrame

# Gets the biotypes as annotated in a gtf-file, and also provides a high-level annotation.
feature_table = GTF_Processing.gene_feature_table(gtf_file=annotation, gene_set=gene_list)
print(feature_table)
                   Gene name  #TSS  #Transcripts  Gene length  Exons length  \
Ensembl ID                                                                    
ENSG00000160294       MCM3AP    10            10        51164          9542   
ENSG00000279493  CH507-9B2.2     1             1         5346           509   
ENSG00000279720     SOWAHCP2     1             1          500           500   

                          Biotype gtf Biotype general  Gene density  
Ensembl ID                                                           
ENSG00000160294        protein_coding  protein-coding             0  
ENSG00000279493        protein_coding  protein-coding             0  
ENSG00000279720  processed_pseudogene      pseudogene             0  
GTF_Processing.match_gene_identifiers(gene_identifiers, gtf_file='', species='human', scopes='symbol,alias,uniprot', fields='ensembl,symbol', ensemblonly=False, return_full=False)

Takes a list of gene identifiers to map to a different type of identifier or multiple identifiers. For Ensembl IDs and symbols first check in a gtf-file, all not found there are queried via the API of https://mygene.info/ (how to cite: https://mygene.info/citation/). See the documentation of mygene for further specifications of the options: https://docs.mygene.info/en/latest/doc/query_service.html. Note, there might be cases where multiple matches are found per gene. For Ensembl IDs the first one is taken. Pay attention to upper case and lower case of symbols. Name mapping is fun.

Parameters:
  • gene_identifiers – List of identifiers, e.g. symbols, Ensembl IDs or Entrez IDs.

  • gtf_file – Gene annotation in gtf-style, can be gzipped. Only needed for symbols and Ensembl IDs.

  • species – Give the name of the species, see below for the available ones.

  • scopes – Where mygene.info will search for the gene symbols.

  • fields – csv-string to which fields the output will be limited to. E.g. ‘ensembl,symbol’. Can also be ‘all’.

  • ensemblonly – If only to return the hits with valid Ensembl gene ids.

  • return_full – If to return the full dictionary as it is received by mygene. Contains more information such as protein or transcript ids which are in nested dictionaries from Ensembl. In this case, doesn’t include any information that was found in the gtf file.

Returns:

  • mapped_identifiers: Dict of {identifier: {field: field ID for each field}} of the mappable identifiers.

  • no_hits: Identifiers which were not mappable via gtf-file nor mygene.info.

Return type:

tuple

# The best task of all, map different identifiers. Let's start with Ensembl IDs to gene symbols, which can often
# be successfully done by a gtf-file alone.
gene_ids = ['ENSG00000160294', 'ENSG00000279493', 'ENSG00000279720']
mapped_ids, missed_ids = GTF_Processing.match_gene_identifiers(gene_ids, gtf_file=annotation, species='human',
                                                               fields="symbol")
print(mapped_ids)
{'ENSG00000279493': {'ensembl': 'ENSG00000279493', 'symbol': 'CH507-9B2.2'}, 'ENSG00000279720': {'ensembl': 'ENSG00000279720', 'symbol': 'SOWAHCP2'}, 'ENSG00000160294': {'ensembl': 'ENSG00000160294', 'symbol': 'MCM3AP'}}
# Next we start from symbols, which is more prone to failing. We can also query for the Entrez ID.
gene_symbols = ['AXL', 'MYO10', 'ATP5SL']
mapped_symbols, missed_symbols = GTF_Processing.match_gene_identifiers(gene_symbols, gtf_file=annotation, species='human',
                                                                       fields="ensembl,entrezgene")
print(mapped_symbols)
{'AXL': {'ensembl': 'ENSG00000167601', 'entrezgene': '558'}, 'MYO10': {'ensembl': 'ENSG00000145555', 'entrezgene': '4651'}, 'ATP5SL': {'ensembl': 'ENSG00000105341', 'entrezgene': '55101'}}
# If we want to lookup Entrez IDs, we have to add entrezgene to the scope in which mygene looks.
gene_entrez = [4651, 558]
mapped_entrez, missed_entrez = GTF_Processing.match_gene_identifiers(gene_entrez, gtf_file=annotation, species='human',
                                                                     scopes='symbol,entrezgene', fields="ensembl")
print(mapped_entrez)
{'4651': {'ensembl': 'ENSG00000145555'}, '558': {'ensembl': 'ENSG00000167601'}}
GTF_Processing.gene_body_bed(gtf_file, gene_set={}, dict_only=False)

From a gtf-file fetches the gene bodies, meaning start-end for the entries labelled as ‘gene’.

Parameters:
  • gtf_file – gtf-file in GENCODE’s format, can be gzipped.

  • gene_set – Set of Ensembl IDs or gene names or mix of both to limit the output to. If empty, return for all genes in the annotation.

  • dict_only – Returns a dict {Ensembl ID: [chr, start, end, Ensembl ID, ‘.’, strand]. Else, returns a bedtool- object with the coordinates per gene.

gene_bodies = GTF_Processing.gene_body_bed(gtf_file=annotation, gene_set=gene_list, dict_only=False)
print(gene_bodies)
chr21   5011799 5017145 ENSG00000279493 .       +
chr21   9082601 9083101 ENSG00000279720 .       -
chr21   46235133        46286297        ENSG00000160294 .       -
GTF_Processing.gene_feature_bed(gtf_file, feature, gene_set={}, dict_only=False, merge=True, length_only=False, keep_strand=False)

On gene-level fetch a specific feature of a gene, e.g. all exons, matching the 3rd column in the gtf-file.

Parameters:
  • gtf_file – gtf-file in GENCODE’s format, can be gzipped.

  • feature – one of gtf’s annotated regions: ‘CDS’, ‘Selenocysteine’, ‘UTR’, ‘exon’, ‘gene’, ‘start_codon’, ‘stop_codon’, ‘transcript’.

  • gene_set – Set of Ensembl IDs or gene names or mix of both to limit the output to. If empty, return for all genes in the annotation.

  • merge – Merge the features per gene.

  • dict_only – Return a dict instead.

  • length_only – Return a dict with {gene: feature_len}, only makes sense with merge=True.

  • keep_strand – Whether the strand should be kept as information.

gene_exons = GTF_Processing.gene_feature_bed(gtf_file=annotation, feature='exon', gene_set=gene_list, dict_only=False,
                                             merge=False, keep_strand=False)
# This is list has 118 entries, so let's only print the first 5.
print(''.join([str(x) for x in gene_exons[:5]]))
chr21   5011799 5011874 ENSG00000279493
chr21   5012548 5012687 ENSG00000279493
chr21   5014386 5014471 ENSG00000279493
chr21   5016935 5017145 ENSG00000279493
chr21   9082601 9083101 ENSG00000279720
GTF_Processing.gene_introns_bed(gtf_file, gene_set={}, dict_only=False, length_only=False)

Gets the introns of annotated genes in a gtf-file, by taking the gene bodies and subtracting exons and UTRs.

Parameters:
  • gtf_file – gtf-file in GENCODE’s format, can be gzipped.

  • gene_set – Set of Ensembl IDs or gene names or mix of both to limit the output to. If empty, return for all genes in the annotation.

  • dict_only – Return a dict instead.

  • length_only – Return a dict with {gene: feature_len}, only makes sense with merge=True.

# Getting the introns is a bit more tricky, as they are not explicitly annotated. We get them by subtracting all
# other annotations from the gene bodies.
gene_introns = GTF_Processing.gene_introns_bed(gtf_file=annotation, gene_set=gene_list)
print(''.join([str(x) for x in gene_introns[:5]]))
chr21   5011874 5012548 ENSG00000279493 .       +
chr21   5012687 5014386 ENSG00000279493 .       +
chr21   5014471 5016935 ENSG00000279493 .       +
chr21   46235426        46236829        ENSG00000160294 .       -
chr21   46236979        46240811        ENSG00000160294 .       -
GTF_Processing.gene_biotypes(gtf_file, gene_set={})

Gives a dict of Ensembl IDs with their biotype {g: {gtf: as noted in the gtf, general: manually summarized}. For the manually summarized biotypes, all gtf-biotypes containing ‘RNA’ or equal ‘ribozyme’ are converted to ‘non-coding RNA’. All biotypes containing ‘pseudogene’ are assigned ‘pseudogene’. ‘TcR gene’ and ‘Ig gene’ are assigned to ‘protein-coding’. Removes Ensembl ID version suffixes.

Parameters:

gene_set – Set of Ensembl IDs or gene names or mix of both to limit the output to. If empty, return for all genes in the annotation.

# Gets the biotypes as annotated in a gtf-file, and also provides a high-level annotation.
biotype_dict = GTF_Processing.gene_biotypes(gtf_file=annotation, gene_set=gene_list)
print(biotype_dict)
{'ENSG00000279493': {'gtf': 'protein_coding', 'general': 'protein-coding'}, 'ENSG00000279720': {'gtf': 'processed_pseudogene', 'general': 'pseudogene'}, 'ENSG00000160294': {'gtf': 'protein_coding', 'general': 'protein-coding'}}
GTF_Processing.bed_to_length_dict(bedobj)

Takes a bedtools object with an identifier (Ensembl ID) in the fourth column, and sums the length of all regions of that gene. Used by other functions.

# Convenient, for example, to get the total exon length.
gene_exons = GTF_Processing.gene_feature_bed(gtf_file=annotation, feature='exon', gene_set=gene_list, dict_only=False,
                                             merge=False, keep_strand=False)
exon_lengths = GTF_Processing.bed_to_length_dict(gene_exons)
print(exon_lengths)
{'ENSG00000279493': 509, 'ENSG00000279720': 500, 'ENSG00000160294': 26635}
GTF_Processing.bed_to_feature_dict(bedobj)

Takes a bedtools object and returns a dictionary with the fourth column as key and the bedtools fields as value. Multiple occurrences of the identifier are appended.

# Admittedly, has quite specific use cases. It can be handy to collect all locations on gene level or on the level of
# any other identifier.
gene_exons = GTF_Processing.gene_feature_bed(gtf_file=annotation, feature='exon', gene_set=gene_list, dict_only=False,
                                             merge=False, keep_strand=False)
exon_dict = GTF_Processing.bed_to_feature_dict(gene_exons)
print(exon_dict['ENSG00000279493'])
[['chr21', '5011799', '5011874', 'ENSG00000279493'], ['chr21', '5012548', '5012687', 'ENSG00000279493'], ['chr21', '5014386', '5014471', 'ENSG00000279493'], ['chr21', '5016935', '5017145', 'ENSG00000279493']]