============ GTF processing ============ Functions based on gtf-files (tested for GENCODE's format) and the resulting outputs. Centred on the usage of `pybedtools `_, but the output can also be easily written to bed files. To run the example code for this module, we will always start with this block of code: .. code-block:: python import GTF_Processing annotation = 'ExampleData/gencode.v38.annotation_Mini.gtf' gene_list = ['ENSG00000160294', 'ENSG00000279493', 'ENSG00000279720'] .. .-------------------------------------------------------------------------------------------------------------------- .. gene_window_bed .. .-------------------------------------------------------------------------------------------------------------------- .. autofunction:: GTF_Processing.gene_window_bed .. code-block:: python promoter_regions = GTF_Processing.gene_window_bed(gtf_file=annotation, extend=200, gene_set=gene_list, tss_type='5') print(promoter_regions) .. include:: gallery/src.GTF_Processing.gene_window_bed.txt :literal: .. .-------------------------------------------------------------------------------------------------------------------- .. gene_feature_table .. .-------------------------------------------------------------------------------------------------------------------- .. autofunction:: GTF_Processing.gene_feature_table .. code-block:: python # Gets the biotypes as annotated in a gtf-file, and also provides a high-level annotation. feature_table = GTF_Processing.gene_feature_table(gtf_file=annotation, gene_set=gene_list) print(feature_table) .. include:: gallery/src.GTF_Processing.gene_feature_table.txt :literal: .. .-------------------------------------------------------------------------------------------------------------------- .. match_gene_identifiers .. .-------------------------------------------------------------------------------------------------------------------- .. autofunction:: GTF_Processing.match_gene_identifiers .. code-block:: python # The best task of all, map different identifiers. Let's start with Ensembl IDs to gene symbols, which can often # be successfully done by a gtf-file alone. gene_ids = ['ENSG00000160294', 'ENSG00000279493', 'ENSG00000279720'] mapped_ids, missed_ids = GTF_Processing.match_gene_identifiers(gene_ids, gtf_file=annotation, species='human', fields="symbol") print(mapped_ids) .. include:: gallery/src.GTF_Processing.match_gene_identifiers.ensembl.txt :literal: .. code-block:: python # Next we start from symbols, which is more prone to failing. We can also query for the Entrez ID. gene_symbols = ['AXL', 'MYO10', 'ATP5SL'] mapped_symbols, missed_symbols = GTF_Processing.match_gene_identifiers(gene_symbols, gtf_file=annotation, species='human', fields="ensembl,entrezgene") print(mapped_symbols) .. include:: gallery/src.GTF_Processing.match_gene_identifiers.symbols.txt :literal: .. code-block:: python # If we want to lookup Entrez IDs, we have to add entrezgene to the scope in which mygene looks. gene_entrez = [4651, 558] mapped_entrez, missed_entrez = GTF_Processing.match_gene_identifiers(gene_entrez, gtf_file=annotation, species='human', scopes='symbol,entrezgene', fields="ensembl") print(mapped_entrez) .. include:: gallery/src.GTF_Processing.match_gene_identifiers.entrez.txt :literal: .. .-------------------------------------------------------------------------------------------------------------------- .. gene_body_bed .. .-------------------------------------------------------------------------------------------------------------------- .. autofunction:: GTF_Processing.gene_body_bed .. code-block:: python gene_bodies = GTF_Processing.gene_body_bed(gtf_file=annotation, gene_set=gene_list, dict_only=False) print(gene_bodies) .. include:: gallery/src.GTF_Processing.gene_body_bed.txt :literal: .. .-------------------------------------------------------------------------------------------------------------------- .. gene_feature_bed .. .-------------------------------------------------------------------------------------------------------------------- .. autofunction:: GTF_Processing.gene_feature_bed .. code-block:: python gene_exons = GTF_Processing.gene_feature_bed(gtf_file=annotation, feature='exon', gene_set=gene_list, dict_only=False, merge=False, keep_strand=False) # This is list has 118 entries, so let's only print the first 5. print(''.join([str(x) for x in gene_exons[:5]])) .. include:: gallery/src.GTF_Processing.gene_feature_bed.txt :literal: .. .-------------------------------------------------------------------------------------------------------------------- .. gene_introns_bed .. .-------------------------------------------------------------------------------------------------------------------- .. autofunction:: GTF_Processing.gene_introns_bed .. code-block:: python # Getting the introns is a bit more tricky, as they are not explicitly annotated. We get them by subtracting all # other annotations from the gene bodies. gene_introns = GTF_Processing.gene_introns_bed(gtf_file=annotation, gene_set=gene_list) print(''.join([str(x) for x in gene_introns[:5]])) .. include:: gallery/src.GTF_Processing.gene_introns_bed.txt :literal: .. .-------------------------------------------------------------------------------------------------------------------- .. gene_biotypes .. .-------------------------------------------------------------------------------------------------------------------- .. autofunction:: GTF_Processing.gene_biotypes .. code-block:: python # Gets the biotypes as annotated in a gtf-file, and also provides a high-level annotation. biotype_dict = GTF_Processing.gene_biotypes(gtf_file=annotation, gene_set=gene_list) print(biotype_dict) .. include:: gallery/src.GTF_Processing.gene_biotypes.txt :literal: .. .-------------------------------------------------------------------------------------------------------------------- .. bed_to_length_dict .. .-------------------------------------------------------------------------------------------------------------------- .. autofunction:: GTF_Processing.bed_to_length_dict .. code-block:: python # Convenient, for example, to get the total exon length. gene_exons = GTF_Processing.gene_feature_bed(gtf_file=annotation, feature='exon', gene_set=gene_list, dict_only=False, merge=False, keep_strand=False) exon_lengths = GTF_Processing.bed_to_length_dict(gene_exons) print(exon_lengths) .. include:: gallery/src.GTF_Processing.bed_to_length_dict.txt :literal: .. .-------------------------------------------------------------------------------------------------------------------- .. bed_to_feature_dict .. .-------------------------------------------------------------------------------------------------------------------- .. autofunction:: GTF_Processing.bed_to_feature_dict .. code-block:: python # Admittedly, has quite specific use cases. It can be handy to collect all locations on gene level or on the level of # any other identifier. gene_exons = GTF_Processing.gene_feature_bed(gtf_file=annotation, feature='exon', gene_set=gene_list, dict_only=False, merge=False, keep_strand=False) exon_dict = GTF_Processing.bed_to_feature_dict(gene_exons) print(exon_dict['ENSG00000279493']) .. include:: gallery/src.GTF_Processing.bed_to_feature_dict.txt :literal: