BasicPlotter¶

Collection of plotting functions, some quite general, others rather specific. For many examples here we’ll use the penguin dataframe provided by seaborn, because it comes conveniently with the package and because penguins are great.

# Block that has to be executed for all.
import BasicPlotter
import pandas as pd
import seaborn as sns
out_dir = 'docs/gallery/'
penguin_df = sns.load_dataset('penguins')   # Example data from seaborn.
print(penguin_df.head())

  species     island  bill_length_mm  bill_depth_mm  flipper_length_mm  \
Adelie  Torgersen            39.1           18.7              181.0   
Adelie  Torgersen            39.5           17.4              186.0   
Adelie  Torgersen            40.3           18.0              195.0   
Adelie  Torgersen             NaN            NaN                NaN   
Adelie  Torgersen            36.7           19.3              193.0   

   body_mass_g     sex  
     3750.0    Male  
     3800.0  Female  
     3250.0  Female  
        NaN     NaN  
     3450.0  Female  

BasicPlotter.basic_bars(plot_df, x_col, y_col, x_order=None, hue_col=None, hue_order=None, title=None, output_path='', y_label='', x_size=8, y_size=6, rotation=None, palette=None, legend=True, font_s=14, legend_out=False, ylim=None, formats=['pdf'])¶

Plots a basic barplot, allows to select hue levels.

Parameters:

y_col – Can be list, then the df will be transformed long format and var_name set to hue_col.
y_label – to have an appropriate y-axis label.

# Make a basic barplot with the average flipper length per species.
avg_flipper_length = pd.DataFrame(penguin_df.groupby('species')['flipper_length_mm'].mean())
BasicPlotter.basic_bars(avg_flipper_length, x_col='species', y_col='flipper_length_mm', formats=['png'],
                        x_order=['Chinstrap', 'Adelie', 'Gentoo'], title='Example bar plot',
                        output_path=out_dir, y_label='Flipper length [mm]', rotation=None, palette='glasbey_cool')

# Make a second version where we additionally split by sex per species.
avg_flipper_length_sex = pd.DataFrame(penguin_df.groupby(['species', 'sex'])['flipper_length_mm'].mean()).reset_index()
BasicPlotter.basic_bars(avg_flipper_length_sex, x_col='species', y_col='flipper_length_mm', formats=['png'],
                        hue_col='sex', x_order=['Chinstrap', 'Adelie', 'Gentoo'], title='Example bar plot with hue',
                        output_path=out_dir + "SexHue", y_label='Flipper length [mm]', rotation=None,
                        palette='glasbey_cool')

BasicPlotter.stacked_bars(plot_df, x_col, y_cols, y_label='', title=None, output_path='', x_size=8, y_size=6, rotation=None, palette=None, legend=True, fraction=False, numerate=False, sort_stacks=True, legend_out=False, width=0.8, vertical=False, hatches=None, font_s=14, formats=['pdf'])¶

Plots a stacked barplot, with a stack for each y_col.

Parameters:

x_col – Column name to use for splitting on the x-axis, if not an existant column will take the index.
y_cols – List of the columns to use for the stacks.
fraction – If True take all values as fraction of the row sum.
numerate – Whether to add the total number per x-group to the x-label.
sort_stacks – Whether to sort the stack groups by size.
legend_out – False to let seaborn place the legend, otherwise a float by how much it should be shifted in x-direction.
vertical – Whether to swap the layout of the plot from horizontal to vertical.
hatches – If given assumes the colour list is meant for the x-axis.

# For multiple groups per bar, let's look at from which island the penguins came. Plot it once as absolute numbers
# and again as fraction.
species_island = penguin_df.groupby(['species', 'island']).size().reset_index().rename(columns={0: 'count'}).pivot(index='species', columns='island', values='count')
for do_fraction in [True, False]:
    BasicPlotter.stacked_bars(species_island, x_col='species', y_cols=species_island.columns, y_label='count', sort_stacks=False,
                              title='Stacked bars '+('fraction' if do_fraction else 'absolute'), output_path=out_dir, legend_out=1.33,
                              rotation=0, palette='glasbey_cool', fraction=do_fraction, formats=['png'])

$basic_stacked_bars_fraction$

BasicPlotter.basic_pie(plot_df, title='', palette=None, numerate=True, legend_perc=True, output_path='', legend_title='', formats=['pdf'])¶

A pie chart.

Parameters:

plot_df – Either a DataFrame with each category as index or a dictionary with {category: count}.
numerate – Whether to add the summed count across categories to the title.
legend_perc – Whether the legend should write the percentages or absolute numbers.

# An alternative to stacked barplots when focusing on one group are pie charts.
species_island_adelie = pd.DataFrame(penguin_df.groupby(['species', 'island']).size().reset_index().rename(columns={0: 'count'}).pivot(index='species', columns='island', values='count').loc['Adelie'])
BasicPlotter.basic_pie(species_island_adelie, title='Adelie islands', palette='glasbey_cool', numerate=True,
                       output_path=out_dir, legend_title='island', formats=['png'])

BasicPlotter.basic_hist(plot_df, x_col, hue_col=None, hue_order=None, bin_num=None, title=None, output_path='', stat='count', cumulative=False, palette='tab10', binrange=None, xsize=12, ysize=8, colour='#2d63ad', font_s=14, ylabel=None, element='step', alpha=0.3, kde=False, legend_out=False, legend_title=True, fill=True, edgecolour=None, multiple='layer', shrink=1, hlines=[], vlines=[], discrete=False, grid=True, linewidth=None, formats=['pdf'])¶

Plots a basic layered histogram which allows for hue, whose order can be defined as well. If x_col is not a column in the df, it will be assumed that hue_col names all the columns which are supposed to be plotted.

Parameters:

stat – count: show the number of observations in each bin. frequency: show the number of observations divided by the bin width. probability or proportion: normalize such that bar heights sum to 1. percent: normalize such that bar heights sum to 100. density: normalize such that the total area of the histogram equals 1.
element – {“bars”, “step”, “poly”}.
multiple – {“layer”, “dodge”, “stack”, “fill”}
cumulative – If a cumulative distribution should be plotted instead of a histogram.
kde – Whether to plot the kernel density estimate.
discrete – If True, each data point gets their own bar with binwidth=1 and bin_num is ignored.
legend_out – False to let seaborn place the legend, otherwise a float by how much it should be shifted in x-direction.
hlines – Plot horizontal dashed grey lines at all positions listed in hlines.
vlines – Same as hlines but vertical.
shrink – Float to shrink the size of the bar-width to.

# Look at the whole distribution of flipper length by using a histogram and split by species.
BasicPlotter.basic_hist(penguin_df, x_col='flipper_length_mm', hue_col='species', bin_num=20, title='Flipper length per species',
                        output_path=out_dir, stat='percent', palette='glasbey_cool', element='step', alpha=0.7, formats=['png'])

_images/flipper_length_mm_species_Hist.png

BasicPlotter.basic_violin(plot_df, y_col, x_col, x_order=None, hue_col=None, hue_order=None, title=None, output_path='', numerate=False, ylim=None, palette=None, xsize=12, ysize=8, boxplot=False, boxplot_meanonly=False, rotation=None, numerate_break=True, jitter=False, colour='#2d63ad', font_s=14, saturation=0.75, jitter_colour='black', jitter_size=5, vertical_grid=False, legend_title=True, legend=True, grid=True, formats=['pdf'])¶

Plots a basic violin plot which allows for hue, whose order can be defined as well. Optionally plot boxplot, or add jitter points for the individual data points. Use y_col=None and x_col=None for seaborn to interpret the columns as separate plots on the x-asis.

Parameters:

numerate – Whether to add the total number per x-group to the x-label.
numerate_break – Whether to add a line break before writing the size of the x-group.
jitter – Whether to add jitter points for each data point on top.
jitter_colour – Colour for the jitter points.
jitter_size – Dot size for the jitter points.
saturation – Controls the saturation of the colours.
boxplot – Whether to plot boxplots instead of violinplots.
boxplot_meanonly – Remove all lines from the boxplot and show just the mean as horizontal line.

# An alternative is to separate the species along the x-axis and do violin plots.
BasicPlotter.basic_violin(penguin_df, y_col='flipper_length_mm', x_col='species', title='Flipper length per species',
                          output_path=out_dir, numerate=True, palette='glasbey_cool', formats=['png'])
# Alternatively do boxplots instead and also add the individual data points as jitter.
BasicPlotter.basic_violin(penguin_df, y_col='flipper_length_mm', x_col='species', title='Flipper length per species',
                          output_path=out_dir+"BoxplotJitter", numerate=True, palette='glasbey_cool', formats=['png'],
                          boxplot=True, jitter=True)

BasicPlotter.basic_2Dhist(plot_df, columns, hue_col=None, hue_order=None, bin_num=200, title=None, output_path='', xsize=12, ysize=8, palette='tab10', cbar=False, cmap='mako', hlines=[], vlines=[], binrange=None, diagonal=False, grid=True, font_s=14, formats=['pdf'])¶

Plots a basic 2D histogram as heatmap which allows for hue, whose order can be defined as well. Useful when a scatterplot would be just a big blob of dots.

Parameters:

columns – List with 2 entries representing the columns from plot_df for the x- and y-axis.
bin_num – Number of bins to use, the higher the finer the resolution. It behaves it a odd though.
cbar – Whether to plot the colorbar that shows the number of counts.
hlines – Plot horizontal dashed grey lines at all positions listed in hlines.
vlines – Same as hlines but vertical.
binrange – Boundaries for the histogram.
diagonal – Whether to add a line on the diagonal.

# Plot the distribution of two features as 2D histogram. This example has very few points, so a scatter would work better.
BasicPlotter.basic_2Dhist(penguin_df, columns=['flipper_length_mm', 'body_mass_g'], bin_num=20, title='Flipper length vs body mass',
                          output_path=out_dir, cbar=True, formats=['png'])

_images/flipper_length_mm_body_mass_g_None_2DHist.png

BasicPlotter.multi_mod_plot(plot_df, score_cols, colour_col=None, marker_col=None, output_path='', diagonal=False, title=None, colour_order=None, marker_order=None, line_plot=False, alpha=0.7, xsize=8, ysize=6, palette=None, xlim=None, ylim=None, msize=30, vlines=[], hlines=[], add_spear=False, na_colour='black', grid=True, label_dots=None, font_s=14, adjust_labels=True, formats=['pdf'])¶

Scatterplot that compares two scores. For each entry in plot_df plot one dot with [x,y] based on score_col and allows to colour all dots based on colour_col, and if marker_col is selected assigns each class a different marker.

Parameters:

score_cols – List of two column names to use for the x- and y-axis.
line_plot – 2D list of dots which will be connected to a lineplot.
hlines – Plot horizontal dashed grey lines at all positions listed in hlines.
vlines – Same as hlines but vertical.
diagonal – Whether to add a line on the diagonal.
add_spear – CARE not properly tested. Whether to add the spearman correlation coefficient between the score_cols to the title.
na_colour – How to colour dots where the colour_col is NA.
label_dots – A pair of columns [do_label, label_col] with boolean do_label telling which entries should get a text label within the plot, and label_col giving the string of the label.
adjust_labels – Whether to use the adjustText package to try to avoid overlap of text labels.

# This one is a scatterplot with a lot of additional options.
# Start with a scatterplot where we colour the dots by the species, each dot being one penguin.
BasicPlotter.multi_mod_plot(penguin_df, score_cols=['flipper_length_mm', 'body_mass_g'], colour_col='species',
                            output_path=out_dir, title='#1: Flipper length vs body mass', alpha=1, palette='glasbey_cool',
                            msize=25, formats=['png'])
# Next, let's add markers to show the island where the penguin was measured.
BasicPlotter.multi_mod_plot(penguin_df, score_cols=['flipper_length_mm', 'body_mass_g'], colour_col='species',
                            marker_col='island', output_path=out_dir, title='#2 Flipper length vs body mass',
                            alpha=1, palette='glasbey_cool', msize=25, formats=['png'])
# Alternatively to having the colours categorical, we can also add a third continuous feature as colour.
BasicPlotter.multi_mod_plot(penguin_df, score_cols=['flipper_length_mm', 'body_mass_g'], colour_col='bill_length_mm',
                            marker_col='island', output_path=out_dir, title='#3 Flipper length vs body mass',
                            alpha=1, msize=25, formats=['png'])
# In addition, label the five heaviest for which we need an additional boolean column saying
# whether a dot should be labelled. The text for the label could be anything, here we write the body mass itself.
top5_index = penguin_df.sort_values('body_mass_g', ascending=False).index[:5]
penguin_df['add_label'] = [i in top5_index for i in penguin_df.index]
BasicPlotter.multi_mod_plot(penguin_df, score_cols=['flipper_length_mm', 'body_mass_g'], colour_col='bill_length_mm',
                            marker_col='island', label_dots=['add_label', 'body_mass_g'], output_path=out_dir+"doLabel",
                            title='#4 Flipper length vs body mass',
                            alpha=1, msize=25, formats=['png'])

BasicPlotter.basic_venn(input_sets, plot_path, blob_colours=ColoursAndShapes.tol_highcontrast, title='', scaled=True, linestyle='', number_size=11, xsize=5, ysize=5, formats=['pdf'])¶

Based on a dictionary of {key: set} with either two or three entries do the intersection of sets and create a Venn diagram from it.

Parameters:

input_sets – {key: set} for all bubbles that should be plotted.
scaled – If the bubble sizes should be scaled to the set sizes. Choose False if the size difference is too high.
linestyle – Linestyle for the rim of the bubbles.

# Plot the overlap of lists of ingredients (incomplete) as a Venn diagram.
ingredients = {"Cookies": {'butter', 'sugar', 'flour', 'baking powder', 'chocolate'},
               'Apple pie': {'butter', 'sugar', 'flour', 'baking powder', 'apples'},
               'Bread': {'flour', 'yeast', 'oil', 'salt'}}
BasicPlotter.basic_venn(input_sets=ingredients, plot_path=out_dir+"Ingredients", formats=['png'])

BasicPlotter.overlap_heatmap(inter_sets, title='', plot_path='', xsize=12, ysize=8, annot=True, mode='Jaccard', annot_type='Jaccard', font_s=14, matrix_only=False, cmap='mako', formats=['pdf'])¶

Create heatmap of pairwise overlap of a collection of sets. Either make a symmetric heatmap of the Jaccard index or plot the asymmetric fraction of overlap. Combinations are formed by the keys of the dictionary.

Parameters:

mode – ‘Jaccard’ to show the Jaccard index of the intersection, ‘Fraction’ to show the percentage.
annot_type – ‘Jaccard’ to get the Jaccard index written into the cells, ‘Fraction’ for the fraction, and ‘Absolute’ to get the absolute number of shared items.
matrix_only – Whether only the matrix should be given without creating a plot, will return two dfs, one with the shared metric and the other the cell labels. If False, only producing the plot without returning matrices.

# Do the overlap again but now once with the Jaccard index and once as fraction.
ingredients = {"Cookies": {'butter', 'sugar', 'flour', 'baking powder', 'chocolate'},
               'Apple pie': {'butter', 'sugar', 'flour', 'baking powder', 'apples'},
               'Bread': {'flour', 'yeast', 'oil', 'salt'}}
for mode in ['Fraction', 'Jaccard']:
    BasicPlotter.overlap_heatmap(inter_sets=ingredients, title="Ingredients overlap as "+mode, plot_path=out_dir+"Ingredients_"+mode,
                                 xsize=10, ysize=6, mode=mode, annot_type='Jaccard' if mode == 'Jaccard' else 'Absolute', formats='png')

$fraction_map$

BasicPlotter.upset_plotter(inter_sets, y_label='Intersection', max_groups=None, min_degree=0, show_percent=False, sort_categories_by='cardinality', sort_by='cardinality', intersection_plot_elements=None, element_size=None, title_tag='', plot_path='', font_s=14, formats=['pdf'])¶

Based on a dictionary with sets as values creates the intersection and an upsetplot. Uses the upsetplot package (https://upsetplot.readthedocs.io/en/stable/#) based on doi.org/10.1109/TVCG.2014.2346248. The size of the resulting plot isn’t that easy to control with the flags, the package isn’t very user-friendly in that regard.

Parameters:

inter_sets – Dictionary of {key: set}, each key will be plotted as one row.
y_label – Label for the y-axis.
max_groups – Defines the maximum number of intersections plotted, sorted descending by size.
min_degree – Minimum overlap between groups to show.
show_percent – Write the percentage of overlap on top of the bars.
sort_categories_by – How to sort the rows, ‘cardinality’, ‘degree’ or ‘input’.
sort_by – How to order the columns, meaning the groups of intersections, ‘cardinality’ or ‘degree’.
intersection_plot_elements – Roughly the height of the plot.
element_size – Roughly the overall size and margins.

# Use the same sets from the previous function, but now visualize the overlap with an UpsetPlot.
ingredients = {"Cookies": {'butter', 'sugar', 'flour', 'baking powder', 'chocolate'},
               'Apple pie': {'butter', 'sugar', 'flour', 'baking powder', 'apples'},
               'Bread': {'flour', 'yeast', 'oil', 'salt'}}
BasicPlotter.upset_plotter(inter_sets=ingredients, sort_categories_by='input', title_tag='Ingredients overlap',
                           plot_path=out_dir+"Ingredients", intersection_plot_elements=4, formats=['png'])

BasicPlotter.cumulative_plot(plot_df, x_col, hue_col, hue_order=None, output_path='', numerate=False, title=None, xlimit=None, vertical_line=False, add_all=False, table_width=0.3, table_x_pos=1.2, palette=None, font_s=16, grid=True, formats=['pdf'])¶

Plot the cumulative distribution of all sets of grouping names in the plotting df. Adds a table with a Kolmogorov-Smirnov test for each non-redunant pairwise comparison next to the plot. Cells with a p-value ≤ 0.05 will be coloured in red.

Parameters:

plot_df – Pandas DataFrame holding the data.
x_col – Column name in plot_df which should be plotted on the x-axis.
hue_col – Column name in the DataFrame used for different curves.
hue_order – If the groups in hue_col should have a specific order.
numerate – If True, show the number of elements in each hue_col group in parentheses.
vertical_line – To plot a vertical line at the given position, e.g. 0.
add_all – If True, add the whole stack of values in x_col again as separate distribution labelled ‘All’.
table_width – Width of the KS table.
table_x_pos – X-position of the KS table, to avoid it overlapping the plot.

# This is most instructive with diverging data e.g. logFC from RNA-seq. We use the RNA data from a study on the histone
# mark H379me2 (10.1038/s41467-022-35070-2) and a formatted file from their supplements.
rna_file = 'ExampleData/41467_2022_35070_MOESM4_ESM_E16Sub.txt'
rna_table = pd.read_table(rna_file, sep='\t', header=0)
# Add a column that groups genes into rough bins of how much of the gene body is covered.
rna_table['binned H3K79me2 GB Coverage'] = pd.cut(rna_table['H3K79me2 GB Coverage'], bins=2).astype('string')
BasicPlotter.cumulative_plot(rna_table, x_col='logFC', hue_col='binned H3K79me2 GB Coverage', palette='glasbey_cool', xlimit=[-1.5, 2],
                             add_all=True, output_path=out_dir, numerate=True, title=None, vertical_line=0, table_width=0.4, table_x_pos=1.2, formats=['png'])

From the plot we can see, that the genes with more coverage of their gene body are more often downregulated and that they have less strong positive logFC, compared to genes with lower gene body coverage.