EndotypY package

Submodules

EndotypY.endotyper module

class EndotypY.endotyper.Endotyper[source]

Bases: object

Endotyper class for endotyping analysis.

This class provides methods to perform endotyping analysis using a random walk approach. It includes methods for reading networks, preparing the random walk matrix, and performing the endotyping analysis.

Attributes:

network_filestr | Path

The path to the input network file.

rfloat

The damping factor for the random walk.

annotate_local_neighborhood(enrichr_lib: str, organism='Human', sig_threshold=0.01, force_download=False)[source]

Get the Gene Ontology (GO) terms for a given gene and its RWR defined neighbors. This function uses the Enrichr library to perform Gene Set Enrichment Analysis (GSEA) and returns significant terms for the expanded neighborhood of genes (significance threshold = p-value for enrichment).

Parameters:
  • enrichr_lib (str) – The name of the Enrichr library to use for GSEA.

  • organism (str) – The organism for which the GSEA is performed. Default is ‘Human’.

  • sig_threshold (float) – The significance threshold for the GSEA results. Default is 0.01.

define_kl_endotypes(distance_metric: str = 'hamming', linkage_method: str = 'complete', alpha: float = 0.05)[source]

Define endotypes based on KL divergence. This function computes the feature matrix from the neighborhood annotations (binary matrix) that describes which enrichment terms are present for each gene based on the enrichment of the gene +local neighborhood. The feature matrix is a binary matrix where rows are genes and columns are enrichment terms. Each entry is 1 if the term is present for the gene, and 0 otherwise.

It then performs kl divergence clustering to identify endotypes. :returns: The Endotyper object with the endotypes defined. :rtype: self

define_local_neighborhood(neighbor_percentage=1, scaling=True)[source]

Run RWR starting from every single gene in seed_genes and extract the top % genes from the visiting probabilities around each seed gene.

Parameters:
  • neighbor_percentage (int) – Percentage of top genes to identify.

  • scaling (bool) – Whether to apply scaling to the RWR.

explore_seed_clusters(scaling=True, k=200)[source]

Run the seed clustering process. This function computes the RWR for each seed gene, clusters them based on their neighborhoods, and plots the results.

Parameters:
  • k_max (-) – Maximum neighborhood size to test.

  • scaling (-) – Whether to apply scaling to the RWR.

extract_disease_module(seed_cluster_id: int = None, scaling=True, k=200)[source]
import_network(network_file: str)[source]

Imports a network from a file.

Parameters:
  • network_file (str) – Path to the network file. Supported formats are:

  • edges (-'.txt' or '.tsv' or '.csv' with two columns representing)

  • tab-separated.

Returns:

The Endotyper object.

Return type:

self

Notes

  • Lines that start with ‘#’ will be ignored.

  • Self-loops are eliminated in the last filtering step

import_seeds(seeds_file: str)[source]

Imports seeds from a file and sets them as the seeds for the object.

Parameters:

seeds_file (str) – The path to the seeds file.

Returns:

The Endotyper object.

Return type:

self

Notes

  • The seeds file should contain a list of seed genes, one per line.

  • Alternative formats for the seeds file is tab separated entries on first line of file.

plot_endotype_grid(size_height=500, size_width=500, ncols=2, node_size='degree', path_length=2, layout_seed=2025, enrichr_lib=None, top_terms=5, gsea_plot_type='dotplot')[source]

Plots endotypes in a grid layout using Plotly with optional GSEA visualization.

This function creates an interactive grid visualization of all identified endotypes, combining endotypes from different iterations into a single grid view.

Parameters:
  • size_height (int, optional) – Height of each subplot in pixels. Defaults to 500.

  • size_width (int, optional) – Width of each subplot in pixels. Defaults to 500.

  • ncols (int, optional) – Number of columns in the grid layout. Defaults to 3.

  • node_size (str or int, optional) – Determines the centrality measure for node sizing. Options are ‘betweenness’ or ‘degree’. If integer, used as fixed node size. Defaults to ‘degree’.

  • path_length (int, optional) – Length of shortest paths to consider between endotype genes. Defaults to 2.

  • layout_seed (int, optional) – Seed for the spring layout. Defaults to 2025.

  • enrichr_lib (str, optional) – Name of Enrichr library for GSEA. If None, no GSEA is performed. Defaults to None.

  • top_terms (int, optional) – Number of top enriched terms to display. Defaults to 5.

  • gsea_plot_type (str, optional) – Type of plot for GSEA results (‘dotplot’ or ‘pie’). Defaults to ‘dotplot’.

Returns:

Plotly figure and optionally GSEA enrichment results if enrichr_lib is provided.

plot_endotypes(node_size: list = ['degree', 'betweenness'], layout: str = 'spring', path_length: int = 2)[source]

Plots multiple endotypes on the network. This function iterates through the endotypes dictionary, combining endotypes from different iterations into a single dictionary. It then calls the plot_multiple_endotypes function to visualize these combined endotypes on the network. :param node_size: network measures to use for node sizing. :type node_size: list, optional :param Defaults to [‘degree’: :param ‘betweenness’].: :param layout: The layout algorithm to use for the network plot. Defaults to ‘spring’. :type layout: str, optional :param path_length: The path length to use for shortest path calculations. Defaults to 2. :type path_length: int, optional

plot_endotypes_metagraph(filter_size_endotypes=True, node_size=15)[source]

Build an endotype metagraph visualization where individual endotype subgraphs (positioned via spring layout) are clustered at meta-positions (determined by inter-endotype connectivity), then globally scaled and rendered with colored hulls, intra-endotype edges, inter-endotype connections, and seed gene highlighted using datamapplot and matplotlib.

Parameters:
  • filter_size_endotypes (bool, optional) – Choose to filter endotypes by size: select only endotypes subgraphs with at least one edge and more than 5 nodes. Defaults to True.

  • node_size (int, optional) – Size of nodes in the plot. Defaults to 15.

prepare_rwr(r=0.8)[source]

Prepares the Random Walk with Restart (RWR) matrix.

This function computes the RWR matrix based on the network and restart probability, using the formula (I-r*M)^-1 where M is the column-wise normalized Markov matrix according to M = A D^{-1}.

To provide the option of scaling the visiting probabilities, a scaling matrix is also created, which is the diagonal matrix of the inverse degree of the nodes in graph G.

Parameters:

r (float, optional) – Damping factor/restart probability. Defaults to 0.8.

Returns:

Returns the Endotyper object with the RWR matrix, scaling matrix,

and index to ensembl mapping stored as attributes.

Return type:

self

EndotypY.expansion module

EndotypY.expansion.calculate_top_genes(G, input_gene_list, rwr_matrix, scaling_matrix, d_idx_ensembl, neighbor_percentage=1, scaling=True)[source]

Calculate the top 1% genes for each gene in input_gene_list using Random Walk with Restart (RWR) algorithm.

Parameters:
  • G (nx.Graph object) – NetworkX graph object.

  • input_gene_list (list of str) – List of genes.

  • scaling (bool) – Whether to scale the visiting probabilities.

  • W (numpy array) – Random walk matrix.

  • Dinvsqrt (numpy array) – Diagonal matrix of the inverse degree of the nodes.

  • d_ensembl_idx (dict) – Dictionary mapping gene Ensembl IDs to their indices.

  • d_idx_ensembl (dict) – Dictionary mapping indices to gene Ensembl IDs.

  • neighbor_percentage (float) – Percentage of top genes to identify.

Returns:

top_genes – Dictionary of the desired top % genes for each gene in input_gene_list.

Return type:

dict

EndotypY.expansion.convert_gs_lib_to_dict(go_terms_dict)[source]

Converts a dictionary of enrichment terms from the format {term: [gene1, gene2, …]} to the format {gene: [term1, term2, …]}.

Parameters:

go_terms_dict (dict or list) – Dictionary where keys are enrichment term IDs and values are lists of gene symbols associated with those terms. Or a list from gseapy that needs to be handled differently.

Returns:

go_terms – Dictionary where keys are gene symbols and values are lists of enrichment term IDs associated with those genes.

Return type:

dict

EndotypY.expansion.get_GSEA_significant_terms(gene_list, library, sig_threshold, organism='human', background=None)[source]

Perform Gene Set Enrichment Analysis (GSEA) using gseapy’s enrichr implementarion and return significant terms.

Parameters:
  • gene_list (list) – List of genes to analyze.

  • library (gene set enrichment library) – Dictionary of term - gene associations (output of gp.get_library() function).

  • background (list) – Background gene list (default: None).

Returns:

significant_terms – List of significant terms from GSEA.

Return type:

list

EndotypY.expansion.get_gene_and_neighborhood_enrichment_terms(gene, top_genes, term_library, sig_threshold)[source]

Get the Gene Ontology (GO) terms for a given gene symbol and its neighbors.

Parameters:
  • gene (str) – Gene symbol (HGNC) of the gene.

  • top_genes (dict) – Dictionary of the desired top % genes for each gene in input_gene_list (output of calculate_top_genes() function).

  • term_library (dict) – Dictionary of term - gene associations (output of gp.get_library() function).

Returns:

all_go_terms – List of unique GO term IDs for the gene and its neighbors.

Return type:

list

EndotypY.expansion.get_module_neighborhood_terms_dict(top_genes, term_library, sig_threshold)[source]

Get the Gene Ontology (GO) terms for all genes in a given disease core module using parallel processing.

Parameters:
  • top_genes (dict) – Dictionary of the desired top % genes for each gene in input_gene_list (output of calculate_top_genes() function).

  • term_library (dict) – Dictionary of term - gene associations (output of gp.get_library() function).

  • n_cores (int, optional) – Number of cores to use for parallel processing (default: mp.cpu_count() - 2).

Returns:

go_terms_dict – Dictionary of unique GO term IDs for each gene in the disease core module.

Return type:

dict

EndotypY.expansion.process_gene(gene, top_genes, term_library, sig_threshold)[source]

Process a single gene to retrieve its GO terms.

Parameters:
  • gene (str) – Gene symbol to process.

  • top_genes (dict) – Dictionary of top genes.

  • term_library (dict) – Dictionary of term - gene associations.

Returns:

A tuple containing the gene symbol and its associated GO terms.

Return type:

tuple

EndotypY.import_export module

EndotypY.import_export.load_seed_set_from_file(seed_file) set[source]

Reads a seed set from an external file. * Lines starting with ‘#’ will be ignored.

Parameters:

seed_filestr | Path

The path to the input file. It can be provided as: - A string (e.g., “data/seeds.txt”) - A pathlib.Path object (e.g., Path(“data”, “seeds.txt”))

Returns:

list of int

A list of unique seeds from the file

Raises:

FileNotFoundError

If the specified file does not exist.

EndotypY.import_export.read_network_from_file(network_file: str | Path) Graph[source]

Reads a network from an external file.

  • The edgelist must be provided as a tab-separated table. The first two columns of the table will be interpreted as an interaction gene1 <==> gene2.

  • Lines that start with ‘#’ will be ignored.

  • The function checks that the input file ends with ‘.txt’ or ‘.tsv’ or ‘.csv’.

  • Self-loops are eliminated in the last filtering step of the function.

Parameters:

network_filestr | Path

The path to the input file. It can be provided as: - A string (e.g., “data/network.txt”) - A pathlib.Path object (e.g., Path(“data”, “network.txt”))

Returns:

nx.Graph

A NetworkX graph with nodes and edges from the file.

Raises:

ValueError

If the file format is not ‘.txt’ or ‘.tsv’ or ‘.csv’.

FileNotFoundError

If the specified file does not exist.

Notes:

  • Lines starting with ‘#’ are ignored.

  • Self-loops (edges where node1 == node2) are removed.

EndotypY.kl_clustering module

EndotypY.kl_clustering.compute_feature_matrix(go_terms_dict)[source]

Create a binary feature matrix from a dictionary of GO terms for each gene.

Parameters:

go_terms_dict (dict) – Dictionary of unique GO term IDs for each gene in the disease core module.

Returns:

feature_matrix – Binary feature matrix where rows corresponds to gene IDs and columns represent GO term IDs.

Return type:

pd.DataFrame

EndotypY.kl_clustering.kl_clustering_endotypes(data: DataFrame, linkage_method: str = 'complete', distance_metric: str = 'hamming', alpha: float = 0.05) dict[source]

A function to perform KL-based hierarchical clustering analysis on the provided dataset. Parameters: - data: pd.DataFrame

The input binary dataset with samples as rows and features as columns.

  • linkage: str

    The linkage method to use for hierarchical clustering.

  • distance_metric: str

    The distance metric to use for calculating pairwise distances.

EndotypY.metagraph_visualization module

EndotypY.metagraph_visualization.plot_endotypes_metagraph(G, d_clusters, seed_genes, filter_size_endotypes=True, node_size=15)[source]

Wrapper function to create meta positions, scale and position endotype subgraphs, and visualize the final layout.

Parameters:

Gnx.Graph

The original full network graph

d_clustersdict

A dictionary where keys are endotype identifiers and values are lists of genes belonging to each endotype

seed_geneslist

List of seed genes used for endotype identification.

filter_size_endotypesbool

If True, only endotype subgraphs with at least one edge and more than 5 nodes are considered for visualization.

node_sizeint

Size of the nodes in the final visualization.

EndotypY.prepare_rwr module

EndotypY.prepare_rwr.prep_rwr(G, r=0.8)[source]

EndotypY.rwr module

EndotypY.rwr.extract_connected_module(G, seed_genes: list, rwr_results, k: int)[source]
EndotypY.rwr.make_p0(G, seeds, scaling: bool)[source]
EndotypY.rwr.rwr(G, seed_genes, scaling, rwr_matrix, scaling_matrix, d_idx_ensembl)[source]
EndotypY.rwr.rwr_from_individual_genes(G, seed_genes, scaling: bool, rwr_matrix, scaling_matrix, d_idx_ensembl)[source]

Run RWR starting from every single gene in seed_genes and store the results.

EndotypY.seed_clusters module

EndotypY.seed_clusters.run_seed_clustering(G, seed_genes, d_rwr_individuals, k_max=200)[source]

Run the seed clustering process. This function computes the RWR for each seed gene, clusters them based on their neighborhoods, and plots the results.

Parameters:
  • G (-) – NetworkX graph representing the connected protein-protein interaction network.

  • seed_genes (-) – List of seed genes to be clustered.

  • scaling (-) – Scaling matrix for the RWR algorithm.

  • rwr_matrix (-) – RWR matrix for the graph.

  • scaling_matrix (-) – Scaling matrix for the graph.

  • d_idx_ensembl (-) – Dictionary mapping indices to Ensembl IDs.

  • k_max (-) – Maximum neighborhood size to test.

EndotypY.utils module

EndotypY.utils.convert_entrez_to_symbols(gene_ids)[source]
EndotypY.utils.convert_symbols_to_entrez(gene_ids)[source]
EndotypY.utils.download_enrichr_library(enrichr_lib: str, organism='Human', force_download=False)[source]

Downloads and caches Enrichr library locally.

EndotypY.visualization module

EndotypY.visualization.plot_endotype(endotype, G, seed_genes, size_height=14, size_width=14, node_size='betweenness', path_length=2, endotype_color='cornflowerblue', layout_seed=2025, return_plot=True)[source]

Draws a subgraph of the PPI network containing the endotype genes and their shortest paths.

Parameters:

endotype: list

List of endotype genes to be highlighted in the graph.

G: networkx.Graph

The Graph object representing the reference network

seed_genes: list

List of seed genes used for endotype identification.

size_height: int, optional

Height of the figure in inches.

size_width: int, optional

Width of the figure in inches.

node_size: str, optional

Determines the centrality measure for size of the nodes in the graph. Options are ‘betweenness’ or ‘degree’.

path_length: int, optional

The length of the shortest paths to consider between endotype genes. Default is 2.

endotype_color: str, optional

Color used to highlight the endotype genes in the graph. Default is ‘orange’.

layout_seed: int, optional

Seed for the spring layout of the graph.

return_plot: bool, optional

If True, the function will not plot the graph but will still return the subgraph.

Returns:

subgraph: networkx.Graph

The subgraph containing the endotype genes and their shortest paths.

EndotypY.visualization.plot_endotype_grid(endotypes, G, seed_genes, size_height=500, size_width=500, ncols=3, node_size='degree', path_length=2, layout_seed=2025, layout='spring', limit_lcc=True, enrichr_lib=None, organism='Human', top_terms=5, force_download=False, gsea_plot_type='dotplot')[source]

Draws multiple endotypes in a grid of subplots using Plotly, with optional GSEA visualization.

Parameters:

endotypes: dict of lists

Dictionary where keys are endotype names and values are lists of endotype genes.

G: networkx.Graph

The Graph object representing the reference network.

seed_genes: list

List of seed genes used for endotype identification.

size_height: int, optional

Height of each subplot in pixels.

size_width: int, optional

Width of each subplot in pixels.

ncols: int, optional

Number of columns in the grid layout.

node_size: str or int, optional

Determines the centrality measure for size of the nodes in the graph. Options are ‘betweenness’ or ‘degree’. If integer, it is used as a fixed node size.

path_length: int, optional

The length of the shortest paths to consider between endotype genes. Default is 2.

layout: str, optional

The layout algorithm to use for positioning nodes. Options are ‘spring’ or ‘kk’

layout_seed: int, optional

Seed for the spring layout of the graph.

limit_lcc: bool, optional

If True, limits each endotype subgraph to its largest connected component.

enrichr_lib: str, optional

Name of the Enrichr library to use for GSEA. If None, no GSEA is performed.

organism: str, optional

Organism for GSEA. Default is ‘Human’.

top_terms: int, optional

Number of top enriched terms to display in plots. Default is 5.

force_download: bool, optional

Force re-download of Enrichr library. Default is False.

gsea_plot_type: str, optional

Type of plot for GSEA results. Options: dotplot and pie. Default is ‘dotplot’.

Returns:

fig: plotly.graph_objects.Figure

The Plotly figure containing the grid of endotype plots.

enrichment_results: dict, optional

Dictionary of GSEA results for each endotype (only if enrichr_lib is provided).

EndotypY.visualization.plot_multiple_endotypes(endotypes, G, seed_genes, size_height=18, size_width=20, node_size=100, path_length=2, layout_seed=2025, layout='spring', limit_lcc=True)[source]

Draws multiple endotypes in a single plot.

Parameters:

endotypes: dict of lists

Dictionary where keys are endotype names and values are lists of endotype genes.

G: networkx.Graph

The Graph object representing the reference network.

seed_genes: list

List of seed genes used for endotype identification.

size_height: int, optional

Height of the figure in inches.

size_width: int, optional

Width of the figure in inches.

node_size: str or int, optional

Determines the centrality measure for size of the nodes in the graph. Options are ‘betweenness’ or ‘degree’. If integer, it is used as a fixed node size.

path_length: int, optional

The length of the shortest paths to consider between endotype genes. Default is 2.

layout_seed: int, optional

Seed for the spring layout of the graph.

Returns:

combined_subgraph: networkx.Graph

Combined graph containing all endotypes and their shortest paths.

Module contents