EndotypY package
Submodules
EndotypY.endotyper module
- class EndotypY.endotyper.Endotyper[source]
Bases:
objectEndotyper class for endotyping analysis.
This class provides methods to perform endotyping analysis using a random walk approach. It includes methods for reading networks, preparing the random walk matrix, and performing the endotyping analysis.
Attributes:
- network_filestr | Path
The path to the input network file.
- rfloat
The damping factor for the random walk.
- annotate_local_neighborhood(enrichr_lib: str, organism='Human', sig_threshold=0.01, force_download=False)[source]
Get the Gene Ontology (GO) terms for a given gene and its RWR defined neighbors. This function uses the Enrichr library to perform Gene Set Enrichment Analysis (GSEA) and returns significant terms for the expanded neighborhood of genes (significance threshold = p-value for enrichment).
- Parameters:
enrichr_lib (str) – The name of the Enrichr library to use for GSEA.
organism (str) – The organism for which the GSEA is performed. Default is ‘Human’.
sig_threshold (float) – The significance threshold for the GSEA results. Default is 0.01.
- define_kl_endotypes(distance_metric: str = 'hamming', linkage_method: str = 'complete', alpha: float = 0.05)[source]
Define endotypes based on KL divergence. This function computes the feature matrix from the neighborhood annotations (binary matrix) that describes which enrichment terms are present for each gene based on the enrichment of the gene +local neighborhood. The feature matrix is a binary matrix where rows are genes and columns are enrichment terms. Each entry is 1 if the term is present for the gene, and 0 otherwise.
It then performs kl divergence clustering to identify endotypes. :returns: The Endotyper object with the endotypes defined. :rtype: self
- define_local_neighborhood(neighbor_percentage=1, scaling=True)[source]
Run RWR starting from every single gene in seed_genes and extract the top % genes from the visiting probabilities around each seed gene.
- Parameters:
neighbor_percentage (int) – Percentage of top genes to identify.
scaling (bool) – Whether to apply scaling to the RWR.
- explore_seed_clusters(scaling=True, k=200)[source]
Run the seed clustering process. This function computes the RWR for each seed gene, clusters them based on their neighborhoods, and plots the results.
- Parameters:
k_max (-) – Maximum neighborhood size to test.
scaling (-) – Whether to apply scaling to the RWR.
- import_network(network_file: str)[source]
Imports a network from a file.
- Parameters:
network_file (str) – Path to the network file. Supported formats are:
edges (-'.txt' or '.tsv' or '.csv' with two columns representing)
tab-separated.
- Returns:
The Endotyper object.
- Return type:
self
Notes
Lines that start with ‘#’ will be ignored.
Self-loops are eliminated in the last filtering step
- import_seeds(seeds_file: str)[source]
Imports seeds from a file and sets them as the seeds for the object.
- Parameters:
seeds_file (str) – The path to the seeds file.
- Returns:
The Endotyper object.
- Return type:
self
Notes
The seeds file should contain a list of seed genes, one per line.
Alternative formats for the seeds file is tab separated entries on first line of file.
- plot_endotype_grid(size_height=500, size_width=500, ncols=2, node_size='degree', path_length=2, layout_seed=2025, enrichr_lib=None, top_terms=5, gsea_plot_type='dotplot')[source]
Plots endotypes in a grid layout using Plotly with optional GSEA visualization.
This function creates an interactive grid visualization of all identified endotypes, combining endotypes from different iterations into a single grid view.
- Parameters:
size_height (int, optional) – Height of each subplot in pixels. Defaults to 500.
size_width (int, optional) – Width of each subplot in pixels. Defaults to 500.
ncols (int, optional) – Number of columns in the grid layout. Defaults to 3.
node_size (str or int, optional) – Determines the centrality measure for node sizing. Options are ‘betweenness’ or ‘degree’. If integer, used as fixed node size. Defaults to ‘degree’.
path_length (int, optional) – Length of shortest paths to consider between endotype genes. Defaults to 2.
layout_seed (int, optional) – Seed for the spring layout. Defaults to 2025.
enrichr_lib (str, optional) – Name of Enrichr library for GSEA. If None, no GSEA is performed. Defaults to None.
top_terms (int, optional) – Number of top enriched terms to display. Defaults to 5.
gsea_plot_type (str, optional) – Type of plot for GSEA results (‘dotplot’ or ‘pie’). Defaults to ‘dotplot’.
- Returns:
Plotly figure and optionally GSEA enrichment results if enrichr_lib is provided.
- plot_endotypes(node_size: list = ['degree', 'betweenness'], layout: str = 'spring', path_length: int = 2)[source]
Plots multiple endotypes on the network. This function iterates through the endotypes dictionary, combining endotypes from different iterations into a single dictionary. It then calls the plot_multiple_endotypes function to visualize these combined endotypes on the network. :param node_size: network measures to use for node sizing. :type node_size: list, optional :param Defaults to [‘degree’: :param ‘betweenness’].: :param layout: The layout algorithm to use for the network plot. Defaults to ‘spring’. :type layout: str, optional :param path_length: The path length to use for shortest path calculations. Defaults to 2. :type path_length: int, optional
- plot_endotypes_metagraph(filter_size_endotypes=True, node_size=15)[source]
Build an endotype metagraph visualization where individual endotype subgraphs (positioned via spring layout) are clustered at meta-positions (determined by inter-endotype connectivity), then globally scaled and rendered with colored hulls, intra-endotype edges, inter-endotype connections, and seed gene highlighted using datamapplot and matplotlib.
- Parameters:
filter_size_endotypes (bool, optional) – Choose to filter endotypes by size: select only endotypes subgraphs with at least one edge and more than 5 nodes. Defaults to True.
node_size (int, optional) – Size of nodes in the plot. Defaults to 15.
- prepare_rwr(r=0.8)[source]
Prepares the Random Walk with Restart (RWR) matrix.
This function computes the RWR matrix based on the network and restart probability, using the formula (I-r*M)^-1 where M is the column-wise normalized Markov matrix according to M = A D^{-1}.
To provide the option of scaling the visiting probabilities, a scaling matrix is also created, which is the diagonal matrix of the inverse degree of the nodes in graph G.
- Parameters:
r (float, optional) – Damping factor/restart probability. Defaults to 0.8.
- Returns:
- Returns the Endotyper object with the RWR matrix, scaling matrix,
and index to ensembl mapping stored as attributes.
- Return type:
self
EndotypY.expansion module
- EndotypY.expansion.calculate_top_genes(G, input_gene_list, rwr_matrix, scaling_matrix, d_idx_ensembl, neighbor_percentage=1, scaling=True)[source]
Calculate the top 1% genes for each gene in input_gene_list using Random Walk with Restart (RWR) algorithm.
- Parameters:
G (nx.Graph object) – NetworkX graph object.
input_gene_list (list of str) – List of genes.
scaling (bool) – Whether to scale the visiting probabilities.
W (numpy array) – Random walk matrix.
Dinvsqrt (numpy array) – Diagonal matrix of the inverse degree of the nodes.
d_ensembl_idx (dict) – Dictionary mapping gene Ensembl IDs to their indices.
d_idx_ensembl (dict) – Dictionary mapping indices to gene Ensembl IDs.
neighbor_percentage (float) – Percentage of top genes to identify.
- Returns:
top_genes – Dictionary of the desired top % genes for each gene in input_gene_list.
- Return type:
dict
- EndotypY.expansion.convert_gs_lib_to_dict(go_terms_dict)[source]
Converts a dictionary of enrichment terms from the format {term: [gene1, gene2, …]} to the format {gene: [term1, term2, …]}.
- Parameters:
go_terms_dict (dict or list) – Dictionary where keys are enrichment term IDs and values are lists of gene symbols associated with those terms. Or a list from gseapy that needs to be handled differently.
- Returns:
go_terms – Dictionary where keys are gene symbols and values are lists of enrichment term IDs associated with those genes.
- Return type:
dict
- EndotypY.expansion.get_GSEA_significant_terms(gene_list, library, sig_threshold, organism='human', background=None)[source]
Perform Gene Set Enrichment Analysis (GSEA) using gseapy’s enrichr implementarion and return significant terms.
- Parameters:
gene_list (list) – List of genes to analyze.
library (gene set enrichment library) – Dictionary of term - gene associations (output of gp.get_library() function).
background (list) – Background gene list (default: None).
- Returns:
significant_terms – List of significant terms from GSEA.
- Return type:
list
- EndotypY.expansion.get_gene_and_neighborhood_enrichment_terms(gene, top_genes, term_library, sig_threshold)[source]
Get the Gene Ontology (GO) terms for a given gene symbol and its neighbors.
- Parameters:
gene (str) – Gene symbol (HGNC) of the gene.
top_genes (dict) – Dictionary of the desired top % genes for each gene in input_gene_list (output of calculate_top_genes() function).
term_library (dict) – Dictionary of term - gene associations (output of gp.get_library() function).
- Returns:
all_go_terms – List of unique GO term IDs for the gene and its neighbors.
- Return type:
list
- EndotypY.expansion.get_module_neighborhood_terms_dict(top_genes, term_library, sig_threshold)[source]
Get the Gene Ontology (GO) terms for all genes in a given disease core module using parallel processing.
- Parameters:
top_genes (dict) – Dictionary of the desired top % genes for each gene in input_gene_list (output of calculate_top_genes() function).
term_library (dict) – Dictionary of term - gene associations (output of gp.get_library() function).
n_cores (int, optional) – Number of cores to use for parallel processing (default: mp.cpu_count() - 2).
- Returns:
go_terms_dict – Dictionary of unique GO term IDs for each gene in the disease core module.
- Return type:
dict
- EndotypY.expansion.process_gene(gene, top_genes, term_library, sig_threshold)[source]
Process a single gene to retrieve its GO terms.
- Parameters:
gene (str) – Gene symbol to process.
top_genes (dict) – Dictionary of top genes.
term_library (dict) – Dictionary of term - gene associations.
- Returns:
A tuple containing the gene symbol and its associated GO terms.
- Return type:
tuple
EndotypY.import_export module
- EndotypY.import_export.load_seed_set_from_file(seed_file) set[source]
Reads a seed set from an external file. * Lines starting with ‘#’ will be ignored.
Parameters:
- seed_filestr | Path
The path to the input file. It can be provided as: - A string (e.g., “data/seeds.txt”) - A pathlib.Path object (e.g., Path(“data”, “seeds.txt”))
Returns:
- list of int
A list of unique seeds from the file
Raises:
- FileNotFoundError
If the specified file does not exist.
- EndotypY.import_export.read_network_from_file(network_file: str | Path) Graph[source]
Reads a network from an external file.
The edgelist must be provided as a tab-separated table. The first two columns of the table will be interpreted as an interaction gene1 <==> gene2.
Lines that start with ‘#’ will be ignored.
The function checks that the input file ends with ‘.txt’ or ‘.tsv’ or ‘.csv’.
Self-loops are eliminated in the last filtering step of the function.
Parameters:
- network_filestr | Path
The path to the input file. It can be provided as: - A string (e.g., “data/network.txt”) - A pathlib.Path object (e.g., Path(“data”, “network.txt”))
Returns:
- nx.Graph
A NetworkX graph with nodes and edges from the file.
Raises:
- ValueError
If the file format is not ‘.txt’ or ‘.tsv’ or ‘.csv’.
- FileNotFoundError
If the specified file does not exist.
Notes:
Lines starting with ‘#’ are ignored.
Self-loops (edges where node1 == node2) are removed.
EndotypY.kl_clustering module
- EndotypY.kl_clustering.compute_feature_matrix(go_terms_dict)[source]
Create a binary feature matrix from a dictionary of GO terms for each gene.
- Parameters:
go_terms_dict (dict) – Dictionary of unique GO term IDs for each gene in the disease core module.
- Returns:
feature_matrix – Binary feature matrix where rows corresponds to gene IDs and columns represent GO term IDs.
- Return type:
pd.DataFrame
- EndotypY.kl_clustering.kl_clustering_endotypes(data: DataFrame, linkage_method: str = 'complete', distance_metric: str = 'hamming', alpha: float = 0.05) dict[source]
A function to perform KL-based hierarchical clustering analysis on the provided dataset. Parameters: - data: pd.DataFrame
The input binary dataset with samples as rows and features as columns.
- linkage: str
The linkage method to use for hierarchical clustering.
- distance_metric: str
The distance metric to use for calculating pairwise distances.
EndotypY.metagraph_visualization module
- EndotypY.metagraph_visualization.plot_endotypes_metagraph(G, d_clusters, seed_genes, filter_size_endotypes=True, node_size=15)[source]
Wrapper function to create meta positions, scale and position endotype subgraphs, and visualize the final layout.
Parameters:
- Gnx.Graph
The original full network graph
- d_clustersdict
A dictionary where keys are endotype identifiers and values are lists of genes belonging to each endotype
- seed_geneslist
List of seed genes used for endotype identification.
- filter_size_endotypesbool
If True, only endotype subgraphs with at least one edge and more than 5 nodes are considered for visualization.
- node_sizeint
Size of the nodes in the final visualization.
EndotypY.prepare_rwr module
EndotypY.rwr module
EndotypY.seed_clusters module
- EndotypY.seed_clusters.run_seed_clustering(G, seed_genes, d_rwr_individuals, k_max=200)[source]
Run the seed clustering process. This function computes the RWR for each seed gene, clusters them based on their neighborhoods, and plots the results.
- Parameters:
G (-) – NetworkX graph representing the connected protein-protein interaction network.
seed_genes (-) – List of seed genes to be clustered.
scaling (-) – Scaling matrix for the RWR algorithm.
rwr_matrix (-) – RWR matrix for the graph.
scaling_matrix (-) – Scaling matrix for the graph.
d_idx_ensembl (-) – Dictionary mapping indices to Ensembl IDs.
k_max (-) – Maximum neighborhood size to test.
EndotypY.utils module
EndotypY.visualization module
- EndotypY.visualization.plot_endotype(endotype, G, seed_genes, size_height=14, size_width=14, node_size='betweenness', path_length=2, endotype_color='cornflowerblue', layout_seed=2025, return_plot=True)[source]
Draws a subgraph of the PPI network containing the endotype genes and their shortest paths.
Parameters:
- endotype: list
List of endotype genes to be highlighted in the graph.
- G: networkx.Graph
The Graph object representing the reference network
- seed_genes: list
List of seed genes used for endotype identification.
- size_height: int, optional
Height of the figure in inches.
- size_width: int, optional
Width of the figure in inches.
- node_size: str, optional
Determines the centrality measure for size of the nodes in the graph. Options are ‘betweenness’ or ‘degree’.
- path_length: int, optional
The length of the shortest paths to consider between endotype genes. Default is 2.
- endotype_color: str, optional
Color used to highlight the endotype genes in the graph. Default is ‘orange’.
- layout_seed: int, optional
Seed for the spring layout of the graph.
- return_plot: bool, optional
If True, the function will not plot the graph but will still return the subgraph.
Returns:
- subgraph: networkx.Graph
The subgraph containing the endotype genes and their shortest paths.
- EndotypY.visualization.plot_endotype_grid(endotypes, G, seed_genes, size_height=500, size_width=500, ncols=3, node_size='degree', path_length=2, layout_seed=2025, layout='spring', limit_lcc=True, enrichr_lib=None, organism='Human', top_terms=5, force_download=False, gsea_plot_type='dotplot')[source]
Draws multiple endotypes in a grid of subplots using Plotly, with optional GSEA visualization.
Parameters:
- endotypes: dict of lists
Dictionary where keys are endotype names and values are lists of endotype genes.
- G: networkx.Graph
The Graph object representing the reference network.
- seed_genes: list
List of seed genes used for endotype identification.
- size_height: int, optional
Height of each subplot in pixels.
- size_width: int, optional
Width of each subplot in pixels.
- ncols: int, optional
Number of columns in the grid layout.
- node_size: str or int, optional
Determines the centrality measure for size of the nodes in the graph. Options are ‘betweenness’ or ‘degree’. If integer, it is used as a fixed node size.
- path_length: int, optional
The length of the shortest paths to consider between endotype genes. Default is 2.
- layout: str, optional
The layout algorithm to use for positioning nodes. Options are ‘spring’ or ‘kk’
- layout_seed: int, optional
Seed for the spring layout of the graph.
- limit_lcc: bool, optional
If True, limits each endotype subgraph to its largest connected component.
- enrichr_lib: str, optional
Name of the Enrichr library to use for GSEA. If None, no GSEA is performed.
- organism: str, optional
Organism for GSEA. Default is ‘Human’.
- top_terms: int, optional
Number of top enriched terms to display in plots. Default is 5.
- force_download: bool, optional
Force re-download of Enrichr library. Default is False.
- gsea_plot_type: str, optional
Type of plot for GSEA results. Options: dotplot and pie. Default is ‘dotplot’.
Returns:
- fig: plotly.graph_objects.Figure
The Plotly figure containing the grid of endotype plots.
- enrichment_results: dict, optional
Dictionary of GSEA results for each endotype (only if enrichr_lib is provided).
- EndotypY.visualization.plot_multiple_endotypes(endotypes, G, seed_genes, size_height=18, size_width=20, node_size=100, path_length=2, layout_seed=2025, layout='spring', limit_lcc=True)[source]
Draws multiple endotypes in a single plot.
Parameters:
- endotypes: dict of lists
Dictionary where keys are endotype names and values are lists of endotype genes.
- G: networkx.Graph
The Graph object representing the reference network.
- seed_genes: list
List of seed genes used for endotype identification.
- size_height: int, optional
Height of the figure in inches.
- size_width: int, optional
Width of the figure in inches.
- node_size: str or int, optional
Determines the centrality measure for size of the nodes in the graph. Options are ‘betweenness’ or ‘degree’. If integer, it is used as a fixed node size.
- path_length: int, optional
The length of the shortest paths to consider between endotype genes. Default is 2.
- layout_seed: int, optional
Seed for the spring layout of the graph.
Returns:
- combined_subgraph: networkx.Graph
Combined graph containing all endotypes and their shortest paths.