geneplexus.geneplexus

`load_genes`(input_genes)	Load gene list and convert to Entrez.
`load_negatives`(input_negatives)	Load gene list and convert to Entrez that will used as negatives.
`cluster_input`([clust_method, ...])	Cluster input gene list.
`fit`([logreg_kwargs, scale, min_num_pos, ...])	Fit the model.
`predict`()	Predict gene scores from fit model.
`make_sim_dfs`()	Compute similarities bewteen the input genes and GO, Monarch and/or Mondo.
`make_small_edgelist`([num_nodes])	Make a subgraph induced by the top predicted genes.
`save_class`([output_dir, save_type, ...])	Save all or parts of the GenePlexus class and results.

class geneplexus.GenePlexus(file_loc=None, net_type='STRING', features='SixSpeciesN2V', sp_trn='Human', sp_res='Human', gsc_trn='Combined', gsc_res='Combined', input_genes=None, input_negatives=None, auto_download=False, log_level='INFO', log_to_file=False)[source]

The GenePlexus API class.

Initialize the GenePlexus object.

Parameters:

file_loc (str | None) – Location of data files, if not specified, set to default data path ~/.data/geneplexus
net_type (Literal['BioGRID', 'STRING', 'IMP']) – Type of network to use
features (Literal['SixSpeciesN2V']) – Type of features of the network to use
sp_trn (Literal['Human', 'Mouse', 'Fly', 'Worm', 'Zebrafish', 'Yeast']) – The species of the training data
sp_res (Literal['All'] | ~typing.Literal['Human', 'Mouse', 'Fly', 'Worm', 'Zebrafish', 'Yeast'] | ~typing.List[~typing.Literal['Human', 'Mouse', 'Fly', 'Worm', 'Zebrafish', 'Yeast']]) – The species the results are in, can be a list
gsc_trn (Literal['GO', 'Monarch', 'Mondo', 'Combined']) – Gene set collection used during training
gsc_res (Literal['GO', 'Monarch', 'Mondo', 'Combined'] | ~typing.List[~typing.Literal['GO', 'Monarch', 'Mondo', 'Combined']]) – Gene set(s) collection used when generating results, can be a list. If list needs to be the same length as number of results species to do
input_genes (List[str] | None) – Input gene list, can be mixed type. Can also be set later if not specified at init time by simply calling load_genes().
input_negatives (List[str] | None) – Input list of negative genes, can be mixed type. Can also be set later if not specified at init time by simply calling load_negatives().
auto_download (bool) – Automatically download necessary files if set.
log_level (Literal['CRITICAL', 'ERROR', 'WARNING', 'INFO', 'DEBUG']) – Logging level.
log_to_file (bool) – If True logger will be saved when save_class is run. This will also create a tmp file that can be explicitly deleted with remove_log_file.

The following clsss attributes are set when __init__ is run

GenePlexus._is_custom bool: If the species, network or feature type was supplied by the user.
GenePlexus._file_loc str: File path set for the data.
GenePlexus._features str: Type of network features used.
GenePlexus._sp_trn str: Species used in training.
GenePlexus._sp_res (str, List[str]): Species used in the results.
GenePlexus._gsc_trn str: Gene set collection used in training.
GenePlexus._gsc_res (str, List[str]): TGene set collection(s) used in results.
GenePlexus._net_type str: Type of network used.
GenePlexus.log_level str: The verbosity of the logger.
GenePlexus.log_to_file bool: Whether or not the log file was saved as a file
GenePlexus.auto_download bool: If data was attmepted to be auto downloaded.
GenePlexus.gsc_trn_original str: If internal data checks are run, this can different than _gsc_trn.
GenePlexus.gsc_res_original (str, List[str]): If internal data checks are run, this can different that _gsc_res.
GenePlexus.sp_gsc_pairs List[str]: The combination of all sp and gsc used, hyphen separated.
GenePlexus.model_info[ModelName] Class: model_info is a dictionary where each key is a different model and holds the ModelInfo class.
GenePlexus.model_info[ModelName].results[ResultName] Class: results is a dictionary where each key is a different result and holds ModelResults class.

cluster_input(clust_method='louvain', clust_min_size=15, clust_weighted=True, clust_kwargs={'domino_module_threshold': 0.05, 'domino_n_steps': 20, 'domino_res': 1, 'domino_seed': 123, 'domino_slice_thresh': 0.3, 'louvain_max_size': 70, 'louvain_max_tries': 3, 'louvain_res': 1, 'louvain_seed': 123})[source]

Cluster input gene list.

Parameters:

clust_method (Literal['louvain', 'domino']) – Clustering method to use (either louvain or domino).
clust_min_size (int) – Ignore clusters if smaller than this value.
clust_weighted (bool) – Whether or not to use weighted edges when building the clusters
clust_kwargs (Dict[str, Any] | None) – keywords args specfic to each clustering method
louvain_max_size – (clust_kwarg, int) Try to recluster if a cluster is bigger than this value.
louvain_max_tries – (clust_kwarg, int) The number of times to recluster any clusters that are bigger the clust_max_size. If cannot accomplished this by clust_max_tries the larger clusters are still retained.
louvain_res – (clust_kwarg, float) Resolution parameter in clustering algorithm.
louvain_seed – (clust_kwarg, int) Set seed used in clustering. Chose None to have this randomally set.
domino_res – (clust_kwarg, float) resolution used to make initial slices.
domino_slice_thresh – (clust_kwarg, float) threshold used for calling slice significant
domino_n_steps – (clust_kwarg, int) number of steps used in pcst
domino_module_threshold – (clust_kwarg, float) threshold used to consider module signifianct
domino_seed – (clust_kwarg, int) random seed to be used in clustering algorithm

The following clsss attributes are set when cluster_input is run

GenePlexus.clust_method (str): Clustering method used
GenePlexus.clust_min_size (int): Minimum size of clusters allowed
GenePlexus.clust_weighted (bool): Whether or not to use edge weights when generating clusters
GenePlexus.clust_kwags (dict): Keyword arguments used for each clustering method
GenePlexus.num_genes_lost (int): Number of input_genes not in any cluster
GenePlexus.per_genes_lost (float): Percentage of input_genes not in any cluster
GenePlexus.num_genes_gained (int): Number of genes in clusters not in input_genes
GenePlexus.per_genes_gained (float): Percentage of genes in clusters not in input_genes
GenePlexus.genes_lost_clustered (List[str]): List of input_genes not in any cluster
GenePlexus.genes_gained_clustered (List[str]): List of cluster genes not in input_genes
GenePlexus.model_info[ModelName].model_genes (List[str]): List of genes used as positives for each clusters model
GenePlexus.model_info[ModelName].results[ResultName] (Class): For each clusters model, set up a key in results dicts for ModelResults class

fit(logreg_kwargs={'C': 1.0, 'max_iter': 10000, 'penalty': 'l2', 'solver': 'lbfgs'}, scale=False, min_num_pos=15, min_num_pos_cv=15, num_folds=3, null_val=None, random_state=0, cross_validate=True)[source]

Fit the model.

Parameters:

logreg_kwargs (Dict[str, Any] | None) – Scikit-learn logistic regression settings (see LogisticRegression).
scale (bool) – Whether to scale the data when doing model training and prediction. It is not recommended to set to True unless using custom data.
min_num_pos (int) – Minimum number of positives required for the model to be trained.
min_num_pos_cv (int) – Minimum number of positives required for performing cross validation evaluation.
num_folds (int) – Number of cross validation folds.
null_val (float | None) – Null values to fill if cross validation was not able to be performed.
random_state (int | None) – Random state for reproducible shuffling stratified cross validation. Set to None for random.
cross_validate (bool) – Whether or not to perform cross validation to evaluate the prediction performance on the gene set. If set to False, then skip cross validation and return null_val as cv scores.

The following clsss attributes are set when fit is run

GenePlexus.min_num_pos (int)

Minumum number of postivies needed to train a model.

GenePlexus.logreg_kwargs (dict)

Keyword arguments for LogisitcRegression function.

GenePlexus.scale (bool)

Whether or not scaling of the data was done in LogisticRegression.

GenePlexus.min_num_pos_cv (int)

The minumum number of positive genes needed for doing cross validation.

GenePlexus.num_folds (int)

Number of cross validation folds to do

GenePlexus.null_vall (None, str, int, float)

Value to fill in for avgps if cross validation couldn’t be performed

GenePlexus.random_state (None, int)

Seed set for doing cross validation

GenePlexus.cross_validate (bool)

Whether or not to perform cross validation

GenePlexus.model_info[ModelName].pos_genes_in_net (1D array of str)

Input gene Entrez IDs that are present in the network.

GenePlexus.model_info[ModelName].genes_not_in_net (1D array of str)

Input gene Entrez IDs that are absent in the network.

GenePlexus.model_info[ModelName].net_genes (1D array of str)

All genes in the network.

GenePlexus.model_info[ModelName].negative_genes (1D array of str)

Negative gene Entrez IDs derived using the input genes and the background gene set collection (gp_trn).

GenePlexus.model_info[ModelName].neutral_gene_info (Dict of Dicts)

Dictionary saying which genes were set to neutrals because the term annotation matched closely enough to the positive training genes.

{
  "{Term ID}" # ID of the matched term : {
     "Name"  : # returns string of term name
     "Genes" : # returns list of genes annotated to term
     "Task"  : # returns type of GSC the term is from
     }
  "All Neutrals" : # returns list of all genes considered neutral
}

GenePlexus.model_info[ModelName].mdl_weights (1D array of floats)

Trained model parameters.

GenePlexus.model_info[ModelName].clf (LogisticRegression)

The fit classifer from sci-kit learn LogisticRegression class.

GenePlexusmodel_info[ModelName]..avgps (1D array of floats)

Cross validation results. Performance is measured using log2(auprc/prior).

GenePlexus.model_info[ModelName].std_scale (StandardScale)

If scaling was performed the object returned from StandardScaler.

GenePlexus.model_info[ModelName].df_convert_out_for_model (DataFrame)

A table specifc to input_genes for each model with the following 4 columns:

Original ID	User supplied Gene ID used to train the model
Entrez ID	Entrez Gene ID
Gene Name	Name Gene ID
In {Network}?	Y or N if the gene was found in the {Network} used to train the model

Note

If setting scale to True then comparison of user trained model to the models pre-trained on known gene sets become less straightforward as those models are trained without any scaling.

load_genes(input_genes)[source]

Load gene list and convert to Entrez.

Parameters:: input_genes (List[str]) – Input gene list, can be mixed type.

The following clsss attributes are set when load_genes is run

GenePlexus.input_genes (List[str])

Input genes converted to uppercase

GenePlexus.df_convert_out (DataFrame)

A table where the following 6 columns:

Original ID	User supplied Gene ID
Entrez ID	Entrez Gene ID
Gene Name	Name Gene ID
In BioGRID?	Y or N if the gene was found in the BioGRID network or not
In IMP?	Y or N if the gene was found in the IMP network or not
In STRING?	Y or N if the gene was found in the STRING network or not

GenePlexus.table_summary (List[Dict[str, int]])

List of netowrk stats summary dictionaries. Each dictionary has the following stucture:

{
  "Network" : # returns name of the network
  "NetworkGenes"  : # returns number of genes in the network
  "PositiveGenes" : # returns number of input genes found in the network
}

GenePlexus.convert_ids (List[str])

Converted gene list.

GenePlexus.input_count (int)

Number of input genes that were able to be converted.

See also

Use geneplexus.util.read_gene_list() to load a gene list from a file.

make_sim_dfs()[source]

Compute similarities bewteen the input genes and GO, Monarch and/or Mondo.

The following clsss attributes are set when make_sim_df is run

GenePlexus.model_info[ModelName].results[ResultName].df_sim (DataFrame)

A table showing how similar the coefficients of the user trained models are to the coefficients of models trained using genes annotated to gsc_res. The table has the following 7 columns:

Task	Which type of GSC the term is from
ID	Term ID
Name	Term Name
Similarity	Cosine similarity between model coefficients between the two models
Z-score	The z-score of the similarities
P-adjusted	The Bonferroni adjusted p-values from the z-scores
Rank	The rank of the term with one being the term with the highest similarity to the user model

make_small_edgelist(num_nodes=50)[source]

Make a subgraph induced by the top predicted genes.

Parameters:: num_nodes (int) – Number of top genes to include.

The following clsss attributes are set when make_small_edgelist is run

GenePlexus.num_nodes (int): The number of nodes to include in the edgelist.
GenePlexus.model_info[ModelName].results[ResultName].df_edge (DataFrame): Table of edge list corresponding to the subgraph induced by the top predicted genes (in Entrez gene ID).
GenePlexus.model_info[ModelName].results[ResultName].isolated_genes (List[str]): List of top predicted genes (in Entrez gene ID) that are isolated from other top predicted genes in the network.
GenePlexus.model_info[ModelName].results[ResultName].df_edge_sym (DataFrame): Table of edge list corresponding to the subgraph induced by the top predicted genes (in gene symbol).
GenePlexus.model_info[ModelName].results[ResultName].isolated_genes_sym (List[str]): List of top predicted genes (in gene symbol) that are isolated from other top predicted genes in the network.

predict()[source]

Predict gene scores from fit model.

The following clsss attributes are set when predict is run

GenePlexus.model_info[ModelName].results[ResultName].df_probs (DataFrame)

A table with the following 9 columns:

Entrez	Entrez Gene ID
Symbol	Symbol Gene ID
Name	Name Gene ID
Known/Novel	Known is gene was in the positive set, otherwise Novel
Class-Label	P (positive in training), N (negative durinig training), U (unused during trianing)
Probability	The probabilties returned from the logisitc regression model
Z-score	The z-score of the model probabilties for all predcited genes
P-adjusted	The Bonferroni adjusted p-values from the z-scores
Rank	The rank of the gene with one being the gene with the highest predcited value

Note

For the Known/Novel and Class-Label columns, if the training species is different than the results species, this information is obtained by looking at the one-to-one orthologs between the species.

Note

Due to the high complexity of the embedding space, and wide variety of postive and negative genes determined for each model, the resulting probabilities may not be well calibrated, however the resulting rankings are very meaningful as evaluated with log2(auPRC/prior).

remove_log_file()[source]: Remove the tmp log file. Only do when at the end of the script)

save_class(output_dir=None, save_type='all', zip_output=False, overwrite=False)[source]

Save all or parts of the GenePlexus class and results.

Parameters:

output_dir (str | None) – Path to save the files to If None will try ~/.data/geneplexus_outputs/results.
save_type (Literal['all', 'results_only']) – which file saving method to use
zip_output (bool) – wehter or not to compress all the results into one zip file
overwrite (bool) – wether to overwrite data or make new directory with incremented index