geneplexus.geneplexus

`load_genes`(input_genes)	Load gene list, convert to Entrez, and set up positives/negatives.
`fit_and_predict`([logreg_kwargs, ...])	Fit a model and predict gene scores.
`make_small_edgelist`([num_nodes])	Make a subgraph induced by the top predicted genes.
`make_sim_dfs`()	Compute similarities bewteen the input genes and GO or DisGeNet.

class geneplexus.GenePlexus(file_loc=None, net_type='STRING', features='Embedding', gsc='GO', input_genes=None, auto_download=False, log_level='WARNING')[source]

The GenePlexus API class.

Initialize the GenePlexus object.

Parameters:

file_loc (str | None) – Location of data files, if not specified, set to default data path ~/.data/geneplexus
net_type (Literal['BioGRID', 'STRING', 'STRING-EXP', 'GIANT-TN']) – Type of network to use.
features (Literal['Adjacency', 'Embedding', 'Influence']) – Type of features of the network to use.
gsc (Literal['GO', 'DisGeNet']) – Type of gene set collection to use for generating negatives.
input_genes (List[str] | None) – Input gene list, can be mixed type. Can also be set later if not specified at init time by simply calling load_genes() (default: None).
auto_download (bool) – Automatically download necessary files if set.
log_level (Literal['CRITICAL', 'ERROR', 'WARNING', 'INFO', 'DEBUG']) – Logging level.

_convert_to_entrez()[source]

Convert the loaded genes to Entrez.

GenePlexus.df_convert_out (DataFrame): A table where the first column contains the original gene IDs, the second column contains the corresponding converted Entrez gene IDs. The rest of the columns are indicators of whether a given gene is present in any one of the networks.
GenePlexus.table_summary (List[Dict[str, int]]): List of netowrk stats summary dictionaries. Each dictionary has three keys: Network, NetworkGenes, and PositiveGenes (the number intersection between the input genes and the network genes).
GenePlexus.input_count (int): Number of input genes.

_get_pos_and_neg_genes()[source]

Set up positive and negative genes given the network.

GenePlexus.pos_genes_in_net (array of str): Array of input gene Entrez IDs that are present in the network.
GenePlexus.genes_not_in_net (array of str): Array of input gene Entrez IDs that are absent in the network.
GenePlexus.net_genes (array of str): Array of network gene Entrez IDs.
GenePlexus.negative_genes (array of str): Array of negative gene Entrez IDs derived using the input genes and the background gene set collection (GSC).

_load_genes(input_genes)[source]

Load gene list into the GenePlexus object.

Note

Implicitely converts genes to upper case.

Parameters:: input_genes (List[str]) –

alter_validation_df()[source]

Make table about presence of input genes in the network.

df_convert_out_subset positive_genes

check_custom()[source]

Check custom network and gsc options.

The following files are required: * Data_{features}_{net_type}.npy * GSC_{gsc}_{net_type}_GoodSets.json * GSC_{gsc}_{net_type}_universetxt

dump_config(outdir)[source]

Save parameters configuration to a config file.

Parameters:: outdir (str) –

property features: Literal['Adjacency', 'Embedding', 'Influence']: Features to use.

property file_loc: str

File location.

Use default data location ~/.data/geneplexus if not set.

fit_and_predict(logreg_kwargs=None, min_num_pos=15, num_folds=3, null_val=-10, random_state=0, cross_validate=True)[source]

Fit a model and predict gene scores.

Parameters:

logreg_kwargs (Dict[str, Any] | None) – Scikit-learn logistic regression settings (see LogisticRegression). If not set, then use the default logistic regression settings (l2 penalty, 10,000 max iterations, lbfgs solver).
min_num_pos (int) – Minimum number of positives required for performing cross validation evaluation.
num_folds (int) – Number of cross validation folds.
null_val (float) – Null values to fill if cross validation was not able to be performed.
random_state (int | None) – Random state for reproducible shuffling stratified cross validation. Set to None for random.
cross_validate (bool) – Whether or not to perform cross validation to evaluate the prediction performance on the gene set. If set to False, then skip cross validation and return null_val as cv scores.

GenePlexus.mdl_weights (array of float): Trained model parameters.
GenePlexus.probs (array of float): Genome-wide gene prediction scores. A high value indicates the relevance of the gene to the input gene list.
GenePlexus.avgps (array of float): Cross validation results. Performance is measured using log2(auprc/prior).
GenePlexus.df_probs (DataFrame): A table with 7 columns: Entrez (the gene Entrez ID), Symbol (the gene Symbol), Name (the gene Name), Probability (the probability of a gene being part of the input gene list), Known/Novel (whether the gene is in the input gene list), Class-Label (positive, negative, or neutral), Rank (rank of relevance of the gene to the input gene list).

property gsc: Literal['GO', 'DisGeNet']: Geneset collection.

load_genes(input_genes)[source]

Load gene list, convert to Entrez, and set up positives/negatives.

GenePlexus.input_genes (List[str]): Input gene list.

Parameters:: input_genes (List[str]) – Input gene list, can be mixed type.