geneplexus.geneplexus
|
Load gene list, convert to Entrez, and set up positives/negatives. |
|
Fit a model and predict gene scores. |
|
Make a subgraph induced by the top predicted genes. |
Compute similarities bewteen the input genes and GO or DisGeNet. |
- class geneplexus.GenePlexus(file_loc=None, net_type='STRING', features='Embedding', gsc='GO', input_genes=None, auto_download=False, log_level='WARNING')[source]
The GenePlexus API class.
Initialize the GenePlexus object.
- Parameters:
file_loc (str | None) – Location of data files, if not specified, set to default data path
~/.data/geneplexusnet_type (Literal['BioGRID', 'STRING', 'STRING-EXP', 'GIANT-TN']) – Type of network to use.
features (Literal['Adjacency', 'Embedding', 'Influence']) – Type of features of the network to use.
gsc (Literal['GO', 'DisGeNet']) – Type of gene set collection to use for generating negatives.
input_genes (List[str] | None) – Input gene list, can be mixed type. Can also be set later if not specified at init time by simply calling
load_genes()(default:None).auto_download (bool) – Automatically download necessary files if set.
log_level (Literal['CRITICAL', 'ERROR', 'WARNING', 'INFO', 'DEBUG']) – Logging level.
- _convert_to_entrez()[source]
Convert the loaded genes to Entrez.
GenePlexus.df_convert_out(DataFrame)A table where the first column contains the original gene IDs, the second column contains the corresponding converted Entrez gene IDs. The rest of the columns are indicators of whether a given gene is present in any one of the networks.
GenePlexus.table_summary(List[Dict[str, int]])List of netowrk stats summary dictionaries. Each dictionary has three keys: Network, NetworkGenes, and PositiveGenes (the number intersection between the input genes and the network genes).
GenePlexus.input_count(int)Number of input genes.
- _get_pos_and_neg_genes()[source]
Set up positive and negative genes given the network.
GenePlexus.pos_genes_in_net(array of str)Array of input gene Entrez IDs that are present in the network.
GenePlexus.genes_not_in_net(array of str)Array of input gene Entrez IDs that are absent in the network.
GenePlexus.net_genes(array of str)Array of network gene Entrez IDs.
GenePlexus.negative_genes(array of str)Array of negative gene Entrez IDs derived using the input genes and the background gene set collection (GSC).
- _load_genes(input_genes)[source]
Load gene list into the GenePlexus object.
Note
Implicitely converts genes to upper case.
- Parameters:
input_genes (List[str]) –
- alter_validation_df()[source]
Make table about presence of input genes in the network.
df_convert_out_subsetpositive_genes
- check_custom()[source]
Check custom network and gsc options.
The following files are required: *
Data_{features}_{net_type}.npy*GSC_{gsc}_{net_type}_GoodSets.json*GSC_{gsc}_{net_type}_universetxt
- dump_config(outdir)[source]
Save parameters configuration to a config file.
- Parameters:
outdir (str) –
- property features: Literal['Adjacency', 'Embedding', 'Influence']
Features to use.
- property file_loc: str
File location.
Use default data location ~/.data/geneplexus if not set.
- fit_and_predict(logreg_kwargs=None, min_num_pos=15, num_folds=3, null_val=-10, random_state=0, cross_validate=True)[source]
Fit a model and predict gene scores.
- Parameters:
logreg_kwargs (Dict[str, Any] | None) – Scikit-learn logistic regression settings (see
LogisticRegression). If not set, then use the default logistic regression settings (l2 penalty, 10,000 max iterations, lbfgs solver).min_num_pos (int) – Minimum number of positives required for performing cross validation evaluation.
num_folds (int) – Number of cross validation folds.
null_val (float) – Null values to fill if cross validation was not able to be performed.
random_state (int | None) – Random state for reproducible shuffling stratified cross validation. Set to None for random.
cross_validate (bool) – Whether or not to perform cross validation to evaluate the prediction performance on the gene set. If set to
False, then skip cross validation and return null_val as cv scores.
GenePlexus.mdl_weights(array of float)Trained model parameters.
GenePlexus.probs(array of float)Genome-wide gene prediction scores. A high value indicates the relevance of the gene to the input gene list.
GenePlexus.avgps(array of float)Cross validation results. Performance is measured using log2(auprc/prior).
GenePlexus.df_probs(DataFrame)A table with 7 columns: Entrez (the gene Entrez ID), Symbol (the gene Symbol), Name (the gene Name), Probability (the probability of a gene being part of the input gene list), Known/Novel (whether the gene is in the input gene list), Class-Label (positive, negative, or neutral), Rank (rank of relevance of the gene to the input gene list).
- property gsc: Literal['GO', 'DisGeNet']
Geneset collection.
- load_genes(input_genes)[source]
Load gene list, convert to Entrez, and set up positives/negatives.
GenePlexus.input_genes(List[str]): Input gene list.- Parameters:
input_genes (List[str]) – Input gene list, can be mixed type.
See also
Use
geneplexus.util.read_gene_list()to load a gene list from a file.
- make_sim_dfs()[source]
Compute similarities bewteen the input genes and GO or DisGeNet.
The similarities are compuared based on the model trained on the input gene set and models pre-trained on known GO and DisGeNet gene sets.
GenePlexus.df_sim_GO(DataFrame)A table with 4 columns: ID (the GO term ID), Name (name of the GO term), Similarity (similarity between the input model and a model trained on the GO term gene set), Rank (rank of similarity between the input model and a model trained on the GO term gene set).
GenePlexus.df_sim_Dis(DataFrame)A table with 4 columns: ID (the DO term ID), Name (name of the DO term), Similarity (similarity between the input model and a model trained on the DO term gene set), Rank (rank of similarity between the input model and a model trained on the DO term gene set).
GenePlexus.weights_GODictionary of pretrained model weights for GO. A key is a GO term, and the value is a dictionary with three keys: Name (name of the GO term), Weights (pretrained model weights), PosGenes (positive genes for this GO term).
GenePlexus.weights_DisDictionary of pretrained model weights for DisGeNet. A key is a DO term, and the value is a dictionary with three keys: Name (name of the DO term), Weights (pretrained model weights), PosGenes (positive genes for this DO term).
- make_small_edgelist(num_nodes=50)[source]
Make a subgraph induced by the top predicted genes.
GenePlexus.df_edge(DataFrame)Table of edge list corresponding to the subgraph induced by the top predicted genes (in Entrez gene ID).
GenePlexus.isolated_genes(List[str])List of top predicted genes (in Entrez gene ID) that are isolated from other top predicted genes in the network.
GenePlexus.df_edge_sym(DataFrame)Table of edge list corresponding to the subgraph induced by the top predicted genes (in gene symbol).
GenePlexus.isolated_genes_sym(List[str])List of top predicted genes (in gene symbol) that are isolated from other top predicted genes in the network.
- Parameters:
num_nodes (int) – Number of top genes to include.
- property net_type: Literal['BioGRID', 'STRING', 'STRING-EXP', 'GIANT-TN']
Network to use.