geneplexus.geneplexus

load_genes(input_genes)

Load gene list and convert to Entrez.

load_negatives(input_negatives)

Load gene list and convert to Entrez that will used as negatives.

fit_and_predict([logreg_kwargs, scale, ...])

Fit a model and predict gene scores.

make_sim_dfs()

Compute similarities bewteen the input genes and GO, Monarch and/or Mondo.

make_small_edgelist([num_nodes])

Make a subgraph induced by the top predicted genes.

class geneplexus.GenePlexus(file_loc=None, net_type='STRING', features='SixSpeciesN2V', sp_trn='Human', sp_res='Human', gsc_trn='Combined', gsc_res='Combined', input_genes=None, input_negatives=None, auto_download=False, log_level='WARNING')[source]

The GenePlexus API class.

Initialize the GenePlexus object.

Parameters:
  • file_loc (str | None) – Location of data files, if not specified, set to default data path ~/.data/geneplexus

  • net_type (Literal['BioGRID', 'STRING', 'IMP']) – Type of network to use

  • features (Literal['SixSpeciesN2V']) – Type of features of the network to use

  • sp_trn (Literal['Human', 'Mouse', 'Fly', 'Worm', 'Fish', 'Yeast']) – The species of the training data

  • sp_res (Literal['Human', 'Mouse', 'Fly', 'Worm', 'Fish', 'Yeast']) – The species the results are in

  • gsc_trn (Literal['GO', 'Monarch', 'Mondo', 'Combined']) – Gene set collection used during training

  • gsc_res (Literal['GO', 'Monarch', 'Mondo', 'Combined']) – Gene set collection used when generating results

  • input_genes (List[str] | None) – Input gene list, can be mixed type. Can also be set later if not specified at init time by simply calling load_genes().

  • input_negatives (List[str] | None) – Input list of negative genes, can be mixed type. Can also be set later if not specified at init time by simply calling load_negatives().

  • auto_download (bool) – Automatically download necessary files if set.

  • log_level (Literal['CRITICAL', 'ERROR', 'WARNING', 'INFO', 'DEBUG']) – Logging level.

_convert_to_entrez(genes_to_load)[source]

Convert the loaded genes to Entrez and make objects showing exactly what was converted

Parameters:

genes_to_load (List[str]) –

_get_pos_and_neg_genes(min_num_pos)[source]

Set up positive and negative splits.

The following clsss attributes are set when this function is run

GenePlexus.pos_genes_in_net (1D array of str)

Input gene Entrez IDs that are present in the network.

GenePlexus.genes_not_in_net (1D array of str)

Input gene Entrez IDs that are absent in the network.

GenePlexus.net_genes (1D array of str)

All genes in the network.

GenePlexus.negative_genes (1D array of str)

Negative gene Entrez IDs derived using the input genes and the background gene set collection (gp_trn).

GenePlexus.neutral_gene_info (Dict of Dicts)

Dictionary saying which genes were set to neutrals because the term annotation matched closely enough to the positive training genes.

{
  "{Term ID}" # ID of the matched term : {
     "Name"  : # returns string of term name
     "Genes" : # returns list of genes annotated to term
     "Task"  : # returns type of GSC the term is from
     }
  "All Neutrals" : # returns list of all genes considered neutral
}
_load_genes(genes_to_load)[source]

Load gene list into the GenePlexus object.

Note

Implicitely converts genes to upper case.

Parameters:

genes_to_load (List[str]) –

alter_validation_df()[source]

Make table about presence of input genes in the network used durning training.

The following clsss attributes are set when this function is run

df_convert_out_subset (DataFrame)

A table with the following 6 columns:

Original ID

User supplied Gene ID

Entrez ID

Entrez Gene ID

Gene Name

Name Gene ID

In {Network}?

Y or N if the gene was found in the {Network} used to train the model

positive_genes (List[str])

List of genes used as positives when training the model

dump_config(outdir)[source]

Save parameters configuration to a config file, used with CLI.

Parameters:

outdir (str) –

property features: Literal['SixSpeciesN2V']

Features to use.

property file_loc: str

File location. Use default data location ~/.data/geneplexus if not set.

fit_and_predict(logreg_kwargs=None, scale=False, min_num_pos=5, min_num_pos_cv=15, num_folds=3, null_val=None, random_state=0, cross_validate=True)[source]

Fit a model and predict gene scores.

Parameters:
  • logreg_kwargs (Dict[str, Any] | None) – Scikit-learn logistic regression settings (see LogisticRegression). If not set, then use the default logistic regression settings (l2 penalty, 10,000 max iterations, lbfgs solver).

  • scale (bool) – Whether to scale the data when doing model training and prediction. It is not recommended to set to True unless using custom data.

  • min_num_pos (int) – Minimum number of positives required for the model to be trained.

  • min_num_pos_cv (int) – Minimum number of positives required for performing cross validation evaluation.

  • num_folds (int) – Number of cross validation folds.

  • null_val (float | None) – Null values to fill if cross validation was not able to be performed.

  • random_state (int | None) – Random state for reproducible shuffling stratified cross validation. Set to None for random.

  • cross_validate (bool) – Whether or not to perform cross validation to evaluate the prediction performance on the gene set. If set to False, then skip cross validation and return null_val as cv scores.

The following clsss attributes are set when this function is run

GenePlexus.mdl_weights (1D array of floats)

Trained model parameters.

GenePlexus.df_probs (DataFrame)

A table with the following 9 columns:

Entrez

Entrez Gene ID

Symbol

Symbol Gene ID

Name

Name Gene ID

Known/Novel

Known is gene was in the positive set, otherwise Novel

Class-Label

P (positive in training), N (negative durinig training), U (unused during trianing)

Probability

The probabilties returned from the logisitc regression model

Z-score

The z-score of the model probabilties for all predcited genes

P-adjusted

The Bonferroni adjusted p-values from the z-scores

Rank

The rank of the gene with one being the gene with the highest predcited value

Note

For the Known/Novel and Class-Label columns, if the training species is different than the results species, this information is obtained by looking at the one-to-one orthologs between the species.

Note

Due to the high complexity of the embedding space, and wide variety of postive and negative genes determined for each model, the resulting probabilities may not be well calibrated, however the resulting rankings are very meaningful as evaluated with log2(auPRC/prior).

Note

If setting scale to True then comparison of user trained model to the models pre-trained on known gene sets become less straightforward as those models are trained without any scaling.

GenePlexus.avgps (1D array of floats)

Cross validation results. Performance is measured using log2(auprc/prior).

GenePlexus.probs (1D array of floats)

Genome-wide gene prediction scores. A high value indicates the relevance of the gene to the input gene list.

property gsc_res: Literal['GO', 'Monarch', 'Mondo', 'Combined']

Geneset collection used when generating results.

property gsc_trn: Literal['GO', 'Monarch', 'Mondo', 'Combined']

Geneset collection used in training.

load_genes(input_genes)[source]

Load gene list and convert to Entrez.

Parameters:

input_genes (List[str]) – Input gene list, can be mixed type.

The following clsss attributes are set when this function is run

GenePlexus.input_genes (List[str])

Input genes converted to uppercase

GenePlexus.df_convert_out (DataFrame)

A table where the following 6 columns:

Original ID

User supplied Gene ID

Entrez ID

Entrez Gene ID

Gene Name

Name Gene ID

In BioGRID?

Y or N if the gene was found in the BioGRID network or not

In IMP?

Y or N if the gene was found in the IMP network or not

In STRING?

Y or N if the gene was found in the STRING network or not

GenePlexus.table_summary (List[Dict[str, int]])

List of netowrk stats summary dictionaries. Each dictionary has the following stucture:

{
  "Network" : # returns name of the network
  "NetworkGenes"  : # returns number of genes in the network
  "PositiveGenes" : # returns number of input genes found in the network
}
GenePlexus.convert_ids (List[str])

Converted gene list.

GenePlexus.input_count (int)

Number of input genes that were able to be converted.

See also

Use geneplexus.util.read_gene_list() to load a gene list from a file.

load_negatives(input_negatives)[source]

Load gene list and convert to Entrez that will used as negatives.

Parameters:

input_negatives (List[str]) – Input negative gene list, can be mixed type.

The following clsss attributes are set when this function is run

GenePlexus.input_negatives (List[str])

Input negatives converted to uppercase

GenePlexus.df_convert_out_negatives (DataFrame)

A table with the following 6 columns:

Original ID

User supplied Gene ID

Entrez ID

Entrez Gene ID

Gene Name

Name Gene ID

In BioGRID?

Y or N if the gene was found in the BioGRID network or not

In IMP?

Y or N if the gene was found in the IMP network or not

In STRING?

Y or N if the gene was found in the STRING network or not

GenePlexus.table_summary_negatives (List[Dict[str, int]])

List of netowrk stats summary dictionaries. Each dictionary has the following stucture:

{
  "Network" : # returns name of the network
  "NetworkGenes"  : # returns number of genes in the network
  "PositiveGenes" : # returns number of input genes found in the network
}
GenePlexus.convert_ids_negatives (List[str])

Converted negative gene list.

GenePlexus.input_count_negatives (int)

Number of negative genes that were able to be converted.

See also

Use geneplexus.util.read_gene_list() to load a gene list from a file.

make_sim_dfs()[source]

Compute similarities bewteen the input genes and GO, Monarch and/or Mondo.

The following clsss attributes are set when this function is run

GenePlexus.df_sim (DataFrame)

A table showing how similar the coefficients of the user trained models are to the coefficients of models trained using genes annotated to gsc_res. The table has the following 7 columns:

Task

Which type of GSC the term is from

ID

Term ID

Name

Term Name

Similarity

Cosine similarity between model coefficients between the two models

Z-score

The z-score of the similarities

P-adjusted

The Bonferroni adjusted p-values from the z-scores

Rank

The rank of the term with one being the term with the highest similarity to the user model

GenePlexus.weights

Dictionary of pretrained model weights for gsc_res.

{
  "{Term ID}" # ID of the GSC term : {
     "Name"  : # returns string of term name
     "PosGenes" : # returns list of genes annotated to term
     "Task"  : # returns type of GSC the term is from
     "Weights" : # return list of coefficients from models trained using genes annotated to the term
     }
}
make_small_edgelist(num_nodes=50)[source]

Make a subgraph induced by the top predicted genes.

Parameters:

num_nodes (int) – Number of top genes to include.

The following clsss attributes are set when this function is run

GenePlexus.df_edge (DataFrame)

Table of edge list corresponding to the subgraph induced by the top predicted genes (in Entrez gene ID).

GenePlexus.isolated_genes (List[str])

List of top predicted genes (in Entrez gene ID) that are isolated from other top predicted genes in the network.

GenePlexus.df_edge_sym (DataFrame)

Table of edge list corresponding to the subgraph induced by the top predicted genes (in gene symbol).

GenePlexus.isolated_genes_sym (List[str])

List of top predicted genes (in gene symbol) that are isolated from other top predicted genes in the network.

property net_type: Literal['BioGRID', 'STRING', 'IMP']

Network to use.

property sp_res: Literal['Human', 'Mouse', 'Fly', 'Worm', 'Fish', 'Yeast']

Results_species.

property sp_trn: Literal['Human', 'Mouse', 'Fly', 'Worm', 'Fish', 'Yeast']

Training species.