Welcome to MethylNet’s documentation!¶

https://github.com/Christensen-Lab-Dartmouth/MethylNet

See README.md in Github repository for install directions and for example scripts for running the pipeline (not all datasets may be available on GEO at this time).

There is both an API and CLI available for use. Examples for CLI usage can be found in ./example_scripts.

datasets.py¶

Contains datatypes core to loading MethylationArrays into Torch tensors.

class methylnet.datasets.MethylationDataSet(methylation_array, transform, outcome_col='', categorical=False, categorical_encoder=False)[source]¶

Pytorch Dataset that contains instances of methylation array samples to be loaded.

Parameters:	methylation_array : MethylationArray Methylation Array input. transform : Transformer Transforms data into torch tensor. outcome_col : str Pheno column(s) to train on. categorical : bool Whether predicting categorical/classification. categorical_encoder : encoder Encoder used to binarize categorical column.
Attributes:	samples : np.array Samples of MethylationArray features : np.array List CpGs. encoder : Encoder used to binarize categorical column. new_shape : Shape of torch tensors. length : Number CpGs. methylation_array outcome_col transform

Methods

to_methyl_array(self) Convert torch dataset back into methylation array, useful because turning into torch dataset can cause the original MethylationArray beta matrix to turn into numpy array, when needs turn back into pandas dataframe.

to_methyl_array(self)[source]¶

Convert torch dataset back into methylation array, useful because turning into torch dataset can cause the original MethylationArray beta matrix to turn into numpy array, when needs turn back into pandas dataframe.

Returns:	MethylationArray

class methylnet.datasets.MethylationGenerationDataSet(methylation_array, transform, outcome_col='', categorical=False, categorical_encoder=False)[source]¶

MethylationArray Torch Dataset that contains instances of methylation array samples to be loaded, specifically targetted for prediction.

Parameters:	methylation_array : MethylationArray Methylation Array input. transform : Transformer Transforms data into torch tensor. outcome_col : str Pheno column(s) to train on. categorical : bool Whether predicting categorical/classification. categorical_encoder : encoder Encoder used to binarize categorical column.
Attributes:	samples : np.array Samples of MethylationArray features : np.array List CpGs. encoder : Encoder used to binarize categorical column. new_shape : Shape of torch tensors. length : Number CpGs. methylation_array outcome_col transform

Methods

to_methyl_array(self) Convert torch dataset back into methylation array, useful because turning into torch dataset can cause the original MethylationArray beta matrix to turn into numpy array, when needs turn back into pandas dataframe.

class methylnet.datasets.MethylationPredictionDataSet(methylation_array, transform, outcome_col='', categorical=False, categorical_encoder=False)[source]¶

MethylationArray Torch Dataset that contains instances of methylation array samples to be loaded, specifically targetted for prediction.

Parameters:	methylation_array : MethylationArray Methylation Array input. transform : Transformer Transforms data into torch tensor. outcome_col : str Pheno column(s) to train on. categorical : bool Whether predicting categorical/classification. categorical_encoder : encoder Encoder used to binarize categorical column.
Attributes:	samples : np.array Samples of MethylationArray features : np.array List CpGs. encoder : Encoder used to binarize categorical column. new_shape : Shape of torch tensors. length : Number CpGs. methylation_array outcome_col transform

Methods

to_methyl_array(self) Convert torch dataset back into methylation array, useful because turning into torch dataset can cause the original MethylationArray beta matrix to turn into numpy array, when needs turn back into pandas dataframe.

class methylnet.datasets.RawBetaArrayDataSet(beta_array, transform)[source]¶

Torch Dataset just for beta matrix in numpy format.

Parameters:	beta_array : numpy.array Beta value matrix in numpy form. transform : Transforms data to torch tensor.
Attributes:	length : Number CpGs. beta_array transform

class methylnet.datasets.Transformer(convolutional=False, cpg_per_row=30000, l=None)[source]¶

Push methyl array sample to pytorch tensor, possibly turning into image.

Parameters:	convolutional : bool Whether running convolutions on torch dataset. cpg_per_row : int Number of CpGs per row of image if convolutional. l : tuple Attributes that describe image.
Attributes:	shape : tuple Shape of methylation array image for sample convolutional cpg_per_row l

Methods

generate(self) Generate function for transform.

generate(self)[source]¶

Generate function for transform.

Returns:	function

methylnet.datasets.cat()¶

Concatenates the given sequence of seq tensors in the given dimension. All tensors must either have the same shape (except in the concatenating dimension) or be empty.

torch.cat() can be seen as an inverse operation for torch.split() and torch.chunk().

torch.cat() can be best understood via examples.

Args:

tensors (sequence of Tensors): any python sequence of tensors of the same type.: Non-empty tensors provided must have the same shape, except in the cat dimension.

dim (int, optional): the dimension over which the tensors are concatenated out (Tensor, optional): the output tensor

Example:

>>> x = torch.randn(2, 3)
>>> x
tensor([[ 0.6580, -1.0969, -0.4614],
        [-0.1034, -0.5790,  0.1497]])
>>> torch.cat((x, x, x), 0)
tensor([[ 0.6580, -1.0969, -0.4614],
        [-0.1034, -0.5790,  0.1497],
        [ 0.6580, -1.0969, -0.4614],
        [-0.1034, -0.5790,  0.1497],
        [ 0.6580, -1.0969, -0.4614],
        [-0.1034, -0.5790,  0.1497]])
>>> torch.cat((x, x, x), 1)
tensor([[ 0.6580, -1.0969, -0.4614,  0.6580, -1.0969, -0.4614,  0.6580,
         -1.0969, -0.4614],
        [-0.1034, -0.5790,  0.1497, -0.1034, -0.5790,  0.1497, -0.1034,
         -0.5790,  0.1497]])

methylnet.datasets.get_methylation_dataset(methylation_array, outcome_col, convolutional=False, cpg_per_row=1200, predict=False, categorical=False, categorical_encoder=False, generate=False)[source]¶

Turn methylation array into pytorch dataset.

Parameters:

methylation_array : MethylationArray: Input MethylationArray.
outcome_col : str: Pheno column to train on.
convolutional : bool: Whether running CNN on methylation data
cpg_per_row : int: If convolutional, number of cpgs per image row.
predict : bool: Running prediction algorithm vs VAE.
categorical : bool: Whether training on categorical vs continuous variables.
categorical_encoder :: Scikit learn encoder.

Returns:

Pytorch Dataset

methylnet.datasets.stack()¶

Concatenates sequence of tensors along a new dimension.

All tensors need to be of the same size.

Arguments:

seq (sequence of Tensors): sequence of tensors to concatenate dim (int): dimension to insert. Has to be between 0 and the number

of dimensions of concatenated tensors (inclusive)

out (Tensor, optional): the output tensor

hyperparameter_scans.py¶

Run randomized grid search to find ideal model hyperparameters, with possible deployments to batch system for scalability.

methylnet.hyperparameter_scans.coarse_scan(hyperparameter_input_csv, hyperparameter_output_log, generate_input, job_chunk_size, stratify_column, reset_all, torque, gpu, gpu_node, nohup, mlp=False, custom_jobs=[], model_complexity_factor=0.9, set_beta=-1.0, n_jobs=4, categorical=True, add_softmax=False, additional_command='', cuda=True, new_grid={})[source]¶

Perform randomized hyperparameter grid search

Parameters:

hyperparameter_input_csv : type: CSV file containing hyperparameter inputs.
hyperparameter_output_log : type: CSV file containing prior runs.
generate_input : type: Generate hyperparameter input csv.
job_chunk_size : type: Number of jobs to be launched at same time.
stratify_column : type: Performing classification?
reset_all : type: Rerun all jobs previously scanned.
torque : type: Run jobs using torque.
gpu : type: What GPU to use, set to -1 to be agnostic to GPU selection.
gpu_node : type: What GPU to use, set to -1 to be agnostic to GPU selection, for torque submission.
nohup : type: Launch jobs using nohup.
mlp : type: If running prediction job (classification/regression) after VAE.
custom_jobs : type: Supply custom job parameters to be run.
model_complexity_factor : type: Degree of neural network model complexity for hyperparameter search. Search for less wide networks with a lower complexity value, bounded between 0 and infinity.
set_beta : type: Don’t hyperparameter scan over beta (KL divergence weight), and set it to value.
n_jobs : type: Number of jobs to generate.
categorical : type: Classification task?
add_softmax : type: Add softmax layer at end of neural network.
cuda : type: Whether to use GPU?

methylnet.hyperparameter_scans.find_top_jobs(hyperparameter_input_csv, hyperparameter_output_log, n_top_jobs, crossover_p=0, val_loss_column='min_val_loss')[source]¶

Finds top performing jobs from hyper parameter scan to rerun and cross-over parameters.

Parameters:	hyperparameter_input_csv : str CSV file containing hyperparameter inputs. hyperparameter_output_log : str CSV file containing prior runs. n_top_jobs : int Number of top jobs to select crossover_p : float Rate of cross over of reused hyperparameters val_loss_column : str Loss column used to select top jobs
Returns:	list List of list of new parameters for jobs to run.

methylnet.hyperparameter_scans.generate_topology(topology_grid, probability_decay_factor=0.9)[source]¶

Generates list denoting neural network topology, list of hidden layer sizes.

Parameters:	topology_grid : list List of different hidden layer sizes (number neurons) to choose from. probability_decay_factor : float Degree of neural network model complexity for hyperparameter search. Search for less wide networks with a lower complexity value, bounded between 0 and infinity.
Returns:	list List of hidden layer sizes.

methylnet.hyperparameter_scans.replace_grid(old_grid, new_grid, topology_grid)[source]¶: Replaces old hyperparameter search grid with one supplied by YAML.

torque_jobs.py¶

Wraps and runs your commands through torque.

methylnet.torque_jobs.assemble_replace_dict(command, use_gpu, additions, queue, time, ngpu)[source]¶

Create dictionary to update BASH submission script for torque.

Parameters:	command : type Command to executer through torque. use_gpu : type GPUs needed? additions : type Additional commands to add (eg. module loads). queue : type Queue to place job in. time : type How many hours to run job for. ngpu : type Number of GPU to use.
Returns:	Dict Dictionary used to update Torque Script.

methylnet.torque_jobs.assemble_run_torque(command, use_gpu, additions, queue, time, ngpu, additional_options='')[source]¶

Runs torque job after passing commands to setup bash file.

Parameters:	command : type Command to executer through torque. use_gpu : type GPUs needed? additions : type Additional commands to add (eg. module loads). queue : type Queue to place job in. time : type How many hours to run job for. ngpu : type Number of GPU to use. additional_options : type Additional options to pass to Torque scheduler.
Returns:	job Custom job name.

methylnet.torque_jobs.run_torque_job_(replace_dict, additional_options='')[source]¶

Run torque job after creating submission script.

Parameters:	replace_dict : type Dictionary used to replace information in bash script to run torque job. additional_options : type Additional options to pass scheduler.
Returns:	str Custom torque job name.

interpretation_classes.py¶

Contains core classes and functions to extracting explanations for the predictions at single samples, and then interrogates important CpGs for biological plausibility.

class methylnet.interpretation_classes.BioInterpreter(dict_top_cpgs)[source]¶

Interrogate CpGs found to be important from SHAP through enrichment and overlap tests.

Parameters:	dict_top_cpgs : type Dictionary of topCpGs output from ShapleyExplorer object
Attributes:	top_cpgs : type Dictionary of top cpgs to be interrogated.

Methods

`get_nearby_cpg_shapleys`(self, all_cpgs, max_gap)	Query for CpGs that are within a max_gap of the topCpGs in the genome, helps find cpgs that could be highly correlated and thus also important.
`gometh`(self[, collection, allcpgs, …])	Run GO, KEGG, GSEA analyses or return nearby genes.
`return_overlap_score`(self[, set_cpgs, …])	Perform overlap test, overlapping top CpGs with IDOL CpGs, age related cpgs, etc.
`run_lola`(self[, all_cpgs, lola_db, cores, …])	Run LOLA enrichment test.

get_nearby_cpg_shapleys(self, all_cpgs, max_gap)[source]¶

Query for CpGs that are within a max_gap of the topCpGs in the genome, helps find cpgs that could be highly correlated and thus also important.

Parameters:	all_cpgs : type List of all cpgs in methylationarray. max_gap : type Max radius to search for nearby CpGs.
Returns:	Dict Results for each class/indiv’s top CpGs, in the form of nearby CpGs to each of these CpGs sets.

gometh(self, collection='GO', allcpgs=[], length_output=20, gsea_analyses=[], gsea_pickle='')[source]¶

Run GO, KEGG, GSEA analyses or return nearby genes.

Parameters:

collection : type: GO, KEGG, GSEA, GENE overlaps?
allcpgs : type: All CpGs defines CpG universe, empty if use all CpGs, smaller group of CpGs may yield more significant results.
length_output : type: How many lines should the output be, maximum number of results to print.
gsea_analyses : type: GSEA analyses/collections to target.
gsea_pickle : type: Location of gsea pickle containing gene sets, may need to run download_help_data to acquire.

Returns:

Dict: Dictionary containing results from each test run on all of the keys in top_cpgs.

return_overlap_score(self, set_cpgs='IDOL', platform='450k', all_cpgs=[], output_csv='output_bio_intersect.csv', extract_library=False)[source]¶

Perform overlap test, overlapping top CpGs with IDOL CpGs, age related cpgs, etc.

Parameters:	set_cpgs : type Set of reference cpgs. platform : type 450K or 850K all_cpgs : type All CpGs to build a universe. output_csv : type Output CSV for results of overlaps between classes, how much doo they share these CpGs overlapped. extract_library : type Can extract a list of the CpGs that ended up beinng overlapped with.
Returns:	List Optional output of overlapped library of CpGs.

run_lola(self, all_cpgs=[], lola_db='', cores=8, collections=[], depletion=False)[source]¶

Run LOLA enrichment test.

Parameters:	all_cpgs : type CpG universe for LOLA. lola_db : type Location of LOLA database, can be downloaded using methylnet-interpret. Set to extended or core, and collections correspond to these. cores : type Number of cores to use. collections : type LOLA collections to run, leave empty to run all. depletion : type Set true to look for depleted regions over enriched.
Returns:	type Description of returned object.

class methylnet.interpretation_classes.CpGExplainer(prediction_function=None, cuda=False)[source]¶

Produces SHAPley explanation scores for individual predictions, approximating a complex model with a basic linear one, one coefficient per CpG per individual.

Parameters:	prediction_function : type Model or function that makes predictions on the data, can be sklearn .predict method or pytorch forward pass, etc… cuda : type Run on GPUs?
Attributes:	explainer : type Type of SHAPley explanation method; kernel (uses LIME model agnostic explainer), deep (uses DeepLift) or Gradient (integrated gradients and backprop for explanations)? prediction_function cuda

Methods

`build_explainer`(self, train_methyl_array[, …])	Builds SHAP explainer using background samples.
`classifier_assign_scores_to_shap_data`(self, …)	Assigns the SHAP scores to a SHAPley data object, populating nested dictionaries (containing class-individual information) with SHAPley information and “top CpGs”.
`feature_select`(self, methyl_array, …)	Perform feature selection based on the best overall SHAP scores across all samples.
`from_explainer`(method, cuda)	Load custom SHAPley explainer
`regressor_assign_scores_to_shap_data`(self, …)	Assigns the SHAP scores to a SHAPley data object, populating nested dictionaries (containing multioutput and single output regression-individual information) with SHAPley information and “top CpGs”.
`return_shapley_predictions`(self, …[, encoder])	Method in development or may be deprecated.
`return_shapley_scores`(self, …[, …])	Generate explanations for individual predictions, in the form of SHAPley scores for supplied test data.

build_explainer(self, train_methyl_array, method='kernel', batch_size=100)[source]¶

Builds SHAP explainer using background samples.

Parameters:	train_methyl_array : type Train Methylation Array from which to populate background samples to make estimated explanations. method : type SHAP explanation method? batch_size : type Break up prediction explanation creation into smaller batch sizes for lower memory consumption.

classifier_assign_scores_to_shap_data(self, test_methyl_array, n_top_features, interest_col='disease', prediction_classes=None)[source]¶

Assigns the SHAP scores to a SHAPley data object, populating nested dictionaries (containing class-individual information) with SHAPley information and “top CpGs”.

Parameters:	test_methyl_array : type Testing MethylationArray. n_top_features : type Number of top SHAP scores to use for a subdict called “top CpGs” interest_col : type Column in pheno sheet from which class names reside. prediction_classes : type User supplied prediction classes.

feature_select(self, methyl_array, n_top_features)[source]¶

Perform feature selection based on the best overall SHAP scores across all samples.

Parameters:	methyl_array : type MethylationArray to run feature selection on. n_top_features : type Number of CpGs to select.
Returns:	MethylationArray Subsetted by top overall CpGs, overall positive contributions to prediction. May need to update.

classmethod from_explainer(method, cuda)[source]¶: Load custom SHAPley explainer

regressor_assign_scores_to_shap_data(self, test_methyl_array, n_top_features, cell_names=[])[source]¶

Assigns the SHAP scores to a SHAPley data object, populating nested dictionaries (containing multioutput and single output regression-individual information) with SHAPley information and “top CpGs”.

Parameters:	test_methyl_array : type Testing MethylationArray. n_top_features : type Number of top SHAP scores to use for a subdict called “top CpGs” cell_names : type If multi-output regression, create separate regression classes to access for all individuals.

return_shapley_predictions(self, test_methyl_array, sample_name, interest_col, encoder=None)[source]¶: Method in development or may be deprecated.

return_shapley_scores(self, test_methyl_array, n_samples, n_outputs, shap_sample_batch_size=None, top_outputs=None)[source]¶

Generate explanations for individual predictions, in the form of SHAPley scores for supplied test data. One SHAP score per individual prediction per CpG, and more if multiclass/output. For multiclass/multivariate outcomes, scores are generated for each outcome class.

Parameters:

test_methyl_array : type: Testing MethylationArray.
n_samples : type: Number of SHAP score samples to produce. More SHAP samples provides convergence to correct score.
n_outputs : type: Number of outcome classes.
shap_sample_batch_size : type: If not None, break up SHAP score sampling into batches of this size to be averaged.
top_outputs : type: For deep explainer, limit number of output classes due to memory consumption.

Returns:

type: Description of returned object.

class methylnet.interpretation_classes.DistanceMatrixCompute(methyl_array, pheno_col)[source]¶

From any embeddings, calculate pairwise distances between classes that label embeddings.

Parameters:	methyl_array : MethylationArray Methylation array storing beta and pheno data. pheno_col : str Column name from which to extract sample names from and group by class.
Attributes:	col : pd.DataFrame Column of pheno array. classes : np.array Unique classes of pheno column. embeddings : pd.DataFrame Embeddings of beta values.

Methods

`calculate_p_values`(self)	Compute pairwise p-values between different clusters using manova.
`compute_distances`(self[, metric, trim])	Compute distance matrix between classes by average distances between points of classes.
`return_distances`(self)	Return the distance matrix
`return_p_values`(self)	Return the distance matrix

calculate_p_values(self)[source]¶: Compute pairwise p-values between different clusters using manova.

compute_distances(self, metric='cosine', trim=0.0)[source]¶

Compute distance matrix between classes by average distances between points of classes.

Parameters:	metric : str Scikit-learn distance metric.

return_distances(self)[source]¶

Return the distance matrix

Returns:	pd.DataFrame Distance matrix between classes.

return_p_values(self)[source]¶

Return the distance matrix

Returns:	pd.DataFrame MANOVA values between classes.

class methylnet.interpretation_classes.PlotCircos[source]¶

Plot Circos Diagram using ggbio (regular circos software may be better)

Attributes:	generate_base_plot : type Function to generate base plot add_plot_data : type Function to add more plot data generate_final_plot : type Function to plot data using Circos

Methods

plot_cpgs(self, top_cpgs[, output_dir]) Plot top CpGs location in genome.

plot_cpgs(self, top_cpgs, output_dir='./')[source]¶

Plot top CpGs location in genome.

Parameters:	top_cpgs : type Input dataframe of top CpG name and SHAP scores. Future: Plot SHAP scores in locations? output_dir : type Where to output plots.

class methylnet.interpretation_classes.ShapleyData[source]¶

Store SHAP results that stores feature importances of CpGs on varying degrees granularity.

Attributes:	top_cpgs : type Quick accessible CpGs that have the n highest SHAP scores. shapley_values : type Storing all SHAPley values for CpGs for varying degrees granularity. For classification problems, this only includes SHAP scores particular to the actual class.

Methods

`add_class`(self, class_name, shap_df, cpgs, …)	Store SHAP scores for particular class to Shapley_Data class.
`add_global_importance`(self, …)	Add overall feature importances, globally across all samples.
`from_pickle`(input_pkl[, from_dict])	Load SHAPley data from pickle.
`to_pickle`(self, output_pkl[, output_dict])	Export Shapley data to pickle.

add_class(self, class_name, shap_df, cpgs, n_top_cpgs, add_top_negative=False, add_top_abs=False)[source]¶

Store SHAP scores for particular class to Shapley_Data class. Save feature importances on granular level, explanations for individual and aggregate predictions.

Parameters:	class_name : type Particular class name to be stored in dictionary structure. shap_df : type SHAP data to pull from. cpgs : type All CpGs. n_top_cpgs : type Number of top CpGs for saving n top CpGs. add_top_negative : type Only consider top negative scores.

add_global_importance(self, global_importance_shaps, cpgs, n_top_cpgs)[source]¶

Add overall feature importances, globally across all samples.

Parameters:	global_importance_shaps : type Overall SHAP values to be saved, aggregated across all samples. cpgs : type All CpGs. n_top_cpgs : type Number of top CpGs to rank and save as well.

classmethod from_pickle(input_pkl, from_dict=False)[source]¶

Load SHAPley data from pickle.

Parameters:	input_pkl : type Input pickle.

to_pickle(self, output_pkl, output_dict=False)[source]¶

Export Shapley data to pickle.

Parameters:	output_pkl : type Output file to save SHAPley data.

class methylnet.interpretation_classes.ShapleyDataExplorer(shapley_data)[source]¶

Datatype used to explore saved ShapleyData.

Parameters:	shapley_data : type ShapleyData instance to be explored.
Attributes:	indiv2class : type Maps individuals to their classes used for quick look-up. shapley_data

Methods

`add_abs_value_classes`(self)	WIP.
`add_bin_continuous_classes`(self)	Bin continuous outcome variable and extract top positive and negative CpGs for each new class.
`extract_class`(self, class_name[, …])	Extract the top cpgs from a class
`extract_individual`(self, individual[, …])	Extract the top cpgs from an individual
`extract_methylation_array`(self, methyl_arr)	Subset MethylationArray Beta values by some top SHAP CpGs from classes or overall.
`jaccard_similarity_top_cpgs`(self, class_names)	Calculate Jaccard Similarity matrix between classes and individuals within those classes based on how they share sets of CpGs.
`limit_number_top_cpgs`(self, n_top_cpgs)	Reduce the number of top CpGs.
`list_classes`(self)	List classes in ShapleyData object.
`list_individuals`(self[, return_list])	List the individuals in the ShapleyData object.
`regenerate_individual_shap_values`(self, …)	Use original SHAP scores to make nested dictionary of top CpGs based on shapley score, can do this for ABS SHAP or Negative SHAP scores as well.
`return_binned_shapley_data`(self, …[, …])	Converts existing shap data based on continuous variable predictions into categorical variable.
`return_cpg_sets`(self)	Return top sets of CpGs from classes and individuals, and the complementary set.
`return_global_importance_cpgs`(self)	Return overall globally important CpGs.
`return_shapley_data_by_methylation_status`(…)	Return dictionary containing two SHAPley datasets, each split by low/high levels of methylation.
`return_top_cpgs`(self[, classes, …])	Given list of classes and individuals, export a dictionary containing data frames of top CpGs and their SHAP scores.
`view_methylation`(self, individual, methyl_arr)	Use MethylationArray to output top SHAP values for individual with methylation data appended to output data frame.

make_shap_scores_abs

add_abs_value_classes(self)[source]¶: WIP.

add_bin_continuous_classes(self)[source]¶: Bin continuous outcome variable and extract top positive and negative CpGs for each new class.

extract_class(self, class_name, class_intersect=False, get_shap_values=False)[source]¶

Extract the top cpgs from a class

Parameters:	class_name : type Class to extract from? class_intersect : type Bool to extract from aggregation of SHAP values from individuals of current class, should have been done already.
Returns:	DataFrame Cpgs and SHAP Values

extract_individual(self, individual, get_shap_values=False)[source]¶

Extract the top cpgs from an individual

Parameters:	individual : type Individual to extract from?
Returns:	Tuple Class name of individual, DataFrame of Cpgs and SHAP Values

extract_methylation_array(self, methyl_arr, classes_only=True, global_vals=False, n_extract=1000, class_name='')[source]¶

Subset MethylationArray Beta values by some top SHAP CpGs from classes or overall.

Parameters:

methyl_arr : type: MethylationArray.
classes_only : type: Setting this to False will include the overall top CpGs per class and the top CpGs for individuals in that class.
global_vals : type: Use top CpGs overall, across the entire dataset.
n_extract : type: Number of top CpGs to subset.
class_name : type: Which class to subset top CpGs from? Blank to use union of all top CpGs.

Returns:

MethylationArray: Stores methylation beta and pheno data, reduced here by queried CpGs.

jaccard_similarity_top_cpgs(self, class_names, individuals=False, overall=False, cooccurrence=False)[source]¶

Calculate Jaccard Similarity matrix between classes and individuals within those classes based on how they share sets of CpGs.

Parameters:	class_names : type Classes to include. individuals : type Individuals to include. overall : type Whether to use overall class top CpGs versus aggregate. cooccurrence : type Output cooccurence instead of jaccard.
Returns:	pd.DataFrame Similarity matrix.

limit_number_top_cpgs(self, n_top_cpgs)[source]¶

Reduce the number of top CpGs.

Parameters:	n_top_cpgs : type Number top CpGs to retain.
Returns:	ShapleyData Returns shapley data with fewer top CpGs.

list_classes(self)[source]¶

List classes in ShapleyData object.

Returns:	List List of classes

list_individuals(self, return_list=False)[source]¶

List the individuals in the ShapleyData object.

Parameters:	return_list : type Return list of individual names rather than a dictionary of class:individual key-value pairs.
Returns:	List or Dictionary Class: Individual dictionary or individuals are elements of list

regenerate_individual_shap_values(self, n_top_cpgs, abs_val=False, neg_val=False)[source]¶

Use original SHAP scores to make nested dictionary of top CpGs based on shapley score, can do this for ABS SHAP or Negative SHAP scores as well.

Parameters:	n_top_cpgs : type Description of parameter n_top_cpgs. abs_val : type Description of parameter abs_val. neg_val : type Description of parameter neg_val.
Returns:	type Description of returned object.

return_binned_shapley_data(self, original_class_name, outcome_col, add_top_negative=False)[source]¶

Converts existing shap data based on continuous variable predictions into categorical variable.

Parameters:	original_class_name : type Regression results were split into ClassName_pos and ClassName_neg, what is the ClassName? outcome_col : type Feed in from pheno sheet one the column to bin samples on. add_top_negative : type Looking to include negative SHAPs?
Returns:	ShapleyData With new class labels, built from regression results.

return_cpg_sets(self)[source]¶

Return top sets of CpGs from classes and individuals, and the complementary set. Returns dictionary of sets of CpGs and the union of all the other sets minus the current set.

Returns:	cpg_sets Dictionary of individuals and classes; CpGs contained in set for particular individual/class cpg_exclusion_sets Dictionary of individuals and classes; CpGs not contained in set for particular individual/class

return_global_importance_cpgs(self)[source]¶

Return overall globally important CpGs.

Returns:	list

return_shapley_data_by_methylation_status(self, methyl_array, threshold)[source]¶

Return dictionary containing two SHAPley datasets, each split by low/high levels of methylation. Todo: Define this using median methylation value vs 0.5.

Parameters:	methyl_array : type MethylationArray instance.
Returns:	dictionary Contains shapley data by methylation status.

return_top_cpgs(self, classes=[], individuals=[], class_intersect=False, cpg_exclusion_sets=None, cpg_sets=None)[source]¶

Given list of classes and individuals, export a dictionary containing data frames of top CpGs and their SHAP scores.

Parameters:

classes : type: Higher level classes to extract CpGs from, list of classes to extract top CpGs from.
individuals : type: Individual samples to extract top CpGs from.
class_intersect : type: Whether the top CpGs should be chosen by aggregating the remaining individual scores.
cpg_exclusion_sets : type: Dict of sets of CpGs, where these ones contain CpGs not particular to particular class.
cpg_sets : type: Contains top CpGs found for each class.

Returns:

Dict: Top CpGs accessed and returned for further processing.

view_methylation(self, individual, methyl_arr)[source]¶

Use MethylationArray to output top SHAP values for individual with methylation data appended to output data frame.

Parameters:	individual : type Individual to query from ShapleyData methyl_arr : type Input MethylationArray
Returns:	class_name individual top_shap_df Top Shapley DataFrame with top cpgs for individual, methylation data appended.

methylnet.interpretation_classes.cooccurrence_fn(list1, list2)[source]¶

Cooccurence of elements between two lists.

Parameters:	list1 list2
Returns:	float Cooccurence between elements in list.

methylnet.interpretation_classes.jaccard_similarity(list1, list2)[source]¶

Jaccard score between two lists.

Parameters:	list1 list2
Returns:	float Jaccard similarity between elements in list.

methylnet.interpretation_classes.main_prediction_function(n_workers, batch_size, model, cuda)[source]¶

Combine dataloader and prediction function into final prediction function that takes in tensor data, for Kernel Explanations.

Parameters:	n_workers : type Number of workers. batch_size : type Batch size for input data. model : type Model/predidction function. cuda : type Running on GPUs?
Returns:	function Final prediction function.

methylnet.interpretation_classes.plot_lola_output_(lola_csv, plot_output_dir, description_col, cell_types)[source]¶

Plots any LOLA output in the form of a forest plot.

Parameters:	lola_csv : type CSV containing lola results. plot_output_dir : type Plot output directory. description_col : type Column that will label the points on the plot. cell_types : type Column containing cell-types, colors plots and are rows to be compared.

methylnet.interpretation_classes.return_dataloader_construct(n_workers, batch_size)[source]¶

Decorator to build dataloader compatible with KernelExplainer for SHAP.

Parameters:	n_workers : type Number CPU. batch_size : type Batch size to load data into pytorch.
Returns:	DataLoader Pytorch dataloader.

methylnet.interpretation_classes.return_predict_function(model, cuda)[source]¶

Decorator to build the supplied prediction function, important for kernel explanations.

Parameters:	model : type Prediction function with predict method. cuda : type Run using cuda.
Returns:	function Predict function.

methylnet.interpretation_classes.return_shap_values(test_arr, explainer, method, n_samples, additional_opts)[source]¶

Return SHAP values, sampled a number of times, used in CpGExplainer class.

Parameters:	test_arr : type Testing MethylationArray. explainer : type SHAP explainer object. method : type Method of explaining. n_samples : type Number of samples to estimate SHAP scores. additional_opts : type Additional options to be passed into non-kernel/gradient methods.
Returns:	np.array/list Shapley scores.

methylnet.interpretation_classes.to_tensor(arr)[source]¶: Turn np.array into tensor.

models.py¶

Contains core PyTorch Models for running VAE and VAE-MLP.

class methylnet.models.AutoEncoder(autoencoder_model, n_epochs, loss_fn, optimizer, cuda=True, kl_warm_up=0, beta=1.0, scheduler_opts={})[source]¶

Wraps Pytorch VAE module into Scikit-learn like interface for ease of training, validation and testing.

Parameters:

autoencoder_model : type: Pytorch VAE Model to supply.
n_epochs : type: Number of epochs to train for.
loss_fn : type: Pytorch loss function for reconstruction error.
optimizer : type: Pytorch Optimizer.
cuda : type: GPU?
kl_warm_up : type: Number of epochs until fully utilizing KLLoss, begin saving models here.
beta : type: Weighting for KLLoss.
scheduler_opts : type: Options to feed learning rate scheduler, which modulates learning rate of optimizer.

Attributes:

model : type: Pytorch VAE model.
scheduler : type: Learning rate scheduler object.
vae_animation_fname : type: Save VAE embeddings evolving over epochs to this file name. Defunct for now.
loss_plt_fname : type: Where to save loss curves. This has been superceded by plot_training_curves in methylnet-visualize command.
plot_interval : type: How often to plot data; defunct.
embed_interval : type: How often to embed; defunct.
validation_set : type: MethylationArray DataLoader, produced from Pytorch MethylationDataset of Validation MethylationArray.
n_epochs
loss_fn
optimizer
cuda
kl_warm_up
beta

Methods

add_validation_set(self, validation_data) Add validation data in the form of Validation DataLoader.

fit(self, train_data) Fit VAE model to training data, best model returned with lowest validation loss over epochs.

fit_transform(self, train_data) Fit VAE model and transform Methylation Array using VAE model.

transform(self, train_data)

Parameters:

add_validation_set(self, validation_data)[source]¶

Add validation data in the form of Validation DataLoader. Adding this will use validation data for early termination / generalization of model to unseen data.

Parameters:	validation_data : type Pytorch DataLoader housing validation MethylationDataset.

fit(self, train_data)[source]¶

Fit VAE model to training data, best model returned with lowest validation loss over epochs.

Parameters:	train_data : DataLoader Training DataLoader that is loading MethylationDataset in batches.
Returns:	self Autoencoder object with updated VAE model.

fit_transform(self, train_data)[source]¶

Fit VAE model and transform Methylation Array using VAE model.

Parameters:	train_data : type Pytorch DataLoader housing training MethylationDataset.
Returns:	np.array Latent Embeddings. np.array Sample names from MethylationArray np.array Outcomes from column of methylarray.

transform(self, train_data)[source]¶

Parameters:	train_data : type Pytorch DataLoader housing training MethylationDataset.
Returns:	np.array Latent Embeddings. np.array Sample names from MethylationArray np.array Outcomes from column of methylarray.

class methylnet.models.MLPFinetuneVAE(mlp_model, n_epochs=None, loss_fn=None, optimizer_vae=None, optimizer_mlp=None, cuda=True, categorical=False, scheduler_opts={}, output_latent=True, train_decoder=False)[source]¶

Wraps VAE_MLP pytorch module into scikit-learn interface with fit, predict and fit_predict methods for ease-of-use model training/evaluation.

Parameters:

mlp_model : type: VAE_MLP model.
n_epochs : type: Number epochs train for.
loss_fn : type: Loss function, pytorch, CrossEntropy, BCE, MSE depending on outcome.
optimizer_vae : type: Optimizer for VAE layers for finetuning original pretrained network.
optimizer_mlp : type: Optimizer for new appended MLP layers.
cuda : type: GPU?
categorical : type: Classification or regression outcome?
scheduler_opts : type: Options for learning rate scheduler, modulates learning rates for VAE and MLP.
output_latent : type: Whether to output latent embeddings during evaluation.
train_decoder : type: Retrain decoder to adjust for finetuning of VAE?

Attributes:

model : type: VAE_MLP.
scheduler_vae : type: Learning rate modulator for VAE optimizer.
scheduler_mlp : type: Learning rate modulator for MLP optimizer.
loss_plt_fname : type: File where to plot loss over time; defunct.
embed_interval : type: How often to return embeddings; defunct.
validation_set : type: Validation set used for hyperparameter tuning and early stopping criteria for generalization.
return_latent : type: Return embedding during evaluation?
n_epochs
loss_fn
optimizer_vae
optimizer_mlp
cuda
categorical
output_latent
train_decoder

Methods

`add_validation_set`(self, validation_data)	Add validation data to reduce overfitting.
`fit`(self, train_data)	Fit MLP to training data to make predictions.
`predict`(self, test_data)	Short summary.

add_validation_set(self, validation_data)[source]¶

Add validation data to reduce overfitting.

Parameters:	validation_data : type Validation Dataloader MethylationDataset.

fit(self, train_data)[source]¶

Fit MLP to training data to make predictions.

Parameters:	train_data : type DataLoader with Training MethylationDataset.
Returns:	self MLPFinetuneVAE with updated parameters.

predict(self, test_data)[source]¶

Short summary.

Parameters:	test_data : type Test DataLoader MethylationDataset.
Returns:	np.array Predictions np.array Ground truth np.array Latent Embeddings np.array Sample names.

class methylnet.models.TybaltTitusVAE(n_input, n_latent, hidden_layer_encoder_topology=[100, 100, 100], cuda=False)[source]¶

Pytorch NN Module housing VAE with fully connected layers and customizable topology.

Parameters:

n_input : type: Number of input CpGs.
n_latent : type: Size of latent embeddings.
hidden_layer_encoder_topology : type: List, length of list contains number of hidden layers for encoder, and each element is number of neurons, mirrored for decoder.
cuda : type: GPU?

Attributes:

cuda_on : type: GPU?
pre_latent_topology : type: Hidden layer topology for encoder.
post_latent_topology : type: Mirrored hidden layer topology for decoder.
encoder_layers : list: Encoder pytorch layers.
encoder : type: Encoder layers wrapped into pytorch module.
z_mean : type: Linear layer from last encoder layer to mean layer.
z_var : type: Linear layer from last encoder layer to var layer.
z_develop : type: Linear layer connecting sampled latent embedding to first layer decoder.
decoder_layers : type: Decoder layers wrapped into pytorch module.
output_layer : type: Linear layer connecting last decoder layer to output layer, which is same size as input..
decoder : type: Wraps decoder_layers and output_layers into Sequential module.
n_input
n_latent

Methods

`__call__`(self, \input, \\*kwargs)	Call self as a function.
`add_module`(self, name, module)	Adds a child module to the current module.
`apply`(self, fn)	Applies `fn` recursively to every submodule (as returned by `.children()`) as well as self.
`buffers`(self[, recurse])	Returns an iterator over module buffers.
`children`(self)	Returns an iterator over immediate children modules.
`cpu`(self)	Moves all model parameters and buffers to the CPU.
`cuda`(self[, device])	Moves all model parameters and buffers to the GPU.
`decode`(self, z)	Decode latent embeddings back into reconstructed input.
`double`(self)	Casts all floating point parameters and buffers to `double` datatype.
`encode`(self, x)	Encode input into latent representation.
`eval`(self)	Sets the module in evaluation mode.
`extra_repr`(self)	Set the extra representation of the module
`float`(self)	Casts all floating point parameters and buffers to float datatype.
`forward`(self, x)	Return reconstructed output, mean and variance of embeddings.
`forward_predict`(self, x)	Forward pass from input to reconstructed input.
`get_latent_z`(self, x)	Encode X into reparameterized latent representation.
`half`(self)	Casts all floating point parameters and buffers to `half` datatype.
`load_state_dict`(self, state_dict[, strict])	Copies parameters and buffers from `state_dict` into this module and its descendants.
`modules`(self)	Returns an iterator over all modules in the network.
`named_buffers`(self[, prefix, recurse])	Returns an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.
`named_children`(self)	Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself.
`named_modules`(self[, memo, prefix])	Returns an iterator over all modules in the network, yielding both the name of the module as well as the module itself.
`named_parameters`(self[, prefix, recurse])	Returns an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.
`parameters`(self[, recurse])	Returns an iterator over module parameters.
`register_backward_hook`(self, hook)	Registers a backward hook on the module.
`register_buffer`(self, name, tensor)	Adds a persistent buffer to the module.
`register_forward_hook`(self, hook)	Registers a forward hook on the module.
`register_forward_pre_hook`(self, hook)	Registers a forward pre-hook on the module.
`register_parameter`(self, name, param)	Adds a parameter to the module.
`sample_z`(self, mean, logvar)	Sample latent embeddings, reparameterize by adding noise to embedding.
`state_dict`(self[, destination, prefix, …])	Returns a dictionary containing a whole state of the module.
`to`(self, \args, \\*kwargs)	Moves and/or casts the parameters and buffers.
`train`(self[, mode])	Sets the module in training mode.
`type`(self, dst_type)	Casts all parameters and buffers to `dst_type`.
`zero_grad`(self)	Sets gradients of all model parameters to zero.

share_memory

decode(self, z)[source]¶

Decode latent embeddings back into reconstructed input.

Parameters:	z : type Reparameterized latent embedding.
Returns:	torch.tensor Reconstructed input.

encode(self, x)[source]¶

Encode input into latent representation.

Parameters:	x : type Input methylation data.
Returns:	torch.tensor Learned mean vector of embeddings. torch.tensor Learned variance of learned mean embeddings.

forward(self, x)[source]¶: Return reconstructed output, mean and variance of embeddings.

forward_predict(self, x)[source]¶: Forward pass from input to reconstructed input.

get_latent_z(self, x)[source]¶

Encode X into reparameterized latent representation.

Parameters:	x : type Input methylation data.
Returns:	torch.tensor Latent embeddings.

sample_z(self, mean, logvar)[source]¶

Sample latent embeddings, reparameterize by adding noise to embedding.

Parameters:	mean : type Learned mean vector of embeddings. logvar : type Learned variance of learned mean embeddings.
Returns:	torch.tensor Mean + noise, reparameterization trick.

class methylnet.models.VAE_MLP(vae_model, n_output, categorical=False, hidden_layer_topology=[100, 100, 100], dropout_p=0.2, add_softmax=False)[source]¶

VAE_MLP, pytorch module used to both finetune VAE embeddings and simultaneously train downstream MLP layers for classification/regression tasks.

Parameters:

vae_model : type: VAE pytorch model for methylation data.
n_output : type: Number of outputs at end of model.
categorical : type: Classification or regression problem?
hidden_layer_topology : type: Hidden Layer topology, list of size number of hidden layers for MLP and each element contains number of neurons per layer.
dropout_p : type: Apply dropout regularization to reduce overfitting.
add_softmax : type: Softmax the output before evaluation.

Attributes:

vae : type: Pytorch VAE module.
topology : type: List with hidden layer topology of MLP.
mlp_layers : type: All MLP layers (# layers and neurons per layer)
output_layer : type: nn.Linear connecting last MLP layer and output nodes.
mlp : type: nn.Sequential wraps all layers into sequential ordered pytorch module.
output_z : type: Whether to output latent embeddings.
n_output
categorical
add_softmax
dropout_p

Methods

`__call__`(self, \input, \\*kwargs)	Call self as a function.
`add_module`(self, name, module)	Adds a child module to the current module.
`apply`(self, fn)	Applies `fn` recursively to every submodule (as returned by `.children()`) as well as self.
`buffers`(self[, recurse])	Returns an iterator over module buffers.
`children`(self)	Returns an iterator over immediate children modules.
`cpu`(self)	Moves all model parameters and buffers to the CPU.
`cuda`(self[, device])	Moves all model parameters and buffers to the GPU.
`decode`(self, z)	Run VAE decoder on embeddings.
`double`(self)	Casts all floating point parameters and buffers to `double` datatype.
`eval`(self)	Sets the module in evaluation mode.
`extra_repr`(self)	Set the extra representation of the module
`float`(self)	Casts all floating point parameters and buffers to float datatype.
`forward`(self, x)	Pass data in to return predictions and embeddings.
`forward_embed`(self, x)	Return predictions, latent embeddings and reconstructed input.
`forward_predict`(self, x)	Make predictions, based on output_z, either output predictions or output embeddings.
`half`(self)	Casts all floating point parameters and buffers to `half` datatype.
`load_state_dict`(self, state_dict[, strict])	Copies parameters and buffers from `state_dict` into this module and its descendants.
`modules`(self)	Returns an iterator over all modules in the network.
`named_buffers`(self[, prefix, recurse])	Returns an iterator over module buffers, yielding both the name of the buffer as well as the buffer itself.
`named_children`(self)	Returns an iterator over immediate children modules, yielding both the name of the module as well as the module itself.
`named_modules`(self[, memo, prefix])	Returns an iterator over all modules in the network, yielding both the name of the module as well as the module itself.
`named_parameters`(self[, prefix, recurse])	Returns an iterator over module parameters, yielding both the name of the parameter as well as the parameter itself.
`parameters`(self[, recurse])	Returns an iterator over module parameters.
`register_backward_hook`(self, hook)	Registers a backward hook on the module.
`register_buffer`(self, name, tensor)	Adds a persistent buffer to the module.
`register_forward_hook`(self, hook)	Registers a forward hook on the module.
`register_forward_pre_hook`(self, hook)	Registers a forward pre-hook on the module.
`register_parameter`(self, name, param)	Adds a parameter to the module.
`state_dict`(self[, destination, prefix, …])	Returns a dictionary containing a whole state of the module.
`to`(self, \args, \\*kwargs)	Moves and/or casts the parameters and buffers.
`toggle_latent_z`(self)	Toggle whether to output latent embeddings during forward pass.
`train`(self[, mode])	Sets the module in training mode.
`type`(self, dst_type)	Casts all parameters and buffers to `dst_type`.
`zero_grad`(self)	Sets gradients of all model parameters to zero.

share_memory

decode(self, z)[source]¶

Run VAE decoder on embeddings.

Parameters:	z : type Embeddings.
Returns:	torch.tensor Reconstructed Input.

forward(self, x)[source]¶

Pass data in to return predictions and embeddings.

Parameters:	x : type Input data.
Returns:	torch.tensor Predictions torch.tensor Embeddings

forward_embed(self, x)[source]¶

Return predictions, latent embeddings and reconstructed input.

Parameters:	x : type Input data
Returns:	torch.tensor Predictions torch.tensor Embeddings torch.tensor Reconstructed input.

forward_predict(self, x)[source]¶

Make predictions, based on output_z, either output predictions or output embeddings.

Parameters:	x : type Input Data.
Returns:	torch.tensor Predictions or embeddings.

toggle_latent_z(self)[source]¶: Toggle whether to output latent embeddings during forward pass.

methylnet.models.project_vae(model, loader, cuda=True)[source]¶

Return Latent Embeddings of any data supplied to it.

Parameters:	model : type VAE Pytorch Model. loader : type Loads data one batch at a time. cuda : type GPU?
Returns:	np.array Latent Embeddings. np.array Sample names from MethylationArray np.array Outcomes from column of methylarray.

methylnet.models.test_mlp(model, loader, categorical, cuda=True, output_latent=True)[source]¶

Evaluate MLP on testing set, output predictions.

Parameters:	model : type VAE_MLP model. loader : type DataLoader with MethylationDataSet categorical : type Categorical or continuous predictions. cuda : type GPU? output_latent : type Output latent embeddings in addition to predictions?
Returns:	np.array Predictions np.array Ground truth np.array Latent Embeddings np.array Sample names.

methylnet.models.train_decoder_(model, x, z)[source]¶

Run if retraining decoder to adjust for adjusted latent embeddings during finetuning of embedding layers for VAE_MLP.

Parameters:	model : type VAE_MLP model. x : type Input methylation data. z : type Latent Embeddings
Returns:	nn.Module VAE_MLP module with updated decoder parameters. float Reconstruction loss over all batches.

methylnet.models.train_mlp(model, loader, loss_func, optimizer_vae, optimizer_mlp, cuda=True, categorical=False, train_decoder=False)[source]¶

Train Multi-layer perceptron appended to latent embeddings of VAE via transfer learning. Do this for one iteration.

Parameters:

model : type: VAE_MLP model.
loader : type: DataLoader with MethylationDataset.
loss_func : type: Loss function (BCE, CrossEntropy, MSE).
optimizer_vae : type: Optimizer for pytorch VAE.
optimizer_mlp : type: Optimizer for outcome MLP layers.
cuda : type: GPU?
categorical : type: Predicting categorical or continuous outcomes.
train_decoder : type: Retrain decoder during training loop to adjust for fine-tuned embeddings.

Returns:

nn.Module: Training VAE_MLP model with updated parameters.
float: Training loss over all batches

methylnet.models.train_vae(model, loader, loss_func, optimizer, cuda=True, epoch=0, kl_warm_up=0, beta=1.0)[source]¶

Function for parameter update during VAE training for one iteration.

Parameters:

model : type: VAE torch model
loader : type: Data loader, generator that calls batches of data.
loss_func : type: Loss function for reconstruction error, nn.BCELoss or MSELoss
optimizer : type: SGD or Adam pytorch optimizer.
cuda : type: GPU?
epoch : type: Epoch of training, passed in from outer loop.
kl_warm_up : type: How many epochs until model is fully utilizes KL Loss.
beta : type: Weight given to KL Loss.

Returns:

nn.Module: Pytorch VAE model
float: Total Training Loss across all batches
float: Total Training reconstruction loss across all batches
float: Total KL Loss across all batches

methylnet.models.vae_loss(output, input, mean, logvar, loss_func, epoch, kl_warm_up=0, beta=1.0)[source]¶

Function to calculate VAE Loss, Reconstruction Loss + Beta KLLoss.

Parameters:

output : torch.tensor: Reconstructed output from autoencoder.
input : torch.tensor: Original input data.
mean : type: Learned mean tensor for each sample point.
logvar : type: Variation around that mean sample point, learned from reparameterization.
loss_func : type: Loss function for reconstruction loss, MSE or BCE.
epoch : type: Epoch of training.
kl_warm_up : type: Number of epochs until fully utilizing KLLoss, begin saving models here.
beta : type: Weighting for KLLoss.

Returns:

torch.tensor: Total loss
torch.tensor: Recon loss
torch.tensor: KL loss

methylnet.models.val_decoder_(model, x, z)[source]¶

Validation Loss over decoder.

Parameters:	model : type VAE_MLP model. x : type Input methylation data. z : type Latent Embeddings
Returns:	float Reconstruction loss over all batches.

methylnet.models.val_mlp(model, loader, loss_func, cuda=True, categorical=False, train_decoder=False)[source]¶

Find validation loss of VAE_MLP over one Epoch.

Parameters:	model : type VAE_MLP model. loader : type DataLoader with MethylationDataset. loss_func : type Loss function (BCE, CrossEntropy, MSE). cuda : type GPU? categorical : type Predicting categorical or continuous outcomes. train_decoder : type Retrain decoder during training loop to adjust for fine-tuned embeddings.
Returns:	nn.Module VAE_MLP model. float Validation loss over all batches

methylnet.models.val_vae(model, loader, loss_func, optimizer, cuda=True, epoch=0, kl_warm_up=0, beta=1.0)[source]¶

Function for validation loss computation during VAE training for one epoch.

Parameters:

model : type: VAE torch model
loader : type: Validation Data loader, generator that calls batches of data.
loss_func : type: Loss function for reconstruction error, nn.BCELoss or MSELoss
optimizer : type: SGD or Adam pytorch optimizer.
cuda : type: GPU?
epoch : type: Epoch of training, passed in from outer loop.
kl_warm_up : type: How many epochs until model is fully utilizes KL Loss.
beta : type: Weight given to KL Loss.

Returns:

nn.Module: Pytorch VAE model
float: Total Validation Loss across all batches
float: Total Validation reconstruction loss across all batches
float: Total Validation KL Loss across all batches

plotter.py¶

Plotting mechanisms for training that are now defuct.

class methylnet.plotter.Plot(title, xlab='vae1', ylab='vae2', data=[])[source]¶

Stores plotting information; defunct, superceded, see methylnet-visualize.

Methods

return_img

class methylnet.plotter.PlotTransformer(data, color_list)[source]¶

Plotting Transformer to help with plotting embeddings changing over epochs; defunct.

Methods

transform

class methylnet.plotter.Plotter(plots, animation=True)[source]¶

Plot embeddings and training curve from Plot objects; defunct, superceded, see methylnet-visualize.

Methods

animate
write_plots

schedulers.py¶

Learning rate schedulers that help enable better and more generalizable models.

class methylnet.schedulers.CosineAnnealingWithRestartsLR(optimizer, T_max, eta_min=0, last_epoch=-1, T_mult=1.0, alpha_decay=1.0)[source]¶

Borrowed from: https://github.com/mpyrozhok/adamwr/blob/master/cyclic_scheduler.py Needs to be updated to reflect newest changes. From original docstring: Set the learning rate of each parameter group using a cosine annealing schedule, where \(\eta_{max}\) is set to the initial lr and \(T_{cur}\) is the number of epochs since the last restart in SGDR:

\[\eta_t = \eta_{min} + \frac{1}{2}(\eta_{max} - \eta_{min})(1 + \cos(\frac{T_{cur}}{T_{max}}\pi))\]

When last_epoch=-1, sets initial lr as lr. It has been proposed in

SGDR: Stochastic Gradient Descent with Warm Restarts. This implements the cosine annealing part of SGDR, the restarts and number of iterations multiplier.

Args:

optimizer (Optimizer): Wrapped optimizer. T_max (int): Maximum number of iterations. T_mult (float): Multiply T_max by this number after each restart. Default: 1. eta_min (float): Minimum learning rate. Default: 0. last_epoch (int): The index of last epoch. Default: -1.

Attributes:	step_n

Methods

`load_state_dict`(self, state_dict)	Loads the schedulers state.
`state_dict`(self)	Returns the state of the scheduler as a `dict`.

cosine
get_lr
restart
step

class methylnet.schedulers.Scheduler(optimizer=None, opts={'T_max': 10, 'T_mult': 2, 'eta_min': 5e-08, 'lr_scheduler_decay': 0.5, 'scheduler': 'null'})[source]¶

Scheduler class that modulates learning rate of torch optimizers over epochs.

Parameters:	optimizer : type torch.Optimizer object opts : type Options of setting the learning rate scheduler, see default.
Attributes:	schedulers : type Different types of schedulers to choose from. scheduler_step_fn : type How scheduler updates learning rate. initial_lr : type Initial set learning rate. scheduler_choice : type What scheduler type was chosen. scheduler : type Scheduler object chosen that will more directly update optimizer LR.

Methods

`get_lr`(self)	Return current learning rate.
`step`(self)	Update optimizer learning rate

get_lr(self)[source]¶

Return current learning rate.

Returns:	float Current learning rate.

step(self)[source]¶: Update optimizer learning rate

methylnet-embed¶

methylnet-embed [OPTIONS] COMMAND [ARGS]...

Options

--version¶: Show the version and exit.

launch_hyperparameter_scan¶

Launch randomized grid search of neural network hyperparameters.

methylnet-embed launch_hyperparameter_scan [OPTIONS]

Options

-hcsv, --hyperparameter_input_csv <hyperparameter_input_csv>¶: CSV file containing hyperparameter inputs. [default: embeddings/embed_hyperparameters_scan_input.csv]

-hl, --hyperparameter_output_log <hyperparameter_output_log>¶: CSV file containing prior runs. [default: embeddings/embed_hyperparameters_log.csv]

-g, --generate_input¶: Generate hyperparameter input csv.

-c, --job_chunk_size <job_chunk_size>¶: If not series, chunk up and run these number of commands at once..

-sc, --stratify_column <stratify_column>¶: Column to stratify samples on. [default: disease]

-r, --reset_all¶: Run all jobs again.

-t, --torque¶: Submit jobs on torque.

-gpu, --gpu <gpu>¶: If torque submit, which gpu to use. [default: -1]

-gn, --gpu_node <gpu_node>¶: If torque submit, which gpu node to use. [default: -1]

-nh, --nohup¶: Nohup launch jobs.

-mc, --model_complexity_factor <model_complexity_factor>¶: Degree of neural network model complexity for hyperparameter search. Search for less wide and less deep networks with a lower complexity value, bounded between 0 and infinity. [default: 1.0]

-b, --set_beta <set_beta>¶: Set beta value, bounded between 0 and infinity. Set to -1 [default: 1.0]

-j, --n_jobs <n_jobs>¶: Number of jobs to generate.

-n, --n_jobs_relaunch <n_jobs_relaunch>¶: Relaunch n top jobs from previous run. [default: 0]

-cp, --crossover_p <crossover_p>¶: Rate of crossover between hyperparameters. [default: 0.0]

-v, --val_loss_column <val_loss_column>¶: Validation loss column.

-a, --additional_command <additional_command>¶: Additional command to input for torque run.

-cu, --cuda¶: Use GPUs.

-grid, --hyperparameter_yaml <hyperparameter_yaml>¶: YAML file with custom subset of hyperparameter grid.

perform_embedding¶

Perform variational autoencoding on methylation dataset.

methylnet-embed perform_embedding [OPTIONS]

Options

-i, --train_pkl <train_pkl>¶: Input database for beta and phenotype data. [default: ./train_val_test_sets/train_methyl_array.pkl]

-o, --output_dir <output_dir>¶: Output directory for embeddings. [default: ./embeddings/]

-c, --cuda¶: Use GPUs.

-n, --n_latent <n_latent>¶: Number of latent dimensions. [default: 64]

-lr, --learning_rate <learning_rate>¶: Learning rate. [default: 0.001]

-wd, --weight_decay <weight_decay>¶: Weight decay of adam optimizer. [default: 0.0001]

-e, --n_epochs <n_epochs>¶: Number of epochs to train over. [default: 50]

-hlt, --hidden_layer_encoder_topology <hidden_layer_encoder_topology>¶: Topology of hidden layers, comma delimited, leave empty for one layer encoder, eg. 100,100 is example of 5-hidden layer topology. [default: ]

-kl, --kl_warm_up <kl_warm_up>¶: Number of epochs before introducing kl_loss. [default: 0]

-b, --beta <beta>¶: Weighting of kl divergence. [default: 1.0]

-s, --scheduler <scheduler>¶: Type of learning rate scheduler. [default: null]

-d, --decay <decay>¶: Learning rate scheduler decay for exp selection. [default: 0.5]

-t, --t_max <t_max>¶: Number of epochs before cosine learning rate restart. [default: 10]

-eta, --eta_min <eta_min>¶: Minimum cosine LR. [default: 1e-06]

-m, --t_mult <t_mult>¶: Multiply current restart period times this number given number of restarts. [default: 2.0]

-bce, --bce_loss¶: Use bce loss instead of MSE.

-bs, --batch_size <batch_size>¶: Batch size. [default: 50]

-vp, --val_pkl <val_pkl>¶: Validation Set Methylation Array Location. [default: ./train_val_test_sets/val_methyl_array.pkl]

-w, --n_workers <n_workers>¶: Number of workers. [default: 9]

-conv, --convolutional¶: Use convolutional VAE.

-hs, --height_kernel_sizes <height_kernel_sizes>¶: Heights of convolutional kernels.

-ws, --width_kernel_sizes <width_kernel_sizes>¶: Widths of convolutional kernels.

-v, --add_validation_set¶: Evaluate validation set.

-l, --loss_reduction <loss_reduction>¶: Type of reduction on loss function. [default: sum]

-hl, --hyperparameter_log <hyperparameter_log>¶: CSV file containing prior runs. [default: embeddings/embed_hyperparameters_log.csv]

-sc, --stratify_column <stratify_column>¶: Column to stratify samples on. [default: disease]

-j, --job_name <job_name>¶: Embedding job name. [default: embed_job]

methylnet-predict¶

methylnet-predict [OPTIONS] COMMAND [ARGS]...

Options

--version¶: Show the version and exit.

classification_report¶

Generate classification report that gives results from classification tasks.

methylnet-predict classification_report [OPTIONS]

Options

-r, --results_pickle <results_pickle>¶: Results from training, validation, and testing. [default: predictions/results.p]

-o, --output_dir <output_dir>¶: Output directory. [default: results/]

-e, --categorical_encoder <categorical_encoder>¶: One hot encoder if categorical model. If path exists, then return top positive controbutions per samples of that class. Encoded values must be of sample class as interest_col. [default: ./predictions/one_hot_encoder.p]

-a, --average_mechanism <average_mechanism>¶: Output directory. [default: micro]

launch_hyperparameter_scan¶

Run randomized hyperparameter scan of neural network hyperparameters.

methylnet-predict launch_hyperparameter_scan [OPTIONS]

Options

-hcsv, --hyperparameter_input_csv <hyperparameter_input_csv>¶: CSV file containing hyperparameter inputs. [default: predictions/predict_hyperparameters_scan_input.csv]

-hl, --hyperparameter_output_log <hyperparameter_output_log>¶: CSV file containing prior runs. [default: predictions/predict_hyperparameters_log.csv]

-g, --generate_input¶: Generate hyperparameter input csv.

-c, --job_chunk_size <job_chunk_size>¶: If not series, chunk up and run these number of commands at once..

-ic, --interest_cols <interest_cols>¶: Column to stratify samples on.

-cat, --categorical¶: Whether to run categorical analysis or not.

-r, --reset_all¶: Run all jobs again.

-t, --torque¶: Submit jobs on torque.

-gpu, --gpu <gpu>¶: If torque submit, which gpu to use. [default: -1]

-gn, --gpu_node <gpu_node>¶: If torque submit, which gpu node to use. [default: -1]

-nh, --nohup¶: Nohup launch jobs.

-n, --n_jobs_relaunch <n_jobs_relaunch>¶: Relaunch n top jobs from previous run. [default: 0]

-cp, --crossover_p <crossover_p>¶: Rate of crossover between hyperparameters. [default: 0.0]

-mc, --model_complexity_factor <model_complexity_factor>¶: Degree of neural network model complexity for hyperparameter search. Search for less wide networks with a lower complexity value, bounded between 0 and infinity. [default: 1.0]

-j, --n_jobs <n_jobs>¶: Number of jobs to generate.

-v, --val_loss_column <val_loss_column>¶: Validation loss column.

-sft, --add_softmax¶: Add softmax for predicting probability distributions.

-a, --additional_command <additional_command>¶: Additional command to input for torque run.

-cu, --cuda¶: Use GPUs.

-grid, --hyperparameter_yaml <hyperparameter_yaml>¶: YAML file with custom subset of hyperparameter grid.

make_new_predictions¶

Run prediction model again to further assess outcome. Only evaluate prediction model.

methylnet-predict make_new_predictions [OPTIONS]

Options

-tp, --test_pkl <test_pkl>¶: Test database for beta and phenotype data. [default: ./train_val_test_sets/test_methyl_array.pkl]

-m, --model_pickle <model_pickle>¶: Pytorch model containing forward_predict method. [default: ./predictions/output_model.p]

-bs, --batch_size <batch_size>¶: Batch size. [default: 50]

-w, --n_workers <n_workers>¶: Number of workers. [default: 9]

-ic, --interest_cols <interest_cols>¶: Specify columns looking to make predictions on. [default: disease]

-cat, --categorical¶: Multi-class prediction. [default: False]

-c, --cuda¶: Use GPUs.

-e, --categorical_encoder <categorical_encoder>¶: One hot encoder if categorical model. If path exists, then return top positive controbutions per samples of that class. Encoded values must be of sample class as interest_col. [default: ./predictions/one_hot_encoder.p]

-o, --output_dir <output_dir>¶: Output directory for predictions. [default: ./new_predictions/]

make_prediction¶

Train prediction model by fine-tuning VAE and appending/training MLP to make classification/regression predictions on MethylationArrays.

methylnet-predict make_prediction [OPTIONS]

Options

-i, --train_pkl <train_pkl>¶: Input database for beta and phenotype data. [default: ./train_val_test_sets/train_methyl_array.pkl]

-tp, --test_pkl <test_pkl>¶: Test database for beta and phenotype data. [default: ./train_val_test_sets/test_methyl_array.pkl]

-vae, --input_vae_pkl <input_vae_pkl>¶: Trained VAE. [default: ./embeddings/output_model.p]

-o, --output_dir <output_dir>¶: Output directory for predictions. [default: ./predictions/]

-c, --cuda¶: Use GPUs.

-ic, --interest_cols <interest_cols>¶: Specify columns looking to make predictions on. [default: disease]

-cat, --categorical¶: Multi-class prediction. [default: False]

-do, --disease_only¶: Only look at disease, or text before subtype_delimiter.

-hlt, --hidden_layer_topology <hidden_layer_topology>¶: Topology of hidden layers, comma delimited, leave empty for one layer encoder, eg. 100,100 is example of 5-hidden layer topology. [default: ]

-lr_vae, --learning_rate_vae <learning_rate_vae>¶: Learning rate VAE. [default: 1e-05]

-lr_mlp, --learning_rate_mlp <learning_rate_mlp>¶: Learning rate MLP. [default: 0.001]

-wd, --weight_decay <weight_decay>¶: Weight decay of adam optimizer. [default: 0.0001]

-dp, --dropout_p <dropout_p>¶: Dropout Percentage. [default: 0.2]

-e, --n_epochs <n_epochs>¶: Number of epochs to train over. [default: 50]

-s, --scheduler <scheduler>¶: Type of learning rate scheduler. [default: null]

-d, --decay <decay>¶: Learning rate scheduler decay for exp selection. [default: 0.5]

-t, --t_max <t_max>¶: Number of epochs before cosine learning rate restart. [default: 10]

-eta, --eta_min <eta_min>¶: Minimum cosine LR. [default: 1e-06]

-m, --t_mult <t_mult>¶: Multiply current restart period times this number given number of restarts. [default: 2.0]

-bs, --batch_size <batch_size>¶: Batch size. [default: 50]

-vp, --val_pkl <val_pkl>¶: Validation Set Methylation Array Location. [default: ./train_val_test_sets/val_methyl_array.pkl]

-w, --n_workers <n_workers>¶: Number of workers. [default: 9]

-v, --add_validation_set¶: Evaluate validation set.

-l, --loss_reduction <loss_reduction>¶: Type of reduction on loss function. [default: sum]

-hl, --hyperparameter_log <hyperparameter_log>¶: CSV file containing prior runs. [default: predictions/predict_hyperparameters_log.csv]

-j, --job_name <job_name>¶: Embedding job name. [default: predict_job]

-sft, --add_softmax¶: Add softmax for predicting probability distributions. Experimental.

regression_report¶

Generate regression report that gives concise results from regression tasks.

methylnet-predict regression_report [OPTIONS]

Options

-r, --results_pickle <results_pickle>¶: Results from training, validation, and testing. [default: predictions/results.p]

-o, --output_dir <output_dir>¶: Output directory. [default: results/]

methylnet-interpret¶

methylnet-interpret [OPTIONS] COMMAND [ARGS]...

Options

--version¶: Show the version and exit.

bin_regression_shaps¶

Take aggregate individual scores from regression results, that normally are not categorized, and aggregate across new category created from continuous data.

methylnet-interpret bin_regression_shaps [OPTIONS]

Options

-s, --shapley_data <shapley_data>¶: Pickle containing top CpGs. [default: ./interpretations/shapley_explanations/shapley_data.p]

-t, --test_pkl <test_pkl>¶: Pickle containing testing set. [default: ./train_val_test_sets/test_methyl_array.pkl]

-c, --col <col>¶: Column to turn into bins. [default: age]

-n, --n_bins <n_bins>¶: Number of bins. [default: 10]

-ot, --output_test_pkl <output_test_pkl>¶: Binned shap pickle for further testing. [default: ./train_val_test_sets/test_methyl_array_shap_binned.pkl]

-os, --output_shap_pkl <output_shap_pkl>¶: Pickle containing top CpGs, binned phenotype. [default: ./interpretations/shapley_explanations/shapley_binned.p]

extract_methylation_array¶

Subset and write methylation array using top SHAP cpgs.

methylnet-interpret extract_methylation_array [OPTIONS]

Options

-s, --shapley_data <shapley_data>¶: Pickle containing top CpGs. [default: ./interpretations/shapley_explanations/shapley_data.p]

-co, --classes_only¶: Only take top CpGs from each class. [default: False]

-o, --output_dir <output_dir>¶: Output directory for methylation array. [default: ./interpretations/shapley_explanations/top_cpgs_extracted_methylarr/]

-t, --test_pkl <test_pkl>¶: Pickle containing testing set. [default: ./train_val_test_sets/test_methyl_array.pkl]

-c, --col <col>¶: Column to color for output csvs. [default: ]

-g, --global_vals¶: Only take top CpGs globally. [default: False]

-n, --n_extract <n_extract>¶: Number cpgs to extract. [default: 1000]

grab_lola_db_cache¶

Download core and extended LOLA databases for enrichment tests.

methylnet-interpret grab_lola_db_cache [OPTIONS]

Options

-o, --output_dir <output_dir>¶: Output directory for lola dbs. [default: ./lola_db/]

interpret_biology¶

Interrogate CpGs with high SHAPley scores for individuals, classes, or overall, for enrichments, genes, GO, KEGG, LOLA, overlap with popular cpg sets, GSEA, find nearby cpgs to top.

methylnet-interpret interpret_biology [OPTIONS]

Options

-a, --all_cpgs_pickle <all_cpgs_pickle>¶: List of all cpgs used in shapley analysis. [default: ./interpretations/shapley_explanations/all_cpgs.p]

-s, --shapley_data_list <shapley_data_list>¶: Pickle containing top CpGs. [default: ./interpretations/shapley_explanations/shapley_data.p]

-o, --output_dir <output_dir>¶: Output directory for interpretations. [default: ./interpretations/biological_explanations/]

-w, --analysis <analysis>¶: Choose biological analysis. [default: GO]

-n, --n_workers <n_workers>¶: Number workers. [default: 8]

-l, --lola_db <lola_db>¶: LOLA region db. [default: ./lola_db/core/nm/t1/resources/regions/LOLACore/hg19/]

-i, --individuals <individuals>¶: Individuals to evaluate. [default: ]

-c, --classes <classes>¶: Classes to evaluate. [default: ]

-m, --max_gap <max_gap>¶: Genomic distance to search for nearby CpGs than found top cpgs shapleys. [default: 1000]

-ci, --class_intersect¶: Compute class shapleys by intersection of individuals. [default: False]

-lo, --length_output <length_output>¶: Number enriched terms to print. [default: 20]

-ss, --set_subtraction¶: Only consider CpGs relevant to particular class. [default: False]

-int, --intersection¶: CpGs common to all classes. [default: False]

-col, --collections <collections>¶: Lola collections. [default: ]

-gsea, --gsea_analyses <gsea_analyses>¶: Gene set enrichment analysis to choose from, if chosen will override other analysis options. http://software.broadinstitute.org/gsea/msigdb/collections.jsp [default: ]

-gsp, --gsea_pickle <gsea_pickle>¶: Gene set enrichment analysis to choose from. [default: ./data/gsea_collections.p]

-d, --depletion¶: Run depletion LOLA analysis instead of enrichment. [default: False]

-ov, --overlap_test¶: Run overlap test with IDOL library instead of all other tests. [default: False]

-cgs, --cpg_set <cpg_set>¶: Test for clock or IDOL enrichments. [default: IDOL]

-g, --top_global¶: Look at global top cpgs, overwrites classes and individuals. [default: False]

-ex, --extract_library¶: Extract a library of cpgs to subset future array for inspection or prediction tests. [default: False]

interpret_embedding_classes¶

Compare average distance between classes in embedding space.

methylnet-interpret interpret_embedding_classes [OPTIONS]

Options

-i, --embedding_methyl_array_pkl <embedding_methyl_array_pkl>¶: Use ./predictions/vae_mlp_methyl_arr.pkl or ./embeddings/vae_methyl_arr.pkl for vae interpretations. [default: ./embeddings/vae_methyl_arr.pkl]

-c, --pheno_col <pheno_col>¶: Column to separate on. [default: disease]

-m, --metric <metric>¶: Distance metric to compare classes. [default: cosine]

-t, --trim <trim>¶: Trim outlier distances. Number 0-0.5. [default: 0.0]

-o, --output_csv <output_csv>¶: Distances between classes. [default: ./results/class_embedding_differences.csv]

-op, --output_pval_csv <output_pval_csv>¶: If specify a CSV file, outputs pairwise manova tests between embeddings for clusters. [default: ]

list_classes¶

List classes/multioutput regression cell-types that have SHAPley data to be interrogated.

methylnet-interpret list_classes [OPTIONS]

Options

-s, --shapley_data <shapley_data>¶: Pickle containing top CpGs. [default: ./interpretations/shapley_explanations/shapley_data.p]

list_individuals¶

List individuals that have ShapleyData. Not all made the cut from the test dataset.

methylnet-interpret list_individuals [OPTIONS]

Options

-s, --shapley_data <shapley_data>¶: Pickle containing top CpGs. [default: ./interpretations/shapley_explanations/shapley_data.p]

order_results_by_col¶

Order results that produces some CSV by phenotype column, alphabetical ordering to show maybe that results group together. Plot using pymethyl-visualize.

methylnet-interpret order_results_by_col [OPTIONS]

Options

-i, --input_csv <input_csv>¶: Output directory for cpg jaccard_stats. [default: ./interpretations/shapley_explanations/top_cpgs_jaccard/all_jaccard.csv]

-t, --test_pkl <test_pkl>¶: Pickle containing testing set. [default: ./train_val_test_sets/test_methyl_array.pkl]

-c, --col <col>¶: Column to sort on. [default: disease]

-o, --output_csv <output_csv>¶: Output directory for cpg jaccard_stats. [default: ./interpretations/shapley_explanations/top_cpgs_jaccard/all_jaccard_sorted.csv]

-sym, --symmetric¶: Is symmetric? [default: False]

plot_all_lola_outputs¶

Iterate through all LOLA csv results and plot forest plots of all enrichment scores.

methylnet-interpret plot_all_lola_outputs [OPTIONS]

Options

-l, --head_lola_dir <head_lola_dir>¶: Location of lola output csvs. [default: interpretations/biological_explanations/]

-o, --plot_output_dir <plot_output_dir>¶: Output directory for interpretations. [default: ./interpretations/biological_explanations/lola_plots/]

-d, --description_col <description_col>¶: Description column for categorical variables [default: description]

-c, --cell_types <cell_types>¶: Cell types. [default: ]

-ov, --overwrite¶: Overwrite existing plots. [default: False]

plot_lola_output¶

Plot LOLA results via forest plot for one csv file.

methylnet-interpret plot_lola_output [OPTIONS]

Options

-l, --lola_csv <lola_csv>¶: Location of lola output csv. [default: ]

-o, --plot_output_dir <plot_output_dir>¶: Output directory for interpretations. [default: ./interpretations/biological_explanations/lola_plots/]

-d, --description_col <description_col>¶: Description column for categorical variables [default: description]

-c, --cell_types <cell_types>¶: Cell types. [default: ]

plot_top_cpgs¶

Plot the top cpgs locations in the genome using circos.

methylnet-interpret plot_top_cpgs [OPTIONS]

Options

-s, --shapley_data <shapley_data>¶: Pickle containing top CpGs. [default: ./interpretations/shapley_explanations/shapley_data.p]

-o, --output_dir <output_dir>¶: Output directory for output plots. [default: ./interpretations/output_plots/]

-i, --individuals <individuals>¶: Individuals to evaluate. [default: ]

-c, --classes <classes>¶: Classes to evaluate. [default: ]

produce_shapley_data¶

Explanations (coefficients for CpGs) for every individual prediction. Produce SHAPley scores for each CpG for each individualized prediction, and then aggregate them across coarser classes. Store CpGs with top SHAPley scores as well for quick access. Store in Shapley data object.

methylnet-interpret produce_shapley_data [OPTIONS]

Options

-i, --train_pkl <train_pkl>¶: Input database for beta and phenotype data. Use ./predictions/vae_mlp_methyl_arr.pkl or ./embeddings/vae_methyl_arr.pkl for vae interpretations. [default: ./train_val_test_sets/train_methyl_array.pkl]

-v, --val_pkl <val_pkl>¶: Val database for beta and phenotype data. Use ./predictions/vae_mlp_methyl_arr.pkl or ./embeddings/vae_methyl_arr.pkl for vae interpretations. [default: ./train_val_test_sets/val_methyl_array.pkl]

-t, --test_pkl <test_pkl>¶: Pickle containing testing set. [default: ./train_val_test_sets/test_methyl_array.pkl]

-m, --model_pickle <model_pickle>¶: Pytorch model containing forward_predict method. [default: ./predictions/output_model.p]

-w, --n_workers <n_workers>¶: Number of workers. [default: 9]

-bs, --batch_size <batch_size>¶: Batch size. [default: 512]

-c, --cuda¶: Use GPUs.

-ns, --n_samples <n_samples>¶: Number of samples for SHAP output. [default: 200]

-nf, --n_top_features <n_top_features>¶: Top features to select for shap outputs. If feature selection, instead use this number to choose number features that make up top global score. [default: 20]

-o, --output_dir <output_dir>¶: Output directory for interpretations. [default: ./interpretations/shapley_explanations/]

-mth, --method <method>¶: Explainer type. [default: kernel]

-ssbs, --shap_sample_batch_size <shap_sample_batch_size>¶: Break up shapley computations into batches. Set to 0 for only 1 batch. [default: 0]

-r, --n_random_representative <n_random_representative>¶: Number of representative samples to choose for background selection. Is this number for regression models, or this number per class for classification problems. [default: 0]

-col, --interest_col <interest_col>¶: Column of interest for sample selection for explainer training. [default: disease]

-rt, --n_random_representative_test <n_random_representative_test>¶: Number of representative samples to choose for test set. Is this number for regression models, or this number per class for classification problems. [default: 0]

-e, --categorical_encoder <categorical_encoder>¶: One hot encoder if categorical model. If path exists, then return top positive controbutions per samples of that class. Encoded values must be of sample class as interest_col. [default: ./predictions/one_hot_encoder.p]

-cc, --cross_class¶: Find importance of features for prediction across classes.

-fs, --feature_selection¶: Perform feature selection using top global SHAP scores. [default: False]

-top, --top_outputs <top_outputs>¶: Get shapley values for fewer outputs if feature selection. [default: 0]

-vae, --vae_interpret¶: Use model to get shapley values for VAE latent dimensions, only works if proper datasets are set. [default: False]

-cl, --pred_class <pred_class>¶: Prediction class top cpgs. [default: ]

-r, --results_csv <results_csv>¶: Remove all misclassifications. [default: ./predictions/results.csv]

-ind, --individual <individual>¶: One individual top cpgs. [default: ]

-rc, --residual_cutoff <residual_cutoff>¶: Activate for regression interpretations. Standard deviation in residuals before removing. [default: 0.0]

-cn, --cell_names <cell_names>¶: Multioutput names for multi-output regression. [default: ]

-dsd, --dump_shap_data¶: Dump raw SHAP data to numpy file, warning, may be large file. [default: False]

reduce_top_cpgs¶

Reduce set of top cpgs.

methylnet-interpret reduce_top_cpgs [OPTIONS]

Options

-s, --shapley_data <shapley_data>¶: Pickle containing top CpGs. [default: ./interpretations/shapley_explanations/shapley_data.p]

-nf, --n_top_features <n_top_features>¶: Top features to select for shap outputs. [default: 500]

-o, --output_pkl <output_pkl>¶: Pickle containing top CpGs, reduced number. [default: ./interpretations/shapley_explanations/shapley_reduced_data.p]

regenerate_top_cpgs¶

Increase size of Top CpGs using the original SHAP scores.

methylnet-interpret regenerate_top_cpgs [OPTIONS]

Options

-s, --shapley_data <shapley_data>¶: Pickle containing top CpGs. [default: ./interpretations/shapley_explanations/shapley_data.p]

-nf, --n_top_features <n_top_features>¶: Top features to select for shap outputs. [default: 500]

-o, --output_pkl <output_pkl>¶: Pickle containing top CpGs, reduced number. [default: ./interpretations/shapley_explanations/shapley_reduced_data.p]

-a, --abs_val¶: Top CpGs found using absolute value.

-n, --neg_val¶: Return top CpGs that are making negative contributions.

return_shap_values¶

Return matrix of shapley values per class, with option to, classes/individuals are columns, CpGs are rows, option to plot multiple histograms/density plots.

methylnet-interpret return_shap_values [OPTIONS]

Options

-s, --shapley_data <shapley_data>¶: Pickle containing top CpGs. [default: ./interpretations/shapley_explanations/shapley_data.p]

-o, --output_dir <output_dir>¶: Output directory for output plots. [default: ./interpretations/shap_outputs/]

-i, --individuals <individuals>¶: Individuals to evaluate. [default: ]

-c, --classes <classes>¶: Classes to evaluate. [default: ]

-hist, --output_histogram¶: Whether to output a histogram for each class/individual of their SHAP scores.

-abs, --absolute¶: Use sums of absolute values in making computations.

-log, --log_transform¶: Log transform final results for plotting.

shapley_jaccard¶

Plot Shapley Jaccard Similarity Matrix to demonstrate sharing of Top CpGs between classes/groupings of individuals.

methylnet-interpret shapley_jaccard [OPTIONS]

Options

-s, --shapley_data <shapley_data>¶: Pickle containing top CpGs. [default: ./interpretations/shapley_explanations/shapley_data.p]

-c, --class_names <class_names>¶: Class names. [default: ]

-o, --output_dir <output_dir>¶: Output directory for cpg jaccard_stats. [default: ./interpretations/shapley_explanations/top_cpgs_jaccard/]

-ov, --overall¶: Output overall similarity. [default: False]

-i, --include_individuals¶: Output individuals. [default: False]

-co, --cooccurrence¶: Output cooccurrence instead jaccard. [default: False]

-opt, --optimize_n_cpgs¶: Search for number of top CpGs to use. [default: False]

split_hyper_hypo_methylation¶

Split SHAPleyData object by methylation type (needs to change; but low methylation is 0.5 beta value and below) and output to file.

methylnet-interpret split_hyper_hypo_methylation [OPTIONS]

Options

-s, --shapley_data <shapley_data>¶: Pickle containing top CpGs. [default: ./interpretations/shapley_explanations/shapley_data.p]

-o, --output_dir <output_dir>¶: Output directory for hypo/hyper shap data. [default: ./interpretations/shapley_explanations/shapley_data_by_methylation/]

-t, --test_pkl <test_pkl>¶: Pickle containing testing set. [default: ./train_val_test_sets/test_methyl_array.pkl]

-thr, --threshold <threshold>¶: Pickle containing testing set. [default: original]

view_methylation_top_cpgs¶

Write the Top CpGs for each class/individuals to file along with methylation information.

methylnet-interpret view_methylation_top_cpgs [OPTIONS]

Options

-s, --shapley_data <shapley_data>¶: Pickle containing top CpGs. [default: ./interpretations/shapley_explanations/shapley_data.p]

-i, --individuals <individuals>¶: Individuals. [default: ]

-o, --output_dir <output_dir>¶: Output directory for cpg methylation shap. [default: ./interpretations/shapley_explanations/top_cpgs_methylation/]

-t, --test_pkl <test_pkl>¶: Pickle containing testing set. [default: ./train_val_test_sets/test_methyl_array.pkl]

methylnet-visualize¶

methylnet-visualize [OPTIONS] COMMAND [ARGS]...

Options

--version¶: Show the version and exit.

plot_roc_curve¶

Plot ROC Curves from classification tasks; requires classification_report be run first.

methylnet-visualize plot_roc_curve [OPTIONS]

Options

-r, --roc_curve_csv <roc_curve_csv>¶: Weighted ROC Curve. [default: results/Weighted_ROC.csv]

-o, --outputfilename <outputfilename>¶: Output image. [default: results/roc_curve.png]

plot_training_curve¶

Plot training curves as output from either the VAE training or VAE MLP.

methylnet-visualize plot_training_curve [OPTIONS]

Options

-t, --training_curve_file <training_curve_file>¶: Training and validation loss and learning curves. [default: predictions/training_val_curve.p]

-o, --outputfilename <outputfilename>¶: Output image. [default: results/training_curve.png]

-vae, --vae_train¶: Plot VAE Training curves.

-thr, --threshold <threshold>¶: Loss values get rid of greater than this.

methylnet-torque¶

methylnet-torque [OPTIONS] COMMAND [ARGS]...

Options

--version¶: Show the version and exit.

run_torque_job¶

Run torque job.

methylnet-torque run_torque_job [OPTIONS]

Options

-c, --command <command>¶: Command to execute through torque. [default: ]

-gpu, --use_gpu¶: Specify whether to use GPUs. [default: False]

-a, --additions <additions>¶: Additional commands to add, one liner for now. [default: ]

-q, --queue <queue>¶: Queue for torque submission, gpuq also a good one if using GPUs. [default: default]

-t, --time <time>¶: Walltime in hours for job. [default: 1]

-n, --ngpu <ngpu>¶: Number of gpus to request. [default: 0]

methylnet-test¶

methylnet-test [OPTIONS] COMMAND [ARGS]...

Options

--version¶: Show the version and exit.

test_pipeline¶

methylnet-test test_pipeline [OPTIONS]

Welcome to MethylNet’s documentation!¶

datasets.py¶

hyperparameter_scans.py¶

torque_jobs.py¶

interpretation_classes.py¶

models.py¶

plotter.py¶

schedulers.py¶

methylnet-embed¶

launch_hyperparameter_scan¶

perform_embedding¶

methylnet-predict¶

classification_report¶

launch_hyperparameter_scan¶

make_new_predictions¶

make_prediction¶

regression_report¶

methylnet-interpret¶

bin_regression_shaps¶

extract_methylation_array¶

grab_lola_db_cache¶

interpret_biology¶

interpret_embedding_classes¶

list_classes¶

list_individuals¶

order_results_by_col¶

plot_all_lola_outputs¶

plot_lola_output¶

plot_top_cpgs¶

produce_shapley_data¶

reduce_top_cpgs¶

regenerate_top_cpgs¶

return_shap_values¶

shapley_jaccard¶

split_hyper_hypo_methylation¶

view_methylation_top_cpgs¶

methylnet-visualize¶

plot_roc_curve¶

plot_training_curve¶

methylnet-torque¶

run_torque_job¶

methylnet-test¶

test_pipeline¶

Example Usage¶

Indices and tables¶