Welcome to PyMethylProcess’s documentation!

https://github.com/Christensen-Lab-Dartmouth/PyMethylProcess

To get started, download pymethylprocess using Docker (joshualevy44/pymethylprocess) or PIP (pymethylprocess) and run pymethyl-install_r_dependencies.

There is both an API and CLI available for use. Examples for CLI usage can be found in ./example_scripts.

Download Download Format Preprocess Visualize TrainTestSplit

PreProcessDataTypes.py

Contains datatypes core to downloading IDATs, preprocessing IDATs and samplesheets.

class pymethylprocess.PreProcessDataTypes.PreProcessIDAT(idat_dir, minfi=None, enmix=None, base=None, meffil=None)[source]

Class that will preprocess IDATs using R pipelines.

idat_dir
Location of idats or samplesheet csv.
minfi
Rpy2 importr minfi library, default to None will load through rpy2
enmix
Rpy2 importr enmix library, default to None will load through rpy2
base
Rpy2 importr base library, default to None will load through rpy2
meffil
Rpy2 importr meffil library, default to None will load through rpy2
export_csv(output_dir)[source]

Export pheno and beta dataframes to CSVs

output_dir
Where to store csvs.
export_pickle(output_pickle, disease='')[source]

Export pheno and beta dataframes to pickle, stored in python dict that can be loaded into MethylationArray

output_pickle
Where to store MethylationArray.
disease
Custom naming scheme for data.
export_sql(output_db, disease='')[source]

Export pheno and beta dataframes to SQL

output_db
Where to store data, sqlite db.
disease
Custom naming scheme for data.
extract_manifest()[source]

Get manifest from RGSet.

extract_pheno_data(methylset=False)[source]

Extract pheno data from MSet or RGSet, minfi.

methylset
If MSet has beenn created, set to True, else extract from original RGSet.
filter_beta()[source]

After creating beta, filter out outliers.

get_beta()[source]

Get beta value matrix from minfi after finding RSet.

get_meth()[source]

Get methylation intensity matrix from MSet

get_unmeth()[source]

Get unmethylated intensity matrix from MSet

load_idats()[source]

For minfi pipeline, load IDATs from specified idat_dir.

move_jpg()[source]

Move jpeg files from current working directory to the idat directory.

output_pheno_beta(meffil=False)[source]

Get pheno and beta dataframe objects stored as attributes for input to MethylationArray object.

meffil
True if ran meffil pipeline.
plot_original_qc(output_dir)[source]

Plot QC results from ENmix pipeline and possible minfi. Still experimental.

output_dir
Where to store plots.
plot_qc_metrics(output_dir)[source]

Plot QC results from ENmix pipeline and possible minfi. Still experimental.

output_dir
Where to store plots.
preprocessENmix(n_cores=6)[source]

Run ENmix preprocessing pipeline.

n_cores
Number of CPUs to use.
preprocessMeffil(n_cores=6, n_pcs=4, qc_report_fname='qc/report.html', normalization_report_fname='norm/report.html', pc_plot_fname='qc/pc_plot.pdf', useCache=True, qc_only=True, qc_parameters={'p.beadnum.cpgs': 0.1, 'p.beadnum.samples': 0.1, 'p.detection.cpgs': 0.1, 'p.detection.samples': 0.1}, rm_sex=False)[source]

Run meffil preprocessing pipeline with functional normalization.

n_cores
Number of CPUs to use.
n_pcs
Number of principal components to use for functional normalization, set to -1 to autoselect via kneedle algorithm.
qc_report_fname
HTML filename to store QC report.
normalization_report_fname
HTML filename to store normalization report
pc_plot_fname
PDF file to store principal components plot.
useCache
Use saved QC objects instead of running through QC again.
qc_only
Perform QC, then save and quit before normalization.
qc_parameters
Python dictionary with parameters for qc.
rm_sex
Remove non-autosomal cpgs?
preprocessNoob()[source]

Run minfi preprocessing with Noob normalization

preprocessRAW()[source]

Run minfi preprocessing with RAW normalization

preprocess_enmix_pipeline(n_cores=6, pipeline='enmix', noob=False, qc_only=False, use_cache=False)[source]

Run complete ENmix or minfi preprocessing pipeline.

n_cores
Number CPUs.
pipeline
Run enmix or minfi
noob
Noob norm or RAW if minfi running.
qc_only
Save and quit after only running QC?
use_cache
Load preexisting RGSet instead of running QC again.
return_beta()[source]

Return minfi RSet after having created MSet.

to_methyl_array(disease='')[source]

Convert results from preprocessing into MethylationArray, and directly return MethylationArray object.

disease
Custom naming scheme for data.
class pymethylprocess.PreProcessDataTypes.PreProcessPhenoData(pheno_sheet, idat_dir, header_line=0)[source]

Class that will manipute phenotype samplesheet before preprocessing of IDATs.

pheno_sheet
Location of clinical info csv.
idat_dir
Location of idats
header_line
Where to start reading clinical csv
concat(other_formatted_sheet)[source]

Concat multiple PreProcessPhenoData objects, concat their dataframes to accept more than one smaplesheet/dataset.

other_formatted_sheet
Other PreProcessPhenoData to concat.
export(output_sheet_name)[source]

Export pheno data to csv after done with manipulation.

output_sheet_name
Output csv name.
format_custom(basename_col, disease_class_column, include_columns={})[source]

Custom format clinical sheet if user supplied idats.

basename_col
Column name of sample names.
disease_class_column
Disease column of clinical info csv.
include_columns
Dictionary specifying other columns to include, and new names to assign them to.
format_geo(disease_class_column='methylation class:ch1', include_columns={})[source]

Format clinical sheets if downloaded geo idats.

disease_class_column
Disease column of clinical info csv.
include_columns
Dictionary specifying other columns to include, and new names to assign them to.
format_tcga(mapping_file='idat_filename_case.txt')[source]

Format clinical sheets if downloaded tcga idats.

mapping_file
Maps uuids to proper tcga sample names, should be downloaded with tcga clinical information.
get_categorical_distribution(key, disease_only=False, subtype_delimiter=', ')[source]

Print categorical distribution, counts for each unique value in phenotype column.

key
Phenotype Column.
disease_only
Whether to split phenotype column entries by delimiter.
subtype_delimiter
Subtype delimiter to split on.
merge(other_formatted_sheet, use_second_sheet_disease=True, no_disease_merge=False)[source]

Merge multiple PreProcessPhenoData objects, merge their dataframes to accept more than one saplesheet/dataset or add more pheno info.

other_formatted_sheet
Other PreProcessPhenoData to merge.
use_second_sheet_disease
Change disease column to that of second sheet instead of first.
no_disease_merge
Keep both disease columns from both sheets.
remove_diseases(exclude_disease_list, low_count, disease_only, subtype_delimiter)[source]

Remove samples with certain diseases from disease column.

exclude_disease_list
List containing diseases to remove.
low_count
Remove samples that have less than x disease occurances in column.
disease_only
Whether to split phenotype column entries by delimiter.
subtype_delimiter
Subtype delimiter to split on.
split_key(key, subtype_delimiter)[source]

Split pheno column by key, with subtype delimiter, eg. entry S1,s2 -> S1 with delimiter “,”.

key
Pheno column name.
subtype_delimiter
Subtype delimiter to split on.
class pymethylprocess.PreProcessDataTypes.TCGADownloader[source]

Downloads TCGA and GEO IDAT and clinical data

download_clinical(output_dir)[source]

Download TCGA Clinical Data.

output_dir
Where to output clinical data csv.
download_geo(query, output_dir)[source]

Download GEO IDATs.

query
GEO accession number to query, must be 450k/850k.
output_dir
Output directory to store idats and clinical information csv
download_tcga(output_dir)[source]

Download TCGA IDATs.

output_dir
Where to output idat files.

MethylationDataTypes.py

Contains datatypes core to storing beta and phenotype methylation data, and imputation.

class pymethylprocess.MethylationDataTypes.ImputerObject(solver, method, opts={})[source]

Class that stores and accesses different types of imputers. Construct sklearn-like imputer given certain input arguments.

solver
Library for imputation, eg. sklearn, fancyimpute.
method
Imputation method in library, named.
opts
Additional options to assign to imputer.
return_imputer()[source]

Return initialized sklearn-like imputer.

class pymethylprocess.MethylationDataTypes.MethylationArray(pheno_df, beta_df, name='')[source]

Stores beta and phenotype information and performs various operations. Initialize MethylationArray object by inputting dataframe of phenotypes and dataframe of beta values with samples as index.

pheno_df
Phenotype dataframe (samples x covariates)
beta_df
Beta Values Dataframe (samples x cpgs)
bin_column(col, n_bins)[source]

Turn continuous variable/covariate into categorical bins. Returns name of new column and updates phenotype matrix to reflect this change.

col
Continuous column of phenotype array to bin.
n_bins
Number of bins to create.
categorical_breakdown(key)[source]

Print categorical distribution, counts for each unique value in phenotype column.

key
Phenotype Column.
feature_select(n_top_cpgs, feature_selection_method='mad', metric='correlation', nn=10)[source]

Perform unsupervised feature selection on MethylationArray.

n_top_cpgs
Number of CpGs to retain.
feature_selection_method
Method to perform selection.
metric
If considering structural feature selection like SPEC, use this distance metric.
nn
Number of nearest neighbors.
classmethod from_pickle(input_pickle)[source]

Load MethylationArray stored in pickle.

Usage: MethylationArray.from_pickle([input_pickle])

input_pickle
Stored MethylationArray pickle.
groupby(key)[source]

Groupby for Methylation Array. Returns generator of methylation arrays grouped by key.

preprocess_sample_df
New phenotype dataframe.
impute(imputer)[source]

Perform imputation on NaN beta vaues. Input imputater returned from ImputerObject.

imputer
Type of imputer object, in sklearn type interface.
merge_preprocess_sheet(preprocess_sample_df)[source]

Feed in another phenotype dataframe that will be merged with existing phenotype array.

preprocess_sample_df
New phenotype dataframe.
overwrite_pheno_data(preprocess_sample_df)[source]

Feed in another phenotype dataframe that will overwrite overlapping keys of existing phenotype array.

preprocess_sample_df
New phenotype dataframe.
remove_missingness(cpg_threshold=None, sample_threshold=None)[source]

Remove samples and CpGs with certain level of missingness..

cpg_threshold
If more than fraction of Samples for this CpG are missing, remove cpg.
sample_threshold
If more than fraction of CpGs for this sample are missing, remove sample.
remove_na_samples(outcome_cols)[source]

Remove samples of MethylationArray who have missing values in phenotype column.

outcome_cols
Phenotype columns, if any rows contain missing values, samples are removed.
remove_whitespace(key)[source]

Remove whitespaces from phenotype column.

key
Phenotype column.
return_cpgs()[source]

Return list of cpgs of MethylationArray

return_idx()[source]

Return sample names of MethylationArray.

return_raw_beta_array()[source]

Return numpy array of methylation beta vaues.

return_shape()[source]

Return dimensionality and number of samples of beta matrix.

split_by_subtype(disease_only, subtype_delimiter)[source]

Split MethylationArray into generator of MethylationArrays by phenotype column. Much akin to groupby. Only splits from disease column.

disease_only
Consider disease superclass.
subtype_delimiter
How to break up disease column if using disease_only.
split_key(key, subtype_delimiter)[source]

Manipulate an entire phenotype column, splitting each element up by some delimiter.

key
Phenotype column.
subtype_delimiter
How to break up strings in columns. S1,s2 -> S1 for instance.
split_train_test(train_p=0.8, stratified=True, disease_only=False, key='disease', subtype_delimiter=', ', val_p=0.0)[source]

Split MethylationArray into training and test sets, with option to stratify by categorical covariate.

train_p
Fraction of methylation array to use as training set.
stratified
Whether to stratify by categorical variable.
disease_only
Consider disease superclass by some delimiter. For instance if disease is S1,s2, superclass would be S1.
key
Column to stratify on.
subtype_delimiter
How to split disease column into super/subclass.
val_p
If set greater than 0, will create additional validation set, fraction of which is broken off from training set.
subsample(key='disease', n_samples=None, frac=None, categorical=False)[source]

Subsample MethylationArray, make the set randomly smaller.

key
If stratifying, use this column of pheno array.
n_samples
Number of samples to consider overall, or per stratum.
frac
Alternative to n_samples, where x frac of array or stratum is considered.
categorical
Whether to stratify by column.
subset_cpgs(cpgs)[source]

Subset beta matrix by list of Cpgs. Parameters ———- cpgs

Cpgs to subset by.
subset_index(index)[source]

Subset MethylationArray by samples.

index
Sample names to subset by.
write_csvs(output_dir)[source]

Write phenotype data and beta values to csvs.

output_dir
Directory to output csv files.
write_db(conn, disease='')[source]

Store phenotype data and beta values in SQL database.

conn
SQLite connection.
disease
Create new tables in db that are related to disease state by this name.
write_pickle(output_pickle, disease='')[source]

Store phenotype data and beta values in pickle file. Is default file format for storing MethylationArray objects.

output_pickle
Pickle file to store MethylationArray data.
class pymethylprocess.MethylationDataTypes.MethylationArrays(list_methylation_arrays)[source]

Literally a list of methylation arrays, with methods operate on these arrays that is memory efficient. Initialize with list of methylation arrays. Can optionally leave list empty or with one element.

list_methylation_arrays
List of methylation arrays.
combine(array_generator=None)[source]

Combine the list of methylation arrays into one array via concatenation of beta matrices and phenotype arrays.

array_generator
Generator of additional methylation arrays for computational memory minimization.
impute(imputer)[source]

Impute all methylation arrays.

imputer
Type of imputation, sklearn-like.
write_dbs(conn)[source]

Write list of methylation arrays to SQL database. Recommend naming MethylationArray.

conn
SQL connection.
write_pkls(pkl)[source]

Write list of methylation arrays to single pickle. Recommend naming each MethylationArray.

pkl
Pickle file to write to.
pymethylprocess.MethylationDataTypes.extract_pheno_beta_df_from_folder(folder)[source]

Return phenotype and beta dataframes from specified folder with csv.

folder
Input folder.
pymethylprocess.MethylationDataTypes.extract_pheno_beta_df_from_pickle_dict(input_dict, disease='')[source]

Return phenotype and beta dataframes from specified dictionary storing MethylationArray python dictionary.

input_dict
Python disctionary storing pheno/beta information.
pymethylprocess.MethylationDataTypes.extract_pheno_beta_df_from_sql(conn, disease='')[source]

Return phenotype and beta dataframes from SQL tables storing MethylationArray info.

conn
SQL connection.

meffil_functions.py

Contains a few R functions that interact with meffil and minfi.

pymethylprocess.meffil_functions.est_cell_counts_IDOL(rgset, library)[source]

Given RGSet object, estimate cell counts for 450k/850k using reference approach via IDOL library.

rgset
RGSet object stored in python via rpy2
library
What type of CpG library to use.
pymethylprocess.meffil_functions.est_cell_counts_meffil(qc_list, cell_type_reference)[source]

Given QCObject list R object, estimate cell counts using reference approach via meffil.

qc_list
R list containing qc objects.
cell_type_reference
Reference blood/tissue set.
pymethylprocess.meffil_functions.est_cell_counts_minfi(rgset)[source]

Given RGSet object, estimate cell counts using reference approach via minfi.

rgset
RGSet object stored in python via rpy2
pymethylprocess.meffil_functions.load_detection_p_values_beadnum(qc_list, n_cores)[source]

Return list of detection p-value matrix and bead number matrix.

qc_list
R list containing qc objects.
n_cores
Number of cores to use in computation.
pymethylprocess.meffil_functions.r_autosomal_cpgs(array_type='450k')[source]

Return list of autosomal cpg probes per platform.

array_type
450k/850k array?
pymethylprocess.meffil_functions.r_snp_cpgs(array_type='450k')[source]

Return list of SNP cpg probes per platform.

array_type
450k/850k array?
pymethylprocess.meffil_functions.remove_sex(beta, array_type='450k')[source]

Remove non-autosomal cpgs from beta matrix.

array_type
450k/850k array?
pymethylprocess.meffil_functions.set_missing(beta, pval_beadnum, detection_val=1e-06)[source]

Set missing beta values to NA, taking into account detection values and bead number thesholds.

pval_beadnum
Detection pvalues and number of beads per cpg/samples
detection_val
If threshold to set site to missingness based on p-value detection.

general_machine_learning.py

Contains a machine learning class to perform scikit-learn like operations, along with held-out hyperparameter grid search.

class pymethylprocess.general_machine_learning.MachineLearning(model, options, grid={}, labelencode=False, n_eval=0)[source]

Machine learning class to run sklearn-like pipeline on MethylationArray data. Initialize object with scikit-learn model, and optionally supply a hyperparameter search grid.

model
Scikit-learn-like model, classification, regression, dimensionality reduction, clustering etc.
options
Options to supply model in form of dictionary.
grid
Alternatively, supply search grid to search for bets hyperparameters.
labelencode
T/F encode string labels.
n_eval
Number of evaluations for randomized grid search, if set to 0, perform exhaustive grid search
assign_results_to_pheno_col(methyl_array, new_col, output_pkl)[source]

Assign results to new phenotype column.

methyl_array
MethylationArray.
new_col
New column name.
output_pkl
Output pickle to dump MethylationArray to.
fit(train_methyl_array, val_methyl_array=None, outcome_cols=None)[source]

Fit data to model.

train_methyl_array
Training MethylationArray.
val_methyl_array
Validation MethylationArray. Can set to None.
outcome_cols
Set to none if not needed, but phenotype column to train on, can be multiple.
fit_predict(train_methyl_array, outcome_cols=None)[source]

Fit and predict training data.

train_methyl_array
Training MethylationArray.
outcome_cols
Set to none if not needed, but phenotype column to train on, can be multiple.
fit_transform(train_methyl_array, outcome_cols=None)[source]

Fit and transform to training data.

train_methyl_array
Training MethylationArray.
outcome_cols
Set to none if not needed, but phenotype column to train on, can be multiple.
predict(test_methyl_array)[source]

Make new predictions on test methylation array.

test_methyl_array
Testing MethylationArray.
return_outcome_metric(methyl_array, outcome_cols, metric, run_bootstrap=False)[source]

Supply metric to evaluate results.

methyl_array
MethylationArray to evaluate.
outcome_cols
Outcome phenotype columns.
metric
Sklearn evaluation metric.
run_bootstrap
Make 95% CI from 1k bootstraps.
store_results(output_pkl, results_dict={})[source]

Store results in pickle file.

output_pkl
Output pickle to dump results to.
results_dict
Supply own results dict to be dumped.
transform(test_methyl_array)[source]

Transform test methylation array.

test_methyl_array
Testing MethylationArray.
transform_results_to_beta(methyl_array, output_pkl)[source]

Transform beta matrix into reduced beta matrix and store.

methyl_array
MethylationArray.
output_pkl
Output pickle to dump MethylationArray to.

pymethyl-install

pymethyl-install [OPTIONS] COMMAND [ARGS]...

Options

--version

Show the version and exit.

change_gcc_path

Change GCC and G++ paths if don’t have version 7.2.0. [Experimental]

pymethyl-install change_gcc_path [OPTIONS]

install_bioconductor

Installs bioconductor.

pymethyl-install install_bioconductor [OPTIONS]

install_custom

Installs bioconductor packages.

pymethyl-install install_custom [OPTIONS]

Options

-p, --package <package>

Custom packages. [default: ENmix]

-m, --manager

Use BiocManager (recommended).

install_meffil

Installs meffil (update!).

pymethyl-install install_meffil [OPTIONS]

install_minfi_others

Installs minfi and other dependencies.

pymethyl-install install_minfi_others [OPTIONS]

install_r_packages

Installs r packages.

pymethyl-install install_r_packages [OPTIONS]

Options

-p, --package <package>

Custom packages. [default: ]

install_some_deps

Installs bioconductor, minfi, enmix, tcga biolinks, and meffil.

pymethyl-install install_some_deps [OPTIONS]

pymethyl-visualize

pymethyl-visualize [OPTIONS] COMMAND [ARGS]...

Options

--version

Show the version and exit.

plot_cell_type_results

Plot csv containing cell type results into side by side boxplots.

pymethyl-visualize plot_cell_type_results [OPTIONS]

Options

-i, --input_csv <input_csv>

Input csv. [default: cell_type_estimates.csv]

-o, --outfilename <outfilename>

Output png. [default: visualizations/cell_type_results.png]

-cols, --plot_cols <plot_cols>

Plot columns. [default: Gran, CD4T, CD8T, Bcell, Mono, NK, gMDSC]

-fs, --font_scale <font_scale>

Font scaling [default: 1.0]

plot_heatmap

Plot heatmap from CSV file.

pymethyl-visualize plot_heatmap [OPTIONS]

Options

-i, --input_csv <input_csv>

Input csv. [default: ]

-o, --outfilename <outfilename>

Output png. [default: output.png]

-idx, --index_col <index_col>

Index load dataframe [default: 0]

-fs, --font_scale <font_scale>

Font scaling [default: 1.0]

-min, --min_val <min_val>

Min heat val [default: 0.0]

-max, --max_val <max_val>

Max heat val, if -1, defaults to None [default: 1.0]

-a, --annot

Annotate heatmap [default: False]

-n, --norm

Normalize matrix data [default: False]

-c, --cluster

Cluster matrix data [default: False]

-m, --matrix_type <matrix_type>

Type of matrix supplied [default: none]

-x, --xticks

Show x ticks [default: False]

-y, --yticks

Show y ticks [default: False]

-t, --transpose

Transpose matrix data [default: False]

-col, --color_column <color_column>

Color column. [default: color]

transform_plot

Dimensionality reduce VAE or original beta values using UMAP and plot using plotly.

pymethyl-visualize transform_plot [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-c, --column_of_interest <column_of_interest>

Column extract from phenotype data. [default: disease]

-o, --output_file <output_file>

Output visualization. [default: ./visualization.html]

-nn, --n_neighbors <n_neighbors>

Number of neighbors UMAP. [default: 5]

-a, --axes_off

Whether to turn axes on or off.

-s, --supervised

Supervise umap embedding.

-d, --min_dist <min_dist>

UMAP min distance. [default: 0.1]

-m, --metric <metric>

Reduction metric. [default: euclidean]

-cc, --case_control_override

Add controls from case_control column and override current disease for classification tasks. [default: False]

pymethyl-preprocess

pymethyl-preprocess [OPTIONS] COMMAND [ARGS]...

Options

--version

Show the version and exit.

batch_deploy_preprocess

Deploy multiple preprocessing jobs in series or parallel.

pymethyl-preprocess batch_deploy_preprocess [OPTIONS]

Options

-n, --n_cores <n_cores>

Number cores to use for preprocessing. [default: 6]

-i, --subtype_output_dir <subtype_output_dir>

Output subtypes pheno csv. [default: ./preprocess_outputs/]

-m, --meffil

Preprocess using meffil.

-t, --torque

Job submission torque.

-r, --run

Actually run local job or just print out command.

-s, --series

Run commands in series.

-p, --pc_qc_parameters_csv <pc_qc_parameters_csv>

For meffil, qc parameters and pcs for final qc and functional normalization. [default: ./preprocess_outputs/pc_qc_parameters.csv]

-u, --use_cache

If this is selected, loads qc results rather than running qc again. Only works for meffil selection.

-qc, --qc_only

Only perform QC for meffil pipeline, caches results into rds file for loading again, only works if use_cache is false.

-c, --chunk_size <chunk_size>

If not series, chunk up and run these number of commands at once.. -1 means all commands at once.

combine_methylation_arrays

If split MethylationArrays by subtype for either preprocessing or imputation, can use to recombine data for downstream step.

pymethyl-preprocess combine_methylation_arrays [OPTIONS]

Options

-i, --input_pkls <input_pkls>

Input pickles for beta and phenotype data. [default: ./preprocess_outputs/methyl_array.pkl]

-d, --optional_input_pkl_dir <optional_input_pkl_dir>

Auto grab input pkls. [default: ]

-o, --output_pkl <output_pkl>

Output database for beta and phenotype data. [default: ./combined_outputs/methyl_array.pkl]

-e, --exclude <exclude>

If -d selected, these diseases will be excluded from study. [default: ]

concat_sample_sheets

Concat two sample files for more fields for minfi+ input, adds more samples.

pymethyl-preprocess concat_sample_sheets [OPTIONS]

Options

-s1, --sample_sheet1 <sample_sheet1>

Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info1.csv]

-s2, --sample_sheet2 <sample_sheet2>

Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info2.csv]

-os, --output_sample_sheet <output_sample_sheet>

CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]

create_sample_sheet

Create sample sheet for input to minfi, meffil, or enmix.

pymethyl-preprocess create_sample_sheet [OPTIONS]

Options

-is, --input_sample_sheet <input_sample_sheet>

Clinical information downloaded from tcga/geo/custom. [default: ./tcga_idats/clinical_info.csv]

-s, --source_type <source_type>

Source type of data. [default: tcga]

-i, --idat_dir <idat_dir>

Idat directory. [default: ./tcga_idats/]

-os, --output_sample_sheet <output_sample_sheet>

CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]

-m, --mapping_file <mapping_file>

Mapping file from uuid to TCGA barcode. Downloaded using download_tcga. [default: ./idat_filename_case.txt]

-l, --header_line <header_line>

Line to begin reading csv/xlsx. [default: 0]

-d, --disease_class_column <disease_class_column>

Disease classification column, for custom and geo datasets. [default: methylation class:ch1]

-b, --basename_col <basename_col>

Basename classification column, for custom datasets. [default: Sentrix ID (.idat)]

-c, --include_columns_file <include_columns_file>

Custom columns file containing columns to keep, separated by n. Add a tab for each line if you wish to rename columns: original_name t new_column_name [default: ]

download_clinical

Download all TCGA 450k clinical info.

pymethyl-preprocess download_clinical [OPTIONS]

Options

-o, --output_dir <output_dir>

Output directory for exported idats. [default: ./tcga_idats/]

download_geo

Download geo methylation study idats and clinical info.

pymethyl-preprocess download_geo [OPTIONS]

Options

-g, --geo_query <geo_query>

GEO study to query. [default: ]

-o, --output_dir <output_dir>

Output directory for exported idats. [default: ./geo_idats/]

download_tcga

Download all tcga 450k data.

pymethyl-preprocess download_tcga [OPTIONS]

Options

-o, --output_dir <output_dir>

Output directory for exported idats. [default: ./tcga_idats/]

feature_select

Filter CpGs by taking x top CpGs with highest mean absolute deviation scores or via spectral feature selection.

pymethyl-preprocess feature_select [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input database for beta and phenotype data. [default: ./imputed_outputs/methyl_array.pkl]

-o, --output_pkl <output_pkl>

Output database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-n, --n_top_cpgs <n_top_cpgs>

Number cpgs to include with highest variance across population. [default: 300000]

-f, --feature_selection_method <feature_selection_method>
-mm, --metric <metric>
-nn, --n_neighbors <n_neighbors>

Number neighbors for feature selection, default enacts rbf kernel. [default: 0]

-m, --mad_top_cpgs <mad_top_cpgs>

Number cpgs to apply mad filtering first before more sophisticated feature selection. If 0 or primary feature selection is mad, no mad pre-filtering. [default: 0]

get_categorical_distribution

Get categorical distribution of columns of sample sheet.

pymethyl-preprocess get_categorical_distribution [OPTIONS]

Options

-is, --formatted_sample_sheet <formatted_sample_sheet>

Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/minfiSheet.csv]

-k, --key <key>

Column of csv to print statistics for. [default: disease]

-d, --disease_only

Only look at disease, or text before subtype_delimiter.

-sd, --subtype_delimiter <subtype_delimiter>

Delimiter for disease extraction. [default: ,]

imputation_pipeline

Imputation of subtype or no subtype using various imputation methods.

pymethyl-preprocess imputation_pipeline [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input database for beta and phenotype data. [default: ./combined_outputs/methyl_array.pkl]

-ss, --split_by_subtype

Imputes CpGs by subtype before combining again.

-m, --method <method>

Method of imputation. [default: KNN]

-s, --solver <solver>

Imputation library. [default: fancyimpute]

-k, --n_neighbors <n_neighbors>

Number neighbors for imputation if using KNN. [default: 5]

-r, --orientation <orientation>

Impute CpGs or samples. [default: Samples]

-o, --output_pkl <output_pkl>

Output database for beta and phenotype data. [default: ./imputed_outputs/methyl_array.pkl]

-n, --n_top_cpgs <n_top_cpgs>

Number cpgs to include with highest variance across population. Greater than 0 allows for mad filtering during imputation to skip mad step. [default: 0]

-f, --feature_selection_method <feature_selection_method>
-mm, --metric <metric>
-nfs, --n_neighbors_fs <n_neighbors_fs>

Number neighbors for feature selection, default enacts rbf kernel. [default: 0]

-d, --disease_only

Only look at disease, or text before subtype_delimiter.

-sd, --subtype_delimiter <subtype_delimiter>

Delimiter for disease extraction. [default: ,]

-st, --sample_threshold <sample_threshold>

Value between 0 and 1 for NaN removal. If samples has sample_threshold proportion of cpgs missing, then remove sample. Set to -1 to not remove samples. [default: -1.0]

-ct, --cpg_threshold <cpg_threshold>

Value between 0 and 1 for NaN removal. If cpgs has cpg_threshold proportion of samples missing, then remove cpg. Set to -1 to not remove samples. [default: -1.0]

meffil_encode

Reformat file for meffil input.

pymethyl-preprocess meffil_encode [OPTIONS]

Options

-is, --input_sample_sheet <input_sample_sheet>

CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]

-os, --output_sample_sheet <output_sample_sheet>

CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]

merge_sample_sheets

Merge two sample files for more fields for minfi+ input.

pymethyl-preprocess merge_sample_sheets [OPTIONS]

Options

-s1, --sample_sheet1 <sample_sheet1>

Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info1.csv]

-s2, --sample_sheet2 <sample_sheet2>

Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info2.csv]

-os, --output_sample_sheet <output_sample_sheet>

CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]

-d, --second_sheet_disease

Use second sheet’s disease column.

-nd, --no_disease_merge

Don’t merge disease columns.

na_report

Print proportion of missing values throughout dataset.

pymethyl-preprocess na_report [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input database for beta and phenotype data. [default: ./preprocess_outputs/methyl_array.pkl]

-o, --output_dir <output_dir>

Output database for na report. [default: ./na_report/]

-r, --head_directory

-i option becomes directory, and searches there for multiple input pickles.

preprocess_pipeline

Perform preprocessing of idats using enmix or meffil.

pymethyl-preprocess preprocess_pipeline [OPTIONS]

Options

-i, --idat_dir <idat_dir>

Idat dir for one sample sheet, alternatively can be your phenotype sample sheet. [default: ./tcga_idats/]

-n, --n_cores <n_cores>

Number cores to use for preprocessing. [default: 6]

-o, --output_pkl <output_pkl>

Output database for beta and phenotype data. [default: ./preprocess_outputs/methyl_array.pkl]

-m, --meffil

Preprocess using meffil.

-pc, --n_pcs <n_pcs>

For meffil, number of principal components for functional normalization. If set to -1, then PCs are selected using elbow method. [default: -1]

-p, --pipeline <pipeline>

If not meffil, preprocess using minfi or enmix. [default: enmix]

-noob, --noob_norm

Run noob normalization of minfi selected.

-u, --use_cache

If this is selected, loads qc results rather than running qc again and update with new qc parameters. Only works for meffil selection. Minfi and enmix just loads RG Set.

-qc, --qc_only

Only perform QC for meffil pipeline, caches results into rds file for loading again, only works if use_cache is false. Minfi and enmix just saves the RGSet before preprocessing.

-bns, --p_beadnum_samples <p_beadnum_samples>

From meffil documentation, “fraction of probes that failed the threshold of 3 beads”. [default: 0.05]

-pds, --p_detection_samples <p_detection_samples>

From meffil documentation, “fraction of probes that failed a detection.pvalue threshold of 0.01”. [default: 0.05]

-bnc, --p_beadnum_cpgs <p_beadnum_cpgs>

From meffil documentation, “fraction of samples that failed the threshold of 3 beads”. [default: 0.05]

-pdc, --p_detection_cpgs <p_detection_cpgs>

From meffil documentation, “fraction of samples that failed a detection.pvalue threshold of 0.01”. [default: 0.05]

-sc, --sex_cutoff <sex_cutoff>

From meffil documentation, “difference of total median intensity for Y chromosome probes and X chromosome probes”. [default: -2]

-sd, --sex_sd <sex_sd>

From meffil documentation, “sex detection outliers if outside this range”. [default: 5]

remove_diseases

Exclude diseases from study by count number or exclusion list.

pymethyl-preprocess remove_diseases [OPTIONS]

Options

-is, --formatted_sample_sheet <formatted_sample_sheet>

Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info.csv]

-e, --exclude_disease_list <exclude_disease_list>

List of conditions to exclude, from disease column, comma delimited. [default: ]

-os, --output_sheet_name <output_sheet_name>

CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]

-l, --low_count <low_count>

Remove diseases if they are below a certain count, default this is not used. [default: 0]

-d, --disease_only

Only look at disease, or text before subtype_delimiter.

-sd, --subtype_delimiter <subtype_delimiter>

Delimiter for disease extraction. [default: ,]

split_preprocess_input_by_subtype

Split preprocess input samplesheet by disease subtype.

pymethyl-preprocess split_preprocess_input_by_subtype [OPTIONS]

Options

-i, --idat_csv <idat_csv>

Idat csv for one sample sheet, alternatively can be your phenotype sample sheet. [default: ./tcga_idats/minfiSheet.csv]

-d, --disease_only

Only look at disease, or text before subtype_delimiter.

-sd, --subtype_delimiter <subtype_delimiter>

Delimiter for disease extraction. [default: ,]

-o, --subtype_output_dir <subtype_output_dir>

Output subtypes pheno csv. [default: ./preprocess_outputs/]

pymethyl-utils

pymethyl-utils [OPTIONS] COMMAND [ARGS]...

Options

--version

Show the version and exit.

backup_pkl

Copy methylarray pickle to new location to backup.

pymethyl-utils backup_pkl [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-o, --output_pkl <output_pkl>

Output database for beta and phenotype data. [default: ./backup/methyl_array.pkl]

bin_column

Convert continuous phenotype column into categorical by binning.

pymethyl-utils bin_column [OPTIONS]

Options

-t, --test_pkl <test_pkl>

Pickle containing testing set. [default: ./train_val_test_sets/test_methyl_array.pkl]

-c, --col <col>

Column to turn into bins. [default: age]

-n, --n_bins <n_bins>

Number of bins. [default: 10]

-ot, --output_test_pkl <output_test_pkl>

Binned shap pickle for further testing. [default: ./train_val_test_sets/test_methyl_array_shap_binned.pkl]

concat_csv

Concatenate two csv files together.

pymethyl-utils concat_csv [OPTIONS]

Options

-i1, --input_csv <input_csv>

Beta csv. [default: ./beta1.csv]

-i2, --input_csv2 <input_csv2>

Beta/other csv 2. [default: ./cell_estimates.csv]

-o, --output_csv <output_csv>

Output csv. [default: ./beta.concat.csv]

-a, --axis <axis>

Axis to merge on. Columns are 0, rows are 1. [default: 1]

counts

Return categorical breakdown of phenotype column.

pymethyl-utils counts [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-k, --key <key>

Key to split on. [default: disease]

create_external_validation_set

Create external validation set containing same CpGs as training set.

pymethyl-utils create_external_validation_set [OPTIONS]

Options

-t, --train_pkl <train_pkl>

Input methyl array. [default: ./train_val_test_sets/train_methyl_array.pkl]

-q, --query_pkl <query_pkl>

Input methylation array to add/subtract cpgs to. [default: ./final_preprocessed/methyl_array.pkl]

-o, --output_pkl <output_pkl>

Output methyl array external validation. [default: ./external_validation/methyl_array.pkl]

-c, --cpg_replace_method <cpg_replace_method>

What to do for missing CpGs. [default: mid]

feature_select_train_val_test

Filter CpGs by taking x top CpGs with highest mean absolute deviation scores or via spectral feature selection.

pymethyl-utils feature_select_train_val_test [OPTIONS]

Options

-i, --input_pkl_dir <input_pkl_dir>

Input database for beta and phenotype data. [default: ./train_val_test_sets/]

-o, --output_dir <output_dir>

Output database for beta and phenotype data. [default: ./train_val_test_sets_fs/]

-n, --n_top_cpgs <n_top_cpgs>

Number cpgs to include with highest variance across population. [default: 300000]

-f, --feature_selection_method <feature_selection_method>
-mm, --metric <metric>
-nn, --n_neighbors <n_neighbors>

Number neighbors for feature selection, default enacts rbf kernel. [default: 0]

-m, --mad_top_cpgs <mad_top_cpgs>

Number cpgs to apply mad filtering first before more sophisticated feature selection. If 0 or primary feature selection is mad, no mad pre-filtering. [default: 0]

fix_key

Format certain column of phenotype array in MethylationArray.

pymethyl-utils fix_key [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-k, --key <key>

Key to split on. [default: disease]

-d, --disease_only

Only look at disease, or text before subtype_delimiter.

-sd, --subtype_delimiter <subtype_delimiter>

Delimiter for disease extraction. [default: ,]

-o, --output_pkl <output_pkl>

Input database for beta and phenotype data. [default: ./fixed_preprocessed/methyl_array.pkl]

modify_pheno_data

Use another spreadsheet to add more descriptive data to methylarray.

pymethyl-utils modify_pheno_data [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-is, --input_formatted_sample_sheet <input_formatted_sample_sheet>

Information passed through function create_sample_sheet, has Basename and disease fields. [default: ./tcga_idats/minfi_sheet.csv]

-o, --output_pkl <output_pkl>

Output database for beta and phenotype data. [default: ./modified_processed/methyl_array.pkl]

move_jpg

Move preprocessing jpegs to preprocessing output directory.

pymethyl-utils move_jpg [OPTIONS]

Options

-i, --input_dir <input_dir>

Directory containing jpg. [default: ./]

-o, --output_dir <output_dir>

Output directory for images. [default: ./preprocess_output_images/]

overwrite_pheno_data

Use another spreadsheet to add more descriptive data to methylarray.

pymethyl-utils overwrite_pheno_data [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-is, --input_formatted_sample_sheet <input_formatted_sample_sheet>

Information passed through function create_sample_sheet, has Basename and disease fields. [default: ./tcga_idats/minfi_sheet.csv]

-o, --output_pkl <output_pkl>

Output database for beta and phenotype data. [default: ./modified_processed/methyl_array.pkl]

-c, --index_col <index_col>

Index col when reading csv. [default: 0]

pkl_to_csv

Output methylarray pickle to csv.

pymethyl-utils pkl_to_csv [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-o, --output_dir <output_dir>

Input database for beta and phenotype data. [default: ./final_preprocessed/]

-c, --col <col>

Column to color. [default: ]

print_number_sex_cpgs

Print number of non-autosomal CpGs.

pymethyl-utils print_number_sex_cpgs [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-a, --array_type <array_type>

Array Type. [default: 450k]

print_shape

Print dimensions of beta matrix.

pymethyl-utils print_shape [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

ref_estimate_cell_counts

Reference based cell type estimates.

pymethyl-utils ref_estimate_cell_counts [OPTIONS]

Options

-ro, --input_r_object_dir <input_r_object_dir>

Input directory containing qc data. [default: ./preprocess_outputs/]

-a, --algorithm <algorithm>

Algorithm to run cell type. [default: meffil]

-ref, --reference <reference>

Cell Type Reference. [default: cord blood gse68456]

-l, --library <library>

IDOL Library. [default: IDOLOptimizedCpGs450klegacy]

-o, --output_csv <output_csv>

Output cell type estimates. [default: ./added_cell_counts/cell_type_estimates.csv]

remove_sex

Remove non-autosomal CpGs.

pymethyl-utils remove_sex [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input database for beta and phenotype data. [default: ./preprocess_outputs/methyl_array.pkl]

-o, --output_pkl <output_pkl>

Output methyl array autosomal. [default: ./autosomal/methyl_array.pkl]

-a, --array_type <array_type>

Array Type. [default: 450k]

remove_snps

Remove SNPs from methylation array.

pymethyl-utils remove_snps [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input database for beta and phenotype data. [default: ./autosomal/methyl_array.pkl]

-o, --output_pkl <output_pkl>

Output methyl array autosomal. [default: ./no_snp/methyl_array.pkl]

-a, --array_type <array_type>

Array Type. [default: 450k]

set_part_array_background

Set subset of CpGs from beta matrix to background values.

pymethyl-utils set_part_array_background [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input methyl array. [default: ./final_preprocessed/methyl_array.pkl]

-c, --cpg_pkl <cpg_pkl>

Pickled numpy array for subsetting. [default: ./subset_cpgs.pkl]

-o, --output_pkl <output_pkl>

Output methyl array external validation. [default: ./removal/methyl_array.pkl]

stratify

Split methylation array by key and store.

pymethyl-utils stratify [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-k, --key <key>

Key to split on. [default: disease]

-o, --output_dir <output_dir>

Output directory for stratified. [default: ./stratified/]

subset_array

Only retain certain number of CpGs from methylation array.

pymethyl-utils subset_array [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input methyl array. [default: ./final_preprocessed/methyl_array.pkl]

-c, --cpg_pkl <cpg_pkl>

Pickled numpy array for subsetting. [default: ./subset_cpgs.pkl]

-o, --output_pkl <output_pkl>

Output methyl array external validation. [default: ./subset/methyl_array.pkl]

train_test_val_split

Split methylation array into train, test, val.

pymethyl-utils train_test_val_split [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-o, --output_dir <output_dir>

Output directory for training, testing, and validation sets. [default: ./train_val_test_sets/]

-tp, --train_percent <train_percent>

Percent data training on. [default: 0.8]

-vp, --val_percent <val_percent>

Percent of training data that comprises validation set. [default: 0.1]

-cat, --categorical

Multi-class prediction. [default: False]

-do, --disease_only

Only look at disease, or text before subtype_delimiter.

-k, --key <key>

Key to split on. [default: disease]

-sd, --subtype_delimiter <subtype_delimiter>

Delimiter for disease extraction. [default: ,]

write_cpgs

Write CpGs in methylation array to file.

pymethyl-utils write_cpgs [OPTIONS]

Options

-i, --input_pkl <input_pkl>

Input methyl array. [default: ./final_preprocessed/methyl_array.pkl]

-c, --cpg_pkl <cpg_pkl>

Pickled numpy array for subsetting. [default: ./subset_cpgs.pkl]

Indices and tables