Welcome to PyMethylProcess’s documentation!¶

https://github.com/Christensen-Lab-Dartmouth/PyMethylProcess

To get started, download pymethylprocess using Docker (joshualevy44/pymethylprocess) or PIP (pymethylprocess) and run pymethyl-install_r_dependencies.

There is both an API and CLI available for use. Examples for CLI usage can be found in ./example_scripts.

PreProcessDataTypes.py¶

Contains datatypes core to downloading IDATs, preprocessing IDATs and samplesheets.

class pymethylprocess.PreProcessDataTypes.PreProcessIDAT(idat_dir, minfi=None, enmix=None, base=None, meffil=None)[source]¶

Class that will preprocess IDATs using R pipelines.

idat_dir: Location of idats or samplesheet csv.
minfi: Rpy2 importr minfi library, default to None will load through rpy2
enmix: Rpy2 importr enmix library, default to None will load through rpy2
base: Rpy2 importr base library, default to None will load through rpy2
meffil: Rpy2 importr meffil library, default to None will load through rpy2

export_csv(output_dir)[source]¶

Export pheno and beta dataframes to CSVs

output_dir: Where to store csvs.

export_pickle(output_pickle, disease='')[source]¶

Export pheno and beta dataframes to pickle, stored in python dict that can be loaded into MethylationArray

output_pickle: Where to store MethylationArray.
disease: Custom naming scheme for data.

export_sql(output_db, disease='')[source]¶

Export pheno and beta dataframes to SQL

output_db: Where to store data, sqlite db.
disease: Custom naming scheme for data.

extract_manifest()[source]¶: Get manifest from RGSet.

extract_pheno_data(methylset=False)[source]¶

Extract pheno data from MSet or RGSet, minfi.

methylset: If MSet has beenn created, set to True, else extract from original RGSet.

filter_beta()[source]¶: After creating beta, filter out outliers.

get_beta()[source]¶: Get beta value matrix from minfi after finding RSet.

get_meth()[source]¶: Get methylation intensity matrix from MSet

get_unmeth()[source]¶: Get unmethylated intensity matrix from MSet

load_idats()[source]¶: For minfi pipeline, load IDATs from specified idat_dir.

move_jpg()[source]¶: Move jpeg files from current working directory to the idat directory.

output_pheno_beta(meffil=False)[source]¶

Get pheno and beta dataframe objects stored as attributes for input to MethylationArray object.

meffil: True if ran meffil pipeline.

plot_original_qc(output_dir)[source]¶

Plot QC results from ENmix pipeline and possible minfi. Still experimental.

output_dir: Where to store plots.

plot_qc_metrics(output_dir)[source]¶

Plot QC results from ENmix pipeline and possible minfi. Still experimental.

output_dir: Where to store plots.

preprocessENmix(n_cores=6)[source]¶

Run ENmix preprocessing pipeline.

n_cores: Number of CPUs to use.

preprocessMeffil(n_cores=6, n_pcs=4, qc_report_fname='qc/report.html', normalization_report_fname='norm/report.html', pc_plot_fname='qc/pc_plot.pdf', useCache=True, qc_only=True, qc_parameters={'p.beadnum.cpgs': 0.1, 'p.beadnum.samples': 0.1, 'p.detection.cpgs': 0.1, 'p.detection.samples': 0.1}, rm_sex=False)[source]¶

Run meffil preprocessing pipeline with functional normalization.

n_cores: Number of CPUs to use.
n_pcs: Number of principal components to use for functional normalization, set to -1 to autoselect via kneedle algorithm.
qc_report_fname: HTML filename to store QC report.
normalization_report_fname: HTML filename to store normalization report
pc_plot_fname: PDF file to store principal components plot.
useCache: Use saved QC objects instead of running through QC again.
qc_only: Perform QC, then save and quit before normalization.
qc_parameters: Python dictionary with parameters for qc.
rm_sex: Remove non-autosomal cpgs?

preprocessNoob()[source]¶: Run minfi preprocessing with Noob normalization

preprocessRAW()[source]¶: Run minfi preprocessing with RAW normalization

preprocess_enmix_pipeline(n_cores=6, pipeline='enmix', noob=False, qc_only=False, use_cache=False)[source]¶

Run complete ENmix or minfi preprocessing pipeline.

n_cores: Number CPUs.
pipeline: Run enmix or minfi
noob: Noob norm or RAW if minfi running.
qc_only: Save and quit after only running QC?
use_cache: Load preexisting RGSet instead of running QC again.

return_beta()[source]¶: Return minfi RSet after having created MSet.

to_methyl_array(disease='')[source]¶

Convert results from preprocessing into MethylationArray, and directly return MethylationArray object.

disease: Custom naming scheme for data.

class pymethylprocess.PreProcessDataTypes.PreProcessPhenoData(pheno_sheet, idat_dir, header_line=0)[source]¶

Class that will manipute phenotype samplesheet before preprocessing of IDATs.

pheno_sheet: Location of clinical info csv.
idat_dir: Location of idats
header_line: Where to start reading clinical csv

concat(other_formatted_sheet)[source]¶

Concat multiple PreProcessPhenoData objects, concat their dataframes to accept more than one smaplesheet/dataset.

other_formatted_sheet: Other PreProcessPhenoData to concat.

export(output_sheet_name)[source]¶

Export pheno data to csv after done with manipulation.

output_sheet_name: Output csv name.

format_custom(basename_col, disease_class_column, include_columns={})[source]¶

Custom format clinical sheet if user supplied idats.

basename_col: Column name of sample names.
disease_class_column: Disease column of clinical info csv.
include_columns: Dictionary specifying other columns to include, and new names to assign them to.

format_geo(disease_class_column='methylation class:ch1', include_columns={})[source]¶

Format clinical sheets if downloaded geo idats.

disease_class_column: Disease column of clinical info csv.
include_columns: Dictionary specifying other columns to include, and new names to assign them to.

format_tcga(mapping_file='idat_filename_case.txt')[source]¶

Format clinical sheets if downloaded tcga idats.

mapping_file: Maps uuids to proper tcga sample names, should be downloaded with tcga clinical information.

get_categorical_distribution(key, disease_only=False, subtype_delimiter=', ')[source]¶

Print categorical distribution, counts for each unique value in phenotype column.

key: Phenotype Column.
disease_only: Whether to split phenotype column entries by delimiter.
subtype_delimiter: Subtype delimiter to split on.

merge(other_formatted_sheet, use_second_sheet_disease=True, no_disease_merge=False)[source]¶

Merge multiple PreProcessPhenoData objects, merge their dataframes to accept more than one saplesheet/dataset or add more pheno info.

other_formatted_sheet: Other PreProcessPhenoData to merge.
use_second_sheet_disease: Change disease column to that of second sheet instead of first.
no_disease_merge: Keep both disease columns from both sheets.

remove_diseases(exclude_disease_list, low_count, disease_only, subtype_delimiter)[source]¶

Remove samples with certain diseases from disease column.

exclude_disease_list: List containing diseases to remove.
low_count: Remove samples that have less than x disease occurances in column.
disease_only: Whether to split phenotype column entries by delimiter.
subtype_delimiter: Subtype delimiter to split on.

split_key(key, subtype_delimiter)[source]¶

Split pheno column by key, with subtype delimiter, eg. entry S1,s2 -> S1 with delimiter “,”.

key: Pheno column name.
subtype_delimiter: Subtype delimiter to split on.

class pymethylprocess.PreProcessDataTypes.TCGADownloader[source]¶

Downloads TCGA and GEO IDAT and clinical data

download_clinical(output_dir)[source]¶

Download TCGA Clinical Data.

output_dir: Where to output clinical data csv.

download_geo(query, output_dir)[source]¶

Download GEO IDATs.

query: GEO accession number to query, must be 450k/850k.
output_dir: Output directory to store idats and clinical information csv

download_tcga(output_dir)[source]¶

Download TCGA IDATs.

output_dir: Where to output idat files.

MethylationDataTypes.py¶

Contains datatypes core to storing beta and phenotype methylation data, and imputation.

class pymethylprocess.MethylationDataTypes.ImputerObject(solver, method, opts={})[source]¶

Class that stores and accesses different types of imputers. Construct sklearn-like imputer given certain input arguments.

solver: Library for imputation, eg. sklearn, fancyimpute.
method: Imputation method in library, named.
opts: Additional options to assign to imputer.

return_imputer()[source]¶: Return initialized sklearn-like imputer.

class pymethylprocess.MethylationDataTypes.MethylationArray(pheno_df, beta_df, name='')[source]¶

Stores beta and phenotype information and performs various operations. Initialize MethylationArray object by inputting dataframe of phenotypes and dataframe of beta values with samples as index.

pheno_df: Phenotype dataframe (samples x covariates)
beta_df: Beta Values Dataframe (samples x cpgs)

bin_column(col, n_bins)[source]¶

Turn continuous variable/covariate into categorical bins. Returns name of new column and updates phenotype matrix to reflect this change.

col: Continuous column of phenotype array to bin.
n_bins: Number of bins to create.

categorical_breakdown(key)[source]¶

Print categorical distribution, counts for each unique value in phenotype column.

key: Phenotype Column.

feature_select(n_top_cpgs, feature_selection_method='mad', metric='correlation', nn=10)[source]¶

Perform unsupervised feature selection on MethylationArray.

n_top_cpgs: Number of CpGs to retain.
feature_selection_method: Method to perform selection.
metric: If considering structural feature selection like SPEC, use this distance metric.
nn: Number of nearest neighbors.

classmethod from_pickle(input_pickle)[source]¶

Load MethylationArray stored in pickle.

Usage: MethylationArray.from_pickle([input_pickle])

input_pickle: Stored MethylationArray pickle.

groupby(key)[source]¶

Groupby for Methylation Array. Returns generator of methylation arrays grouped by key.

preprocess_sample_df: New phenotype dataframe.

impute(imputer)[source]¶

Perform imputation on NaN beta vaues. Input imputater returned from ImputerObject.

imputer: Type of imputer object, in sklearn type interface.

merge_preprocess_sheet(preprocess_sample_df)[source]¶

Feed in another phenotype dataframe that will be merged with existing phenotype array.

preprocess_sample_df: New phenotype dataframe.

overwrite_pheno_data(preprocess_sample_df)[source]¶

Feed in another phenotype dataframe that will overwrite overlapping keys of existing phenotype array.

preprocess_sample_df: New phenotype dataframe.

remove_missingness(cpg_threshold=None, sample_threshold=None)[source]¶

Remove samples and CpGs with certain level of missingness..

cpg_threshold: If more than fraction of Samples for this CpG are missing, remove cpg.
sample_threshold: If more than fraction of CpGs for this sample are missing, remove sample.

remove_na_samples(outcome_cols)[source]¶

Remove samples of MethylationArray who have missing values in phenotype column.

outcome_cols: Phenotype columns, if any rows contain missing values, samples are removed.

remove_whitespace(key)[source]¶

Remove whitespaces from phenotype column.

key: Phenotype column.

return_cpgs()[source]¶: Return list of cpgs of MethylationArray

return_idx()[source]¶: Return sample names of MethylationArray.

return_raw_beta_array()[source]¶: Return numpy array of methylation beta vaues.

return_shape()[source]¶: Return dimensionality and number of samples of beta matrix.

split_by_subtype(disease_only, subtype_delimiter)[source]¶

Split MethylationArray into generator of MethylationArrays by phenotype column. Much akin to groupby. Only splits from disease column.

disease_only: Consider disease superclass.
subtype_delimiter: How to break up disease column if using disease_only.

split_key(key, subtype_delimiter)[source]¶

Manipulate an entire phenotype column, splitting each element up by some delimiter.

key: Phenotype column.
subtype_delimiter: How to break up strings in columns. S1,s2 -> S1 for instance.

split_train_test(train_p=0.8, stratified=True, disease_only=False, key='disease', subtype_delimiter=', ', val_p=0.0)[source]¶

Split MethylationArray into training and test sets, with option to stratify by categorical covariate.

train_p: Fraction of methylation array to use as training set.
stratified: Whether to stratify by categorical variable.
disease_only: Consider disease superclass by some delimiter. For instance if disease is S1,s2, superclass would be S1.
key: Column to stratify on.
subtype_delimiter: How to split disease column into super/subclass.
val_p: If set greater than 0, will create additional validation set, fraction of which is broken off from training set.

subsample(key='disease', n_samples=None, frac=None, categorical=False)[source]¶

Subsample MethylationArray, make the set randomly smaller.

key: If stratifying, use this column of pheno array.
n_samples: Number of samples to consider overall, or per stratum.
frac: Alternative to n_samples, where x frac of array or stratum is considered.
categorical: Whether to stratify by column.

subset_cpgs(cpgs)[source]¶: Subset beta matrix by list of Cpgs. Parameters ———- cpgs

Cpgs to subset by.

subset_index(index)[source]¶

Subset MethylationArray by samples.

index: Sample names to subset by.

write_csvs(output_dir)[source]¶

Write phenotype data and beta values to csvs.

output_dir: Directory to output csv files.

write_db(conn, disease='')[source]¶

Store phenotype data and beta values in SQL database.

conn: SQLite connection.
disease: Create new tables in db that are related to disease state by this name.

write_pickle(output_pickle, disease='')[source]¶

Store phenotype data and beta values in pickle file. Is default file format for storing MethylationArray objects.

output_pickle: Pickle file to store MethylationArray data.

class pymethylprocess.MethylationDataTypes.MethylationArrays(list_methylation_arrays)[source]¶

Literally a list of methylation arrays, with methods operate on these arrays that is memory efficient. Initialize with list of methylation arrays. Can optionally leave list empty or with one element.

list_methylation_arrays: List of methylation arrays.

combine(array_generator=None)[source]¶

Combine the list of methylation arrays into one array via concatenation of beta matrices and phenotype arrays.

array_generator: Generator of additional methylation arrays for computational memory minimization.

impute(imputer)[source]¶

Impute all methylation arrays.

imputer: Type of imputation, sklearn-like.

write_dbs(conn)[source]¶

Write list of methylation arrays to SQL database. Recommend naming MethylationArray.

conn: SQL connection.

write_pkls(pkl)[source]¶

Write list of methylation arrays to single pickle. Recommend naming each MethylationArray.

pkl: Pickle file to write to.

pymethylprocess.MethylationDataTypes.extract_pheno_beta_df_from_folder(folder)[source]¶

Return phenotype and beta dataframes from specified folder with csv.

folder: Input folder.

pymethylprocess.MethylationDataTypes.extract_pheno_beta_df_from_pickle_dict(input_dict, disease='')[source]¶

Return phenotype and beta dataframes from specified dictionary storing MethylationArray python dictionary.

input_dict: Python disctionary storing pheno/beta information.

pymethylprocess.MethylationDataTypes.extract_pheno_beta_df_from_sql(conn, disease='')[source]¶

Return phenotype and beta dataframes from SQL tables storing MethylationArray info.

conn: SQL connection.

meffil_functions.py¶

Contains a few R functions that interact with meffil and minfi.

pymethylprocess.meffil_functions.est_cell_counts_IDOL(rgset, library)[source]¶

Given RGSet object, estimate cell counts for 450k/850k using reference approach via IDOL library.

rgset: RGSet object stored in python via rpy2
library: What type of CpG library to use.

pymethylprocess.meffil_functions.est_cell_counts_meffil(qc_list, cell_type_reference)[source]¶

Given QCObject list R object, estimate cell counts using reference approach via meffil.

qc_list: R list containing qc objects.
cell_type_reference: Reference blood/tissue set.

pymethylprocess.meffil_functions.est_cell_counts_minfi(rgset)[source]¶

Given RGSet object, estimate cell counts using reference approach via minfi.

rgset: RGSet object stored in python via rpy2

pymethylprocess.meffil_functions.load_detection_p_values_beadnum(qc_list, n_cores)[source]¶

Return list of detection p-value matrix and bead number matrix.

qc_list: R list containing qc objects.
n_cores: Number of cores to use in computation.

pymethylprocess.meffil_functions.r_autosomal_cpgs(array_type='450k')[source]¶

Return list of autosomal cpg probes per platform.

array_type: 450k/850k array?

pymethylprocess.meffil_functions.r_snp_cpgs(array_type='450k')[source]¶

Return list of SNP cpg probes per platform.

array_type: 450k/850k array?

pymethylprocess.meffil_functions.remove_sex(beta, array_type='450k')[source]¶

Remove non-autosomal cpgs from beta matrix.

array_type: 450k/850k array?

pymethylprocess.meffil_functions.set_missing(beta, pval_beadnum, detection_val=1e-06)[source]¶

Set missing beta values to NA, taking into account detection values and bead number thesholds.

pval_beadnum: Detection pvalues and number of beads per cpg/samples
detection_val: If threshold to set site to missingness based on p-value detection.

general_machine_learning.py¶

Contains a machine learning class to perform scikit-learn like operations, along with held-out hyperparameter grid search.

class pymethylprocess.general_machine_learning.MachineLearning(model, options, grid={}, labelencode=False, n_eval=0)[source]¶

Machine learning class to run sklearn-like pipeline on MethylationArray data. Initialize object with scikit-learn model, and optionally supply a hyperparameter search grid.

model: Scikit-learn-like model, classification, regression, dimensionality reduction, clustering etc.
options: Options to supply model in form of dictionary.
grid: Alternatively, supply search grid to search for bets hyperparameters.
labelencode: T/F encode string labels.
n_eval: Number of evaluations for randomized grid search, if set to 0, perform exhaustive grid search

assign_results_to_pheno_col(methyl_array, new_col, output_pkl)[source]¶

Assign results to new phenotype column.

methyl_array: MethylationArray.
new_col: New column name.
output_pkl: Output pickle to dump MethylationArray to.

fit(train_methyl_array, val_methyl_array=None, outcome_cols=None)[source]¶

Fit data to model.

train_methyl_array: Training MethylationArray.
val_methyl_array: Validation MethylationArray. Can set to None.
outcome_cols: Set to none if not needed, but phenotype column to train on, can be multiple.

fit_predict(train_methyl_array, outcome_cols=None)[source]¶

Fit and predict training data.

train_methyl_array: Training MethylationArray.
outcome_cols: Set to none if not needed, but phenotype column to train on, can be multiple.

fit_transform(train_methyl_array, outcome_cols=None)[source]¶

Fit and transform to training data.

train_methyl_array: Training MethylationArray.
outcome_cols: Set to none if not needed, but phenotype column to train on, can be multiple.

predict(test_methyl_array)[source]¶

Make new predictions on test methylation array.

test_methyl_array: Testing MethylationArray.

return_outcome_metric(methyl_array, outcome_cols, metric, run_bootstrap=False)[source]¶

Supply metric to evaluate results.

methyl_array: MethylationArray to evaluate.
outcome_cols: Outcome phenotype columns.
metric: Sklearn evaluation metric.
run_bootstrap: Make 95% CI from 1k bootstraps.

store_results(output_pkl, results_dict={})[source]¶

Store results in pickle file.

output_pkl: Output pickle to dump results to.
results_dict: Supply own results dict to be dumped.

transform(test_methyl_array)[source]¶

Transform test methylation array.

test_methyl_array: Testing MethylationArray.

transform_results_to_beta(methyl_array, output_pkl)[source]¶

Transform beta matrix into reduced beta matrix and store.

methyl_array: MethylationArray.
output_pkl: Output pickle to dump MethylationArray to.

pymethyl-install¶

pymethyl-install [OPTIONS] COMMAND [ARGS]...

Options

--version¶: Show the version and exit.

change_gcc_path¶

Change GCC and G++ paths if don’t have version 7.2.0. [Experimental]

pymethyl-install change_gcc_path [OPTIONS]

install_bioconductor¶

Installs bioconductor.

pymethyl-install install_bioconductor [OPTIONS]

install_custom¶

Installs bioconductor packages.

pymethyl-install install_custom [OPTIONS]

Options

-p, --package <package>¶: Custom packages. [default: ENmix]

-m, --manager¶: Use BiocManager (recommended).

install_meffil¶

Installs meffil (update!).

pymethyl-install install_meffil [OPTIONS]

install_minfi_others¶

Installs minfi and other dependencies.

pymethyl-install install_minfi_others [OPTIONS]

install_r_packages¶

Installs r packages.

pymethyl-install install_r_packages [OPTIONS]

Options

-p, --package <package>¶: Custom packages. [default: ]

install_some_deps¶

Installs bioconductor, minfi, enmix, tcga biolinks, and meffil.

pymethyl-install install_some_deps [OPTIONS]

install_tcga_biolinks¶

Installs tcga biolinks.

pymethyl-install install_tcga_biolinks [OPTIONS]

pymethyl-visualize¶

pymethyl-visualize [OPTIONS] COMMAND [ARGS]...

Options

--version¶: Show the version and exit.

plot_cell_type_results¶

Plot csv containing cell type results into side by side boxplots.

pymethyl-visualize plot_cell_type_results [OPTIONS]

Options

-i, --input_csv <input_csv>¶: Input csv. [default: cell_type_estimates.csv]

-o, --outfilename <outfilename>¶: Output png. [default: visualizations/cell_type_results.png]

-cols, --plot_cols <plot_cols>¶: Plot columns. [default: Gran, CD4T, CD8T, Bcell, Mono, NK, gMDSC]

-fs, --font_scale <font_scale>¶: Font scaling [default: 1.0]

plot_heatmap¶

Plot heatmap from CSV file.

pymethyl-visualize plot_heatmap [OPTIONS]

Options

-i, --input_csv <input_csv>¶: Input csv. [default: ]

-o, --outfilename <outfilename>¶: Output png. [default: output.png]

-idx, --index_col <index_col>¶: Index load dataframe [default: 0]

-fs, --font_scale <font_scale>¶: Font scaling [default: 1.0]

-min, --min_val <min_val>¶: Min heat val [default: 0.0]

-max, --max_val <max_val>¶: Max heat val, if -1, defaults to None [default: 1.0]

-a, --annot¶: Annotate heatmap [default: False]

-n, --norm¶: Normalize matrix data [default: False]

-c, --cluster¶: Cluster matrix data [default: False]

-m, --matrix_type <matrix_type>¶: Type of matrix supplied [default: none]

-x, --xticks¶: Show x ticks [default: False]

-y, --yticks¶: Show y ticks [default: False]

-t, --transpose¶: Transpose matrix data [default: False]

-col, --color_column <color_column>¶: Color column. [default: color]

transform_plot¶

Dimensionality reduce VAE or original beta values using UMAP and plot using plotly.

pymethyl-visualize transform_plot [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-c, --column_of_interest <column_of_interest>¶: Column extract from phenotype data. [default: disease]

-o, --output_file <output_file>¶: Output visualization. [default: ./visualization.html]

-nn, --n_neighbors <n_neighbors>¶: Number of neighbors UMAP. [default: 5]

-a, --axes_off¶: Whether to turn axes on or off.

-s, --supervised¶: Supervise umap embedding.

-d, --min_dist <min_dist>¶: UMAP min distance. [default: 0.1]

-m, --metric <metric>¶: Reduction metric. [default: euclidean]

-cc, --case_control_override¶: Add controls from case_control column and override current disease for classification tasks. [default: False]

pymethyl-preprocess¶

pymethyl-preprocess [OPTIONS] COMMAND [ARGS]...

Options

--version¶: Show the version and exit.

batch_deploy_preprocess¶

Deploy multiple preprocessing jobs in series or parallel.

pymethyl-preprocess batch_deploy_preprocess [OPTIONS]

Options

-n, --n_cores <n_cores>¶: Number cores to use for preprocessing. [default: 6]

-i, --subtype_output_dir <subtype_output_dir>¶: Output subtypes pheno csv. [default: ./preprocess_outputs/]

-m, --meffil¶: Preprocess using meffil.

-t, --torque¶: Job submission torque.

-r, --run¶: Actually run local job or just print out command.

-s, --series¶: Run commands in series.

-p, --pc_qc_parameters_csv <pc_qc_parameters_csv>¶: For meffil, qc parameters and pcs for final qc and functional normalization. [default: ./preprocess_outputs/pc_qc_parameters.csv]

-u, --use_cache¶: If this is selected, loads qc results rather than running qc again. Only works for meffil selection.

-qc, --qc_only¶: Only perform QC for meffil pipeline, caches results into rds file for loading again, only works if use_cache is false.

-c, --chunk_size <chunk_size>¶: If not series, chunk up and run these number of commands at once.. -1 means all commands at once.

combine_methylation_arrays¶

If split MethylationArrays by subtype for either preprocessing or imputation, can use to recombine data for downstream step.

pymethyl-preprocess combine_methylation_arrays [OPTIONS]

Options

-i, --input_pkls <input_pkls>¶: Input pickles for beta and phenotype data. [default: ./preprocess_outputs/methyl_array.pkl]

-d, --optional_input_pkl_dir <optional_input_pkl_dir>¶: Auto grab input pkls. [default: ]

-o, --output_pkl <output_pkl>¶: Output database for beta and phenotype data. [default: ./combined_outputs/methyl_array.pkl]

-e, --exclude <exclude>¶: If -d selected, these diseases will be excluded from study. [default: ]

concat_sample_sheets¶

Concat two sample files for more fields for minfi+ input, adds more samples.

pymethyl-preprocess concat_sample_sheets [OPTIONS]

Options

-s1, --sample_sheet1 <sample_sheet1>¶: Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info1.csv]

-s2, --sample_sheet2 <sample_sheet2>¶: Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info2.csv]

-os, --output_sample_sheet <output_sample_sheet>¶: CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]

create_sample_sheet¶

Create sample sheet for input to minfi, meffil, or enmix.

pymethyl-preprocess create_sample_sheet [OPTIONS]

Options

-is, --input_sample_sheet <input_sample_sheet>¶: Clinical information downloaded from tcga/geo/custom. [default: ./tcga_idats/clinical_info.csv]

-s, --source_type <source_type>¶: Source type of data. [default: tcga]

-i, --idat_dir <idat_dir>¶: Idat directory. [default: ./tcga_idats/]

-os, --output_sample_sheet <output_sample_sheet>¶: CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]

-m, --mapping_file <mapping_file>¶: Mapping file from uuid to TCGA barcode. Downloaded using download_tcga. [default: ./idat_filename_case.txt]

-l, --header_line <header_line>¶: Line to begin reading csv/xlsx. [default: 0]

-d, --disease_class_column <disease_class_column>¶: Disease classification column, for custom and geo datasets. [default: methylation class:ch1]

-b, --basename_col <basename_col>¶: Basename classification column, for custom datasets. [default: Sentrix ID (.idat)]

-c, --include_columns_file <include_columns_file>¶: Custom columns file containing columns to keep, separated by n. Add a tab for each line if you wish to rename columns: original_name t new_column_name [default: ]

download_clinical¶

Download all TCGA 450k clinical info.

pymethyl-preprocess download_clinical [OPTIONS]

Options

-o, --output_dir <output_dir>¶: Output directory for exported idats. [default: ./tcga_idats/]

download_geo¶

Download geo methylation study idats and clinical info.

pymethyl-preprocess download_geo [OPTIONS]

Options

-g, --geo_query <geo_query>¶: GEO study to query. [default: ]

-o, --output_dir <output_dir>¶: Output directory for exported idats. [default: ./geo_idats/]

download_tcga¶

Download all tcga 450k data.

pymethyl-preprocess download_tcga [OPTIONS]

Options

-o, --output_dir <output_dir>¶: Output directory for exported idats. [default: ./tcga_idats/]

feature_select¶

Filter CpGs by taking x top CpGs with highest mean absolute deviation scores or via spectral feature selection.

pymethyl-preprocess feature_select [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input database for beta and phenotype data. [default: ./imputed_outputs/methyl_array.pkl]

-o, --output_pkl <output_pkl>¶: Output database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-n, --n_top_cpgs <n_top_cpgs>¶: Number cpgs to include with highest variance across population. [default: 300000]

-f, --feature_selection_method <feature_selection_method>¶

-mm, --metric <metric>¶

-nn, --n_neighbors <n_neighbors>¶: Number neighbors for feature selection, default enacts rbf kernel. [default: 0]

-m, --mad_top_cpgs <mad_top_cpgs>¶: Number cpgs to apply mad filtering first before more sophisticated feature selection. If 0 or primary feature selection is mad, no mad pre-filtering. [default: 0]

get_categorical_distribution¶

Get categorical distribution of columns of sample sheet.

pymethyl-preprocess get_categorical_distribution [OPTIONS]

Options

-is, --formatted_sample_sheet <formatted_sample_sheet>¶: Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/minfiSheet.csv]

-k, --key <key>¶: Column of csv to print statistics for. [default: disease]

-d, --disease_only¶: Only look at disease, or text before subtype_delimiter.

-sd, --subtype_delimiter <subtype_delimiter>¶: Delimiter for disease extraction. [default: ,]

imputation_pipeline¶

Imputation of subtype or no subtype using various imputation methods.

pymethyl-preprocess imputation_pipeline [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input database for beta and phenotype data. [default: ./combined_outputs/methyl_array.pkl]

-ss, --split_by_subtype¶: Imputes CpGs by subtype before combining again.

-m, --method <method>¶: Method of imputation. [default: KNN]

-s, --solver <solver>¶: Imputation library. [default: fancyimpute]

-k, --n_neighbors <n_neighbors>¶: Number neighbors for imputation if using KNN. [default: 5]

-r, --orientation <orientation>¶: Impute CpGs or samples. [default: Samples]

-o, --output_pkl <output_pkl>¶: Output database for beta and phenotype data. [default: ./imputed_outputs/methyl_array.pkl]

-n, --n_top_cpgs <n_top_cpgs>¶: Number cpgs to include with highest variance across population. Greater than 0 allows for mad filtering during imputation to skip mad step. [default: 0]

-f, --feature_selection_method <feature_selection_method>¶

-mm, --metric <metric>¶

-nfs, --n_neighbors_fs <n_neighbors_fs>¶: Number neighbors for feature selection, default enacts rbf kernel. [default: 0]

-d, --disease_only¶: Only look at disease, or text before subtype_delimiter.

-sd, --subtype_delimiter <subtype_delimiter>¶: Delimiter for disease extraction. [default: ,]

-st, --sample_threshold <sample_threshold>¶: Value between 0 and 1 for NaN removal. If samples has sample_threshold proportion of cpgs missing, then remove sample. Set to -1 to not remove samples. [default: -1.0]

-ct, --cpg_threshold <cpg_threshold>¶: Value between 0 and 1 for NaN removal. If cpgs has cpg_threshold proportion of samples missing, then remove cpg. Set to -1 to not remove samples. [default: -1.0]

meffil_encode¶

Reformat file for meffil input.

pymethyl-preprocess meffil_encode [OPTIONS]

Options

-is, --input_sample_sheet <input_sample_sheet>¶: CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]

-os, --output_sample_sheet <output_sample_sheet>¶: CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]

merge_sample_sheets¶

Merge two sample files for more fields for minfi+ input.

pymethyl-preprocess merge_sample_sheets [OPTIONS]

Options

-s1, --sample_sheet1 <sample_sheet1>¶: Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info1.csv]

-s2, --sample_sheet2 <sample_sheet2>¶: Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info2.csv]

-os, --output_sample_sheet <output_sample_sheet>¶: CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]

-d, --second_sheet_disease¶: Use second sheet’s disease column.

-nd, --no_disease_merge¶: Don’t merge disease columns.

na_report¶

Print proportion of missing values throughout dataset.

pymethyl-preprocess na_report [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input database for beta and phenotype data. [default: ./preprocess_outputs/methyl_array.pkl]

-o, --output_dir <output_dir>¶: Output database for na report. [default: ./na_report/]

-r, --head_directory¶: -i option becomes directory, and searches there for multiple input pickles.

preprocess_pipeline¶

Perform preprocessing of idats using enmix or meffil.

pymethyl-preprocess preprocess_pipeline [OPTIONS]

Options

-i, --idat_dir <idat_dir>¶: Idat dir for one sample sheet, alternatively can be your phenotype sample sheet. [default: ./tcga_idats/]

-n, --n_cores <n_cores>¶: Number cores to use for preprocessing. [default: 6]

-o, --output_pkl <output_pkl>¶: Output database for beta and phenotype data. [default: ./preprocess_outputs/methyl_array.pkl]

-m, --meffil¶: Preprocess using meffil.

-pc, --n_pcs <n_pcs>¶: For meffil, number of principal components for functional normalization. If set to -1, then PCs are selected using elbow method. [default: -1]

-p, --pipeline <pipeline>¶: If not meffil, preprocess using minfi or enmix. [default: enmix]

-noob, --noob_norm¶: Run noob normalization of minfi selected.

-u, --use_cache¶: If this is selected, loads qc results rather than running qc again and update with new qc parameters. Only works for meffil selection. Minfi and enmix just loads RG Set.

-qc, --qc_only¶: Only perform QC for meffil pipeline, caches results into rds file for loading again, only works if use_cache is false. Minfi and enmix just saves the RGSet before preprocessing.

-bns, --p_beadnum_samples <p_beadnum_samples>¶: From meffil documentation, “fraction of probes that failed the threshold of 3 beads”. [default: 0.05]

-pds, --p_detection_samples <p_detection_samples>¶: From meffil documentation, “fraction of probes that failed a detection.pvalue threshold of 0.01”. [default: 0.05]

-bnc, --p_beadnum_cpgs <p_beadnum_cpgs>¶: From meffil documentation, “fraction of samples that failed the threshold of 3 beads”. [default: 0.05]

-pdc, --p_detection_cpgs <p_detection_cpgs>¶: From meffil documentation, “fraction of samples that failed a detection.pvalue threshold of 0.01”. [default: 0.05]

-sc, --sex_cutoff <sex_cutoff>¶: From meffil documentation, “difference of total median intensity for Y chromosome probes and X chromosome probes”. [default: -2]

-sd, --sex_sd <sex_sd>¶: From meffil documentation, “sex detection outliers if outside this range”. [default: 5]

remove_diseases¶

Exclude diseases from study by count number or exclusion list.

pymethyl-preprocess remove_diseases [OPTIONS]

Options

-is, --formatted_sample_sheet <formatted_sample_sheet>¶: Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info.csv]

-e, --exclude_disease_list <exclude_disease_list>¶: List of conditions to exclude, from disease column, comma delimited. [default: ]

-os, --output_sheet_name <output_sheet_name>¶: CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]

-l, --low_count <low_count>¶: Remove diseases if they are below a certain count, default this is not used. [default: 0]

-d, --disease_only¶: Only look at disease, or text before subtype_delimiter.

-sd, --subtype_delimiter <subtype_delimiter>¶: Delimiter for disease extraction. [default: ,]

split_preprocess_input_by_subtype¶

Split preprocess input samplesheet by disease subtype.

pymethyl-preprocess split_preprocess_input_by_subtype [OPTIONS]

Options

-i, --idat_csv <idat_csv>¶: Idat csv for one sample sheet, alternatively can be your phenotype sample sheet. [default: ./tcga_idats/minfiSheet.csv]

-d, --disease_only¶: Only look at disease, or text before subtype_delimiter.

-sd, --subtype_delimiter <subtype_delimiter>¶: Delimiter for disease extraction. [default: ,]

-o, --subtype_output_dir <subtype_output_dir>¶: Output subtypes pheno csv. [default: ./preprocess_outputs/]

pymethyl-utils¶

pymethyl-utils [OPTIONS] COMMAND [ARGS]...

Options

--version¶: Show the version and exit.

backup_pkl¶

Copy methylarray pickle to new location to backup.

pymethyl-utils backup_pkl [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-o, --output_pkl <output_pkl>¶: Output database for beta and phenotype data. [default: ./backup/methyl_array.pkl]

bin_column¶

Convert continuous phenotype column into categorical by binning.

pymethyl-utils bin_column [OPTIONS]

Options

-t, --test_pkl <test_pkl>¶: Pickle containing testing set. [default: ./train_val_test_sets/test_methyl_array.pkl]

-c, --col <col>¶: Column to turn into bins. [default: age]

-n, --n_bins <n_bins>¶: Number of bins. [default: 10]

-ot, --output_test_pkl <output_test_pkl>¶: Binned shap pickle for further testing. [default: ./train_val_test_sets/test_methyl_array_shap_binned.pkl]

concat_csv¶

Concatenate two csv files together.

pymethyl-utils concat_csv [OPTIONS]

Options

-i1, --input_csv <input_csv>¶: Beta csv. [default: ./beta1.csv]

-i2, --input_csv2 <input_csv2>¶: Beta/other csv 2. [default: ./cell_estimates.csv]

-o, --output_csv <output_csv>¶: Output csv. [default: ./beta.concat.csv]

-a, --axis <axis>¶: Axis to merge on. Columns are 0, rows are 1. [default: 1]

counts¶

Return categorical breakdown of phenotype column.

pymethyl-utils counts [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-k, --key <key>¶: Key to split on. [default: disease]

create_external_validation_set¶

Create external validation set containing same CpGs as training set.

pymethyl-utils create_external_validation_set [OPTIONS]

Options

-t, --train_pkl <train_pkl>¶: Input methyl array. [default: ./train_val_test_sets/train_methyl_array.pkl]

-q, --query_pkl <query_pkl>¶: Input methylation array to add/subtract cpgs to. [default: ./final_preprocessed/methyl_array.pkl]

-o, --output_pkl <output_pkl>¶: Output methyl array external validation. [default: ./external_validation/methyl_array.pkl]

-c, --cpg_replace_method <cpg_replace_method>¶: What to do for missing CpGs. [default: mid]

feature_select_train_val_test¶

Filter CpGs by taking x top CpGs with highest mean absolute deviation scores or via spectral feature selection.

pymethyl-utils feature_select_train_val_test [OPTIONS]

Options

-i, --input_pkl_dir <input_pkl_dir>¶: Input database for beta and phenotype data. [default: ./train_val_test_sets/]

-o, --output_dir <output_dir>¶: Output database for beta and phenotype data. [default: ./train_val_test_sets_fs/]

-n, --n_top_cpgs <n_top_cpgs>¶: Number cpgs to include with highest variance across population. [default: 300000]

-f, --feature_selection_method <feature_selection_method>¶

-mm, --metric <metric>¶

-nn, --n_neighbors <n_neighbors>¶: Number neighbors for feature selection, default enacts rbf kernel. [default: 0]

-m, --mad_top_cpgs <mad_top_cpgs>¶: Number cpgs to apply mad filtering first before more sophisticated feature selection. If 0 or primary feature selection is mad, no mad pre-filtering. [default: 0]

fix_key¶

Format certain column of phenotype array in MethylationArray.

pymethyl-utils fix_key [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-k, --key <key>¶: Key to split on. [default: disease]

-d, --disease_only¶: Only look at disease, or text before subtype_delimiter.

-sd, --subtype_delimiter <subtype_delimiter>¶: Delimiter for disease extraction. [default: ,]

-o, --output_pkl <output_pkl>¶: Input database for beta and phenotype data. [default: ./fixed_preprocessed/methyl_array.pkl]

modify_pheno_data¶

Use another spreadsheet to add more descriptive data to methylarray.

pymethyl-utils modify_pheno_data [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-is, --input_formatted_sample_sheet <input_formatted_sample_sheet>¶: Information passed through function create_sample_sheet, has Basename and disease fields. [default: ./tcga_idats/minfi_sheet.csv]

-o, --output_pkl <output_pkl>¶: Output database for beta and phenotype data. [default: ./modified_processed/methyl_array.pkl]

move_jpg¶

Move preprocessing jpegs to preprocessing output directory.

pymethyl-utils move_jpg [OPTIONS]

Options

-i, --input_dir <input_dir>¶: Directory containing jpg. [default: ./]

-o, --output_dir <output_dir>¶: Output directory for images. [default: ./preprocess_output_images/]

overwrite_pheno_data¶

Use another spreadsheet to add more descriptive data to methylarray.

pymethyl-utils overwrite_pheno_data [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-is, --input_formatted_sample_sheet <input_formatted_sample_sheet>¶: Information passed through function create_sample_sheet, has Basename and disease fields. [default: ./tcga_idats/minfi_sheet.csv]

-o, --output_pkl <output_pkl>¶: Output database for beta and phenotype data. [default: ./modified_processed/methyl_array.pkl]

-c, --index_col <index_col>¶: Index col when reading csv. [default: 0]

pkl_to_csv¶

Output methylarray pickle to csv.

pymethyl-utils pkl_to_csv [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-o, --output_dir <output_dir>¶: Input database for beta and phenotype data. [default: ./final_preprocessed/]

-c, --col <col>¶: Column to color. [default: ]

print_number_sex_cpgs¶

Print number of non-autosomal CpGs.

pymethyl-utils print_number_sex_cpgs [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-a, --array_type <array_type>¶: Array Type. [default: 450k]

print_shape¶

Print dimensions of beta matrix.

pymethyl-utils print_shape [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

ref_estimate_cell_counts¶

Reference based cell type estimates.

pymethyl-utils ref_estimate_cell_counts [OPTIONS]

Options

-ro, --input_r_object_dir <input_r_object_dir>¶: Input directory containing qc data. [default: ./preprocess_outputs/]

-a, --algorithm <algorithm>¶: Algorithm to run cell type. [default: meffil]

-ref, --reference <reference>¶: Cell Type Reference. [default: cord blood gse68456]

-l, --library <library>¶: IDOL Library. [default: IDOLOptimizedCpGs450klegacy]

-o, --output_csv <output_csv>¶: Output cell type estimates. [default: ./added_cell_counts/cell_type_estimates.csv]

remove_sex¶

Remove non-autosomal CpGs.

pymethyl-utils remove_sex [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input database for beta and phenotype data. [default: ./preprocess_outputs/methyl_array.pkl]

-o, --output_pkl <output_pkl>¶: Output methyl array autosomal. [default: ./autosomal/methyl_array.pkl]

-a, --array_type <array_type>¶: Array Type. [default: 450k]

remove_snps¶

Remove SNPs from methylation array.

pymethyl-utils remove_snps [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input database for beta and phenotype data. [default: ./autosomal/methyl_array.pkl]

-o, --output_pkl <output_pkl>¶: Output methyl array autosomal. [default: ./no_snp/methyl_array.pkl]

-a, --array_type <array_type>¶: Array Type. [default: 450k]

set_part_array_background¶

Set subset of CpGs from beta matrix to background values.

pymethyl-utils set_part_array_background [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input methyl array. [default: ./final_preprocessed/methyl_array.pkl]

-c, --cpg_pkl <cpg_pkl>¶: Pickled numpy array for subsetting. [default: ./subset_cpgs.pkl]

-o, --output_pkl <output_pkl>¶: Output methyl array external validation. [default: ./removal/methyl_array.pkl]

stratify¶

Split methylation array by key and store.

pymethyl-utils stratify [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-k, --key <key>¶: Key to split on. [default: disease]

-o, --output_dir <output_dir>¶: Output directory for stratified. [default: ./stratified/]

subset_array¶

Only retain certain number of CpGs from methylation array.

pymethyl-utils subset_array [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input methyl array. [default: ./final_preprocessed/methyl_array.pkl]

-c, --cpg_pkl <cpg_pkl>¶: Pickled numpy array for subsetting. [default: ./subset_cpgs.pkl]

-o, --output_pkl <output_pkl>¶: Output methyl array external validation. [default: ./subset/methyl_array.pkl]

train_test_val_split¶

Split methylation array into train, test, val.

pymethyl-utils train_test_val_split [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]

-o, --output_dir <output_dir>¶: Output directory for training, testing, and validation sets. [default: ./train_val_test_sets/]

-tp, --train_percent <train_percent>¶: Percent data training on. [default: 0.8]

-vp, --val_percent <val_percent>¶: Percent of training data that comprises validation set. [default: 0.1]

-cat, --categorical¶: Multi-class prediction. [default: False]

-do, --disease_only¶: Only look at disease, or text before subtype_delimiter.

-k, --key <key>¶: Key to split on. [default: disease]

-sd, --subtype_delimiter <subtype_delimiter>¶: Delimiter for disease extraction. [default: ,]

write_cpgs¶

Write CpGs in methylation array to file.

pymethyl-utils write_cpgs [OPTIONS]

Options

-i, --input_pkl <input_pkl>¶: Input methyl array. [default: ./final_preprocessed/methyl_array.pkl]

-c, --cpg_pkl <cpg_pkl>¶: Pickled numpy array for subsetting. [default: ./subset_cpgs.pkl]

Welcome to PyMethylProcess’s documentation!¶

PreProcessDataTypes.py¶

MethylationDataTypes.py¶

meffil_functions.py¶

general_machine_learning.py¶

pymethyl-install¶

change_gcc_path¶

install_bioconductor¶

install_custom¶

install_meffil¶

install_minfi_others¶

install_r_packages¶

install_some_deps¶

install_tcga_biolinks¶

pymethyl-visualize¶

plot_cell_type_results¶

plot_heatmap¶

transform_plot¶

pymethyl-preprocess¶

batch_deploy_preprocess¶

combine_methylation_arrays¶

concat_sample_sheets¶

create_sample_sheet¶

download_clinical¶

download_geo¶

download_tcga¶

feature_select¶

get_categorical_distribution¶

imputation_pipeline¶

meffil_encode¶

merge_sample_sheets¶

na_report¶

preprocess_pipeline¶

remove_diseases¶

split_preprocess_input_by_subtype¶

pymethyl-utils¶

backup_pkl¶

bin_column¶

concat_csv¶

counts¶

create_external_validation_set¶

feature_select_train_val_test¶

fix_key¶

modify_pheno_data¶

move_jpg¶

overwrite_pheno_data¶

pkl_to_csv¶

print_number_sex_cpgs¶

print_shape¶

ref_estimate_cell_counts¶

remove_sex¶

remove_snps¶

set_part_array_background¶

stratify¶

subset_array¶

train_test_val_split¶

write_cpgs¶

Indices and tables¶