Welcome to PyMethylProcess’s documentation!¶
https://github.com/Christensen-Lab-Dartmouth/PyMethylProcess
To get started, download pymethylprocess using Docker (joshualevy44/pymethylprocess) or PIP (pymethylprocess) and run pymethyl-install_r_dependencies.
There is both an API and CLI available for use. Examples for CLI usage can be found in ./example_scripts.
PreProcessDataTypes.py¶
Contains datatypes core to downloading IDATs, preprocessing IDATs and samplesheets.
-
class
pymethylprocess.PreProcessDataTypes.
PreProcessIDAT
(idat_dir, minfi=None, enmix=None, base=None, meffil=None)[source]¶ Class that will preprocess IDATs using R pipelines.
- idat_dir
- Location of idats or samplesheet csv.
- minfi
- Rpy2 importr minfi library, default to None will load through rpy2
- enmix
- Rpy2 importr enmix library, default to None will load through rpy2
- base
- Rpy2 importr base library, default to None will load through rpy2
- meffil
- Rpy2 importr meffil library, default to None will load through rpy2
-
export_csv
(output_dir)[source]¶ Export pheno and beta dataframes to CSVs
- output_dir
- Where to store csvs.
-
export_pickle
(output_pickle, disease='')[source]¶ Export pheno and beta dataframes to pickle, stored in python dict that can be loaded into MethylationArray
- output_pickle
- Where to store MethylationArray.
- disease
- Custom naming scheme for data.
-
export_sql
(output_db, disease='')[source]¶ Export pheno and beta dataframes to SQL
- output_db
- Where to store data, sqlite db.
- disease
- Custom naming scheme for data.
-
extract_pheno_data
(methylset=False)[source]¶ Extract pheno data from MSet or RGSet, minfi.
- methylset
- If MSet has beenn created, set to True, else extract from original RGSet.
-
output_pheno_beta
(meffil=False)[source]¶ Get pheno and beta dataframe objects stored as attributes for input to MethylationArray object.
- meffil
- True if ran meffil pipeline.
-
plot_original_qc
(output_dir)[source]¶ Plot QC results from ENmix pipeline and possible minfi. Still experimental.
- output_dir
- Where to store plots.
-
plot_qc_metrics
(output_dir)[source]¶ Plot QC results from ENmix pipeline and possible minfi. Still experimental.
- output_dir
- Where to store plots.
-
preprocessENmix
(n_cores=6)[source]¶ Run ENmix preprocessing pipeline.
- n_cores
- Number of CPUs to use.
-
preprocessMeffil
(n_cores=6, n_pcs=4, qc_report_fname='qc/report.html', normalization_report_fname='norm/report.html', pc_plot_fname='qc/pc_plot.pdf', useCache=True, qc_only=True, qc_parameters={'p.beadnum.cpgs': 0.1, 'p.beadnum.samples': 0.1, 'p.detection.cpgs': 0.1, 'p.detection.samples': 0.1}, rm_sex=False)[source]¶ Run meffil preprocessing pipeline with functional normalization.
- n_cores
- Number of CPUs to use.
- n_pcs
- Number of principal components to use for functional normalization, set to -1 to autoselect via kneedle algorithm.
- qc_report_fname
- HTML filename to store QC report.
- normalization_report_fname
- HTML filename to store normalization report
- pc_plot_fname
- PDF file to store principal components plot.
- useCache
- Use saved QC objects instead of running through QC again.
- qc_only
- Perform QC, then save and quit before normalization.
- qc_parameters
- Python dictionary with parameters for qc.
- rm_sex
- Remove non-autosomal cpgs?
-
preprocess_enmix_pipeline
(n_cores=6, pipeline='enmix', noob=False, qc_only=False, use_cache=False)[source]¶ Run complete ENmix or minfi preprocessing pipeline.
- n_cores
- Number CPUs.
- pipeline
- Run enmix or minfi
- noob
- Noob norm or RAW if minfi running.
- qc_only
- Save and quit after only running QC?
- use_cache
- Load preexisting RGSet instead of running QC again.
-
class
pymethylprocess.PreProcessDataTypes.
PreProcessPhenoData
(pheno_sheet, idat_dir, header_line=0)[source]¶ Class that will manipute phenotype samplesheet before preprocessing of IDATs.
- pheno_sheet
- Location of clinical info csv.
- idat_dir
- Location of idats
- header_line
- Where to start reading clinical csv
-
concat
(other_formatted_sheet)[source]¶ Concat multiple PreProcessPhenoData objects, concat their dataframes to accept more than one smaplesheet/dataset.
- other_formatted_sheet
- Other PreProcessPhenoData to concat.
-
export
(output_sheet_name)[source]¶ Export pheno data to csv after done with manipulation.
- output_sheet_name
- Output csv name.
-
format_custom
(basename_col, disease_class_column, include_columns={})[source]¶ Custom format clinical sheet if user supplied idats.
- basename_col
- Column name of sample names.
- disease_class_column
- Disease column of clinical info csv.
- include_columns
- Dictionary specifying other columns to include, and new names to assign them to.
-
format_geo
(disease_class_column='methylation class:ch1', include_columns={})[source]¶ Format clinical sheets if downloaded geo idats.
- disease_class_column
- Disease column of clinical info csv.
- include_columns
- Dictionary specifying other columns to include, and new names to assign them to.
-
format_tcga
(mapping_file='idat_filename_case.txt')[source]¶ Format clinical sheets if downloaded tcga idats.
- mapping_file
- Maps uuids to proper tcga sample names, should be downloaded with tcga clinical information.
-
get_categorical_distribution
(key, disease_only=False, subtype_delimiter=', ')[source]¶ Print categorical distribution, counts for each unique value in phenotype column.
- key
- Phenotype Column.
- disease_only
- Whether to split phenotype column entries by delimiter.
- subtype_delimiter
- Subtype delimiter to split on.
-
merge
(other_formatted_sheet, use_second_sheet_disease=True, no_disease_merge=False)[source]¶ Merge multiple PreProcessPhenoData objects, merge their dataframes to accept more than one saplesheet/dataset or add more pheno info.
- other_formatted_sheet
- Other PreProcessPhenoData to merge.
- use_second_sheet_disease
- Change disease column to that of second sheet instead of first.
- no_disease_merge
- Keep both disease columns from both sheets.
-
remove_diseases
(exclude_disease_list, low_count, disease_only, subtype_delimiter)[source]¶ Remove samples with certain diseases from disease column.
- exclude_disease_list
- List containing diseases to remove.
- low_count
- Remove samples that have less than x disease occurances in column.
- disease_only
- Whether to split phenotype column entries by delimiter.
- subtype_delimiter
- Subtype delimiter to split on.
-
class
pymethylprocess.PreProcessDataTypes.
TCGADownloader
[source]¶ Downloads TCGA and GEO IDAT and clinical data
-
download_clinical
(output_dir)[source]¶ Download TCGA Clinical Data.
- output_dir
- Where to output clinical data csv.
-
MethylationDataTypes.py¶
Contains datatypes core to storing beta and phenotype methylation data, and imputation.
-
class
pymethylprocess.MethylationDataTypes.
ImputerObject
(solver, method, opts={})[source]¶ Class that stores and accesses different types of imputers. Construct sklearn-like imputer given certain input arguments.
- solver
- Library for imputation, eg. sklearn, fancyimpute.
- method
- Imputation method in library, named.
- opts
- Additional options to assign to imputer.
-
class
pymethylprocess.MethylationDataTypes.
MethylationArray
(pheno_df, beta_df, name='')[source]¶ Stores beta and phenotype information and performs various operations. Initialize MethylationArray object by inputting dataframe of phenotypes and dataframe of beta values with samples as index.
- pheno_df
- Phenotype dataframe (samples x covariates)
- beta_df
- Beta Values Dataframe (samples x cpgs)
-
bin_column
(col, n_bins)[source]¶ Turn continuous variable/covariate into categorical bins. Returns name of new column and updates phenotype matrix to reflect this change.
- col
- Continuous column of phenotype array to bin.
- n_bins
- Number of bins to create.
-
categorical_breakdown
(key)[source]¶ Print categorical distribution, counts for each unique value in phenotype column.
- key
- Phenotype Column.
-
feature_select
(n_top_cpgs, feature_selection_method='mad', metric='correlation', nn=10)[source]¶ Perform unsupervised feature selection on MethylationArray.
- n_top_cpgs
- Number of CpGs to retain.
- feature_selection_method
- Method to perform selection.
- metric
- If considering structural feature selection like SPEC, use this distance metric.
- nn
- Number of nearest neighbors.
-
classmethod
from_pickle
(input_pickle)[source]¶ Load MethylationArray stored in pickle.
Usage: MethylationArray.from_pickle([input_pickle])
- input_pickle
- Stored MethylationArray pickle.
-
groupby
(key)[source]¶ Groupby for Methylation Array. Returns generator of methylation arrays grouped by key.
- preprocess_sample_df
- New phenotype dataframe.
-
impute
(imputer)[source]¶ Perform imputation on NaN beta vaues. Input imputater returned from ImputerObject.
- imputer
- Type of imputer object, in sklearn type interface.
-
merge_preprocess_sheet
(preprocess_sample_df)[source]¶ Feed in another phenotype dataframe that will be merged with existing phenotype array.
- preprocess_sample_df
- New phenotype dataframe.
-
overwrite_pheno_data
(preprocess_sample_df)[source]¶ Feed in another phenotype dataframe that will overwrite overlapping keys of existing phenotype array.
- preprocess_sample_df
- New phenotype dataframe.
-
remove_missingness
(cpg_threshold=None, sample_threshold=None)[source]¶ Remove samples and CpGs with certain level of missingness..
- cpg_threshold
- If more than fraction of Samples for this CpG are missing, remove cpg.
- sample_threshold
- If more than fraction of CpGs for this sample are missing, remove sample.
-
remove_na_samples
(outcome_cols)[source]¶ Remove samples of MethylationArray who have missing values in phenotype column.
- outcome_cols
- Phenotype columns, if any rows contain missing values, samples are removed.
-
split_by_subtype
(disease_only, subtype_delimiter)[source]¶ Split MethylationArray into generator of MethylationArrays by phenotype column. Much akin to groupby. Only splits from disease column.
- disease_only
- Consider disease superclass.
- subtype_delimiter
- How to break up disease column if using disease_only.
-
split_key
(key, subtype_delimiter)[source]¶ Manipulate an entire phenotype column, splitting each element up by some delimiter.
- key
- Phenotype column.
- subtype_delimiter
- How to break up strings in columns. S1,s2 -> S1 for instance.
-
split_train_test
(train_p=0.8, stratified=True, disease_only=False, key='disease', subtype_delimiter=', ', val_p=0.0)[source]¶ Split MethylationArray into training and test sets, with option to stratify by categorical covariate.
- train_p
- Fraction of methylation array to use as training set.
- stratified
- Whether to stratify by categorical variable.
- disease_only
- Consider disease superclass by some delimiter. For instance if disease is S1,s2, superclass would be S1.
- key
- Column to stratify on.
- subtype_delimiter
- How to split disease column into super/subclass.
- val_p
- If set greater than 0, will create additional validation set, fraction of which is broken off from training set.
-
subsample
(key='disease', n_samples=None, frac=None, categorical=False)[source]¶ Subsample MethylationArray, make the set randomly smaller.
- key
- If stratifying, use this column of pheno array.
- n_samples
- Number of samples to consider overall, or per stratum.
- frac
- Alternative to n_samples, where x frac of array or stratum is considered.
- categorical
- Whether to stratify by column.
-
subset_cpgs
(cpgs)[source]¶ Subset beta matrix by list of Cpgs. Parameters ———- cpgs
Cpgs to subset by.
-
write_csvs
(output_dir)[source]¶ Write phenotype data and beta values to csvs.
- output_dir
- Directory to output csv files.
-
class
pymethylprocess.MethylationDataTypes.
MethylationArrays
(list_methylation_arrays)[source]¶ Literally a list of methylation arrays, with methods operate on these arrays that is memory efficient. Initialize with list of methylation arrays. Can optionally leave list empty or with one element.
- list_methylation_arrays
- List of methylation arrays.
-
combine
(array_generator=None)[source]¶ Combine the list of methylation arrays into one array via concatenation of beta matrices and phenotype arrays.
- array_generator
- Generator of additional methylation arrays for computational memory minimization.
-
pymethylprocess.MethylationDataTypes.
extract_pheno_beta_df_from_folder
(folder)[source]¶ Return phenotype and beta dataframes from specified folder with csv.
- folder
- Input folder.
meffil_functions.py¶
Contains a few R functions that interact with meffil and minfi.
-
pymethylprocess.meffil_functions.
est_cell_counts_IDOL
(rgset, library)[source]¶ Given RGSet object, estimate cell counts for 450k/850k using reference approach via IDOL library.
- rgset
- RGSet object stored in python via rpy2
- library
- What type of CpG library to use.
-
pymethylprocess.meffil_functions.
est_cell_counts_meffil
(qc_list, cell_type_reference)[source]¶ Given QCObject list R object, estimate cell counts using reference approach via meffil.
- qc_list
- R list containing qc objects.
- cell_type_reference
- Reference blood/tissue set.
-
pymethylprocess.meffil_functions.
est_cell_counts_minfi
(rgset)[source]¶ Given RGSet object, estimate cell counts using reference approach via minfi.
- rgset
- RGSet object stored in python via rpy2
-
pymethylprocess.meffil_functions.
load_detection_p_values_beadnum
(qc_list, n_cores)[source]¶ Return list of detection p-value matrix and bead number matrix.
- qc_list
- R list containing qc objects.
- n_cores
- Number of cores to use in computation.
-
pymethylprocess.meffil_functions.
r_autosomal_cpgs
(array_type='450k')[source]¶ Return list of autosomal cpg probes per platform.
- array_type
- 450k/850k array?
-
pymethylprocess.meffil_functions.
r_snp_cpgs
(array_type='450k')[source]¶ Return list of SNP cpg probes per platform.
- array_type
- 450k/850k array?
-
pymethylprocess.meffil_functions.
remove_sex
(beta, array_type='450k')[source]¶ Remove non-autosomal cpgs from beta matrix.
- array_type
- 450k/850k array?
-
pymethylprocess.meffil_functions.
set_missing
(beta, pval_beadnum, detection_val=1e-06)[source]¶ Set missing beta values to NA, taking into account detection values and bead number thesholds.
- pval_beadnum
- Detection pvalues and number of beads per cpg/samples
- detection_val
- If threshold to set site to missingness based on p-value detection.
general_machine_learning.py¶
Contains a machine learning class to perform scikit-learn like operations, along with held-out hyperparameter grid search.
-
class
pymethylprocess.general_machine_learning.
MachineLearning
(model, options, grid={}, labelencode=False, n_eval=0)[source]¶ Machine learning class to run sklearn-like pipeline on MethylationArray data. Initialize object with scikit-learn model, and optionally supply a hyperparameter search grid.
- model
- Scikit-learn-like model, classification, regression, dimensionality reduction, clustering etc.
- options
- Options to supply model in form of dictionary.
- grid
- Alternatively, supply search grid to search for bets hyperparameters.
- labelencode
- T/F encode string labels.
- n_eval
- Number of evaluations for randomized grid search, if set to 0, perform exhaustive grid search
-
assign_results_to_pheno_col
(methyl_array, new_col, output_pkl)[source]¶ Assign results to new phenotype column.
- methyl_array
- MethylationArray.
- new_col
- New column name.
- output_pkl
- Output pickle to dump MethylationArray to.
-
fit
(train_methyl_array, val_methyl_array=None, outcome_cols=None)[source]¶ Fit data to model.
- train_methyl_array
- Training MethylationArray.
- val_methyl_array
- Validation MethylationArray. Can set to None.
- outcome_cols
- Set to none if not needed, but phenotype column to train on, can be multiple.
-
fit_predict
(train_methyl_array, outcome_cols=None)[source]¶ Fit and predict training data.
- train_methyl_array
- Training MethylationArray.
- outcome_cols
- Set to none if not needed, but phenotype column to train on, can be multiple.
-
fit_transform
(train_methyl_array, outcome_cols=None)[source]¶ Fit and transform to training data.
- train_methyl_array
- Training MethylationArray.
- outcome_cols
- Set to none if not needed, but phenotype column to train on, can be multiple.
-
predict
(test_methyl_array)[source]¶ Make new predictions on test methylation array.
- test_methyl_array
- Testing MethylationArray.
-
return_outcome_metric
(methyl_array, outcome_cols, metric, run_bootstrap=False)[source]¶ Supply metric to evaluate results.
- methyl_array
- MethylationArray to evaluate.
- outcome_cols
- Outcome phenotype columns.
- metric
- Sklearn evaluation metric.
- run_bootstrap
- Make 95% CI from 1k bootstraps.
-
store_results
(output_pkl, results_dict={})[source]¶ Store results in pickle file.
- output_pkl
- Output pickle to dump results to.
- results_dict
- Supply own results dict to be dumped.
pymethyl-install¶
pymethyl-install [OPTIONS] COMMAND [ARGS]...
Options
-
--version
¶
Show the version and exit.
change_gcc_path¶
Change GCC and G++ paths if don’t have version 7.2.0. [Experimental]
pymethyl-install change_gcc_path [OPTIONS]
install_custom¶
Installs bioconductor packages.
pymethyl-install install_custom [OPTIONS]
Options
-
-p
,
--package
<package>
¶ Custom packages. [default: ENmix]
-
-m
,
--manager
¶
Use BiocManager (recommended).
install_minfi_others¶
Installs minfi and other dependencies.
pymethyl-install install_minfi_others [OPTIONS]
install_r_packages¶
Installs r packages.
pymethyl-install install_r_packages [OPTIONS]
Options
-
-p
,
--package
<package>
¶ Custom packages. [default: ]
install_some_deps¶
Installs bioconductor, minfi, enmix, tcga biolinks, and meffil.
pymethyl-install install_some_deps [OPTIONS]
pymethyl-visualize¶
pymethyl-visualize [OPTIONS] COMMAND [ARGS]...
Options
-
--version
¶
Show the version and exit.
plot_cell_type_results¶
Plot csv containing cell type results into side by side boxplots.
pymethyl-visualize plot_cell_type_results [OPTIONS]
Options
-
-i
,
--input_csv
<input_csv>
¶ Input csv. [default: cell_type_estimates.csv]
-
-o
,
--outfilename
<outfilename>
¶ Output png. [default: visualizations/cell_type_results.png]
-
-cols
,
--plot_cols
<plot_cols>
¶ Plot columns. [default: Gran, CD4T, CD8T, Bcell, Mono, NK, gMDSC]
-
-fs
,
--font_scale
<font_scale>
¶ Font scaling [default: 1.0]
plot_heatmap¶
Plot heatmap from CSV file.
pymethyl-visualize plot_heatmap [OPTIONS]
Options
-
-i
,
--input_csv
<input_csv>
¶ Input csv. [default: ]
-
-o
,
--outfilename
<outfilename>
¶ Output png. [default: output.png]
-
-idx
,
--index_col
<index_col>
¶ Index load dataframe [default: 0]
-
-fs
,
--font_scale
<font_scale>
¶ Font scaling [default: 1.0]
-
-min
,
--min_val
<min_val>
¶ Min heat val [default: 0.0]
-
-max
,
--max_val
<max_val>
¶ Max heat val, if -1, defaults to None [default: 1.0]
-
-a
,
--annot
¶
Annotate heatmap [default: False]
-
-n
,
--norm
¶
Normalize matrix data [default: False]
-
-c
,
--cluster
¶
Cluster matrix data [default: False]
-
-m
,
--matrix_type
<matrix_type>
¶ Type of matrix supplied [default: none]
-
-x
,
--xticks
¶
Show x ticks [default: False]
-
-y
,
--yticks
¶
Show y ticks [default: False]
-
-t
,
--transpose
¶
Transpose matrix data [default: False]
-
-col
,
--color_column
<color_column>
¶ Color column. [default: color]
transform_plot¶
Dimensionality reduce VAE or original beta values using UMAP and plot using plotly.
pymethyl-visualize transform_plot [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
-
-c
,
--column_of_interest
<column_of_interest>
¶ Column extract from phenotype data. [default: disease]
-
-o
,
--output_file
<output_file>
¶ Output visualization. [default: ./visualization.html]
-
-nn
,
--n_neighbors
<n_neighbors>
¶ Number of neighbors UMAP. [default: 5]
-
-a
,
--axes_off
¶
Whether to turn axes on or off.
-
-s
,
--supervised
¶
Supervise umap embedding.
-
-d
,
--min_dist
<min_dist>
¶ UMAP min distance. [default: 0.1]
-
-m
,
--metric
<metric>
¶ Reduction metric. [default: euclidean]
-
-cc
,
--case_control_override
¶
Add controls from case_control column and override current disease for classification tasks. [default: False]
pymethyl-preprocess¶
pymethyl-preprocess [OPTIONS] COMMAND [ARGS]...
Options
-
--version
¶
Show the version and exit.
batch_deploy_preprocess¶
Deploy multiple preprocessing jobs in series or parallel.
pymethyl-preprocess batch_deploy_preprocess [OPTIONS]
Options
-
-n
,
--n_cores
<n_cores>
¶ Number cores to use for preprocessing. [default: 6]
-
-i
,
--subtype_output_dir
<subtype_output_dir>
¶ Output subtypes pheno csv. [default: ./preprocess_outputs/]
-
-m
,
--meffil
¶
Preprocess using meffil.
-
-t
,
--torque
¶
Job submission torque.
-
-r
,
--run
¶
Actually run local job or just print out command.
-
-s
,
--series
¶
Run commands in series.
-
-p
,
--pc_qc_parameters_csv
<pc_qc_parameters_csv>
¶ For meffil, qc parameters and pcs for final qc and functional normalization. [default: ./preprocess_outputs/pc_qc_parameters.csv]
-
-u
,
--use_cache
¶
If this is selected, loads qc results rather than running qc again. Only works for meffil selection.
-
-qc
,
--qc_only
¶
Only perform QC for meffil pipeline, caches results into rds file for loading again, only works if use_cache is false.
-
-c
,
--chunk_size
<chunk_size>
¶ If not series, chunk up and run these number of commands at once.. -1 means all commands at once.
combine_methylation_arrays¶
If split MethylationArrays by subtype for either preprocessing or imputation, can use to recombine data for downstream step.
pymethyl-preprocess combine_methylation_arrays [OPTIONS]
Options
-
-i
,
--input_pkls
<input_pkls>
¶ Input pickles for beta and phenotype data. [default: ./preprocess_outputs/methyl_array.pkl]
-
-d
,
--optional_input_pkl_dir
<optional_input_pkl_dir>
¶ Auto grab input pkls. [default: ]
-
-o
,
--output_pkl
<output_pkl>
¶ Output database for beta and phenotype data. [default: ./combined_outputs/methyl_array.pkl]
-
-e
,
--exclude
<exclude>
¶ If -d selected, these diseases will be excluded from study. [default: ]
concat_sample_sheets¶
Concat two sample files for more fields for minfi+ input, adds more samples.
pymethyl-preprocess concat_sample_sheets [OPTIONS]
Options
-
-s1
,
--sample_sheet1
<sample_sheet1>
¶ Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info1.csv]
-
-s2
,
--sample_sheet2
<sample_sheet2>
¶ Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info2.csv]
-
-os
,
--output_sample_sheet
<output_sample_sheet>
¶ CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]
create_sample_sheet¶
Create sample sheet for input to minfi, meffil, or enmix.
pymethyl-preprocess create_sample_sheet [OPTIONS]
Options
-
-is
,
--input_sample_sheet
<input_sample_sheet>
¶ Clinical information downloaded from tcga/geo/custom. [default: ./tcga_idats/clinical_info.csv]
-
-s
,
--source_type
<source_type>
¶ Source type of data. [default: tcga]
-
-i
,
--idat_dir
<idat_dir>
¶ Idat directory. [default: ./tcga_idats/]
-
-os
,
--output_sample_sheet
<output_sample_sheet>
¶ CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]
-
-m
,
--mapping_file
<mapping_file>
¶ Mapping file from uuid to TCGA barcode. Downloaded using download_tcga. [default: ./idat_filename_case.txt]
-
-l
,
--header_line
<header_line>
¶ Line to begin reading csv/xlsx. [default: 0]
-
-d
,
--disease_class_column
<disease_class_column>
¶ Disease classification column, for custom and geo datasets. [default: methylation class:ch1]
-
-b
,
--basename_col
<basename_col>
¶ Basename classification column, for custom datasets. [default: Sentrix ID (.idat)]
-
-c
,
--include_columns_file
<include_columns_file>
¶ Custom columns file containing columns to keep, separated by n. Add a tab for each line if you wish to rename columns: original_name t new_column_name [default: ]
download_clinical¶
Download all TCGA 450k clinical info.
pymethyl-preprocess download_clinical [OPTIONS]
Options
-
-o
,
--output_dir
<output_dir>
¶ Output directory for exported idats. [default: ./tcga_idats/]
download_geo¶
Download geo methylation study idats and clinical info.
pymethyl-preprocess download_geo [OPTIONS]
Options
-
-g
,
--geo_query
<geo_query>
¶ GEO study to query. [default: ]
-
-o
,
--output_dir
<output_dir>
¶ Output directory for exported idats. [default: ./geo_idats/]
download_tcga¶
Download all tcga 450k data.
pymethyl-preprocess download_tcga [OPTIONS]
Options
-
-o
,
--output_dir
<output_dir>
¶ Output directory for exported idats. [default: ./tcga_idats/]
feature_select¶
Filter CpGs by taking x top CpGs with highest mean absolute deviation scores or via spectral feature selection.
pymethyl-preprocess feature_select [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input database for beta and phenotype data. [default: ./imputed_outputs/methyl_array.pkl]
-
-o
,
--output_pkl
<output_pkl>
¶ Output database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
-
-n
,
--n_top_cpgs
<n_top_cpgs>
¶ Number cpgs to include with highest variance across population. [default: 300000]
-
-f
,
--feature_selection_method
<feature_selection_method>
¶
-
-mm
,
--metric
<metric>
¶
-
-nn
,
--n_neighbors
<n_neighbors>
¶ Number neighbors for feature selection, default enacts rbf kernel. [default: 0]
-
-m
,
--mad_top_cpgs
<mad_top_cpgs>
¶ Number cpgs to apply mad filtering first before more sophisticated feature selection. If 0 or primary feature selection is mad, no mad pre-filtering. [default: 0]
get_categorical_distribution¶
Get categorical distribution of columns of sample sheet.
pymethyl-preprocess get_categorical_distribution [OPTIONS]
Options
-
-is
,
--formatted_sample_sheet
<formatted_sample_sheet>
¶ Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/minfiSheet.csv]
-
-k
,
--key
<key>
¶ Column of csv to print statistics for. [default: disease]
-
-d
,
--disease_only
¶
Only look at disease, or text before subtype_delimiter.
-
-sd
,
--subtype_delimiter
<subtype_delimiter>
¶ Delimiter for disease extraction. [default: ,]
imputation_pipeline¶
Imputation of subtype or no subtype using various imputation methods.
pymethyl-preprocess imputation_pipeline [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input database for beta and phenotype data. [default: ./combined_outputs/methyl_array.pkl]
-
-ss
,
--split_by_subtype
¶
Imputes CpGs by subtype before combining again.
-
-m
,
--method
<method>
¶ Method of imputation. [default: KNN]
-
-s
,
--solver
<solver>
¶ Imputation library. [default: fancyimpute]
-
-k
,
--n_neighbors
<n_neighbors>
¶ Number neighbors for imputation if using KNN. [default: 5]
-
-r
,
--orientation
<orientation>
¶ Impute CpGs or samples. [default: Samples]
-
-o
,
--output_pkl
<output_pkl>
¶ Output database for beta and phenotype data. [default: ./imputed_outputs/methyl_array.pkl]
-
-n
,
--n_top_cpgs
<n_top_cpgs>
¶ Number cpgs to include with highest variance across population. Greater than 0 allows for mad filtering during imputation to skip mad step. [default: 0]
-
-f
,
--feature_selection_method
<feature_selection_method>
¶
-
-mm
,
--metric
<metric>
¶
-
-nfs
,
--n_neighbors_fs
<n_neighbors_fs>
¶ Number neighbors for feature selection, default enacts rbf kernel. [default: 0]
-
-d
,
--disease_only
¶
Only look at disease, or text before subtype_delimiter.
-
-sd
,
--subtype_delimiter
<subtype_delimiter>
¶ Delimiter for disease extraction. [default: ,]
-
-st
,
--sample_threshold
<sample_threshold>
¶ Value between 0 and 1 for NaN removal. If samples has sample_threshold proportion of cpgs missing, then remove sample. Set to -1 to not remove samples. [default: -1.0]
-
-ct
,
--cpg_threshold
<cpg_threshold>
¶ Value between 0 and 1 for NaN removal. If cpgs has cpg_threshold proportion of samples missing, then remove cpg. Set to -1 to not remove samples. [default: -1.0]
meffil_encode¶
Reformat file for meffil input.
pymethyl-preprocess meffil_encode [OPTIONS]
Options
-
-is
,
--input_sample_sheet
<input_sample_sheet>
¶ CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]
-
-os
,
--output_sample_sheet
<output_sample_sheet>
¶ CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]
merge_sample_sheets¶
Merge two sample files for more fields for minfi+ input.
pymethyl-preprocess merge_sample_sheets [OPTIONS]
Options
-
-s1
,
--sample_sheet1
<sample_sheet1>
¶ Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info1.csv]
-
-s2
,
--sample_sheet2
<sample_sheet2>
¶ Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info2.csv]
-
-os
,
--output_sample_sheet
<output_sample_sheet>
¶ CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]
-
-d
,
--second_sheet_disease
¶
Use second sheet’s disease column.
-
-nd
,
--no_disease_merge
¶
Don’t merge disease columns.
na_report¶
Print proportion of missing values throughout dataset.
pymethyl-preprocess na_report [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input database for beta and phenotype data. [default: ./preprocess_outputs/methyl_array.pkl]
-
-o
,
--output_dir
<output_dir>
¶ Output database for na report. [default: ./na_report/]
-
-r
,
--head_directory
¶
-i option becomes directory, and searches there for multiple input pickles.
preprocess_pipeline¶
Perform preprocessing of idats using enmix or meffil.
pymethyl-preprocess preprocess_pipeline [OPTIONS]
Options
-
-i
,
--idat_dir
<idat_dir>
¶ Idat dir for one sample sheet, alternatively can be your phenotype sample sheet. [default: ./tcga_idats/]
-
-n
,
--n_cores
<n_cores>
¶ Number cores to use for preprocessing. [default: 6]
-
-o
,
--output_pkl
<output_pkl>
¶ Output database for beta and phenotype data. [default: ./preprocess_outputs/methyl_array.pkl]
-
-m
,
--meffil
¶
Preprocess using meffil.
-
-pc
,
--n_pcs
<n_pcs>
¶ For meffil, number of principal components for functional normalization. If set to -1, then PCs are selected using elbow method. [default: -1]
-
-p
,
--pipeline
<pipeline>
¶ If not meffil, preprocess using minfi or enmix. [default: enmix]
-
-noob
,
--noob_norm
¶
Run noob normalization of minfi selected.
-
-u
,
--use_cache
¶
If this is selected, loads qc results rather than running qc again and update with new qc parameters. Only works for meffil selection. Minfi and enmix just loads RG Set.
-
-qc
,
--qc_only
¶
Only perform QC for meffil pipeline, caches results into rds file for loading again, only works if use_cache is false. Minfi and enmix just saves the RGSet before preprocessing.
-
-bns
,
--p_beadnum_samples
<p_beadnum_samples>
¶ From meffil documentation, “fraction of probes that failed the threshold of 3 beads”. [default: 0.05]
-
-pds
,
--p_detection_samples
<p_detection_samples>
¶ From meffil documentation, “fraction of probes that failed a detection.pvalue threshold of 0.01”. [default: 0.05]
-
-bnc
,
--p_beadnum_cpgs
<p_beadnum_cpgs>
¶ From meffil documentation, “fraction of samples that failed the threshold of 3 beads”. [default: 0.05]
-
-pdc
,
--p_detection_cpgs
<p_detection_cpgs>
¶ From meffil documentation, “fraction of samples that failed a detection.pvalue threshold of 0.01”. [default: 0.05]
-
-sc
,
--sex_cutoff
<sex_cutoff>
¶ From meffil documentation, “difference of total median intensity for Y chromosome probes and X chromosome probes”. [default: -2]
-
-sd
,
--sex_sd
<sex_sd>
¶ From meffil documentation, “sex detection outliers if outside this range”. [default: 5]
remove_diseases¶
Exclude diseases from study by count number or exclusion list.
pymethyl-preprocess remove_diseases [OPTIONS]
Options
-
-is
,
--formatted_sample_sheet
<formatted_sample_sheet>
¶ Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info.csv]
-
-e
,
--exclude_disease_list
<exclude_disease_list>
¶ List of conditions to exclude, from disease column, comma delimited. [default: ]
-
-os
,
--output_sheet_name
<output_sheet_name>
¶ CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]
-
-l
,
--low_count
<low_count>
¶ Remove diseases if they are below a certain count, default this is not used. [default: 0]
-
-d
,
--disease_only
¶
Only look at disease, or text before subtype_delimiter.
-
-sd
,
--subtype_delimiter
<subtype_delimiter>
¶ Delimiter for disease extraction. [default: ,]
split_preprocess_input_by_subtype¶
Split preprocess input samplesheet by disease subtype.
pymethyl-preprocess split_preprocess_input_by_subtype [OPTIONS]
Options
-
-i
,
--idat_csv
<idat_csv>
¶ Idat csv for one sample sheet, alternatively can be your phenotype sample sheet. [default: ./tcga_idats/minfiSheet.csv]
-
-d
,
--disease_only
¶
Only look at disease, or text before subtype_delimiter.
-
-sd
,
--subtype_delimiter
<subtype_delimiter>
¶ Delimiter for disease extraction. [default: ,]
-
-o
,
--subtype_output_dir
<subtype_output_dir>
¶ Output subtypes pheno csv. [default: ./preprocess_outputs/]
pymethyl-utils¶
pymethyl-utils [OPTIONS] COMMAND [ARGS]...
Options
-
--version
¶
Show the version and exit.
backup_pkl¶
Copy methylarray pickle to new location to backup.
pymethyl-utils backup_pkl [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
-
-o
,
--output_pkl
<output_pkl>
¶ Output database for beta and phenotype data. [default: ./backup/methyl_array.pkl]
bin_column¶
Convert continuous phenotype column into categorical by binning.
pymethyl-utils bin_column [OPTIONS]
Options
-
-t
,
--test_pkl
<test_pkl>
¶ Pickle containing testing set. [default: ./train_val_test_sets/test_methyl_array.pkl]
-
-c
,
--col
<col>
¶ Column to turn into bins. [default: age]
-
-n
,
--n_bins
<n_bins>
¶ Number of bins. [default: 10]
-
-ot
,
--output_test_pkl
<output_test_pkl>
¶ Binned shap pickle for further testing. [default: ./train_val_test_sets/test_methyl_array_shap_binned.pkl]
concat_csv¶
Concatenate two csv files together.
pymethyl-utils concat_csv [OPTIONS]
Options
-
-i1
,
--input_csv
<input_csv>
¶ Beta csv. [default: ./beta1.csv]
-
-i2
,
--input_csv2
<input_csv2>
¶ Beta/other csv 2. [default: ./cell_estimates.csv]
-
-o
,
--output_csv
<output_csv>
¶ Output csv. [default: ./beta.concat.csv]
-
-a
,
--axis
<axis>
¶ Axis to merge on. Columns are 0, rows are 1. [default: 1]
counts¶
Return categorical breakdown of phenotype column.
pymethyl-utils counts [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
-
-k
,
--key
<key>
¶ Key to split on. [default: disease]
create_external_validation_set¶
Create external validation set containing same CpGs as training set.
pymethyl-utils create_external_validation_set [OPTIONS]
Options
-
-t
,
--train_pkl
<train_pkl>
¶ Input methyl array. [default: ./train_val_test_sets/train_methyl_array.pkl]
-
-q
,
--query_pkl
<query_pkl>
¶ Input methylation array to add/subtract cpgs to. [default: ./final_preprocessed/methyl_array.pkl]
-
-o
,
--output_pkl
<output_pkl>
¶ Output methyl array external validation. [default: ./external_validation/methyl_array.pkl]
-
-c
,
--cpg_replace_method
<cpg_replace_method>
¶ What to do for missing CpGs. [default: mid]
feature_select_train_val_test¶
Filter CpGs by taking x top CpGs with highest mean absolute deviation scores or via spectral feature selection.
pymethyl-utils feature_select_train_val_test [OPTIONS]
Options
-
-i
,
--input_pkl_dir
<input_pkl_dir>
¶ Input database for beta and phenotype data. [default: ./train_val_test_sets/]
-
-o
,
--output_dir
<output_dir>
¶ Output database for beta and phenotype data. [default: ./train_val_test_sets_fs/]
-
-n
,
--n_top_cpgs
<n_top_cpgs>
¶ Number cpgs to include with highest variance across population. [default: 300000]
-
-f
,
--feature_selection_method
<feature_selection_method>
¶
-
-mm
,
--metric
<metric>
¶
-
-nn
,
--n_neighbors
<n_neighbors>
¶ Number neighbors for feature selection, default enacts rbf kernel. [default: 0]
-
-m
,
--mad_top_cpgs
<mad_top_cpgs>
¶ Number cpgs to apply mad filtering first before more sophisticated feature selection. If 0 or primary feature selection is mad, no mad pre-filtering. [default: 0]
fix_key¶
Format certain column of phenotype array in MethylationArray.
pymethyl-utils fix_key [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
-
-k
,
--key
<key>
¶ Key to split on. [default: disease]
-
-d
,
--disease_only
¶
Only look at disease, or text before subtype_delimiter.
-
-sd
,
--subtype_delimiter
<subtype_delimiter>
¶ Delimiter for disease extraction. [default: ,]
-
-o
,
--output_pkl
<output_pkl>
¶ Input database for beta and phenotype data. [default: ./fixed_preprocessed/methyl_array.pkl]
modify_pheno_data¶
Use another spreadsheet to add more descriptive data to methylarray.
pymethyl-utils modify_pheno_data [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
-
-is
,
--input_formatted_sample_sheet
<input_formatted_sample_sheet>
¶ Information passed through function create_sample_sheet, has Basename and disease fields. [default: ./tcga_idats/minfi_sheet.csv]
-
-o
,
--output_pkl
<output_pkl>
¶ Output database for beta and phenotype data. [default: ./modified_processed/methyl_array.pkl]
move_jpg¶
Move preprocessing jpegs to preprocessing output directory.
pymethyl-utils move_jpg [OPTIONS]
Options
-
-i
,
--input_dir
<input_dir>
¶ Directory containing jpg. [default: ./]
-
-o
,
--output_dir
<output_dir>
¶ Output directory for images. [default: ./preprocess_output_images/]
overwrite_pheno_data¶
Use another spreadsheet to add more descriptive data to methylarray.
pymethyl-utils overwrite_pheno_data [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
-
-is
,
--input_formatted_sample_sheet
<input_formatted_sample_sheet>
¶ Information passed through function create_sample_sheet, has Basename and disease fields. [default: ./tcga_idats/minfi_sheet.csv]
-
-o
,
--output_pkl
<output_pkl>
¶ Output database for beta and phenotype data. [default: ./modified_processed/methyl_array.pkl]
-
-c
,
--index_col
<index_col>
¶ Index col when reading csv. [default: 0]
pkl_to_csv¶
Output methylarray pickle to csv.
pymethyl-utils pkl_to_csv [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
-
-o
,
--output_dir
<output_dir>
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/]
-
-c
,
--col
<col>
¶ Column to color. [default: ]
print_number_sex_cpgs¶
Print number of non-autosomal CpGs.
pymethyl-utils print_number_sex_cpgs [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
-
-a
,
--array_type
<array_type>
¶ Array Type. [default: 450k]
print_shape¶
Print dimensions of beta matrix.
pymethyl-utils print_shape [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
ref_estimate_cell_counts¶
Reference based cell type estimates.
pymethyl-utils ref_estimate_cell_counts [OPTIONS]
Options
-
-ro
,
--input_r_object_dir
<input_r_object_dir>
¶ Input directory containing qc data. [default: ./preprocess_outputs/]
-
-a
,
--algorithm
<algorithm>
¶ Algorithm to run cell type. [default: meffil]
-
-ref
,
--reference
<reference>
¶ Cell Type Reference. [default: cord blood gse68456]
-
-l
,
--library
<library>
¶ IDOL Library. [default: IDOLOptimizedCpGs450klegacy]
-
-o
,
--output_csv
<output_csv>
¶ Output cell type estimates. [default: ./added_cell_counts/cell_type_estimates.csv]
remove_sex¶
Remove non-autosomal CpGs.
pymethyl-utils remove_sex [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input database for beta and phenotype data. [default: ./preprocess_outputs/methyl_array.pkl]
-
-o
,
--output_pkl
<output_pkl>
¶ Output methyl array autosomal. [default: ./autosomal/methyl_array.pkl]
-
-a
,
--array_type
<array_type>
¶ Array Type. [default: 450k]
remove_snps¶
Remove SNPs from methylation array.
pymethyl-utils remove_snps [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input database for beta and phenotype data. [default: ./autosomal/methyl_array.pkl]
-
-o
,
--output_pkl
<output_pkl>
¶ Output methyl array autosomal. [default: ./no_snp/methyl_array.pkl]
-
-a
,
--array_type
<array_type>
¶ Array Type. [default: 450k]
set_part_array_background¶
Set subset of CpGs from beta matrix to background values.
pymethyl-utils set_part_array_background [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input methyl array. [default: ./final_preprocessed/methyl_array.pkl]
-
-c
,
--cpg_pkl
<cpg_pkl>
¶ Pickled numpy array for subsetting. [default: ./subset_cpgs.pkl]
-
-o
,
--output_pkl
<output_pkl>
¶ Output methyl array external validation. [default: ./removal/methyl_array.pkl]
stratify¶
Split methylation array by key and store.
pymethyl-utils stratify [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
-
-k
,
--key
<key>
¶ Key to split on. [default: disease]
-
-o
,
--output_dir
<output_dir>
¶ Output directory for stratified. [default: ./stratified/]
subset_array¶
Only retain certain number of CpGs from methylation array.
pymethyl-utils subset_array [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input methyl array. [default: ./final_preprocessed/methyl_array.pkl]
-
-c
,
--cpg_pkl
<cpg_pkl>
¶ Pickled numpy array for subsetting. [default: ./subset_cpgs.pkl]
-
-o
,
--output_pkl
<output_pkl>
¶ Output methyl array external validation. [default: ./subset/methyl_array.pkl]
train_test_val_split¶
Split methylation array into train, test, val.
pymethyl-utils train_test_val_split [OPTIONS]
Options
-
-i
,
--input_pkl
<input_pkl>
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
-
-o
,
--output_dir
<output_dir>
¶ Output directory for training, testing, and validation sets. [default: ./train_val_test_sets/]
-
-tp
,
--train_percent
<train_percent>
¶ Percent data training on. [default: 0.8]
-
-vp
,
--val_percent
<val_percent>
¶ Percent of training data that comprises validation set. [default: 0.1]
-
-cat
,
--categorical
¶
Multi-class prediction. [default: False]
-
-do
,
--disease_only
¶
Only look at disease, or text before subtype_delimiter.
-
-k
,
--key
<key>
¶ Key to split on. [default: disease]
-
-sd
,
--subtype_delimiter
<subtype_delimiter>
¶ Delimiter for disease extraction. [default: ,]