Welcome to PyMethylProcess’s documentation!¶
To get started, download pymethylprocess using Docker (joshualevy44/pymethylprocess) or PIP (pymethylprocess) and run pymethyl-install_r_dependencies.
There is both an API and CLI available for use. Examples for CLI usage can be found in ./example_scripts.

Contains datatypes core to downloading IDATs, preprocessing IDATs and samplesheets.
(idat_dir, minfi=None, enmix=None, base=None, meffil=None)[source]¶ Class that will preprocess IDATs using R pipelines.
- idat_dir
- Location of idats or samplesheet csv.
- minfi
- Rpy2 importr minfi library, default to None will load through rpy2
- enmix
- Rpy2 importr enmix library, default to None will load through rpy2
- base
- Rpy2 importr base library, default to None will load through rpy2
- meffil
- Rpy2 importr meffil library, default to None will load through rpy2
(output_dir)[source]¶ Export pheno and beta dataframes to CSVs
- output_dir
- Where to store csvs.
(output_pickle, disease='')[source]¶ Export pheno and beta dataframes to pickle, stored in python dict that can be loaded into MethylationArray
- output_pickle
- Where to store MethylationArray.
- disease
- Custom naming scheme for data.
(output_db, disease='')[source]¶ Export pheno and beta dataframes to SQL
- output_db
- Where to store data, sqlite db.
- disease
- Custom naming scheme for data.
(methylset=False)[source]¶ Extract pheno data from MSet or RGSet, minfi.
- methylset
- If MSet has beenn created, set to True, else extract from original RGSet.
(meffil=False)[source]¶ Get pheno and beta dataframe objects stored as attributes for input to MethylationArray object.
- meffil
- True if ran meffil pipeline.
(output_dir)[source]¶ Plot QC results from ENmix pipeline and possible minfi. Still experimental.
- output_dir
- Where to store plots.
(output_dir)[source]¶ Plot QC results from ENmix pipeline and possible minfi. Still experimental.
- output_dir
- Where to store plots.
(n_cores=6)[source]¶ Run ENmix preprocessing pipeline.
- n_cores
- Number of CPUs to use.
(n_cores=6, n_pcs=4, qc_report_fname='qc/report.html', normalization_report_fname='norm/report.html', pc_plot_fname='qc/pc_plot.pdf', useCache=True, qc_only=True, qc_parameters={'p.beadnum.cpgs': 0.1, 'p.beadnum.samples': 0.1, 'p.detection.cpgs': 0.1, 'p.detection.samples': 0.1}, rm_sex=False)[source]¶ Run meffil preprocessing pipeline with functional normalization.
- n_cores
- Number of CPUs to use.
- n_pcs
- Number of principal components to use for functional normalization, set to -1 to autoselect via kneedle algorithm.
- qc_report_fname
- HTML filename to store QC report.
- normalization_report_fname
- HTML filename to store normalization report
- pc_plot_fname
- PDF file to store principal components plot.
- useCache
- Use saved QC objects instead of running through QC again.
- qc_only
- Perform QC, then save and quit before normalization.
- qc_parameters
- Python dictionary with parameters for qc.
- rm_sex
- Remove non-autosomal cpgs?
(n_cores=6, pipeline='enmix', noob=False, qc_only=False, use_cache=False)[source]¶ Run complete ENmix or minfi preprocessing pipeline.
- n_cores
- Number CPUs.
- pipeline
- Run enmix or minfi
- noob
- Noob norm or RAW if minfi running.
- qc_only
- Save and quit after only running QC?
- use_cache
- Load preexisting RGSet instead of running QC again.
(pheno_sheet, idat_dir, header_line=0)[source]¶ Class that will manipute phenotype samplesheet before preprocessing of IDATs.
- pheno_sheet
- Location of clinical info csv.
- idat_dir
- Location of idats
- header_line
- Where to start reading clinical csv
(other_formatted_sheet)[source]¶ Concat multiple PreProcessPhenoData objects, concat their dataframes to accept more than one smaplesheet/dataset.
- other_formatted_sheet
- Other PreProcessPhenoData to concat.
(output_sheet_name)[source]¶ Export pheno data to csv after done with manipulation.
- output_sheet_name
- Output csv name.
(basename_col, disease_class_column, include_columns={})[source]¶ Custom format clinical sheet if user supplied idats.
- basename_col
- Column name of sample names.
- disease_class_column
- Disease column of clinical info csv.
- include_columns
- Dictionary specifying other columns to include, and new names to assign them to.
(disease_class_column='methylation class:ch1', include_columns={})[source]¶ Format clinical sheets if downloaded geo idats.
- disease_class_column
- Disease column of clinical info csv.
- include_columns
- Dictionary specifying other columns to include, and new names to assign them to.
(mapping_file='idat_filename_case.txt')[source]¶ Format clinical sheets if downloaded tcga idats.
- mapping_file
- Maps uuids to proper tcga sample names, should be downloaded with tcga clinical information.
(key, disease_only=False, subtype_delimiter=', ')[source]¶ Print categorical distribution, counts for each unique value in phenotype column.
- key
- Phenotype Column.
- disease_only
- Whether to split phenotype column entries by delimiter.
- subtype_delimiter
- Subtype delimiter to split on.
(other_formatted_sheet, use_second_sheet_disease=True, no_disease_merge=False)[source]¶ Merge multiple PreProcessPhenoData objects, merge their dataframes to accept more than one saplesheet/dataset or add more pheno info.
- other_formatted_sheet
- Other PreProcessPhenoData to merge.
- use_second_sheet_disease
- Change disease column to that of second sheet instead of first.
- no_disease_merge
- Keep both disease columns from both sheets.
(exclude_disease_list, low_count, disease_only, subtype_delimiter)[source]¶ Remove samples with certain diseases from disease column.
- exclude_disease_list
- List containing diseases to remove.
- low_count
- Remove samples that have less than x disease occurances in column.
- disease_only
- Whether to split phenotype column entries by delimiter.
- subtype_delimiter
- Subtype delimiter to split on.
[source]¶ Downloads TCGA and GEO IDAT and clinical data
(output_dir)[source]¶ Download TCGA Clinical Data.
- output_dir
- Where to output clinical data csv.
Contains datatypes core to storing beta and phenotype methylation data, and imputation.
(solver, method, opts={})[source]¶ Class that stores and accesses different types of imputers. Construct sklearn-like imputer given certain input arguments.
- solver
- Library for imputation, eg. sklearn, fancyimpute.
- method
- Imputation method in library, named.
- opts
- Additional options to assign to imputer.
(pheno_df, beta_df, name='')[source]¶ Stores beta and phenotype information and performs various operations. Initialize MethylationArray object by inputting dataframe of phenotypes and dataframe of beta values with samples as index.
- pheno_df
- Phenotype dataframe (samples x covariates)
- beta_df
- Beta Values Dataframe (samples x cpgs)
(col, n_bins)[source]¶ Turn continuous variable/covariate into categorical bins. Returns name of new column and updates phenotype matrix to reflect this change.
- col
- Continuous column of phenotype array to bin.
- n_bins
- Number of bins to create.
(key)[source]¶ Print categorical distribution, counts for each unique value in phenotype column.
- key
- Phenotype Column.
(n_top_cpgs, feature_selection_method='mad', metric='correlation', nn=10)[source]¶ Perform unsupervised feature selection on MethylationArray.
- n_top_cpgs
- Number of CpGs to retain.
- feature_selection_method
- Method to perform selection.
- metric
- If considering structural feature selection like SPEC, use this distance metric.
- nn
- Number of nearest neighbors.
(input_pickle)[source]¶ Load MethylationArray stored in pickle.
Usage: MethylationArray.from_pickle([input_pickle])
- input_pickle
- Stored MethylationArray pickle.
(key)[source]¶ Groupby for Methylation Array. Returns generator of methylation arrays grouped by key.
- preprocess_sample_df
- New phenotype dataframe.
(imputer)[source]¶ Perform imputation on NaN beta vaues. Input imputater returned from ImputerObject.
- imputer
- Type of imputer object, in sklearn type interface.
(preprocess_sample_df)[source]¶ Feed in another phenotype dataframe that will be merged with existing phenotype array.
- preprocess_sample_df
- New phenotype dataframe.
(preprocess_sample_df)[source]¶ Feed in another phenotype dataframe that will overwrite overlapping keys of existing phenotype array.
- preprocess_sample_df
- New phenotype dataframe.
(cpg_threshold=None, sample_threshold=None)[source]¶ Remove samples and CpGs with certain level of missingness..
- cpg_threshold
- If more than fraction of Samples for this CpG are missing, remove cpg.
- sample_threshold
- If more than fraction of CpGs for this sample are missing, remove sample.
(outcome_cols)[source]¶ Remove samples of MethylationArray who have missing values in phenotype column.
- outcome_cols
- Phenotype columns, if any rows contain missing values, samples are removed.
(disease_only, subtype_delimiter)[source]¶ Split MethylationArray into generator of MethylationArrays by phenotype column. Much akin to groupby. Only splits from disease column.
- disease_only
- Consider disease superclass.
- subtype_delimiter
- How to break up disease column if using disease_only.
(key, subtype_delimiter)[source]¶ Manipulate an entire phenotype column, splitting each element up by some delimiter.
- key
- Phenotype column.
- subtype_delimiter
- How to break up strings in columns. S1,s2 -> S1 for instance.
(train_p=0.8, stratified=True, disease_only=False, key='disease', subtype_delimiter=', ', val_p=0.0)[source]¶ Split MethylationArray into training and test sets, with option to stratify by categorical covariate.
- train_p
- Fraction of methylation array to use as training set.
- stratified
- Whether to stratify by categorical variable.
- disease_only
- Consider disease superclass by some delimiter. For instance if disease is S1,s2, superclass would be S1.
- key
- Column to stratify on.
- subtype_delimiter
- How to split disease column into super/subclass.
- val_p
- If set greater than 0, will create additional validation set, fraction of which is broken off from training set.
(key='disease', n_samples=None, frac=None, categorical=False)[source]¶ Subsample MethylationArray, make the set randomly smaller.
- key
- If stratifying, use this column of pheno array.
- n_samples
- Number of samples to consider overall, or per stratum.
- frac
- Alternative to n_samples, where x frac of array or stratum is considered.
- categorical
- Whether to stratify by column.
(cpgs)[source]¶ Subset beta matrix by list of Cpgs. Parameters ———- cpgs
Cpgs to subset by.
(output_dir)[source]¶ Write phenotype data and beta values to csvs.
- output_dir
- Directory to output csv files.
(list_methylation_arrays)[source]¶ Literally a list of methylation arrays, with methods operate on these arrays that is memory efficient. Initialize with list of methylation arrays. Can optionally leave list empty or with one element.
- list_methylation_arrays
- List of methylation arrays.
(array_generator=None)[source]¶ Combine the list of methylation arrays into one array via concatenation of beta matrices and phenotype arrays.
- array_generator
- Generator of additional methylation arrays for computational memory minimization.
(folder)[source]¶ Return phenotype and beta dataframes from specified folder with csv.
- folder
- Input folder.
Contains a few R functions that interact with meffil and minfi.
(rgset, library)[source]¶ Given RGSet object, estimate cell counts for 450k/850k using reference approach via IDOL library.
- rgset
- RGSet object stored in python via rpy2
- library
- What type of CpG library to use.
(qc_list, cell_type_reference)[source]¶ Given QCObject list R object, estimate cell counts using reference approach via meffil.
- qc_list
- R list containing qc objects.
- cell_type_reference
- Reference blood/tissue set.
(rgset)[source]¶ Given RGSet object, estimate cell counts using reference approach via minfi.
- rgset
- RGSet object stored in python via rpy2
(qc_list, n_cores)[source]¶ Return list of detection p-value matrix and bead number matrix.
- qc_list
- R list containing qc objects.
- n_cores
- Number of cores to use in computation.
(array_type='450k')[source]¶ Return list of autosomal cpg probes per platform.
- array_type
- 450k/850k array?
(array_type='450k')[source]¶ Return list of SNP cpg probes per platform.
- array_type
- 450k/850k array?
(beta, array_type='450k')[source]¶ Remove non-autosomal cpgs from beta matrix.
- array_type
- 450k/850k array?
(beta, pval_beadnum, detection_val=1e-06)[source]¶ Set missing beta values to NA, taking into account detection values and bead number thesholds.
- pval_beadnum
- Detection pvalues and number of beads per cpg/samples
- detection_val
- If threshold to set site to missingness based on p-value detection.
Contains a machine learning class to perform scikit-learn like operations, along with held-out hyperparameter grid search.
(model, options, grid={}, labelencode=False, n_eval=0)[source]¶ Machine learning class to run sklearn-like pipeline on MethylationArray data. Initialize object with scikit-learn model, and optionally supply a hyperparameter search grid.
- model
- Scikit-learn-like model, classification, regression, dimensionality reduction, clustering etc.
- options
- Options to supply model in form of dictionary.
- grid
- Alternatively, supply search grid to search for bets hyperparameters.
- labelencode
- T/F encode string labels.
- n_eval
- Number of evaluations for randomized grid search, if set to 0, perform exhaustive grid search
(methyl_array, new_col, output_pkl)[source]¶ Assign results to new phenotype column.
- methyl_array
- MethylationArray.
- new_col
- New column name.
- output_pkl
- Output pickle to dump MethylationArray to.
(train_methyl_array, val_methyl_array=None, outcome_cols=None)[source]¶ Fit data to model.
- train_methyl_array
- Training MethylationArray.
- val_methyl_array
- Validation MethylationArray. Can set to None.
- outcome_cols
- Set to none if not needed, but phenotype column to train on, can be multiple.
(train_methyl_array, outcome_cols=None)[source]¶ Fit and predict training data.
- train_methyl_array
- Training MethylationArray.
- outcome_cols
- Set to none if not needed, but phenotype column to train on, can be multiple.
(train_methyl_array, outcome_cols=None)[source]¶ Fit and transform to training data.
- train_methyl_array
- Training MethylationArray.
- outcome_cols
- Set to none if not needed, but phenotype column to train on, can be multiple.
(test_methyl_array)[source]¶ Make new predictions on test methylation array.
- test_methyl_array
- Testing MethylationArray.
(methyl_array, outcome_cols, metric, run_bootstrap=False)[source]¶ Supply metric to evaluate results.
- methyl_array
- MethylationArray to evaluate.
- outcome_cols
- Outcome phenotype columns.
- metric
- Sklearn evaluation metric.
- run_bootstrap
- Make 95% CI from 1k bootstraps.
(output_pkl, results_dict={})[source]¶ Store results in pickle file.
- output_pkl
- Output pickle to dump results to.
- results_dict
- Supply own results dict to be dumped.
pymethyl-install [OPTIONS] COMMAND [ARGS]...
Show the version and exit.
Change GCC and G++ paths if don’t have version 7.2.0. [Experimental]
pymethyl-install change_gcc_path [OPTIONS]
Installs bioconductor packages.
pymethyl-install install_custom [OPTIONS]
¶ Custom packages. [default: ENmix]
Use BiocManager (recommended).
Installs minfi and other dependencies.
pymethyl-install install_minfi_others [OPTIONS]
Installs r packages.
pymethyl-install install_r_packages [OPTIONS]
¶ Custom packages. [default: ]
Installs bioconductor, minfi, enmix, tcga biolinks, and meffil.
pymethyl-install install_some_deps [OPTIONS]
pymethyl-visualize [OPTIONS] COMMAND [ARGS]...
Show the version and exit.
Plot csv containing cell type results into side by side boxplots.
pymethyl-visualize plot_cell_type_results [OPTIONS]
¶ Input csv. [default: cell_type_estimates.csv]
¶ Output png. [default: visualizations/cell_type_results.png]
¶ Plot columns. [default: Gran, CD4T, CD8T, Bcell, Mono, NK, gMDSC]
¶ Font scaling [default: 1.0]
Plot heatmap from CSV file.
pymethyl-visualize plot_heatmap [OPTIONS]
¶ Input csv. [default: ]
¶ Output png. [default: output.png]
¶ Index load dataframe [default: 0]
¶ Font scaling [default: 1.0]
¶ Min heat val [default: 0.0]
¶ Max heat val, if -1, defaults to None [default: 1.0]
Annotate heatmap [default: False]
Normalize matrix data [default: False]
Cluster matrix data [default: False]
¶ Type of matrix supplied [default: none]
Show x ticks [default: False]
Show y ticks [default: False]
Transpose matrix data [default: False]
¶ Color column. [default: color]
Dimensionality reduce VAE or original beta values using UMAP and plot using plotly.
pymethyl-visualize transform_plot [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
¶ Column extract from phenotype data. [default: disease]
¶ Output visualization. [default: ./visualization.html]
¶ Number of neighbors UMAP. [default: 5]
Whether to turn axes on or off.
Supervise umap embedding.
¶ UMAP min distance. [default: 0.1]
¶ Reduction metric. [default: euclidean]
Add controls from case_control column and override current disease for classification tasks. [default: False]
pymethyl-preprocess [OPTIONS] COMMAND [ARGS]...
Show the version and exit.
Deploy multiple preprocessing jobs in series or parallel.
pymethyl-preprocess batch_deploy_preprocess [OPTIONS]
¶ Number cores to use for preprocessing. [default: 6]
¶ Output subtypes pheno csv. [default: ./preprocess_outputs/]
Preprocess using meffil.
Job submission torque.
Actually run local job or just print out command.
Run commands in series.
¶ For meffil, qc parameters and pcs for final qc and functional normalization. [default: ./preprocess_outputs/pc_qc_parameters.csv]
If this is selected, loads qc results rather than running qc again. Only works for meffil selection.
Only perform QC for meffil pipeline, caches results into rds file for loading again, only works if use_cache is false.
¶ If not series, chunk up and run these number of commands at once.. -1 means all commands at once.
If split MethylationArrays by subtype for either preprocessing or imputation, can use to recombine data for downstream step.
pymethyl-preprocess combine_methylation_arrays [OPTIONS]
¶ Input pickles for beta and phenotype data. [default: ./preprocess_outputs/methyl_array.pkl]
¶ Auto grab input pkls. [default: ]
¶ Output database for beta and phenotype data. [default: ./combined_outputs/methyl_array.pkl]
¶ If -d selected, these diseases will be excluded from study. [default: ]
Concat two sample files for more fields for minfi+ input, adds more samples.
pymethyl-preprocess concat_sample_sheets [OPTIONS]
¶ Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info1.csv]
¶ Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info2.csv]
¶ CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]
Create sample sheet for input to minfi, meffil, or enmix.
pymethyl-preprocess create_sample_sheet [OPTIONS]
¶ Clinical information downloaded from tcga/geo/custom. [default: ./tcga_idats/clinical_info.csv]
¶ Source type of data. [default: tcga]
¶ Idat directory. [default: ./tcga_idats/]
¶ CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]
¶ Mapping file from uuid to TCGA barcode. Downloaded using download_tcga. [default: ./idat_filename_case.txt]
¶ Line to begin reading csv/xlsx. [default: 0]
¶ Disease classification column, for custom and geo datasets. [default: methylation class:ch1]
¶ Basename classification column, for custom datasets. [default: Sentrix ID (.idat)]
¶ Custom columns file containing columns to keep, separated by n. Add a tab for each line if you wish to rename columns: original_name t new_column_name [default: ]
Download all TCGA 450k clinical info.
pymethyl-preprocess download_clinical [OPTIONS]
¶ Output directory for exported idats. [default: ./tcga_idats/]
Download geo methylation study idats and clinical info.
pymethyl-preprocess download_geo [OPTIONS]
¶ GEO study to query. [default: ]
¶ Output directory for exported idats. [default: ./geo_idats/]
Download all tcga 450k data.
pymethyl-preprocess download_tcga [OPTIONS]
¶ Output directory for exported idats. [default: ./tcga_idats/]
Filter CpGs by taking x top CpGs with highest mean absolute deviation scores or via spectral feature selection.
pymethyl-preprocess feature_select [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./imputed_outputs/methyl_array.pkl]
¶ Output database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
¶ Number cpgs to include with highest variance across population. [default: 300000]
¶ Number neighbors for feature selection, default enacts rbf kernel. [default: 0]
¶ Number cpgs to apply mad filtering first before more sophisticated feature selection. If 0 or primary feature selection is mad, no mad pre-filtering. [default: 0]
Get categorical distribution of columns of sample sheet.
pymethyl-preprocess get_categorical_distribution [OPTIONS]
¶ Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/minfiSheet.csv]
¶ Column of csv to print statistics for. [default: disease]
Only look at disease, or text before subtype_delimiter.
¶ Delimiter for disease extraction. [default: ,]
Imputation of subtype or no subtype using various imputation methods.
pymethyl-preprocess imputation_pipeline [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./combined_outputs/methyl_array.pkl]
Imputes CpGs by subtype before combining again.
¶ Method of imputation. [default: KNN]
¶ Imputation library. [default: fancyimpute]
¶ Number neighbors for imputation if using KNN. [default: 5]
¶ Impute CpGs or samples. [default: Samples]
¶ Output database for beta and phenotype data. [default: ./imputed_outputs/methyl_array.pkl]
¶ Number cpgs to include with highest variance across population. Greater than 0 allows for mad filtering during imputation to skip mad step. [default: 0]
¶ Number neighbors for feature selection, default enacts rbf kernel. [default: 0]
Only look at disease, or text before subtype_delimiter.
¶ Delimiter for disease extraction. [default: ,]
¶ Value between 0 and 1 for NaN removal. If samples has sample_threshold proportion of cpgs missing, then remove sample. Set to -1 to not remove samples. [default: -1.0]
¶ Value between 0 and 1 for NaN removal. If cpgs has cpg_threshold proportion of samples missing, then remove cpg. Set to -1 to not remove samples. [default: -1.0]
Reformat file for meffil input.
pymethyl-preprocess meffil_encode [OPTIONS]
¶ CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]
¶ CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]
Merge two sample files for more fields for minfi+ input.
pymethyl-preprocess merge_sample_sheets [OPTIONS]
¶ Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info1.csv]
¶ Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info2.csv]
¶ CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]
Use second sheet’s disease column.
Don’t merge disease columns.
Print proportion of missing values throughout dataset.
pymethyl-preprocess na_report [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./preprocess_outputs/methyl_array.pkl]
¶ Output database for na report. [default: ./na_report/]
-i option becomes directory, and searches there for multiple input pickles.
Perform preprocessing of idats using enmix or meffil.
pymethyl-preprocess preprocess_pipeline [OPTIONS]
¶ Idat dir for one sample sheet, alternatively can be your phenotype sample sheet. [default: ./tcga_idats/]
¶ Number cores to use for preprocessing. [default: 6]
¶ Output database for beta and phenotype data. [default: ./preprocess_outputs/methyl_array.pkl]
Preprocess using meffil.
¶ For meffil, number of principal components for functional normalization. If set to -1, then PCs are selected using elbow method. [default: -1]
¶ If not meffil, preprocess using minfi or enmix. [default: enmix]
Run noob normalization of minfi selected.
If this is selected, loads qc results rather than running qc again and update with new qc parameters. Only works for meffil selection. Minfi and enmix just loads RG Set.
Only perform QC for meffil pipeline, caches results into rds file for loading again, only works if use_cache is false. Minfi and enmix just saves the RGSet before preprocessing.
¶ From meffil documentation, “fraction of probes that failed the threshold of 3 beads”. [default: 0.05]
¶ From meffil documentation, “fraction of probes that failed a detection.pvalue threshold of 0.01”. [default: 0.05]
¶ From meffil documentation, “fraction of samples that failed the threshold of 3 beads”. [default: 0.05]
¶ From meffil documentation, “fraction of samples that failed a detection.pvalue threshold of 0.01”. [default: 0.05]
¶ From meffil documentation, “difference of total median intensity for Y chromosome probes and X chromosome probes”. [default: -2]
¶ From meffil documentation, “sex detection outliers if outside this range”. [default: 5]
Exclude diseases from study by count number or exclusion list.
pymethyl-preprocess remove_diseases [OPTIONS]
¶ Clinical information downloaded from tcga/geo/custom, formatted using create_sample_sheet. [default: ./tcga_idats/clinical_info.csv]
¶ List of conditions to exclude, from disease column, comma delimited. [default: ]
¶ CSV for minfi input. [default: ./tcga_idats/minfiSheet.csv]
¶ Remove diseases if they are below a certain count, default this is not used. [default: 0]
Only look at disease, or text before subtype_delimiter.
¶ Delimiter for disease extraction. [default: ,]
Split preprocess input samplesheet by disease subtype.
pymethyl-preprocess split_preprocess_input_by_subtype [OPTIONS]
¶ Idat csv for one sample sheet, alternatively can be your phenotype sample sheet. [default: ./tcga_idats/minfiSheet.csv]
Only look at disease, or text before subtype_delimiter.
¶ Delimiter for disease extraction. [default: ,]
¶ Output subtypes pheno csv. [default: ./preprocess_outputs/]
pymethyl-utils [OPTIONS] COMMAND [ARGS]...
Show the version and exit.
Copy methylarray pickle to new location to backup.
pymethyl-utils backup_pkl [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
¶ Output database for beta and phenotype data. [default: ./backup/methyl_array.pkl]
Convert continuous phenotype column into categorical by binning.
pymethyl-utils bin_column [OPTIONS]
¶ Pickle containing testing set. [default: ./train_val_test_sets/test_methyl_array.pkl]
¶ Column to turn into bins. [default: age]
¶ Number of bins. [default: 10]
¶ Binned shap pickle for further testing. [default: ./train_val_test_sets/test_methyl_array_shap_binned.pkl]
Concatenate two csv files together.
pymethyl-utils concat_csv [OPTIONS]
¶ Beta csv. [default: ./beta1.csv]
¶ Beta/other csv 2. [default: ./cell_estimates.csv]
¶ Output csv. [default: ./beta.concat.csv]
¶ Axis to merge on. Columns are 0, rows are 1. [default: 1]
Return categorical breakdown of phenotype column.
pymethyl-utils counts [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
¶ Key to split on. [default: disease]
Create external validation set containing same CpGs as training set.
pymethyl-utils create_external_validation_set [OPTIONS]
¶ Input methyl array. [default: ./train_val_test_sets/train_methyl_array.pkl]
¶ Input methylation array to add/subtract cpgs to. [default: ./final_preprocessed/methyl_array.pkl]
¶ Output methyl array external validation. [default: ./external_validation/methyl_array.pkl]
¶ What to do for missing CpGs. [default: mid]
Filter CpGs by taking x top CpGs with highest mean absolute deviation scores or via spectral feature selection.
pymethyl-utils feature_select_train_val_test [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./train_val_test_sets/]
¶ Output database for beta and phenotype data. [default: ./train_val_test_sets_fs/]
¶ Number cpgs to include with highest variance across population. [default: 300000]
¶ Number neighbors for feature selection, default enacts rbf kernel. [default: 0]
¶ Number cpgs to apply mad filtering first before more sophisticated feature selection. If 0 or primary feature selection is mad, no mad pre-filtering. [default: 0]
Format certain column of phenotype array in MethylationArray.
pymethyl-utils fix_key [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
¶ Key to split on. [default: disease]
Only look at disease, or text before subtype_delimiter.
¶ Delimiter for disease extraction. [default: ,]
¶ Input database for beta and phenotype data. [default: ./fixed_preprocessed/methyl_array.pkl]
Use another spreadsheet to add more descriptive data to methylarray.
pymethyl-utils modify_pheno_data [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
¶ Information passed through function create_sample_sheet, has Basename and disease fields. [default: ./tcga_idats/minfi_sheet.csv]
¶ Output database for beta and phenotype data. [default: ./modified_processed/methyl_array.pkl]
Move preprocessing jpegs to preprocessing output directory.
pymethyl-utils move_jpg [OPTIONS]
¶ Directory containing jpg. [default: ./]
¶ Output directory for images. [default: ./preprocess_output_images/]
Use another spreadsheet to add more descriptive data to methylarray.
pymethyl-utils overwrite_pheno_data [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
¶ Information passed through function create_sample_sheet, has Basename and disease fields. [default: ./tcga_idats/minfi_sheet.csv]
¶ Output database for beta and phenotype data. [default: ./modified_processed/methyl_array.pkl]
¶ Index col when reading csv. [default: 0]
Output methylarray pickle to csv.
pymethyl-utils pkl_to_csv [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/]
¶ Column to color. [default: ]
Print number of non-autosomal CpGs.
pymethyl-utils print_number_sex_cpgs [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
¶ Array Type. [default: 450k]
Print dimensions of beta matrix.
pymethyl-utils print_shape [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
Reference based cell type estimates.
pymethyl-utils ref_estimate_cell_counts [OPTIONS]
¶ Input directory containing qc data. [default: ./preprocess_outputs/]
¶ Algorithm to run cell type. [default: meffil]
¶ Cell Type Reference. [default: cord blood gse68456]
¶ IDOL Library. [default: IDOLOptimizedCpGs450klegacy]
¶ Output cell type estimates. [default: ./added_cell_counts/cell_type_estimates.csv]
Remove non-autosomal CpGs.
pymethyl-utils remove_sex [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./preprocess_outputs/methyl_array.pkl]
¶ Output methyl array autosomal. [default: ./autosomal/methyl_array.pkl]
¶ Array Type. [default: 450k]
Remove SNPs from methylation array.
pymethyl-utils remove_snps [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./autosomal/methyl_array.pkl]
¶ Output methyl array autosomal. [default: ./no_snp/methyl_array.pkl]
¶ Array Type. [default: 450k]
Set subset of CpGs from beta matrix to background values.
pymethyl-utils set_part_array_background [OPTIONS]
¶ Input methyl array. [default: ./final_preprocessed/methyl_array.pkl]
¶ Pickled numpy array for subsetting. [default: ./subset_cpgs.pkl]
¶ Output methyl array external validation. [default: ./removal/methyl_array.pkl]
Split methylation array by key and store.
pymethyl-utils stratify [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
¶ Key to split on. [default: disease]
¶ Output directory for stratified. [default: ./stratified/]
Only retain certain number of CpGs from methylation array.
pymethyl-utils subset_array [OPTIONS]
¶ Input methyl array. [default: ./final_preprocessed/methyl_array.pkl]
¶ Pickled numpy array for subsetting. [default: ./subset_cpgs.pkl]
¶ Output methyl array external validation. [default: ./subset/methyl_array.pkl]
Split methylation array into train, test, val.
pymethyl-utils train_test_val_split [OPTIONS]
¶ Input database for beta and phenotype data. [default: ./final_preprocessed/methyl_array.pkl]
¶ Output directory for training, testing, and validation sets. [default: ./train_val_test_sets/]
¶ Percent data training on. [default: 0.8]
¶ Percent of training data that comprises validation set. [default: 0.1]
Multi-class prediction. [default: False]
Only look at disease, or text before subtype_delimiter.
¶ Key to split on. [default: disease]
¶ Delimiter for disease extraction. [default: ,]