API Reference
Quick Function Finder
“I want to…”
Get started with test data:
Download test files:
download_test_files()
Load genetic data:
PLINK binary (.bed/.bim/.fam):
load_plink_data()PLINK2 (.pgen/.pvar/.psam):
load_pgen_data()VCF:
load_vcf_data()BGEN:
load_bgen_data()Phenotype:
prepare_phenotype_data()
Validate and check data quality:
Validate genotype encoding:
validate_genotype_df()Fix encoding issues:
validate_and_fix_encoding()Validate phenotype data:
validate_phenotype_df()Validate and align datasets:
validate_and_align_data()
Quality control:
Comprehensive QC:
filter_genotype_data()Filter by MAF:
filter_variants_by_maf()Filter by missingness:
filter_variants_by_missing()Filter by HWE:
filter_variants_by_hwe()Calculate HWE p-values:
calculate_hwe_pvalues()Filter samples:
filter_samples_by_call_rate()Check case/control balance:
check_case_control_balance()
Control for population structure:
Calculate GRM:
calculate_grm_gcta()Load GRM:
load_grm_gcta()Calculate PCs (basic):
calculate_pca_sklearn()Calculate PCs (with PLINK2):
calculate_pca_plink()Calculate PCs (for related samples):
calculate_pca_pcair()Add PCs to phenotype:
attach_pcs_to_phenotype()Get PC covariate names:
get_pc_covariate_list()Find related samples:
identify_related_samples()Remove related samples:
filter_related_samples()
Prepare data for analysis:
Split train/test:
stratified_train_test_split()Impute missing covariates:
impute_covariates()
Run EDGE analysis:
Initialize EDGE:
EDGEAnalysisCalculate alpha:
EDGEAnalysis.calculate_alpha()Apply alpha:
EDGEAnalysis.apply_alpha()Full workflow:
EDGEAnalysis.run_full_analysis()Cross-validation:
cross_validated_edge_analysis()Check failed SNPs:
EDGEAnalysis.get_skipped_snps()
Compare with standard GWAS:
Run additive model:
standard_gwas()/additive_gwas()
Visualize results:
Manhattan plot:
manhattan_plot()QQ plot:
qq_plot()Alpha distribution:
plot_alpha_distribution()
Save and format results:
Save results:
save_results()Load alpha values:
load_alpha_values()Format for publication:
format_gwas_output_for_locuszoom()Save for LocusZoom:
save_for_locuszoom()Validate LocusZoom format:
validate_locuszoom_format()Create summary report:
create_summary_report()
Core Analysis
- class EDGEAnalysis(outcome_type='binary', outcome_transform=None, ols_method='bfgs', n_jobs=-1, max_iter=1000, verbose=True)
Main class for EDGE GWAS analysis.
- Parameters:
outcome_type (str) – Type of outcome - ‘binary’ for logistic regression or ‘continuous’ for linear regression
outcome_transform (str, optional) – Transformation for continuous outcomes. Options: None, ‘log’, ‘log10’, ‘inverse_normal’, ‘rank_inverse_normal’
ols_method (str) – Optimization method for OLS regression. Options: ‘newton’, ‘bfgs’, ‘lbfgs’, ‘nm’, ‘cg’, ‘ncg’, ‘powell’, ‘basinhopping’
n_jobs (int) – Number of parallel jobs (-1 uses all available cores)
max_iter (int) – Maximum iterations for model convergence
verbose (bool) – Print progress information
- calculate_alpha(genotype_data, phenotype_df, outcome, covariates, variant_info=None, grm_matrix=None, grm_sample_ids=None, mean_centered=False, use_fast_approximation=True)
Calculate EDGE alpha values from training data.
- Parameters:
genotype_data (pd.DataFrame) – Genotype data with samples as index and variants as columns (0/1/2 encoding)
phenotype_df (pd.DataFrame) – Phenotype data with sample IDs as index
outcome (str) – Name of outcome variable in phenotype_df
covariates (list) – List of covariate names in phenotype_df
variant_info (pd.DataFrame, optional) – Optional variant information with variant_id as index
grm_matrix (np.ndarray, optional) – Optional GRM matrix from GCTA (for population structure control)
grm_sample_ids (pd.DataFrame, optional) – DataFrame with FID, IID, and sample_id corresponding to GRM rows
mean_centered (bool) – If True, use mean-centered codominant model without intercept
use_fast_approximation (bool) – If True, use faster approximation for GRM-based binary models
- Returns:
DataFrame with alpha values for each variant
- Return type:
pd.DataFrame
- apply_alpha(genotype_data, phenotype_df, outcome, covariates, alpha_values=None, grm_matrix=None, grm_sample_ids=None, variant_info=None, use_fast_approximation=True)
Apply EDGE alpha values to test data and perform GWAS.
- Parameters:
genotype_data (pd.DataFrame) – Genotype data with samples as index and variants as columns (0/1/2 encoding)
phenotype_df (pd.DataFrame) – Phenotype data with sample IDs as index
outcome (str) – Name of outcome variable in phenotype_df
covariates (list) – List of covariate names in phenotype_df
alpha_values (pd.DataFrame, optional) – DataFrame with alpha values (from calculate_alpha). If None, uses self.alpha_values
grm_matrix (np.ndarray, optional) – Optional GRM matrix from GCTA
grm_sample_ids (pd.DataFrame, optional) – DataFrame with FID, IID, and sample_id corresponding to GRM rows
variant_info (pd.DataFrame, optional) – Optional variant information DataFrame
use_fast_approximation (bool) – If True, use faster approximation for GRM-based binary models
- Returns:
DataFrame with GWAS results
- Return type:
pd.DataFrame
- run_full_analysis(train_genotype, train_phenotype, test_genotype, test_phenotype, outcome, covariates, variant_info=None, grm_matrix=None, grm_sample_ids=None, mean_centered=False, use_fast_approximation=True, output_prefix=None)
Run complete EDGE analysis: calculate alpha on training data, apply alpha on test data.
- Parameters:
train_genotype (pd.DataFrame) – Training genotype data
train_phenotype (pd.DataFrame) – Training phenotype data
test_genotype (pd.DataFrame) – Test genotype data
test_phenotype (pd.DataFrame) – Test phenotype data
outcome (str) – Name of outcome variable
covariates (list) – List of covariate names
variant_info (pd.DataFrame, optional) – Optional variant information DataFrame
grm_matrix (np.ndarray, optional) – Optional GRM matrix from GCTA
grm_sample_ids (pd.DataFrame, optional) – Optional sample IDs for GRM
mean_centered (bool) – If True, use mean-centered model without intercept
use_fast_approximation (bool) – If True, use faster approximation for GRM-based binary models
output_prefix (str, optional) – Optional prefix for output files
- Returns:
Tuple of (alpha_df, gwas_df)
- Return type:
Data Loading
- load_plink_data(bed_file, bim_file, fam_file, minor_allele_as_alt=True, verbose=True)
Load PLINK binary format data (.bed/.bim/.fam).
- Parameters:
- Returns:
Tuple of (genotype_df, variant_info_df)
- Return type:
- load_pgen_data(pgen_file, pvar_file, psam_file, minor_allele_as_alt=True, verbose=True)
Load PLINK 2 binary format data (.pgen/.pvar/.psam).
- Parameters:
- Returns:
Tuple of (genotype_df, variant_info_df)
- Return type:
Note
Requires pgenlib package:
pip install pgenlib
- load_vcf_data(vcf_file, dosage=True, minor_allele_as_alt=True, verbose=True)
Load VCF format data.
- Parameters:
- Returns:
Tuple of (genotype_df, variant_info_df)
- Return type:
Note
Requires cyvcf2 package:
pip install cyvcf2
- load_bgen_data(bgen_file, sample_file=None, minor_allele_as_alt=True, verbose=True)
Load BGEN format data.
- Parameters:
- Returns:
Tuple of (genotype_df, variant_info_df) - genotypes are dosages
- Return type:
Note
Requires bgen_reader package:
pip install bgen-reader
- prepare_phenotype_data(phenotype_file, outcome_col, covariate_cols, sample_id_col='IID', sep='\\t', log_transform_outcome=False)
Load and prepare phenotype data.
- Parameters:
phenotype_file (str) – Path to phenotype file
outcome_col (str) – Name of outcome column
covariate_cols (list) – List of covariate column names
sample_id_col (str) – Name of sample ID column (will become index)
sep (str) – File separator
log_transform_outcome (bool) – Apply log10(x+1) transformation to outcome
- Returns:
DataFrame with sample IDs as index, outcome and covariates as columns
- Return type:
pd.DataFrame
- download_test_files(output_dir='tests', version='v0.1.2', overwrite=False, verbose=True)
Download test files from GitHub repository.
Data Validation
- validate_genotype_df(genotype_df, variant_info_df=None, name='genotype_df', check_encoding=True, verbose=True, return_details=False)
Validate genotype DataFrame format and encoding.
- Parameters:
genotype_df (pd.DataFrame) – Genotype DataFrame (samples x variants)
variant_info_df (pd.DataFrame, optional) – Optional variant information DataFrame
name (str) – Name for error messages
check_encoding (bool) – If True, validate encoding (requires variant_info_df)
verbose (bool) – Print validation results
return_details (bool) – If True, return (passed, report_df)
- Returns:
None (raises errors if invalid) OR bool (validation passed) OR Tuple[bool, pd.DataFrame] (if return_details=True)
- Return type:
- validate_and_fix_encoding(genotype_df, variant_info_df, verbose=True)
Validate and automatically fix genotype encoding.
- validate_phenotype_df(phenotype_df, outcome_col, covariate_cols, name='phenotype_df')
Validate phenotype DataFrame format.
- Parameters:
- Raises:
TypeError – If not a pandas DataFrame
ValueError – If required columns are missing or DataFrame is invalid
- validate_and_align_data(genotype_df, phenotype_df, outcome_col=None, covariate_cols=None, geno_id_col=None, pheno_id_col=None, keep_only_common=True, verbose=True)
Validate and align genotype and phenotype data by sample IDs.
- Parameters:
genotype_df (pd.DataFrame) – Genotype DataFrame (samples x variants)
phenotype_df (pd.DataFrame) – Phenotype DataFrame
outcome_col (str, optional) – Name of outcome column (optional, for validation)
covariate_cols (list, optional) – List of covariate columns (optional, for validation)
geno_id_col (str, optional) – Column name for sample IDs in genotype_df (None = use index)
pheno_id_col (str, optional) – Column name for sample IDs in phenotype_df (None = use index)
keep_only_common (bool) – If True, keep only samples present in both datasets
verbose (bool) – Print validation information
- Returns:
Tuple of (aligned_genotype_df, aligned_phenotype_df)
- Return type:
- Raises:
ValueError – If no common samples found or if keep_only_common=False and samples don’t match
Quality Control
- filter_genotype_data(genotype_df, phenotype_df=None, min_maf=None, max_missing_per_variant=None, min_call_rate_per_sample=None, verbose=True)
Comprehensive genotype data filtering with multiple QC criteria.
- Parameters:
genotype_df (pd.DataFrame) – Genotype DataFrame (samples x variants)
phenotype_df (pd.DataFrame, optional) – Optional phenotype DataFrame (required if filtering samples)
min_maf (float, optional) – Minimum minor allele frequency (e.g., 0.01 for 1%). If None, no MAF filtering
max_missing_per_variant (float, optional) – Maximum missing rate per variant (e.g., 0.1 for 10%). If None, no filtering
min_call_rate_per_sample (float, optional) – Minimum call rate per sample (e.g., 0.95 for 95%). If None, no filtering
verbose (bool) – Print filtering information
- Returns:
filtered_genotype_df OR (filtered_genotype_df, filtered_phenotype_df)
- Return type:
pd.DataFrame or tuple
- filter_variants_by_maf(genotype_df, min_maf=0.01, verbose=True)
Filter variants by minor allele frequency.
- filter_variants_by_missing(genotype_df, max_missing=0.1, verbose=True)
Filter variants by missing genotype rate.
- filter_samples_by_call_rate(genotype_df, phenotype_df, min_call_rate=0.95, verbose=True)
Filter samples by genotype call rate.
- Parameters:
- Returns:
Tuple of (filtered_genotype_df, filtered_phenotype_df)
- Return type:
- calculate_hwe_pvalues(genotype_df, verbose=True)
Calculate Hardy-Weinberg Equilibrium p-values for each variant.
- Parameters:
genotype_df (pd.DataFrame) – Genotype DataFrame
verbose (bool) – Print calculation information
- Returns:
Series of HWE p-values for each variant
- Return type:
pd.Series
- filter_variants_by_hwe(genotype_df, hwe_threshold=1e-6, verbose=True)
Filter variants by Hardy-Weinberg Equilibrium p-value.
- check_case_control_balance(phenotype_df, outcome_col, verbose=True)
Check case/control balance in binary outcome.
Population Structure Control
- calculate_grm_gcta(plink_prefix, output_prefix=None, maf_threshold=0.01, method='grm', max_threads=1, verbose=True)
Calculate genetic relationship matrix (GRM) using GCTA.
- Parameters:
plink_prefix (str) – Prefix for PLINK binary files (.bed/.bim/.fam)
output_prefix (str, optional) – Prefix for output GRM files (default: temp directory)
maf_threshold (float) – MAF threshold for variant filtering
method (str) – GRM calculation method (‘grm’ for full, ‘grm-sparse’ for sparse)
max_threads (int) – Maximum number of threads to use
verbose (bool) – Print progress information
- Returns:
Path to output GRM prefix
- Return type:
Note
Requires GCTA to be installed and available in PATH. Download from: https://yanglab.westlake.edu.cn/software/gcta/
- load_grm_gcta(grm_prefix, verbose=True)
Load GRM calculated by GCTA.
- Parameters:
- Returns:
Tuple of (grm_matrix, sample_ids_df)
- Return type:
- Raises:
FileNotFoundError – If GRM files are not found
- calculate_pca_sklearn(genotype_df, n_pcs=10, verbose=True)
Calculate principal components using scikit-learn (basic PCA without relatedness correction).
- Parameters:
- Returns:
DataFrame with ‘IID’ as index and PC1, PC2, …, PCn columns
- Return type:
pd.DataFrame
Note
This is a basic PCA without correction for relatedness. For more robust PCA accounting for relatedness, use
calculate_pca_plink().
- calculate_pca_plink(file_prefix, n_pcs=10, file_format='bfile', output_prefix=None, maf_threshold=0.01, ld_window=50, ld_step=5, ld_r2=0.2, approx=False, verbose=True)
Calculate principal components using PLINK2.
- Parameters:
file_prefix (str) – Prefix for input files
n_pcs (int) – Number of principal components to calculate
file_format (str) – Input file format (‘bfile’, ‘pfile’, ‘vcf’, ‘bgen’)
output_prefix (str, optional) – Prefix for output files (default: temp directory)
maf_threshold (float, optional) – MAF threshold for variant filtering (None to skip)
ld_window (int, optional) – Window size for LD pruning in variant count (None to skip)
ld_step (int, optional) – Step size for LD pruning in variant count (None to skip)
ld_r2 (float, optional) – R² threshold for LD pruning (None to skip)
approx (bool) – Use approximate PCA for large cohorts
verbose (bool) – Print progress information
- Returns:
DataFrame with IID as index and PC1, PC2, …, PCn columns
- Return type:
pd.DataFrame
- calculate_pca_pcair(plink_prefix, n_pcs=10, kinship_matrix=None, divergence_matrix=None, output_prefix=None, kin_threshold=0.0884, div_threshold=-0.0884, maf_threshold=0.01, verbose=True)
Calculate PC-AiR (Principal Components - Analysis in Related samples).
- Parameters:
plink_prefix (str) – Prefix for PLINK binary files
n_pcs (int) – Number of principal components to calculate
kinship_matrix (str, optional) – Path to kinship matrix (if None, calculates using GCTA)
divergence_matrix (str, optional) – Path to divergence matrix (optional)
output_prefix (str, optional) – Prefix for output files (default: temp directory)
kin_threshold (float) – Kinship threshold for defining relatives
div_threshold (float) – Divergence threshold
maf_threshold (float) – MAF threshold for variant filtering
verbose (bool) – Print progress information
- Returns:
DataFrame with IID as index and PC1, PC2, …, PCn columns
- Return type:
pd.DataFrame
Note
Requires R with GENESIS, SNPRelate, and gdsfmt packages installed.
- attach_pcs_to_phenotype(phenotype_df, pca_df, n_pcs=10, pc_prefix='PC', sample_id_col=None, drop_na=False, verbose=True)
Attach principal components to phenotype DataFrame.
- Parameters:
phenotype_df (pd.DataFrame) – Phenotype DataFrame (IID as index or column)
pca_df (pd.DataFrame) – PCA DataFrame with IID as index and PC columns
n_pcs (int) – Number of PCs to attach (will use PC1 to PCn)
pc_prefix (str) – Prefix for PC column names
sample_id_col (str, optional) – Column name in phenotype_df to use for matching. If None, uses index
drop_na (bool) – If True, remove samples with missing PCs after merging
verbose (bool) – Print information about merging
- Returns:
Phenotype DataFrame with PC columns added
- Return type:
pd.DataFrame
- Raises:
ValueError – If requested PCs are not available in pca_df
- get_pc_covariate_list(n_pcs, pc_prefix='PC')
Generate list of PC covariate names for use in EDGE analysis.
Identify pairs of related samples based on GRM threshold.
- Parameters:
- Returns:
DataFrame with columns IID1, IID2, kinship (sorted by kinship descending)
- Return type:
pd.DataFrame
Filter out related samples to create an unrelated subset.
- Parameters:
phenotype_df (pd.DataFrame) – Phenotype DataFrame
grm_matrix (np.ndarray) – n_samples x n_samples GRM matrix
sample_ids (pd.DataFrame) – DataFrame with sample IDs (from load_grm_gcta)
threshold (float) – Relatedness threshold
method (str) – Method for selecting unrelated samples (‘greedy’ or ‘random’)
sample_id_col (str, optional) – Column name in phenotype_df for sample IDs. If None, uses index
verbose (bool) – Print filtering information
- Returns:
Filtered phenotype DataFrame with unrelated samples only
- Return type:
pd.DataFrame
Data Preparation
- stratified_train_test_split(genotype_df, phenotype_df, outcome_col, test_size=0.5, random_state=42, is_binary=True, geno_id_col=None, pheno_id_col=None)
Split data into training and test sets with stratification.
- Parameters:
genotype_df (pd.DataFrame) – Genotype DataFrame (samples x variants)
phenotype_df (pd.DataFrame) – Phenotype DataFrame
outcome_col (str) – Name of outcome column for stratification
test_size (float) – Proportion of samples in test set
random_state (int) – Random seed for reproducibility
is_binary (bool) – Whether outcome is binary (enables stratification)
geno_id_col (str, optional) – Column name in genotype_df for sample IDs. If None, uses index
pheno_id_col (str, optional) – Column name in phenotype_df for sample IDs. If None, uses index
- Returns:
Tuple of (train_geno, test_geno, train_pheno, test_pheno)
- Return type:
- Raises:
ValueError – If no common samples found or stratification fails
- impute_covariates(phenotype_df, covariate_cols, method='median', drop_na=False, verbose=True)
Impute missing values in covariates.
- Parameters:
phenotype_df (pd.DataFrame) – Phenotype DataFrame with covariates
covariate_cols (list) – List of covariate column names to impute
method (str) – Imputation method - ‘drop’, ‘mean’, ‘median’, ‘mode’, ‘knn’, ‘missforest’, ‘mice’
drop_na (bool) – If True, drop rows with missing outcome after imputation
verbose (bool) – Print imputation information
- Returns:
DataFrame with imputed covariates
- Return type:
pd.DataFrame
Note
For ‘missforest’ and ‘mice’, install:
pip install missingpy
Standard GWAS
- standard_gwas(genotype_df, phenotype_df, outcome, covariates, outcome_type='binary')
Perform standard additive GWAS for comparison with EDGE.
- Parameters:
- Returns:
DataFrame with variant_id, coef, pval, std_err
- Return type:
pd.DataFrame
- additive_gwas(genotype_df, phenotype_df, outcome, covariates, outcome_type='binary')
Alias for
standard_gwas(). Perform standard additive GWAS.- Parameters:
- Returns:
DataFrame with variant_id, coef, pval, std_err
- Return type:
pd.DataFrame
- cross_validated_edge_analysis(genotype_df, phenotype_df, outcome, covariates, outcome_type='binary', n_folds=5, n_jobs=8, random_state=42)
Perform k-fold cross-validation for EDGE analysis.
- Parameters:
genotype_df (pd.DataFrame) – Genotype DataFrame
phenotype_df (pd.DataFrame) – Phenotype DataFrame
outcome (str) – Name of outcome column
covariates (list) – List of covariate column names
outcome_type (str) – ‘binary’ or ‘continuous’
n_folds (int) – Number of cross-validation folds
n_jobs (int) – Number of parallel jobs for EDGE analysis
random_state (int) – Random seed for reproducibility
- Returns:
Tuple of (avg_alpha, meta_gwas_df, combined_alpha, combined_gwas)
- Return type:
Visualization
- manhattan_plot(gwas_df, output='manhattan.png', title='EDGE GWAS Manhattan Plot', sig_threshold=5e-8, figsize=(14, 6), colors=None)
Create Manhattan plot from EDGE GWAS results.
- Parameters:
gwas_df (pd.DataFrame or list) – DataFrame or list of DataFrames with columns ‘chrom’, ‘pos’, ‘pval’
output (str) – Output filename for the plot
title (str) – Plot title
sig_threshold (float) – Genome-wide significance threshold
figsize (tuple) – Figure size as (width, height)
colors (list, optional) – List of two colors for alternating chromosomes
- qq_plot(gwas_df, output='qq_plot.png', title='EDGE GWAS QQ Plot', figsize=(8, 8))
Create QQ plot from EDGE GWAS results and calculate genomic inflation factor.
- plot_alpha_distribution(alpha_df, output='alpha_distribution.png', bins=50, figsize=(10, 6), xlim=None)
Plot distribution of alpha values.
Input/Output
- save_results(gwas_df, alpha_df=None, output_prefix='edge_gwas', save_alpha=True)
Save EDGE GWAS results to files.
- load_alpha_values(alpha_file)
Load pre-calculated alpha values.
- Parameters:
alpha_file (str) – Path to alpha values file
- Returns:
DataFrame with alpha values
- Return type:
pd.DataFrame
- format_gwas_output_for_locuszoom(gwas_df, include_alpha=True, sort_by='pval', format_for_locuszoom=False)
Format GWAS output for publication/reporting or LocusZoom upload.
- Parameters:
- Returns:
Formatted DataFrame
- Return type:
pd.DataFrame
Note
For LocusZoom format, the output will be tab-delimited with columns: chrom, pos, ref, alt, pval, beta, se, eaf, and optionally alpha_value. The file should be sorted by chrom and pos, compressed with bgzip, and indexed with tabix for optimal LocusZoom performance.
- save_for_locuszoom(gwas_df, output_file, include_alpha=True, compress=True)
Save GWAS results in LocusZoom-compatible format.
- Parameters:
Note
For best performance with LocusZoom:
Compress with bgzip:
bgzip output_file.tsvIndex with tabix:
tabix -s 1 -b 2 -e 2 output_file.tsv.gz
- validate_locuszoom_format(gwas_df)
Validate that GWAS results meet LocusZoom format requirements.
- Parameters:
gwas_df (pd.DataFrame) – GWAS results DataFrame
- Returns:
Dictionary with validation results (valid, errors, warnings, info)
- Return type:
- create_summary_report(gwas_df, alpha_df=None, significance_threshold=5e-8, output_file=None)
Create a summary report of EDGE GWAS analysis.
Function Index
- Core Analysis (see core_analysis)
- Data Loading (see data_loading)
- Data Validation (see data_validation)
- Quality Control (see quality_control)
- Population Structure Control (see population_structure)
- Data Preparation (see data_preparation)
- Standard GWAS (see standard_gwas)
- Visualization (see Visualization Guide)
- Input/Output (see input_output)
See Also
Documentation:
Documentation Home - Home
Installation Guide - Installation instructions and requirements
Quick Start Guide - Getting started guide with simple examples
Statistical Model - Statistical methods and mathematical background
Example Workflows - Example analyses and case studies
Visualization Guide - Plotting and visualization guide
API Reference - Complete API documentation
Troubleshooting Guide - Troubleshooting guide and common issues
Frequently Asked Questions (FAQ) - Frequently asked questions
Citation - How to cite EDGE in publications
Changelog - Version history and release notes
Advanced Topics for Further Updates - Planned features and roadmap
—
Last updated: 2026-02-10 for edge-gwas v0.1.2
For questions or issues, visit: https://github.com/nicenzhou/edge-gwas/issues