API Reference
Complete API documentation for edge-gwas package.
Core Module
EDGEAnalysis Class
Main class for EDGE GWAS analysis.
from edge_gwas import EDGEAnalysis
# Binary outcome
edge = EDGEAnalysis(outcome_type='binary')
# Continuous outcome with transformation
edge = EDGEAnalysis(
outcome_type='continuous',
outcome_transform='rank_inverse_normal'
)
Parameters:
outcome_type: ‘binary’ or ‘continuous’outcome_transform: ‘log’, ‘log10’, ‘inverse_normal’, ‘rank_inverse_normal’n_jobs: Number of parallel jobs (-1 = all cores)outcome_transform(str, optional): NEW in v0.1.1 - Transformation for continuous outcomesNone: No transformation (default)'log': Natural log transformation'log10': Log base 10 transformation'inverse_normal': Inverse normal transformation (parametric)'rank_inverse_normal': Rank-based inverse normal transformation (robust to outliers)
ols_method(str, optional): NEW in v0.1.1 - Optimization method for OLS regression (continuous outcomes only)'bfgs': Broyden-Fletcher-Goldfarb-Shannon algorithm (default)Best for: General purpose, balanced speed and accuracy
Convergence: Fast, uses approximate Hessian
Memory: Moderate (O(p²) where p = number of parameters)
Recommended: Most analyses
'newton': Newton-Raphson algorithmBest for: High precision requirements
Convergence: Very fast when close to optimum (quadratic)
Memory: High (requires exact Hessian computation)
Recommended: Small datasets, need high accuracy
'lbfgs': Limited-memory BFGSBest for: Large datasets (N > 100,000)
Convergence: Slightly slower than BFGS
Memory: Low (O(mp) where m ≈ 10)
Recommended: Biobank-scale data, memory-constrained systems
'nm': Nelder-Mead simplex algorithmBest for: Robustness, convergence issues with gradient methods
Convergence: Slow, derivative-free
Memory: Low
Recommended: Troubleshooting convergence failures
'cg': Conjugate GradientBest for: Sparse problems, many covariates
Convergence: Moderate
Memory: Low
Recommended: Large covariate sets
'ncg': Newton Conjugate GradientBest for: Large-scale problems with >100 covariates
Convergence: Good, uses Hessian-vector products
Memory: Moderate
Recommended: High-dimensional analyses
'powell': Powell’s conjugate direction methodBest for: Derivative-free optimization
Convergence: Slow, robust
Memory: Low
Recommended: Non-smooth objective functions (rare in GWAS)
'basinhopping': Basin-hopping global optimizationBest for: Multiple local minima (very rare in GWAS)
Convergence: Very slow, explores parameter space thoroughly
Memory: Moderate
Recommended: Only when suspecting multiple solutions
Note: Optimization method only affects continuous outcomes with linear regression. Binary outcomes always use BFGS for logistic regression.
n_jobs(int): Number of CPU cores for parallel processing-1: Use all available cores1: Single-threaded (no parallelization)n: Use n cores
max_iter(int): Maximum iterations for model convergenceDefault: 1000 (usually sufficient for BFGS, Newton)
Increase to 2000-5000 for Nelder-Mead or if convergence warnings occur
Reduce to 500 if most variants converge quickly
verbose(bool): Print progress messages
Methods
calculate_alpha()
Calculate encoding parameters from training data.
alpha_df = edge.calculate_alpha(
genotype_data,
phenotype_df,
outcome,
covariates,
variant_info=None,
grm_matrix=None, # NEW in v0.1.1
grm_sample_ids=None, # NEW in v0.1.1
mean_centered=False # NEW in v0.1.1
)
Parameters:
genotype_data(pandas.DataFrame): Genotype matrix (samples × variants), values {0, 1, 2}phenotype_df(pandas.DataFrame): Phenotype data with outcome and covariatesoutcome(str): Name of outcome columncovariates(list): List of covariate column namesvariant_info(pandas.DataFrame, optional): Variant information (variant_id, chr, pos, ref_allele, alt_allele)grm_matrix(numpy.ndarray, optional): NEW in v0.1.1 - GRM matrix for population structure controlgrm_sample_ids(pandas.DataFrame, optional): NEW in v0.1.1 - Sample IDs corresponding to GRMmean_centered(bool, optional): NEW in v0.1.1 - If True, fit codominant regression without intercept (default: False)
Returns:
pandas.DataFrame: Alpha values with columns: variant_id, alpha_value, ref_allele, alt_allele, eaf, coef_het, coef_hom, std_err_het, std_err_hom, pval_het, pval_hom, n_samples
Examples:
# Basic usage
alpha_df = edge.calculate_alpha(
genotype_data=train_geno,
phenotype_df=train_pheno,
outcome='trait',
covariates=['age', 'sex']
)
# With GRM for population structure control
from edge_gwas.utils import load_grm_gcta
grm_matrix, grm_ids = load_grm_gcta('grm_prefix')
alpha_df = edge.calculate_alpha(
genotype_data=train_geno,
phenotype_df=train_pheno,
outcome='trait',
covariates=['age', 'sex'],
grm_matrix=grm_matrix,
grm_sample_ids=grm_ids
)
# Mean-centered model (Y - E[Y], no intercept)
alpha_df = edge.calculate_alpha(
genotype_data=train_geno,
phenotype_df=train_pheno,
outcome='trait',
covariates=['age', 'sex'],
mean_centered=True
)
Notes:
Model:
Y ~ β_het × I(geno=1) + β_hom × I(geno=2) + covariatesAlpha:
α = β_het / β_homWhen
mean_centered=True: no intercept term in the regression model
apply_alpha()
Apply alpha values to test data for GWAS.
gwas_df = edge.apply_alpha(
genotype_data,
phenotype_df,
outcome,
covariates,
alpha_values,
grm_matrix=None, # NEW in v0.1.1
grm_sample_ids=None # NEW in v0.1.1
)
Parameters:
genotype_data(pandas.DataFrame): Test genotype matrixphenotype_df(pandas.DataFrame): Test phenotype dataoutcome(str): Name of outcome columncovariates(list): List of covariate column namesalpha_values(pandas.DataFrame): Alpha values from traininggrm_matrix(numpy.ndarray, optional): NEW in v0.1.1 - GRM matrix for mixed modelgrm_sample_ids(pandas.DataFrame, optional): NEW in v0.1.1 - Sample IDs for GRM
Returns:
pandas.DataFrame: GWAS results with columns:variant_id: SNP identifierchr: Chromosomepos: Positionpval: P-valuecoef: Effect coefficientstd_err: Standard errorstat: Test statisticalpha_value: Applied alpha
run_full_analysis()
Complete two-stage EDGE analysis.
alpha_df, gwas_df = edge.run_full_analysis(
train_genotype,
train_phenotype,
test_genotype,
test_phenotype,
outcome,
covariates,
variant_info=None,
grm_matrix=None, # NEW in v0.1.1
grm_sample_ids=None, # NEW in v0.1.1
mean_centered=False, # NEW in v0.1.1
output_prefix=None
)
Parameters:
train_genotype/test_genotype(pandas.DataFrame): Training/test genotype datatrain_phenotype/test_phenotype(pandas.DataFrame): Training/test phenotype dataoutcome(str): Outcome column namecovariates(list): Covariate column namesvariant_info(pandas.DataFrame, optional): Variant informationgrm_matrix(numpy.ndarray, optional): NEW in v0.1.1 - GRM for population structure controlgrm_sample_ids(pandas.DataFrame, optional): NEW in v0.1.1 - Sample IDs for GRMmean_centered(bool, optional): NEW in v0.1.1 - If True, fit codominant regression without intercept (default: False)output_prefix(str, optional): Prefix for output files
Returns:
tuple: (alpha_df, gwas_df)
Examples:
# Basic usage
alpha_df, gwas_df = edge.run_full_analysis(
train_geno, train_pheno,
test_geno, test_pheno,
outcome='trait',
covariates=['age', 'sex'],
output_prefix='results/edge'
)
# With GRM and mean-centered model
from edge_gwas.utils import load_grm_gcta
grm_matrix, grm_ids = load_grm_gcta('grm_prefix')
alpha_df, gwas_df = edge.run_full_analysis(
train_geno, train_pheno,
test_geno, test_pheno,
outcome='trait',
covariates=['age', 'sex'],
grm_matrix=grm_matrix,
grm_sample_ids=grm_ids,
mean_centered=True,
output_prefix='results/edge'
)
get_skipped_snps()
Get list of SNPs that failed convergence.
skipped = edge.get_skipped_snps()
Returns:
list: List of variant IDs that were skipped
Utilities Module
Calculate genetic relationship matrix (GRM) using GCTA.
from edge_gwas.utils import calculate_grm_gcta
grm_prefix = calculate_grm_gcta(
plink_prefix='genotypes',
output_prefix='output/grm',
maf_threshold=0.01,
method='grm',
max_threads=8,
verbose=True
)
Parameters:
plink_prefix(str): Prefix for PLINK binary files (.bed/.bim/.fam)output_prefix(str, optional): Prefix for output GRM files (default: temp directory)maf_threshold(float): MAF threshold for variant filtering (default: 0.01)method(str): GRM calculation method'grm': Full GRM (default)'grm-sparse': Sparse GRM (faster for large cohorts)
max_threads(int): Maximum number of threads to use (default: 1)verbose(bool): Print progress information
Returns:
str: Path to output GRM prefix
Output files:
prefix.grm.bin: GRM values (lower triangle, binary format)prefix.grm.N.bin: Number of SNPs used for each pairprefix.grm.id: Sample IDs (FID and IID)
Note:
Requires GCTA to be installed. Install with:
edge-gwas-install-tools
Load GRM calculated by GCTA.
from edge_gwas.utils import load_grm_gcta
grm_matrix, sample_ids = load_grm_gcta(
grm_prefix='output/grm',
verbose=True
)
Parameters:
grm_prefix(str): Prefix for GRM files (without .grm.bin extension)verbose(bool): Print loading information
Returns:
tuple: (grm_matrix, sample_ids_df)grm_matrix(numpy.ndarray): n_samples × n_samples symmetric GRM matrixsample_ids_df(pandas.DataFrame): DataFrame with FID and IID columns
Example:
# Calculate and load GRM
grm_prefix = calculate_grm_gcta('genotypes', output_prefix='grm/output')
grm_matrix, grm_ids = load_grm_gcta('grm/output')
print(f"GRM shape: {grm_matrix.shape}")
print(f"Mean diagonal: {np.diag(grm_matrix).mean():.3f}")
Identify pairs of related samples based on GRM threshold.
from edge_gwas.utils import identify_related_samples
related_pairs = identify_related_samples(
grm_matrix,
sample_ids,
threshold=0.0884, # ~3rd degree relatives
verbose=True
)
Parameters:
grm_matrix(numpy.ndarray): GRM matrixsample_ids(pandas.DataFrame): Sample IDs from load_grm_gcta()threshold(float): Relatedness threshold0.354: 1st degree (parent-offspring, full siblings)0.177: 2nd degree (half-siblings, grandparent-grandchild)0.0884: 3rd degree (first cousins) - default
verbose(bool): Print summary statistics
Returns:
pandas.DataFrame: Related pairs with columns:IID1: First sample IDIID2: Second sample IDkinship: Kinship coefficient
Example:
grm_matrix, sample_ids = load_grm_gcta('grm_prefix')
# Find 2nd degree or closer relatives
related = identify_related_samples(grm_matrix, sample_ids, threshold=0.177)
print(f"Found {len(related)} related pairs")
Filter out related samples to create an unrelated subset.
from edge_gwas.utils import filter_related_samples
unrelated_pheno = filter_related_samples(
phenotype_df,
grm_matrix,
sample_ids,
threshold=0.0884,
method='greedy',
sample_id_col=None,
verbose=True
)
Parameters:
phenotype_df(pandas.DataFrame): Phenotype DataFramegrm_matrix(numpy.ndarray): GRM matrixsample_ids(pandas.DataFrame): Sample IDs from load_grm_gcta()threshold(float): Relatedness thresholdmethod(str): Method for selecting unrelated samples'greedy': Iteratively remove sample with most relatives (default)'random': Randomly remove one from each related pair
sample_id_col(str, optional): Column name in phenotype_df for sample IDs (if None, uses index)verbose(bool): Print filtering information
Returns:
pandas.DataFrame: Filtered phenotype DataFrame with unrelated samples only
Example:
# Remove related samples for QC
unrelated_pheno = filter_related_samples(
phenotype_df, grm_matrix, sample_ids,
threshold=0.0884, method='greedy'
)
print(f"Unrelated samples: {len(unrelated_pheno)}")
UPDATED in v0.1.1 - Calculate principal components using PLINK2 with support for multiple input formats.
from edge_gwas.utils import calculate_pca_plink
pca_df = calculate_pca_plink(
file_prefix='genotypes',
n_pcs=10,
file_format='bfile',
output_prefix=None,
maf_threshold=0.01,
ld_window=50,
ld_step=5,
ld_r2=0.2,
approx=False,
verbose=True
)
Parameters:
file_prefix(str): Prefix for input files (renamed from plink_prefix)n_pcs(int): Number of principal components to calculate (default: 10)file_format(str): NEW in v0.1.1 - Input file format (default: ‘bfile’). Options:'bfile': PLINK1 binary (.bed/.bim/.fam)'pfile': PLINK2 binary (.pgen/.pvar/.psam)'vcf': VCF file (.vcf or .vcf.gz)'bgen': BGEN file (.bgen)
output_prefix(str, optional): Prefix for output files (default: temp directory)maf_threshold(float, optional): UPDATED - MAF threshold for variant filtering (default: 0.01, None to skip)ld_window(int, optional): UPDATED - Window size for LD pruning in variant count (default: 50, None to skip LD pruning)ld_step(int, optional): UPDATED - Step size for LD pruning (default: 5, None to skip LD pruning)ld_r2(float, optional): UPDATED - R² threshold for LD pruning (default: 0.2, None to skip LD pruning)approx(bool): Use approximate PCA for large cohorts (default: False)verbose(bool): Print progress information (default: True)
Returns:
pandas.DataFrame: PCA results with IID as index and PC1, PC2, …, PCn as columns
Note:
Requires PLINK2 to be installed. Install with:
edge-gwas-install-tools
Changes from v0.1.0:
Added
file_formatparameter for multiple input formatsRenamed
plink_prefixtofile_prefixMAF and LD parameters now optional (set to None to skip)
Removed
approx_samplesparameter (PLINK2 handles automatically)Fixed output parsing to skip header lines
Example:
# PLINK1 binary format (default, backward compatible)
pca_df = calculate_pca_plink('genotypes', n_pcs=10)
# PLINK2 binary format
pca_df = calculate_pca_plink(
'genotypes',
n_pcs=10,
file_format='pfile'
)
# VCF file
pca_df = calculate_pca_plink(
'mydata.vcf.gz',
n_pcs=10,
file_format='vcf'
)
# Skip LD pruning for small datasets
pca_df = calculate_pca_plink(
'genotypes',
n_pcs=10,
ld_window=None,
ld_step=None,
ld_r2=None
)
Calculate PC-AiR (Principal Components - Analysis in Related samples).
from edge_gwas.utils import calculate_pca_pcair
pca_df = calculate_pca_pcair(
plink_prefix='genotypes',
n_pcs=10,
kinship_matrix=None,
divergence_matrix=None,
output_prefix=None,
kin_threshold=0.0884,
div_threshold=-0.0884,
maf_threshold=0.01,
verbose=True
)
Parameters:
plink_prefix(str): Prefix for PLINK binary files (.bed/.bim/.fam)n_pcs(int): Number of principal components to calculate (default: 10)kinship_matrix(str, optional): Path to kinship matrix file (GCTA GRM format prefix). If None, will compute using calculate_grm_gcta()divergence_matrix(str, optional): Path to divergence matrix file (optional)output_prefix(str, optional): Prefix for output files (default: temp directory)kin_threshold(float): Kinship threshold for defining relatedness (default: 0.0884, ~ 3rd degree relatives)div_threshold(float): Divergence threshold (default: -0.0884)maf_threshold(float): MAF threshold for GRM calculation if kinship_matrix is None (default: 0.01)verbose(bool): Print progress information (default: True)
Returns:
pandas.DataFrame: PCA results with IID as index and PC1, PC2, …, PCn as columns
Note:
Requires R with GENESIS package installed:
edge-gwas-install-tools
Or install manually in R:
if (!requireNamespace("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("GENESIS")
BiocManager::install("SNPRelate")
BiocManager::install("gdsfmt")
Reference:
Conomos et al. (2015) Genetic Epidemiology 39(4):276-293
Example:
# PC-AiR with automatic GRM calculation
pca_df = calculate_pca_pcair('genotypes', n_pcs=10)
# PC-AiR with pre-computed GRM
pca_df = calculate_pca_pcair(
'genotypes',
n_pcs=10,
kinship_matrix='grm_prefix'
)
# Custom kinship threshold (2nd degree relatives)
pca_df = calculate_pca_pcair(
'genotypes',
n_pcs=10,
kin_threshold=0.177
)
# Use PC-AiR results as covariates
pheno = attach_pcs_to_phenotype(pheno, pca_df, n_pcs=10)
covariates = ['age', 'sex'] + get_pc_covariate_list(10)
Calculate principal components using scikit-learn (basic PCA without relatedness correction).
from edge_gwas.utils import calculate_pca_sklearn
pca_df = calculate_pca_sklearn(
genotype_df,
n_pcs=10,
verbose=True
)
Parameters:
genotype_df(pandas.DataFrame): Genotype matrix (samples × variants)n_pcs(int): Number of principal components to calculate (default: 10)verbose(bool): Print progress information
Returns:
pandas.DataFrame: PCA results with IID as index and PC1, PC2, …, PCn as columns
Note:
This is a basic PCA without correction for relatedness. For robust PCA, use calculate_pca_plink() or calculate_pca_pcair().
Example:
# Quick PCA for exploratory analysis
pca_df = calculate_pca_sklearn(genotype_df, n_pcs=10)
# Check variance explained
print("PCA complete")
UPDATED in v0.1.1 - Attach principal components to phenotype DataFrame with option to drop missing samples.
from edge_gwas.utils import attach_pcs_to_phenotype
pheno_with_pcs = attach_pcs_to_phenotype(
phenotype_df,
pca_df,
n_pcs=10,
pc_prefix='PC',
sample_id_col=None,
drop_na=False,
verbose=True
)
Parameters:
phenotype_df(pandas.DataFrame): Phenotype DataFramepca_df(pandas.DataFrame): PCA DataFrame with IID as indexn_pcs(int): Number of PCs to attach (default: 10)pc_prefix(str): Prefix for PC column names (default: ‘PC’)sample_id_col(str, optional): Column name in phenotype_df for sample IDs (None = use index)drop_na(bool): NEW in v0.1.1 - If True, remove samples with missing PCs after merging (default: False)verbose(bool): Print information about merging (default: True)
Returns:
pandas.DataFrame: Phenotype DataFrame with PC columns added
Note:
The function automatically handles type mismatches between sample IDs (e.g., strings vs integers).
Changes from v0.1.0:
Added
drop_naparameter to remove samples without PCsImproved sample ID matching with automatic type conversion
Fixed duplicate column issues during merge
Example:
# Calculate PCA
pca_df = calculate_pca_plink('genotypes', n_pcs=10)
# Attach PCs, keep samples with missing PCs (default)
pheno = attach_pcs_to_phenotype(pheno, pca_df, n_pcs=10)
# Attach PCs and remove samples without PCs
pheno = attach_pcs_to_phenotype(
pheno,
pca_df,
n_pcs=10,
drop_na=True
)
# When phenotype has IID as column
pheno = attach_pcs_to_phenotype(
pheno,
pca_df,
n_pcs=10,
sample_id_col='IID',
drop_na=True
)
# Complete workflow with PCs as covariates
pca_df = calculate_pca_plink('genotypes', n_pcs=10)
pheno = attach_pcs_to_phenotype(pheno, pca_df, n_pcs=10, drop_na=True)
# Generate PC covariate list
from edge_gwas.utils import get_pc_covariate_list
covariates = ['age', 'sex'] + get_pc_covariate_list(10)
# Run EDGE analysis with PCs
alpha_df, gwas_df = edge.run_full_analysis(
train_g, train_p, test_g, test_p,
outcome='disease',
covariates=covariates
)
Generate list of PC covariate names for use in EDGE analysis.
from edge_gwas.utils import get_pc_covariate_list
pc_list = get_pc_covariate_list(n_pcs=10, pc_prefix='PC')
Parameters:
n_pcs(int): Number of PCspc_prefix(str): Prefix for PC column names (default: ‘PC’)
Returns:
list: List of PC column names [‘PC1’, ‘PC2’, …, ‘PCn’]
Example:
# Generate PC covariate list
pc_list = get_pc_covariate_list(10)
# Returns: ['PC1', 'PC2', 'PC3', ..., 'PC10']
# Use in EDGE analysis
covariates = ['age', 'sex'] + get_pc_covariate_list(10)
alpha_df, gwas_df = edge.run_full_analysis(
train_g, train_p, test_g, test_p,
outcome='disease',
covariates=covariates
)
load_plink_data()
Load PLINK binary format data (.bed/.bim/.fam).
from edge_gwas.utils import load_plink_data
geno, info = load_plink_data('data.bed', 'data.bim', 'data.fam')
Parameters:
bed_file(str): Path to .bed filebim_file(str): Path to .bim filefam_file(str): Path to .fam fileminor_allele_as_alt(bool): Ensure minor allele is ALT (default: True)verbose(bool): Print loading information
Returns: tuple of (genotype_df, variant_info_df)
load_pgen_data()
Load PLINK 2 binary format data (.pgen/.pvar/.psam).
geno, info = load_pgen_data('data.pgen', 'data.pvar', 'data.psam')
Requires: pip install pgenlib
load_vcf_data()
Load VCF format data (.vcf or .vcf.gz).
geno, info = load_vcf_data('data.vcf.gz', dosage=True)
Parameters:
dosage(bool): Use dosages (DS field) vs hard calls (GT field)minor_allele_as_alt(bool): Ensure minor allele is ALT
Requires: pip install cyvcf2
load_bgen_data()
Load BGEN format data.
geno, info = load_bgen_data('data.bgen', sample_file='data.sample')
Requires: pip install bgen-reader
prepare_phenotype_data()
Load and prepare phenotype data.
pheno = prepare_phenotype_data(
'pheno.txt',
outcome_col='disease',
covariate_cols=['age', 'sex', 'pc1', 'pc2'],
sample_id_col='IID',
sep='\t'
)
Parameters:
outcome_col: Name of outcome columncovariate_cols: List of covariate columnssample_id_col: Sample ID column (becomes index)sep: File separator
Split data into training and test sets with stratification and flexible sample ID matching.
from edge_gwas.utils import stratified_train_test_split
train_g, test_g, train_p, test_p = stratified_train_test_split(
genotype_df,
phenotype_df,
outcome_col='disease',
test_size=0.5,
random_state=42,
is_binary=True,
geno_id_col=None,
pheno_id_col=None
)
Parameters:
genotype_df(pandas.DataFrame): Genotype dataphenotype_df(pandas.DataFrame): Phenotype dataoutcome_col(str): Outcome column for stratificationtest_size(float): Proportion for test set (0-1)random_state(int): Random seed for reproducibilityis_binary(bool): Whether to stratify by outcomegeno_id_col(str, optional): NEW in v0.1.1 - Column or index name for sample IDs in genotype_df. If None, uses the DataFrame index (default: None)pheno_id_col(str, optional): NEW in v0.1.1 - Column or index name for sample IDs in phenotype_df. If None, uses the DataFrame index (default: None)
Returns:
tuple: (train_geno, test_geno, train_pheno, test_pheno)
Examples:
# Example 1: Basic usage with matching indices (recommended)
pheno = pheno.set_index('IID') # Set IID as index
train_g, test_g, train_p, test_p = stratified_train_test_split(
geno, pheno, 'disease', test_size=0.5
)
# Example 2: Specify different ID columns for genotype and phenotype
train_g, test_g, train_p, test_p = stratified_train_test_split(
geno, pheno, 'disease',
geno_id_col='sample_id', # Use 'sample_id' column/index from genotype
pheno_id_col='IID', # Use 'IID' column from phenotype
test_size=0.5
)
# Example 3: Genotype uses index, phenotype uses column
train_g, test_g, train_p, test_p = stratified_train_test_split(
geno, pheno, 'disease',
geno_id_col=None, # Use genotype index
pheno_id_col='IID', # Use phenotype 'IID' column
test_size=0.3,
random_state=123
)
Note:
The function automatically handles type mismatches between sample IDs (e.g., strings vs integers) by converting both to strings for comparison, then mapping back to original types for proper indexing. This ensures samples are matched correctly even if genotype IDs are strings (‘11’) and phenotype IDs are integers (11).
Tip:
For simplest usage, ensure both DataFrames have matching sample IDs as their index:
# Prepare data with matching indices
pheno = pheno.set_index('IID')
# Then use without specifying ID columns
train_g, test_g, train_p, test_p = stratified_train_test_split(
geno, pheno, 'disease'
)
filter_genotype_data()
Comprehensive genotype QC filtering (all filters optional).
# MAF filter only
geno_qc = filter_genotype_data(geno, min_maf=0.01)
# MAF + missing rate
geno_qc = filter_genotype_data(
geno,
min_maf=0.01,
max_missing_per_variant=0.1
)
# All filters (returns genotype + phenotype)
geno_qc, pheno_qc = filter_genotype_data(
geno, pheno,
min_maf=0.01,
max_missing_per_variant=0.05,
min_call_rate_per_sample=0.95
)
Parameters:
min_maf: Minimum MAF (e.g., 0.01 = 1%)max_missing_per_variant: Max missing rate per variant (e.g., 0.1 = 10%)min_call_rate_per_sample: Min call rate per sample (e.g., 0.95 = 95%)
Filter order: MAF → Variant missing → Sample call rate
Filter variants by minor allele frequency.
from edge_gwas.utils import filter_variants_by_maf
filtered_geno = filter_variants_by_maf(
genotype_df,
min_maf=0.01,
verbose=True
)
Parameters:
genotype_df(pandas.DataFrame): Genotype data (works with both hard calls and dosages)min_maf(float): Minimum MAF thresholdverbose(bool): Print filtering information
Returns:
pandas.DataFrame: Filtered genotype data
Filter variants by missingness rate.
from edge_gwas.utils import filter_variants_by_missing
filtered_geno = filter_variants_by_missing(
genotype_df,
max_missing=0.05,
verbose=True
)
Parameters:
genotype_df(pandas.DataFrame): Genotype datamax_missing(float): Maximum missingness rate (0-1)verbose(bool): Print filtering information
Returns:
pandas.DataFrame: Filtered genotype data
NEW in v0.1.1 - Filter samples by genotype call rate.
from edge_gwas.utils import filter_samples_by_call_rate
filtered_geno, filtered_pheno = filter_samples_by_call_rate(
genotype_df,
phenotype_df,
min_call_rate=0.95,
verbose=True
)
Parameters:
genotype_df(pandas.DataFrame): Genotype dataphenotype_df(pandas.DataFrame): Phenotype datamin_call_rate(float): Minimum call rate (proportion of non-missing genotypes)verbose(bool): Print filtering information
Returns:
tuple: (filtered_genotype_df, filtered_phenotype_df)
Example:
# Remove samples with >5% missing genotypes
geno, pheno = filter_samples_by_call_rate(geno, pheno, min_call_rate=0.95)
NEW in v0.1.1 - Filter variants by Hardy-Weinberg Equilibrium p-value.
from edge_gwas.utils import filter_variants_by_hwe
filtered_geno = filter_variants_by_hwe(
genotype_df,
hwe_threshold=1e-6,
verbose=True
)
Parameters:
genotype_df(pandas.DataFrame): Genotype DataFrame (samples × variants)hwe_threshold(float): Minimum HWE p-value threshold (default: 1e-6)verbose(bool): Print filtering information (default: True)
Returns:
pandas.DataFrame: Filtered genotype DataFrame
Note:
Variants with HWE p-value < threshold are removed
Variants with insufficient data for HWE test (NaN) are kept
Common thresholds: 1e-6 (stringent), 1e-4 (moderate), 1e-3 (lenient)
Example:
# Filter variants deviating from HWE
geno = filter_variants_by_hwe(geno, hwe_threshold=1e-6)
# More lenient threshold
geno = filter_variants_by_hwe(geno, hwe_threshold=1e-4)
# Complete QC workflow
from edge_gwas.utils import (
filter_variants_by_maf,
filter_variants_by_missing,
filter_variants_by_hwe
)
# Apply multiple filters
geno = filter_variants_by_maf(geno, min_maf=0.01)
geno = filter_variants_by_missing(geno, max_missing=0.05)
geno = filter_variants_by_hwe(geno, hwe_threshold=1e-6)
print(f"Variants after QC: {geno.shape[1]}")
NEW in v0.1.1 - Check case/control balance in binary outcome.
from edge_gwas.utils import check_case_control_balance
balance = check_case_control_balance(
phenotype_df,
outcome_col='disease',
verbose=True
)
Parameters:
phenotype_df(pandas.DataFrame): Phenotype dataoutcome_col(str): Name of outcome columnverbose(bool): Print balance information
Returns:
dict: Dictionary with keys:case_count(int): Number of casescontrol_count(int): Number of controlsratio(float): Case/control ratio
Example:
balance = check_case_control_balance(pheno, 'disease')
print(f"Cases: {balance['case_count']}, Controls: {balance['control_count']}")
print(f"Ratio: {balance['ratio']:.2f}")
NEW in v0.1.1 - Calculate Hardy-Weinberg Equilibrium p-values for each variant.
from edge_gwas.utils import calculate_hwe_pvalues
hwe_pvals = calculate_hwe_pvalues(
genotype_df,
verbose=True
)
Parameters:
genotype_df(pandas.DataFrame): Genotype DataFrame (samples × variants)verbose(bool): Print calculation information (default: True)
Returns:
pandas.Series: HWE p-values for each variant
Note:
Only uses hard calls (0, 1, 2) for HWE calculation
Requires minimum 10 samples per variant
Uses chi-square test with minimum expected count of 5
Variants with insufficient data return NaN
Example:
# Calculate HWE p-values
hwe_pvals = calculate_hwe_pvalues(genotype_df)
# Find variants deviating from HWE
hwe_violations = hwe_pvals[hwe_pvals < 1e-6]
print(f"Variants violating HWE (p < 1e-6): {len(hwe_violations)}")
# Summary statistics
print(f"Mean HWE p-value: {hwe_pvals.mean():.4f}")
print(f"Median HWE p-value: {hwe_pvals.median():.4f}")
Impute missing values in covariates using various methods.
from edge_gwas.utils import impute_covariates
pheno_imputed = impute_covariates(
phenotype_df,
covariate_cols=['age', 'bmi', 'PC1', 'PC2'],
method='median',
drop_na=False,
verbose=True
)
Parameters:
phenotype_df(pandas.DataFrame): Phenotype DataFrame with covariatescovariate_cols(list): List of covariate column names to imputemethod(str): Imputation method (default: ‘median’). Options:'drop': Remove all rows with any missing values'mean': Replace with mean (numeric only)'median': Replace with median (numeric only)'mode': Replace with mode (works for categorical)'knn': K-Nearest Neighbors imputation (k=5)'missforest': MissForest algorithm (requires missingpy)'mice': Multiple Imputation by Chained Equations (requires missingpy)
drop_na(bool): If True, drop rows with any remaining missing values after imputation (default: False)verbose(bool): Print imputation information (default: True)
Returns:
pandas.DataFrame: DataFrame with imputed covariates
Note:
For ‘missforest’ and ‘mice’, install missingpy:
pip install missingpy
Example:
# Simple median imputation
pheno = impute_covariates(
pheno,
covariate_cols=['age', 'bmi'],
method='median'
)
# KNN imputation for multiple covariates
pheno = impute_covariates(
pheno,
covariate_cols=['age', 'bmi', 'PC1', 'PC2', 'PC3'],
method='knn'
)
# Drop rows with missing values
pheno = impute_covariates(
pheno,
covariate_cols=['age', 'sex'],
method='drop'
)
validate_and_align_data()
Validate and align genotype and phenotype data by sample IDs.
geno_aligned, pheno_aligned = validate_and_align_data(
geno, pheno,
outcome_col='disease',
covariate_cols=['age', 'sex']
)
Parameters:
keep_only_common(bool): Keep only overlapping samples (default: True)outcome_col: Outcome column to validatecovariate_cols: Covariate columns to validate
validate_genotype_df()
Validate genotype DataFrame format and encoding.
# Basic validation
validate_genotype_df(geno)
# With encoding check
is_valid = validate_genotype_df(geno, variant_info, check_encoding=True)
# Detailed report
is_valid, report = validate_genotype_df(
geno, variant_info,
check_encoding=True,
return_details=True
)
Parameters:
genotype_df: Genotype DataFrame (samples × variants)variant_info_df: Optional variant info for encoding validationcheck_encoding(bool): Validate minor allele as ALTreturn_details(bool): Return detailed report
validate_and_fix_encoding()
Validate and automatically fix genotype encoding.
geno_fixed, info_fixed, report = validate_and_fix_encoding(geno, variant_info)
# Check what was fixed
print(report[report['was_fixed'] == True])
Returns: (fixed_genotype_df, fixed_variant_info_df, report_df)
Calculate genomic inflation factor (λ).
from edge_gwas.utils import calculate_genomic_inflation
lambda_gc = calculate_genomic_inflation(pvalues)
Parameters:
pvalues(pandas.Series or array): P-values from GWAS
Returns:
float: Genomic inflation factor (λ)
Merge GWAS results with alpha values.
from edge_gwas.utils import merge_alpha_with_gwas
merged_df = merge_alpha_with_gwas(gwas_df, alpha_df)
Parameters:
gwas_df(pandas.DataFrame): GWAS resultsalpha_df(pandas.DataFrame): Alpha values
Returns:
pandas.DataFrame: Merged results with both GWAS and alpha information
NEW in v0.1.1 - Perform standard additive GWAS for comparison with EDGE.
from edge_gwas.utils import additive_gwas
results_df = additive_gwas(
genotype_df,
phenotype_df,
outcome='disease',
covariates=['age', 'sex', 'PC1', 'PC2'],
outcome_type='binary'
)
Parameters:
genotype_df(pandas.DataFrame): Genotype dataphenotype_df(pandas.DataFrame): Phenotype dataoutcome(str): Name of outcome columncovariates(list): List of covariate column namesoutcome_type(str):'binary'for logistic regression,'continuous'for linear regression
Returns:
pandas.DataFrame: Results with columns:variant_id: SNP identifiercoef: Effect coefficientstd_err: Standard errorpval: P-value
Example:
# Binary outcome (logistic regression)
additive_results = additive_gwas(
geno, pheno, 'disease',
['age', 'sex'],
outcome_type='binary'
)
# Continuous outcome (linear regression)
additive_results = additive_gwas(
geno, pheno, 'BMI',
['age', 'sex'],
outcome_type='continuous'
)
NEW in v0.1.1 - Perform k-fold cross-validation for EDGE analysis.
from edge_gwas.utils import cross_validated_edge_analysis
avg_alpha, meta_gwas, all_alpha, all_gwas = cross_validated_edge_analysis(
genotype_df,
phenotype_df,
outcome='disease',
covariates=['age', 'sex', 'PC1', 'PC2'],
outcome_type='binary',
n_folds=5,
n_jobs=8,
random_state=42
)
Parameters:
genotype_df(pandas.DataFrame): Genotype dataphenotype_df(pandas.DataFrame): Phenotype dataoutcome(str): Name of outcome columncovariates(list): List of covariate column namesoutcome_type(str):'binary'or'continuous'n_folds(int): Number of cross-validation foldsn_jobs(int): Number of parallel jobs for EDGE analysisrandom_state(int): Random seed for reproducibility
Returns:
tuple: (avg_alpha, meta_gwas_df, combined_alpha, combined_gwas)avg_alpha: Averaged alpha values across foldsmeta_gwas_df: Meta-analyzed GWAS results (Fisher’s method)combined_alpha: All alpha values from all foldscombined_gwas: All GWAS results from all folds
Example:
# Run 5-fold cross-validation
avg_alpha, meta_gwas, all_alpha, all_gwas = cross_validated_edge_analysis(
genotype_df=geno,
phenotype_df=pheno,
outcome='disease',
covariates=['age', 'sex'] + [f'PC{i}' for i in range(1, 11)],
outcome_type='binary',
n_folds=5
)
# Check alpha stability
print(f"Mean alpha std: {avg_alpha['alpha_std'].mean():.3f}")
# Find unstable variants
unstable = avg_alpha[avg_alpha['alpha_std'] > 0.3]
print(f"Unstable variants: {len(unstable)}")
Visualization Module
Create Manhattan plot of GWAS results.
from edge_gwas.visualize import manhattan_plot
manhattan_plot(
gwas_df,
output='manhattan.png',
title='EDGE GWAS Manhattan Plot',
sig_threshold=5e-8,
suggestive_threshold=1e-5,
figsize=(14, 6),
colors=['#1f77b4', '#ff7f0e']
)
Parameters:
gwas_df(pandas.DataFrame): GWAS results with columns: chr, pos, pvaloutput(str): Output filenametitle(str): Plot titlesig_threshold(float): Genome-wide significance thresholdsuggestive_threshold(float): Suggestive significance thresholdfigsize(tuple): Figure size (width, height) in inchescolors(list): Colors for alternating chromosomes
Returns:
None (saves plot to file)
Create QQ plot and calculate genomic inflation factor.
from edge_gwas.visualize import qq_plot
lambda_gc = qq_plot(
gwas_df,
output='qq_plot.png',
title='QQ Plot',
figsize=(8, 8)
)
Parameters:
gwas_df(pandas.DataFrame): GWAS results with pval columnoutput(str): Output filenametitle(str): Plot titlefigsize(tuple): Figure size (width, height) in inches
Returns:
float: Genomic inflation factor (λ)
Plot distribution of alpha values.
from edge_gwas.visualize import plot_alpha_distribution
plot_alpha_distribution(
alpha_df,
output='alpha_distribution.png',
bins=50,
figsize=(10, 6),
title='Alpha Value Distribution'
)
Parameters:
alpha_df(pandas.DataFrame): Alpha values with alpha_value columnoutput(str): Output filenamebins(int): Number of histogram binsfigsize(tuple): Figure size (width, height)title(str): Plot title
Returns:
None (saves plot to file)
I/O Handlers Module
NEW in v0.1.1 - Download test data files from GitHub repository for tutorials and testing.
from edge_gwas import download_test_files
results = download_test_files(
output_dir='tests',
version='v0.1.1',
overwrite=False,
verbose=True
)
Parameters:
output_dir(str): Directory to save test files (default: ‘tests’)version(str): GitHub release version tag to download from (default: ‘v0.1.1’)overwrite(bool): If True, overwrite existing files (default: False)verbose(bool): Print download progress (default: True)
Returns:
dict: Dictionary with download results containing:downloaded(list): List of successfully downloaded filesskipped(list): List of skipped (already existing) filesfailed(list): List of failed downloads
Downloaded Files:
test.bed: PLINK binary genotype file (3,925 samples × 1,000 variants)test.bim: PLINK variant information filetest.fam: PLINK sample information filetest.phen: Phenotype file with disease status and age
Examples:
# Example 1: Basic usage - download to default 'tests' directory
from edge_gwas import download_test_files
download_test_files()
# Files saved to: tests/test.bed, tests/test.bim, tests/test.fam, tests/test.phen
# Example 2: Download to custom directory
download_test_files(output_dir='data/examples')
# Example 3: Force re-download (overwrite existing files)
results = download_test_files(overwrite=True)
print(f"Downloaded: {results['downloaded']}")
print(f"Skipped: {results['skipped']}")
# Example 4: Download specific version
download_test_files(version='main') # Download from main branch
# Example 5: Verify download results
results = download_test_files()
if results['failed']:
print(f"Warning: Failed to download {results['failed']}")
else:
print("✓ All test files ready!")
Complete Tutorial Workflow:
from edge_gwas import (
download_test_files,
load_plink_data,
EDGEAnalysis
)
import pandas as pd
# Step 1: Download test data
download_test_files()
# Step 2: Load test data
geno, info = load_plink_data('tests/test.bed', 'tests/test.bim', 'tests/test.fam')
pheno = pd.read_csv('tests/test.phen', sep=' ')
# Step 3: Prepare data
pheno = pheno.set_index('IID')
# Step 4: Split and analyze
from edge_gwas.utils import stratified_train_test_split
train_g, test_g, train_p, test_p = stratified_train_test_split(
geno, pheno, 'disease', test_size=0.5
)
# Step 5: Run EDGE analysis
edge = EDGEAnalysis(outcome_type='binary')
alpha_df, gwas_df = edge.run_full_analysis(
train_g, train_p, test_g, test_p,
outcome_col='disease',
covariate_cols=['age']
)
print(f"✓ Analysis complete! Found {len(gwas_df)} variants")
File Verification:
import os
# Download files
results = download_test_files()
# Verify file sizes and existence
test_files = {
'tests/test.bed': 'Genotype data (binary)',
'tests/test.bim': 'Variant information',
'tests/test.fam': 'Sample information',
'tests/test.phen': 'Phenotype data'
}
for filepath, description in test_files.items():
if os.path.exists(filepath):
size_kb = os.path.getsize(filepath) / 1024
print(f"✓ {filepath}")
print(f" {description} - {size_kb:.1f} KB")
else:
print(f"✗ {filepath} - MISSING!")
Re-downloading Files:
# Method 1: Use overwrite parameter
download_test_files(overwrite=True)
# Method 2: Delete directory first
import shutil
if os.path.exists('tests'):
shutil.rmtree('tests')
download_test_files()
# Method 3: Download to new location
download_test_files(output_dir='tests_fresh')
Troubleshooting:
If download fails:
# Check internet connection and retry
results = download_test_files(overwrite=True, verbose=True)
# If still failing, download manually:
# 1. Visit: https://github.com/nicenzhou/edge-gwas/tree/v0.1.1/tests
# 2. Download files: test.bed, test.bim, test.fam, test.phen
# 3. Place in 'tests' directory
Note:
Requires internet connection to download from GitHub
Files are downloaded from the specified GitHub release version
Total download size: ~1-2 MB
Test data contains simulated genotypes for demonstration purposes only
See Also:
Quick Start Guide - Quick start tutorial using test data
Tutorials - Detailed tutorials with test data examples
load_plink_data()- Load the downloaded PLINK filesprepare_phenotype_data()- Load the downloaded phenotype file
Save GWAS and alpha results to files.
from edge_gwas.io_handlers import save_results
output_files = save_results(
gwas_df,
alpha_df=None,
output_prefix='edge_gwas',
save_alpha=True
)
Parameters:
gwas_df(pandas.DataFrame): GWAS resultsalpha_df(pandas.DataFrame, optional): Alpha valuesoutput_prefix(str): Prefix for output filessave_alpha(bool): Whether to save alpha values
Returns:
dict: Dictionary with keys ‘gwas’ and ‘alpha’ containing file paths
Load pre-calculated alpha values from file.
from edge_gwas.io_handlers import load_alpha_values
alpha_df = load_alpha_values('alpha_values.txt')
Parameters:
filename(str): Path to alpha values file
Returns:
pandas.DataFrame: Alpha values
Generate text summary of GWAS results.
from edge_gwas.io_handlers import create_summary_report
report = create_summary_report(
gwas_df,
alpha_df=None,
significance_threshold=5e-8,
output_file='summary.txt'
)
Parameters:
gwas_df(pandas.DataFrame): GWAS resultsalpha_df(pandas.DataFrame, optional): Alpha valuessignificance_threshold(float): P-value threshold for significanceoutput_file(str, optional): Output file path
Returns:
str: Summary report text
Data Structures
Expected format:
Index: Sample IDs (as strings)
Columns: Variant IDs
Values: Genotype encoding (0, 1, 2, or NaN for missing)
0 = Reference homozygote (REF/REF)
1 = Heterozygote (REF/ALT)
2 = Alternate homozygote (ALT/ALT)
NaN = Missing
For dosages: continuous values 0-2
Example:
# rs123 rs456 rs789
# Sample1 0 1 2
# Sample2 1 2 0
# Sample3 2 0 1
Expected format:
Index: Sample IDs (as strings, matching genotype data)
Columns: Outcome and covariate columns
Example:
# disease age sex PC1 PC2
# Sample1 1 45 0 0.1 -0.2
# Sample2 0 38 1 -0.3 0.4
# Sample3 1 52 0 0.2 0.1
Expected format:
Columns: variant_id, chr, pos, ref, alt, etc.
Example:
# variant_id chr pos ref alt
# 0 rs123 1 1234567 A G
# 1 rs456 1 2345678 C T
# 2 rs789 2 3456789 G A
Output format from calculate_alpha():
# variant_id alpha_value eaf coef_het coef_hom
# 0 rs123 0.45 0.235 0.23 0.51
# 1 rs456 0.52 0.412 0.31 0.60
# 2 rs789 0.38 0.189 0.19 0.50
Output format from apply_alpha():
# variant_id chr pos pval coef std_err stat alpha_value
# 0 rs123 1 1234567 1.23e-08 0.0451 0.0089 5.07 0.45
# 1 rs456 1 2345678 3.45e-06 0.0312 0.0067 4.66 0.52
# 2 rs789 2 3456789 5.67e-10 0.0623 0.0095 6.56 0.38
Output format from PCA functions:
# PC1 PC2 PC3 PC4 PC5
# IID
# Sample1 0.12 -0.23 0.45 -0.12 0.34
# Sample2 -0.34 0.56 -0.12 0.23 -0.45
# Sample3 0.23 -0.12 0.34 -0.45 0.12
Output format from load_grm_gcta():
# FID IID
# 0 FAM001 Sample1
# 1 FAM001 Sample2
# 2 FAM002 Sample3
Constants and Defaults
# Significance thresholds
GWAS_THRESHOLD = 5e-8 # Genome-wide significance
SUGGESTIVE_THRESHOLD = 1e-5 # Suggestive threshold
# Quality control defaults
DEFAULT_MAF = 0.01 # Minimum MAF
DEFAULT_MISSING = 0.05 # Maximum missingness
DEFAULT_HWE = 1e-6 # HWE p-value threshold
DEFAULT_CALL_RATE = 0.95 # Minimum sample call rate
# Analysis defaults
DEFAULT_TEST_SIZE = 0.5 # Train/test split ratio
DEFAULT_N_JOBS = -1 # Use all CPU cores
DEFAULT_MAX_ITER = 1000 # Maximum iterations
# GRM defaults (NEW in v0.1.1)
DEFAULT_GRM_MAF = 0.01 # MAF for GRM calculation
DEFAULT_KIN_THRESHOLD = 0.0884 # Kinship threshold (3rd degree)
# PCA defaults (NEW in v0.1.1)
DEFAULT_N_PCS = 10 # Number of PCs
DEFAULT_LD_R2 = 0.2 # LD pruning threshold
DEFAULT_LD_WINDOW = 50 # LD window size (kb)
Exceptions
class EdgeGWASError(Exception):
"""Base exception for edge-gwas."""
pass
class DataFormatError(EdgeGWASError):
"""Raised when input data format is invalid."""
pass
class ConvergenceError(EdgeGWASError):
"""Raised when model fails to converge."""
pass
class MissingDataError(EdgeGWASError):
"""Raised when required data is missing."""
pass
class GRMError(EdgeGWASError):
"""Raised when GRM-related operations fail."""
pass
class ToolNotFoundError(EdgeGWASError):
"""Raised when external tool (PLINK2, GCTA, R) is not found."""
pass
Version Information
import edge_gwas
# Get version string
print(edge_gwas.__version__) # '0.1.1'
# Get author information
print(edge_gwas.__author__) # 'Your Name'
# Get license
print(edge_gwas.__license__) # 'GPL-3.0'
Command-Line Tools (NEW in v0.1.1)
Interactive installer for external tools (PLINK2, GCTA, R packages).
# Install all tools interactively
edge-gwas-install-tools
# The installer will:
# 1. Detect your operating system and architecture
# 2. Download and install PLINK2
# 3. Download and install GCTA
# 4. Install R packages (GENESIS, SNPRelate, gdsfmt)
# 5. Configure PATH if needed
Supported platforms:
Linux (x86_64)
macOS (Intel x86_64 and Apple Silicon ARM64)
Features:
Automatic architecture detection
SSL-safe downloads with retry logic
Verification of installed tools
PATH configuration guidance
Verify that external tools are properly installed.
# Check all tools
edge-gwas-check-tools
# Output example:
# ======================================================================
# EDGE-GWAS External Tools Check
# ======================================================================
#
# Python Packages:
# ----------------------------------------------------------------------
# ✓ numpy: Installed
# ✓ pandas: Installed
# ✓ scipy: Installed
# ✓ statsmodels: Installed
# ✓ sklearn: Installed
# ✓ matplotlib: Installed
# ✓ pandas_plink: Installed
#
# External Tools:
# ----------------------------------------------------------------------
# ✓ PLINK2: PLINK v2.00a3.7 64-bit Intel (24 Jan 2024)
# ✓ GCTA: GCTA 1.95.0
#
# R and Packages:
# ----------------------------------------------------------------------
# ✓ R: R version 4.3.0
# ✓ R package GENESIS: Installed
# ✓ R package SNPRelate: Installed
# ✓ R package gdsfmt: Installed
#
# ======================================================================
# ✓ All tools and packages are properly installed!
# ======================================================================
Function Reference Summary
Class: EDGEAnalysis
Method |
Purpose |
Key Input |
Output |
|---|---|---|---|
|
Calculate encoding parameters from training data |
Genotype, phenotype, covariates, GRM (optional) |
Alpha DataFrame |
|
Apply alpha values to test data for GWAS |
Genotype, phenotype, alpha values, GRM (optional) |
GWAS results DataFrame |
|
Complete two-stage workflow |
Train/test genotype & phenotype, GRM (optional) |
(alpha_df, gwas_df) |
|
Get list of SNPs that failed convergence |
None |
List of variant IDs |
Data Loading:
Function |
Purpose |
Requirements |
|---|---|---|
|
Load PLINK binary files (.bed/.bim/.fam) |
pandas-plink |
|
Load PLINK 2 files (.pgen/.pvar/.psam) |
pgenlib |
|
Load VCF/VCF.GZ files |
cyvcf2 |
|
Load BGEN files |
bgen-reader |
|
Load and prepare phenotype data |
- |
Data Quality Control:
Function |
Purpose |
Output |
|---|---|---|
|
Comprehensive QC (MAF, missing, call rate) |
Filtered data |
|
Apply standard GWAS filters (MAF≥0.01, missing≤0.05, call rate≥0.95) |
Filtered data |
|
Validate format and encoding |
bool or report |
|
Auto-fix minor allele encoding |
Fixed data + report |
|
Align genotype and phenotype by sample IDs |
Aligned data |
|
Print diagnostic information |
Console output |
Population Structure Control:
Function |
Purpose |
Requirements |
|---|---|---|
|
Calculate GRM using GCTA |
GCTA |
|
Load GRM from GCTA files |
- |
|
Calculate PCs using PLINK2 |
PLINK2 |
|
Merge PCs with phenotype |
- |
|
Find related sample pairs |
- |
|
Remove related samples |
- |
Data Processing:
Function |
Purpose |
|---|---|
|
Split data into train/test sets with stratification |
|
Impute missing covariate values |
Statistical Functions:
Function |
Purpose |
|---|---|
|
Calculate genomic inflation factor (λ) |
|
Standard additive GWAS for comparison |
|
K-fold cross-validation for EDGE |
Complete Workflow Example
from edge_gwas import EDGEAnalysis
from edge_gwas.utils import (
load_plink_data,
prepare_phenotype_data,
apply_standard_qc,
validate_and_align_data,
calculate_grm_gcta,
load_grm_gcta,
calculate_pca_plink,
attach_pcs_to_phenotype,
stratified_train_test_split
)
from edge_gwas.visualize import manhattan_plot, qq_plot
# 1. Load data
geno, info = load_plink_data('data.bed', 'data.bim', 'data.fam')
pheno = prepare_phenotype_data(
'pheno.txt',
outcome_col='disease',
covariate_cols=['age', 'sex']
)
# 2. QC and alignment
geno_qc, pheno_qc = apply_standard_qc(geno, pheno)
geno_qc, pheno_qc = validate_and_align_data(geno_qc, pheno_qc)
# 3. Population structure
grm_prefix = calculate_grm_gcta('data', output_prefix='grm')
grm_matrix, grm_ids = load_grm_gcta('grm')
pca = calculate_pca_plink('data', n_pcs=10)
pheno_qc = attach_pcs_to_phenotype(pheno_qc, pca, n_pcs=10)
# 4. Split data
train_g, test_g, train_p, test_p = stratified_train_test_split(
geno_qc, pheno_qc, outcome_col='disease', test_size=0.5
)
# 5. EDGE analysis
edge = EDGEAnalysis(outcome_type='binary')
covariates = ['age', 'sex'] + [f'PC{i}' for i in range(1, 11)]
alpha_df, gwas_df = edge.run_full_analysis(
train_g, train_p, test_g, test_p,
outcome='disease',
covariates=covariates,
grm_matrix=grm_matrix,
grm_sample_ids=grm_ids,
output_prefix='results/edge'
)
# 6. Visualize
manhattan_plot(gwas_df, output='manhattan.png')
qq_plot(gwas_df, output='qq.png')
Migration Guide (v0.1.0 → v0.1.1)
Key Changes:
New QC function - Use
filter_genotype_data()for comprehensive filtering:# New unified approach geno_qc, pheno_qc = filter_genotype_data( geno, pheno, min_maf=0.01, max_missing_per_variant=0.05, min_call_rate_per_sample=0.95 )
Data validation - New encoding validation and auto-fix:
# Validate and fix minor allele encoding geno_fixed, info_fixed, report = validate_and_fix_encoding(geno, info)
LocusZoom output - GWAS results now include LocusZoom-compatible columns:
gwas_df = edge.apply_alpha( test_g, test_p, outcome='disease', covariates=covariates, variant_info=info, split_variant_id=True, variant_id_pattern='chr:pos:ref:alt' ) # Save for LocusZoom edge.save_for_locuszoom(gwas_df, 'locuszoom_input.txt')
GRM support - Add population structure control:
grm_matrix, grm_ids = load_grm_gcta('grm_prefix') alpha_df = edge.calculate_alpha( ..., grm_matrix=grm_matrix, grm_sample_ids=grm_ids )
Function Index
By Category:
Core Analysis:
- EDGEAnalysis (class) → calculate_alpha(), apply_alpha(), run_full_analysis()
Data Loading:
- load_plink_data(), load_pgen_data(), load_vcf_data(), load_bgen_data(), prepare_phenotype_data()
Quality Control:
- filter_genotype_data(), apply_standard_qc(), validate_genotype_df(), validate_and_fix_encoding(), validate_and_align_data(), diagnose_genotype_data()
Population Structure:
- calculate_grm_gcta(), load_grm_gcta(), calculate_pca_plink(), attach_pcs_to_phenotype(), identify_related_samples(), filter_related_samples()
Statistical:
- calculate_genomic_inflation(), additive_gwas(), cross_validated_edge_analysis()
Visualization:
- manhattan_plot(), qq_plot(), plot_alpha_distribution()
Quick Function Finder
“I want to…”
Load genetic data:
PLINK binary (.bed/.bim/.fam):
load_plink_data()PLINK2 (.pgen/.pvar/.psam):
load_pgen_data()BGEN:
load_bgen_data()VCF:
load_vcf_data()
Control for population structure:
Calculate GRM:
calculate_grm_gcta()Load GRM:
load_grm_gcta()Calculate PCs (unrelated):
calculate_pca_plink()Calculate PCs (related):
calculate_pca_pcair()Add PCs to phenotype:
attach_pcs_to_phenotype()
Quality control:
Filter by MAF:
filter_variants_by_maf()Filter by missingness:
filter_variants_by_missing()Filter by HWE:
filter_variants_by_hwe()Filter samples:
filter_samples_by_call_rate()Check case/control balance:
check_case_control_balance()Find related samples:
identify_related_samples()Remove related samples:
filter_related_samples()
Run EDGE analysis:
Calculate alpha:
calculate_alpha()Apply alpha:
apply_alpha()Full workflow:
run_full_analysis()Cross-validation:
cross_validated_edge_analysis()
Compare with standard GWAS:
Run additive model:
additive_gwas()
Visualize results:
Manhattan plot:
manhattan_plot()QQ plot:
qq_plot()Alpha distribution:
plot_alpha_distribution()
Save/load results:
Save results:
save_results()Load alpha values:
load_alpha_values()Create summary:
create_summary_report()
See Also
Documentation:
Documentation Home - Home
Installation Guide - Installation instructions and requirements
Quick Start Guide - Getting started guide with simple examples
User Guide - Comprehensive user guide and tutorials
API Reference - Complete API documentation
Example Workflows - Example analyses and case studies
Visualization Guide - Plotting and visualization guide
Statistical Model - Statistical methods and mathematical background
Troubleshooting Guide - Troubleshooting guide and common issues
Frequently Asked Questions (FAQ) - Frequently asked questions
Citation - How to cite EDGE in publications
Changelog - Version history and release notes
Advanced Topics for Further Updates - Planned features and roadmap
—
Last updated: 2025-12-25 for edge-gwas v0.1.1
For questions or issues, visit: https://github.com/nicenzhou/edge-gwas/issues