User Guide
This guide covers essential aspects of using edge-gwas v0.1.1.
Overview
edge-gwas performs GWAS analysis with flexible genetic encoding to detect nonadditive SNP effects.
Key Features:
Two-stage analysis framework (training/test split)
Flexible encoding for nonadditive effects detection
Population structure control (PCA, GRM)
Multiple file format support (PLINK, PGEN, BGEN, VCF)
Outcome transformations for continuous traits
Built-in visualization and QC tools
EDGE Methodology
EDGE (Efficient Detection of Genetic Effects) uses a two-stage approach:
Training: Calculate encoding parameters (α) for each variant
Testing: Apply α values to perform GWAS
Alpha (α) Interpretation:
α ≈ 0: Recessive (only homozygotes affected)
α ≈ 0.5: Additive (heterozygotes intermediate)
α ≈ 1: Dominant (heterozygotes = homozygotes)
This flexible encoding detects effects that standard additive models miss.
Complete Workflow
1. Load Data
from edge_gwas.utils import load_plink_data, prepare_phenotype_data
# Load genotypes (multiple formats supported)
geno, info = load_plink_data('data.bed', 'data.bim', 'data.fam')
# Load phenotypes
pheno = prepare_phenotype_data(
'pheno.txt',
outcome_col='disease',
covariate_cols=['age', 'sex']
)
2. Quality Control
from edge_gwas.utils import (
filter_variants_by_maf,
filter_variants_by_hwe,
filter_samples_by_call_rate
)
# Apply QC filters
geno = filter_variants_by_maf(geno, min_maf=0.01)
geno = filter_variants_by_hwe(geno, hwe_threshold=1e-6)
geno, pheno = filter_samples_by_call_rate(geno, pheno, min_call_rate=0.95)
Recommended Thresholds:
MAF > 1%
Missingness < 5%
HWE p > 1e-6
Sample call rate > 95%
3. Population Structure Control (NEW in v0.1.1)
Option A: PCA (for unrelated samples):
from edge_gwas.utils import calculate_pca_plink, attach_pcs_to_phenotype
pca_df = calculate_pca_plink('data', n_pcs=10)
pheno = attach_pcs_to_phenotype(pheno, pca_df, n_pcs=10)
Option B: GRM (for related samples):
from edge_gwas.utils import calculate_grm_gcta, load_grm_gcta
grm_prefix = calculate_grm_gcta('data')
grm_matrix, grm_ids = load_grm_gcta(grm_prefix)
4. Split Data
from edge_gwas.utils import stratified_train_test_split
train_g, test_g, train_p, test_p = stratified_train_test_split(
geno, pheno, 'disease', test_size=0.5, random_state=42
)
5. Run EDGE Analysis
Basic Analysis:
from edge_gwas import EDGEAnalysis
from edge_gwas.utils import get_pc_covariate_list
edge = EDGEAnalysis(outcome_type='binary', n_jobs=-1)
covariates = ['age', 'sex'] + get_pc_covariate_list(10)
alpha_df, gwas_df = edge.run_full_analysis(
train_g, train_p, test_g, test_p,
outcome='disease',
covariates=covariates
)
With GRM (NEW in v0.1.1):
alpha_df, gwas_df = edge.run_full_analysis(
train_g, train_p, test_g, test_p,
outcome='disease',
covariates=covariates,
grm_matrix=grm_matrix,
grm_sample_ids=grm_ids
)
Continuous Outcomes with Transformation (NEW in v0.1.1):
edge = EDGEAnalysis(
outcome_type='continuous',
outcome_transform='rank_inverse_normal'
)
alpha_df, gwas_df = edge.run_full_analysis(...)
6. Interpret Results
# Significant hits
sig = gwas_df[gwas_df['pval'] < 5e-8]
print(f"Significant variants: {len(sig)}")
# Check genomic inflation
from edge_gwas.utils import calculate_genomic_inflation
lambda_gc = calculate_genomic_inflation(gwas_df['pval'])
print(f"Lambda GC: {lambda_gc:.3f}") # Should be 0.95-1.05
# Alpha distribution
print(f"Mean alpha: {alpha_df['alpha_value'].mean():.3f}")
Significance Thresholds:
p < 5×10⁻⁸: Genome-wide significant
p < 1×10⁻⁵: Suggestive
λ = 0.95-1.05: Good control for population structure
7. Visualize
from edge_gwas.visualize import manhattan_plot, qq_plot
manhattan_plot(gwas_df, 'manhattan.png')
lambda_gc = qq_plot(gwas_df, 'qq.png')
File Formats
Supported Genotype Formats:
PLINK:
.bed/.bim/.fam→load_plink_data()PLINK2:
.pgen/.pvar/.psam→load_pgen_data()BGEN:
.bgen→load_bgen_data()VCF:
.vcf/.vcf.gz→load_vcf_data()
Phenotype Format:
Tab-delimited text with columns:
Sample ID (IID)
Outcome variable
Covariates
Example:
IID disease age sex
SAMPLE1 1 45 0
SAMPLE2 0 38 1
SAMPLE3 1 52 0
Best Practices
Quality Control
Variant QC: MAF > 0.01, missingness < 5%, HWE p > 1e-6
Sample QC: Call rate > 95%, check for relatedness
Population Structure: Always include ≥10 PCs as covariates
Genomic Inflation: λ should be 0.95-1.05
Covariates
Always include:
Age, sex, genotyping batch
10-20 principal components
For biobank data:
Study center, assessment date
20+ PCs if diverse ancestry
Data Splitting
50/50 split recommended for balanced power
Stratify binary outcomes to maintain case/control balance
Same random seed for reproducibility
Performance
Parallel processing: Set
n_jobs=-1to use all coresLarge datasets: Process chromosomes separately
Memory: Use binary formats (PLINK, PGEN, BGEN)
Advanced Features (v0.1.1)
Cross-Validation
from edge_gwas.utils import cross_validated_edge_analysis
avg_alpha, meta_gwas, all_alpha, all_gwas = cross_validated_edge_analysis(
geno, pheno, 'disease', covariates,
outcome_type='binary', n_folds=5
)
Compare with Additive GWAS
from edge_gwas.utils import additive_gwas
add_results = additive_gwas(test_g, test_p, 'disease', covariates, 'binary')
# Compare p-values
comparison = gwas_df.merge(add_results, on='variant_id', suffixes=('_edge', '_add'))
Population Structure Options
Scenario |
Method |
Command |
|---|---|---|
Unrelated samples |
Standard PCA |
|
Related samples |
PC-AiR |
|
Biobank data |
GRM + PC-AiR |
|
Outcome Transformations
For non-normal continuous traits:
Transformation |
When to Use |
|---|---|
|
Right-skewed data (e.g., gene expression) |
|
Right-skewed with large range |
|
Parametric normalization |
|
Robust to outliers (recommended) |
Troubleshooting
Common Issues
High genomic inflation (λ > 1.05)
Add more PCs (try 15-20 instead of 10)
Use GRM for related samples
Check for batch effects
“No common samples found”
Ensure all IDs are strings:
df.index = df.index.astype(str)Check ID column names match
Memory errors
Process chromosomes separately
Reduce
n_jobsUse sparse GRM:
method='grm-sparse'
Convergence warnings
Increase
max_iterCheck for rare variants (filter MAF > 0.05)
View skipped SNPs:
edge.get_skipped_snps()
Quick Reference
Import commonly used functions:
from edge_gwas import EDGEAnalysis
from edge_gwas.utils import (
load_plink_data, prepare_phenotype_data,
filter_variants_by_maf, filter_variants_by_hwe,
calculate_pca_plink, attach_pcs_to_phenotype,
get_pc_covariate_list, stratified_train_test_split,
calculate_grm_gcta, load_grm_gcta
)
from edge_gwas.visualize import manhattan_plot, qq_plot
Typical workflow:
# 1. Load → 2. QC → 3. PCA → 4. Split → 5. EDGE → 6. Visualize
geno, info = load_plink_data(...)
geno = filter_variants_by_maf(geno, 0.01)
pca_df = calculate_pca_plink('data', 10)
train_g, test_g, train_p, test_p = stratified_train_test_split(...)
alpha_df, gwas_df = edge.run_full_analysis(...)
manhattan_plot(gwas_df, 'results.png')
See Also
Documentation:
Documentation Home - Home
Installation Guide - Installation instructions and requirements
Quick Start Guide - Getting started guide with simple examples
User Guide - Comprehensive user guide and tutorials
API Reference - Complete API documentation
Example Workflows - Example analyses and case studies
Visualization Guide - Plotting and visualization guide
Statistical Model - Statistical methods and mathematical background
Troubleshooting Guide - Troubleshooting guide and common issues
Frequently Asked Questions (FAQ) - Frequently asked questions
Citation - How to cite EDGE in publications
Changelog - Version history and release notes
Advanced Topics for Further Updates - Planned features and roadmap
—
Last updated: 2025-12-25 for edge-gwas v0.1.1
For questions or issues, visit: https://github.com/nicenzhou/edge-gwas/issues