Statistical Model
EDGE (Elastic Data-Driven Encoding) employs a flexible encoding approach to detect nonadditive genetic effects that are often missed by traditional additive GWAS models.
Overview
EDGE uses a two-stage analysis framework:
Training stage: Calculate encoding parameters (α) from training data
Test stage: Apply α values to test data for association testing
This approach allows detection of various inheritance patterns on a continuous scale, from recessive to dominant to over-dominant effects.
Key Innovation: Unlike standard GWAS that assumes additive inheritance (α = 0.5), EDGE estimates α directly from the data, enabling detection of nonadditive effects.
Regression Model
Stage 1: Codominant Model (Training Data)
EDGE separately estimates effects for heterozygous and homozygous alternate genotypes:
Equation 1: Codominant Regression Model
Where:
\(Y\) = phenotype/outcome
\(SNP_{Het}\) = indicator for heterozygous genotype (0 or 1)
\(SNP_{HA}\) = indicator for homozygous alternate genotype (0 or 1)
\(COV_i\) = covariates (age, sex, principal components, etc.)
\(\beta_0\) = intercept
\(\beta_{Het}\) = effect size for heterozygous genotype
\(\beta_{HA}\) = effect size for homozygous alternate genotype
\(\beta_{cov_i}\) = effect sizes for covariates
Genotype Encoding for Equation 1:
True Genotype |
\(SNP_{Het}\) |
\(SNP_{HA}\) |
Interpretation |
|---|---|---|---|
0/0 (Ref/Ref) |
0 |
0 |
Reference homozygote |
0/1 (Ref/Alt) |
1 |
0 |
Heterozygote |
1/1 (Alt/Alt) |
0 |
1 |
Alternate homozygote |
Encoding Parameter
Equation 2: Alpha Calculation
Where:
\(\alpha\) = encoding parameter representing the ratio of heterozygous to homozygous alternate effects
\(\beta_{Het}\) = effect size for heterozygous genotype (from Equation 1)
\(\beta_{HA}\) = effect size for homozygous alternate genotype (from Equation 1)
Special Cases:
If \(|\beta_{HA}| < 10^{-10}\), α is undefined (SNP skipped)
If model fails to converge, SNP is skipped
Stage 2: EDGE-Encoded Model (Test Data)
Equation 3: EDGE-Encoded Genotypes
Equation 4: Association Testing Model
Where:
\(G_{EDGE}\) = EDGE-encoded genotype (from Equation 3)
\(\beta\) = effect coefficient being tested
Null hypothesis: \(H_0: \beta = 0\)
Interpretation of Alpha
The α parameter indicates the inheritance pattern:
Alpha Range |
Inheritance Model |
Biological Interpretation |
|---|---|---|
\(\alpha < 0\) |
Under-recessive |
Heterozygotes have opposite effect direction |
\(\alpha \approx 0\) |
Recessive |
Heterozygotes similar to reference homozygotes |
\(0 < \alpha < 0.3\) |
Partially recessive |
Heterozygotes closer to reference |
\(\alpha \approx 0.5\) |
Additive |
Heterozygotes have half the effect (standard GWAS) |
\(0.7 < \alpha < 1.3\) |
Partially dominant |
Heterozygotes closer to alternate homozygotes |
\(\alpha \approx 1\) |
Dominant |
Heterozygotes similar to alternate homozygotes |
\(\alpha > 1.3\) |
Over-dominant |
Heterozygotes have stronger effect than both homozygotes |
Example Interpretations:
\(\alpha = 0\): Recessive disease allele (e.g., cystic fibrosis)
\(\alpha = 0.5\): Additive effect (standard GWAS assumption)
\(\alpha = 1\): Dominant disease allele (e.g., Huntington’s)
\(\alpha = 2\): Heterozygote advantage (e.g., sickle cell trait)
Two-Stage Analysis
Why Two Stages?
Prevents Overfitting:
Using same data to estimate α and test association inflates false positives
Two-stage approach provides unbiased p-values
Independent test set ensures valid statistical inference
Balances Power and Bias:
50/50 split recommended for optimal balance
Larger training set: more stable α estimates
Larger test set: more powerful association testing
Stage 1: Calculate Alpha (Training Data)
For each SNP in the training dataset:
Fit codominant model (Equation 1) using:
Logistic regression for binary outcomes (BFGS optimization)
Linear regression for continuous outcomes (configurable optimization method)
Extract coefficients \(\beta_{Het}\) and \(\beta_{HA}\)
Calculate \(\alpha = \beta_{Het} / \beta_{HA}\)
Store α value for this SNP
Check convergence:
If model converges: save α
If model fails: skip SNP and record in log
Statistical Software:
Uses
statsmodelspackage in PythonOptimization algorithms (see Optimization Methods section)
Maximum iterations: configurable (default: 1000)
Stage 2: Apply Alpha (Test Data)
For each SNP in the test dataset:
Retrieve α value calculated from training data
Encode genotypes using Equation 3:
0/0 → 0
0/1 → α
1/1 → 1
Fit association model (Equation 4)
Test \(H_0: \beta = 0\) using Wald test
Extract p-value, effect size, standard error
Output:
Variant ID
Chromosome and position
P-value
Effect coefficient (\(\beta\))
Standard error
Test statistic
Applied α value
Outcome Types
Binary Outcomes (Logistic Regression)
For case-control studies (e.g., disease status):
Training Stage:
Test Stage:
Where \(P(Y=1)\) is the probability of being a case.
Effect Interpretation:
\(\beta > 0\): Risk allele (increases disease odds)
\(\beta < 0\): Protective allele (decreases disease odds)
Odds ratio: \(OR = e^\beta\)
Optimization:
Uses BFGS algorithm by default
Suitable for non-convex optimization problems
Handles perfect separation robustly
Quantitative Outcomes (Linear Regression)
For continuous traits (e.g., BMI, blood pressure):
Training Stage:
Test Stage:
Where \(\epsilon \sim N(0, \sigma^2)\) is the error term.
Effect Interpretation:
\(\beta\) = change in trait per unit change in \(G_{EDGE}\)
Units depend on trait (e.g., kg for weight, mmHg for blood pressure)
Optimization:
Configurable optimization method (default: BFGS)
See Optimization Methods section for available algorithms
Optimization Methods
For continuous outcomes (linear regression), users can select from multiple optimization algorithms:
Available Algorithms
Method |
Description |
Best Use Case |
|---|---|---|
|
Broyden-Fletcher-Goldfarb-Shannon (default) |
General purpose, good balance of speed and accuracy |
|
Newton-Raphson |
Fast convergence when close to optimum |
|
Limited-memory BFGS |
Large datasets, memory-constrained systems |
|
Nelder-Mead |
Derivative-free, robust to noise |
|
Conjugate Gradient |
Large-scale problems, sparse matrices |
|
Newton Conjugate Gradient |
Large problems requiring second-order information |
|
Powell’s method |
Derivative-free, quadratic convergence |
|
Basin-hopping |
Global optimization, multiple local minima |
Algorithm Details
BFGS (Default):
Quasi-Newton method using approximation of Hessian matrix
Good balance between computational cost and convergence speed
Pros: Fast, memory-efficient, handles ill-conditioned problems
Cons: May struggle with very large datasets
Recommended for: Most GWAS applications
Newton-Raphson:
Uses exact Hessian matrix (second derivatives)
Pros: Quadratic convergence near optimum
Cons: Expensive to compute Hessian, requires good starting values
Recommended for: Small datasets, when high precision needed
L-BFGS:
Limited-memory version of BFGS
Pros: Handles large-scale problems, low memory footprint
Cons: Slightly slower convergence than BFGS
Recommended for: Biobank-scale data (>100,000 samples)
Nelder-Mead:
Simplex-based, derivative-free method
Pros: Robust, doesn’t require gradients
Cons: Slower convergence, not suitable for high dimensions
Recommended for: Debugging, noisy data
Conjugate Gradient:
Iterative method using conjugate directions
Pros: Memory-efficient, good for sparse problems
Cons: Slower than Newton methods
Recommended for: Very large covariate sets
Newton Conjugate Gradient:
Combines Newton method with CG for Hessian inversion
Pros: Handles large-scale problems efficiently
Cons: Requires gradient and Hessian-vector products
Recommended for: Problems with >1000 covariates
Powell’s Method:
Derivative-free, uses conjugate directions
Pros: Robust, no gradient calculation
Cons: Can be slow, may not converge
Recommended for: Non-smooth objective functions
Basin-hopping:
Global optimization combining local search with random jumps
Pros: Finds global optimum, robust to starting values
Cons: Very slow, many function evaluations
Recommended for: Suspected multiple local minima (rare in GWAS)
Usage Example
from edge_gwas import EDGEAnalysis
# Default: BFGS optimization
edge = EDGEAnalysis(
outcome_type='continuous',
ols_method='bfgs' # Default, can be omitted
)
# Use L-BFGS for large dataset
edge_large = EDGEAnalysis(
outcome_type='continuous',
ols_method='lbfgs'
)
# Use Newton for high precision
edge_precise = EDGEAnalysis(
outcome_type='continuous',
ols_method='newton',
max_iter=2000 # May need more iterations
)
# Calculate alpha values
alpha_df = edge.calculate_alpha(
train_genotype,
train_phenotype,
outcome='bmi',
covariates=['age', 'sex', 'pc1', 'pc2']
)
Choosing an Optimization Method
Decision Tree:
Start with default (BFGS)
Works well for >95% of cases
Good balance of speed and accuracy
If convergence issues occur:
Try
'newton'for better precisionTry
'nm'or'powell'for robustnessIncrease
max_iterparameter
For very large datasets (N > 100,000):
Use
'lbfgs'to reduce memory usageConsider
'cg'if memory is still an issue
For many covariates (P > 100):
Use
'ncg'for efficiencyConsider
'cg'as alternative
If suspect multiple solutions:
Use
'basinhopping'(very slow!)Check results for biological plausibility
Performance Comparison (approximate):
Method |
Speed |
Memory |
Accuracy |
Robustness |
|---|---|---|---|---|
BFGS |
Fast |
Medium |
High |
High |
Newton |
Very Fast |
High |
Very High |
Medium |
L-BFGS |
Fast |
Low |
High |
High |
Nelder-Mead |
Slow |
Low |
Medium |
Very High |
CG |
Medium |
Low |
Medium |
Medium |
NCG |
Medium |
Medium |
High |
Medium |
Powell |
Slow |
Low |
Medium |
High |
Basin-hopping |
Very Slow |
Medium |
Very High |
Very High |
Convergence Diagnostics
When a model fails to converge, the variant is automatically skipped and logged. You can:
Check skipped variants:
skipped = edge.get_skipped_snps() print(f"Number of skipped SNPs: {len(skipped)}")
Try different optimization method:
edge_robust = EDGEAnalysis( outcome_type='continuous', ols_method='nm', # More robust max_iter=5000 # More iterations )
Examine specific failing variants:
Check MAF (may be too low)
Check for perfect collinearity with covariates
Verify no data quality issues
Warning Signs:
>10% of SNPs fail to converge → Check data quality
Consistent failures for common variants → Try different method
Only rare variants fail → Expected, may increase MAF threshold
Outcome Transformations
For continuous outcomes, EDGE supports several transformations to improve normality and stabilize variance.
Available Transformations
Transformation |
Mathematical Form |
Use Case |
|---|---|---|
None (default) |
\(Y' = Y\) |
Normally distributed outcomes |
|
\(Y' = \ln(Y)\) |
Right-skewed, positive values only |
|
\(Y' = \log_{10}(Y)\) |
Right-skewed, prefer base-10 scale |
|
Parametric INT |
Approximately normal, reduces outliers |
|
Rank-based INT (RINT) |
Heavy-tailed, robust to outliers |
Log Transformation
Natural Log:
Log Base 10:
Requirements:
All values must be positive (\(Y > 0\))
Raises error if any \(Y \leq 0\)
Effect on interpretation:
\(\beta\) represents proportional change
\(e^\beta\) is multiplicative effect (for natural log)
Common for: income, biomarkers, gene expression
Example:
edge = EDGEAnalysis(
outcome_type='continuous',
outcome_transform='log'
)
# For outcome = triglycerides (right-skewed)
alpha_df = edge.calculate_alpha(
train_genotype,
train_phenotype,
outcome='triglycerides', # Must be > 0
covariates=['age', 'sex', 'bmi']
)
Inverse Normal Transformation (INT)
Parametric INT:
Where \(\Phi^{-1}\) is the inverse normal CDF.
Process:
Standardize Y using sample mean and SD
Calculate ranks
Convert ranks to quantiles
Apply inverse normal transformation
Properties:
Assumes underlying normal distribution
Less robust to outliers than RINT
Preserves relative ordering
Example:
edge = EDGEAnalysis(
outcome_type='continuous',
outcome_transform='inverse_normal'
)
Rank-Based Inverse Normal Transformation (RINT)
RINT with Blom’s Formula:
Alternative formulas:
Van der Waerden: \(\frac{rank(Y)}{n + 1}\)
Blom (default): \(\frac{rank(Y) - 3/8}{n + 1/4}\)
Tukey: \(\frac{rank(Y) - 1/3}{n + 1/3}\)
Advantages:
Extremely robust to outliers
No assumptions about original distribution
Handles ties appropriately (average rank method)
Forces exact normal distribution
Disadvantages:
Loses information about original scale
Interpretation in normalized units only
May reduce power for truly normal traits
When to use:
Heavy-tailed distributions
Presence of outliers
Unknown original distribution
Standard practice in many GWAS studies
Example:
edge = EDGEAnalysis(
outcome_type='continuous',
outcome_transform='rank_inverse_normal'
)
# For outcome with outliers (e.g., C-reactive protein)
alpha_df = edge.calculate_alpha(
train_genotype,
train_phenotype,
outcome='crp', # Often has outliers
covariates=['age', 'sex', 'bmi', 'smoking']
)
Choosing a Transformation
Decision Guide:
Check outcome distribution:
import matplotlib.pyplot as plt plt.hist(phenotype_df['outcome'], bins=50) plt.title('Outcome Distribution') plt.show()
Apply transformation based on distribution:
Normal: No transformation needed
Right-skewed (e.g., income, biomarkers):
'log'or'log10'Heavy-tailed with outliers:
'rank_inverse_normal'Moderate deviation from normal:
'inverse_normal'
Verify transformation effectiveness:
from scipy import stats # Original _, p_original = stats.shapiro(y_original) # After transformation _, p_transformed = stats.shapiro(y_transformed) print(f"Normality p-value: {p_original:.4f} → {p_transformed:.4f}")
Practical Recommendations:
Trait Type |
Recommended Transformation |
|---|---|
Anthropometric (height, weight) |
None or |
Lipids (cholesterol, triglycerides) |
|
Inflammatory markers (CRP, IL-6) |
|
Blood pressure |
None or |
Glucose, insulin |
|
Gene expression |
|
Counts (cell counts) |
|
Unknown distribution |
|
Effect Size Interpretation After Transformation
Log transformation:
\(\beta\) = log-fold change per EDGE unit
Original scale: multiply by \(e^\beta\) (natural log) or \(10^\beta\) (log10)
Inverse normal transformations:
\(\beta\) in standard deviation units
Not directly interpretable on original scale
Focus on p-values and relative rankings
Example interpretation:
# Log-transformed triglycerides
# beta = 0.05, alpha = 0.3
# Interpretation:
# - Each EDGE-encoded allele associated with exp(0.05) = 1.05x
# - 5% increase in triglycerides
# - Partially recessive (alpha = 0.3)
Reporting transformed results:
Always state transformation method used
Report effects in transformed units
Optionally back-transform for key findings
Note limitations of interpretation
Transformation Diagnostics
Check for invalid values:
# After transformation
print(f"NaN values: {y_transformed.isna().sum()}")
print(f"Inf values: {np.isinf(y_transformed).sum()}")
# Summary statistics
print(y_transformed.describe())
Common issues:
Log of negative/zero: Add small constant or use log1p
Perfect ties in RINT: Rare, but can occur
Extreme outliers: May still influence log transformation
Solutions:
# For log transformation with zeros
y_transformed = np.log1p(y) # log(1 + y)
# For negative values, shift to positive
y_shifted = y - y.min() + 1
y_transformed = np.log(y_shifted)
# Winsorize extreme outliers before transformation
from scipy.stats import mstats
y_winsorized = mstats.winsorize(y, limits=[0.01, 0.01])
Statistical Testing
Hypothesis Testing
Null Hypothesis:
Alternative Hypothesis:
Test Statistic:
Wald test statistic:
Where:
\(\hat{\beta}\) = estimated coefficient
\(SE(\hat{\beta})\) = standard error of the coefficient
P-value Calculation
Two-tailed test:
Where \(z_{obs}\) is the observed test statistic.
For logistic regression:
Test statistic follows standard normal distribution asymptotically
P-value calculated from normal distribution
For linear regression:
Can use t-distribution with \(n - p - 1\) degrees of freedom
Large samples: t-distribution ≈ normal distribution
Multiple Testing Correction
Genome-wide Significance:
This threshold corresponds to Bonferroni correction for ~1 million independent tests.
Suggestive Threshold:
Used for exploratory analysis or hypothesis generation.
Other Methods:
Bonferroni correction: \(\alpha_{corrected} = \frac{0.05}{n_{tests}}\)
False Discovery Rate (FDR): Controls proportion of false discoveries
Genomic control (λ): Adjusts for population stratification
Genomic Inflation Factor
Interpretation:
\(\lambda \approx 1.0\): No inflation (good)
\(\lambda > 1.05\): Possible population stratification or cryptic relatedness
\(\lambda < 0.95\): Possible over-correction
Remedies for inflation:
Add more principal components as covariates
Check for batch effects
Verify sample quality control
Consider mixed models for related individuals
Quality Control Considerations
Minor Allele Frequency (MAF)
Recommendation: MAF > 0.01 (1%)
Rationale:
Rare variants have unstable α estimates
Insufficient heterozygotes for reliable \(\beta_{Het}\) estimation
Higher chance of convergence failure
For rare variant analysis:
Use gene-based or region-based methods
Increase MAF threshold to 0.05 for small samples (<5,000)
Consider burden tests or SKAT
Hardy-Weinberg Equilibrium (HWE)
Recommendation: HWE p > 1 × 10⁻⁶ in controls
Calculation:
Where:
\(O_{Het}\) = observed heterozygote count
\(E_{Het} = 2 \times n \times p \times (1-p)\)
\(n\) = sample size
\(p\) = allele frequency
Deviation indicates:
Genotyping errors
Population stratification
Selection pressure
Copy number variation
Missingness
Per-variant recommendation: < 5% missing
Per-sample recommendation: < 5% missing
Impact on EDGE:
Missing genotypes excluded from model fitting
Reduces effective sample size
May bias α estimates if missingness is non-random
Population Structure
Recommendation: Include 10+ principal components (PCs) as covariates
Why PCs matter:
Control for population stratification
Reduce confounding
Lower genomic inflation
Improve power by reducing variance
How many PCs:
Start with 10 PCs
Check \(\lambda\); if > 1.05, add more (up to 20)
Diminishing returns beyond 20 PCs
Too many PCs can reduce power
Sample Size Considerations
Minimum Sample Size
For binary outcomes:
Minimum: 500 cases + 500 controls
Recommended: 5,000+ cases and controls
Power depends on effect size, MAF, and α
For continuous outcomes:
Minimum: 1,000 samples
Recommended: 10,000+ samples
Power Calculations
Power to detect association depends on:
Sample size (larger = more power)
Effect size (larger = more power)
MAF (intermediate MAF = most power)
α value (closer to 0.5 = less advantage over additive)
Significance threshold (more stringent = less power)
EDGE has most advantage when:
True α ≠ 0.5 (nonadditive inheritance)
Sufficient heterozygotes (MAF 0.05-0.50)
Large sample size (N > 5,000)
True effect size is moderate to large
Comparison with Other Methods
EDGE vs. Standard Additive GWAS
Standard Additive Model:
Where \(G_{additive}\) is the allele count (0, 1, or 2).
Key Differences:
Aspect |
Additive GWAS |
EDGE |
|---|---|---|
Genetic model |
Fixed (α = 0.5) |
Data-driven (α estimated) |
Inheritance patterns |
Additive only |
All patterns (recessive to dominant) |
Power for additive effects |
Optimal |
Comparable |
Power for nonadditive effects |
Low |
High |
Computational cost |
Lower |
Moderate (2-stage) |
Interpretation |
Simple |
More informative (α values) |
Optimization |
Standard OLS/GLM |
Configurable algorithms |
Outcome transformations |
User must handle |
Built-in support |
EDGE vs. Genotypic Model
Genotypic Model:
Tests both \(\beta_{Het}\) and \(\beta_{HA}\) simultaneously using 2 df test.
Advantages of EDGE:
1 df test (more powerful with correct α)
Provides interpretable α parameter
Better for replication (apply α to new data)
Optimized computational methods
Disadvantages of EDGE:
Requires data splitting
Less powerful if α is unstable
EDGE vs. Cochran-Armitage Trend Test
Trend Test with weights:
Relationship to EDGE:
EDGE α is analogous to weight w
Standard additive: w = 0.5
Recessive: w = 0
Dominant: w = 1
EDGE advantages:
α estimated from data, not pre-specified
Two-stage design prevents overfitting
Flexible optimization for stability
Built-in outcome transformations
Practical Considerations
When to Use EDGE
EDGE is most beneficial when:
Suspected nonadditive inheritance (family studies suggest recessive/dominant)
Previous additive GWAS found few/no associations
Biological mechanism suggests specific inheritance pattern
Follow-up of candidate genes with known inheritance
Large sample size available (>5,000 samples)
Outcome is non-normal (use transformations)
Need interpretable inheritance patterns
Additive GWAS may be sufficient when:
Initial exploratory analysis
Small sample size (<1,000 samples)
Known additive effects from literature
Computational resources limited
Outcome is normally distributed
Recommended Workflow
Step 1: Prepare data and select parameters
from edge_gwas import EDGEAnalysis
from edge_gwas.utils import additive_gwas
# For continuous outcome with right-skewed distribution
edge = EDGEAnalysis(
outcome_type='continuous',
outcome_transform='log', # or 'rank_inverse_normal'
ols_method='bfgs', # default, suitable for most cases
max_iter=1000,
verbose=True
)
Step 2: Run both EDGE and additive GWAS
# EDGE analysis
alpha_df, edge_results = edge.run_full_analysis(
train_g, train_p, test_g, test_p,
outcome='triglycerides',
covariates=['age', 'sex', 'bmi', 'pc1', 'pc2', 'pc3'],
output_prefix='results/edge_triglycerides'
)
# Additive GWAS for comparison
additive_results = additive_gwas(
test_g, test_p,
outcome='triglycerides',
covariates=['age', 'sex', 'bmi', 'pc1', 'pc2', 'pc3'],
outcome_type='continuous',
outcome_transform='log'
)
Step 3: Compare results
# Identify SNPs significant in EDGE but not additive
edge_sig = edge_results[edge_results['pval'] < 5e-8]
add_sig = additive_results[additive_results['pval'] < 5e-8]
edge_only = edge_sig[~edge_sig['variant_id'].isin(add_sig['variant_id'])]
print(f"SNPs significant only in EDGE: {len(edge_only)}")
# Examine alpha values for significant SNPs
print("\nAlpha distribution for EDGE-specific findings:")
print(edge_only['alpha_value'].describe())
Step 4: Validate findings
Replicate in independent cohort
Check α consistency across cohorts
Perform functional follow-up
Check biological plausibility
Step 5: Handle convergence issues
# Check skipped SNPs
skipped = edge.get_skipped_snps()
print(f"Skipped {len(skipped)} SNPs")
# If many SNPs fail, try different method
if len(skipped) > 0.1 * len(test_g.columns):
edge_robust = EDGEAnalysis(
outcome_type='continuous',
outcome_transform='log',
ols_method='nm', # More robust
max_iter=5000
)
# Re-run analysis
alpha_df2, edge_results2 = edge_robust.run_full_analysis(
train_g, train_p, test_g, test_p,
outcome='triglycerides',
covariates=['age', 'sex', 'bmi', 'pc1', 'pc2', 'pc3']
)
Computational Complexity
Time Complexity:
Training stage: \(O(n_{train} \times m \times p^2 \times k)\)
Test stage: \(O(n_{test} \times m \times p^2)\)
Where:
\(n\) = sample size
\(m\) = number of variants
\(p\) = number of covariates
\(k\) = iterations for convergence (method-dependent)
Space Complexity:
\(O(n \times m)\) for genotype data
Additional \(O(m)\) for α values storage
Optimization Strategies:
Process chromosomes separately
Use parallel processing (multiple cores)
Filter variants before analysis
Use sparse matrix representations for rare variants
Choose efficient optimization method (BFGS or L-BFGS)
Method-specific performance:
Method |
Relative Speed |
Notes |
|---|---|---|
BFGS |
1.0x (baseline) |
Good default choice |
Newton |
0.8x (faster) |
When close to optimum |
L-BFGS |
1.1x |
Slightly slower but memory-efficient |
Nelder-Mead |
3-5x (slower) |
Robust but slow |
CG |
1.5x |
Good for sparse problems |
NCG |
1.3x |
Efficient for large covariate sets |
Powell |
4-6x (slower) |
Derivative-free |
Basin-hopping |
50-100x (much slower) |
Global optimization, rarely needed |
Implementation Details
Software Implementation
edge-gwas v0.1.1 uses:
statsmodels: Logistic and linear regression
scikit-learn: Data splitting, preprocessing
pandas: Data management
numpy/scipy: Numerical computations
joblib: Parallel processing
Numerical Stability:
Check for \(|\beta_{HA}| > 10^{-10}\) before computing α
Multiple optimization algorithms for robustness
Employ convergence tolerance of 1e-6 (configurable)
Maximum iterations: configurable (default: 1000)
Automatic handling of invalid transformation values
Convergence Issues
SNPs may fail to converge due to:
Rare variants: Too few observations for stable estimates
Perfect separation: In logistic regression when outcome completely separated by genotype
Multicollinearity: Highly correlated covariates
Numerical instability: Extreme coefficient values
Poor starting values: Especially with Newton method
Transformation issues: Invalid values after transformation
Solutions:
Increase
max_iterparameterTry different optimization method:
edge = EDGEAnalysis( outcome_type='continuous', ols_method='nm', # More robust max_iter=5000 )
Filter rare variants (MAF > 0.01)
Check covariate correlations
Use outcome transformation
Use
get_skipped_snps()to identify problematic variants
Systematic convergence failures:
# If >10% of SNPs fail
skipped = edge.get_skipped_snps()
if len(skipped) / total_snps > 0.1:
# Possible data quality issues
print("Warning: High failure rate. Check:")
print("1. MAF distribution")
print("2. Covariate multicollinearity")
print("3. Outcome distribution")
print("4. Sample size per genotype group")
Extensions and Future Directions
Planned Features (v0.2.0+)
Gene-based testing: Aggregate SNPs within genes
Mixed models: Account for relatedness and population structure
X chromosome: Proper handling of X-linked inheritance
Rare variant analysis: Optimized methods for MAF < 0.01
Meta-analysis tools: Combine α estimates across studies
GPU acceleration: Faster computation for biobank-scale data
Additional transformations: Box-Cox, Yeo-Johnson
Adaptive method selection: Automatically choose best optimization method
Interactive optimization: Real-time convergence monitoring
Batch effect correction: Built-in adjustments for technical covariates
Research Directions
Optimal splitting ratio: Beyond 50/50 for different scenarios
Cross-validation: Stability of α across folds
Bayesian EDGE: Prior distributions on α
Multi-trait EDGE: Joint analysis of related phenotypes
Gene-environment interaction: EDGE with interaction terms
Optimization method benchmarking: Systematic comparison across data types
Transformation selection: Automated selection of optimal transformation
Non-linear relationships: Spline-based EDGE encoding
Limitations
Current Limitations
Data splitting required: Reduces effective sample size
Assumes independence: Not designed for related individuals (mixed models coming)
Single SNP analysis: Does not model joint effects
Autosomal variants only: X/Y chromosomes require special handling
European-ancestry optimized: May need adjustment for diverse populations
Linear transformations: Non-linear relationships not fully captured
Computational cost: Higher than standard GWAS (but manageable)
Statistical Assumptions
EDGE assumes:
Independent samples: No cryptic relatedness (unless using mixed models)
No genotyping errors: QC filters applied
Hardy-Weinberg Equilibrium: In controls for binary outcomes
Correct model specification: Covariates adequately control confounding
Large sample size: Asymptotic properties of tests hold
Proper transformation: Normality after transformation (for linear regression)
Convergence: Optimization reaches true optimum
Validation Studies
Simulation Studies
EDGE has been validated through simulations showing:
Type I error control: Maintains nominal false positive rate (5%)
Increased power: 10-50% power gain for nonadditive effects
Robustness: Stable performance across MAF ranges
α recovery: Accurately estimates true inheritance pattern
Transformation effectiveness: Proper error control with non-normal outcomes
Optimization stability: Consistent results across optimization methods
Real Data Applications
EDGE has been applied to:
Cardiometabolic traits: BMI, diabetes, blood pressure, lipids
UK Biobank data: >500,000 individuals
Novel discoveries: Identified nonadditive effects missed by additive GWAS
Diverse outcomes: Binary and continuous with various transformations
See Zhou et al. (2023) for detailed results.
Method Comparison Studies
Benchmarking shows:
BFGS vs Newton: BFGS more robust, Newton slightly faster when converging
Transformation impact: RINT provides best Type I error control for heavy-tailed traits
Power curves: EDGE superior for α < 0.3 or α > 0.7
Replication rates: Higher for EDGE-discovered loci with extreme α values
Mathematical Proofs
Unbiased p-values (Sketch)
Theorem: Two-stage EDGE produces valid p-values under the null hypothesis.
Proof sketch:
α estimated on training data \(D_{train}\)
Association tested on independent test data \(D_{test}\)
Under \(H_0\): \(\beta = 0\), test statistic in \(D_{test}\) follows null distribution
Independence of \(D_{train}\) and \(D_{test}\) ensures no inflation
Transformation applied identically to both datasets maintains independence
Therefore, Type I error rate is controlled at nominal level
Consistency of α (Sketch)
Theorem: \(\hat{\alpha} \xrightarrow{p} \alpha_{true}\) as \(n \rightarrow \infty\)
Conditions:
\(|\beta_{HA}| > c > 0\) for some constant c
Standard regularity conditions for MLE
Convergence of optimization algorithm to true optimum
Implication: With sufficient training data and proper optimization, α estimates converge to true value.
Convergence of Optimization Methods
Theorem: Under standard regularity conditions, quasi-Newton methods (BFGS, L-BFGS) achieve superlinear convergence.
Rate: \(\|\theta_{k+1} - \theta^*\| \leq C \|\theta_k - \theta^*\|^{1+\epsilon}\)
Where:
\(\theta^*\) = true parameter value
\(\epsilon > 0\) for superlinear convergence
\(C\) = positive constant
Glossary of Terms
- Alpha (α)
Encoding parameter representing the ratio of heterozygous to homozygous effects
- Additive model
Genetic model assuming heterozygote effect is exactly half of homozygous effect (α = 0.5)
- BFGS
Broyden-Fletcher-Goldfarb-Shannon algorithm for optimization (default for EDGE)
- Codominant model
Model that separately estimates effects for heterozygotes and homozygotes
- EDGE encoding
Genotype encoding using estimated α values
- Genomic inflation (λ)
Measure of systematic p-value inflation due to population structure
- HWE
Hardy-Weinberg Equilibrium
- INT
Inverse Normal Transformation (parametric)
- L-BFGS
Limited-memory BFGS, memory-efficient variant for large datasets
- MAF
Minor Allele Frequency
- Nelder-Mead
Derivative-free simplex optimization method
- Newton-Raphson
Second-order optimization using exact Hessian matrix
- Over-dominant
Inheritance pattern where heterozygotes have stronger effect than either homozygote (α > 1)
- Quasi-Newton methods
Optimization methods using approximation of Hessian (e.g., BFGS, L-BFGS)
- Recessive
Inheritance pattern where heterozygotes behave like reference homozygotes (α ≈ 0)
- RINT
Rank-based Inverse Normal Transformation, robust to outliers
- Wald test
Statistical test using ratio of estimate to standard error
- Two-stage analysis
Analysis approach separating parameter estimation (stage 1) from hypothesis testing (stage 2)
Best Practices Summary
Data Preparation
Quality Control:
MAF > 0.01 (or 0.05 for small samples)
HWE p > 1×10⁻⁶ in controls
Genotype call rate > 95%
Sample call rate > 95%
Remove related individuals (kinship > 0.0884)
Covariates:
Include age, sex, assessment center (if applicable)
Add 10-20 principal components
Check for multicollinearity (VIF < 10)
Consider batch effects
Outcome Preparation:
Check distribution (histogram, Q-Q plot)
Remove extreme outliers (>5 SD) before transformation
Choose appropriate transformation:
Normal → None
Right-skewed →
'log'or'rank_inverse_normal'Heavy-tailed →
'rank_inverse_normal'Unknown →
'rank_inverse_normal'(safest)
Verify transformation effectiveness
Parameter Selection
Optimization Method:
Default:
ols_method='bfgs'(suitable for >95% of cases)Large datasets (N>100k):
ols_method='lbfgs'Convergence issues:
ols_method='nm'orols_method='newton'Many covariates (P>100):
ols_method='ncg'
Maximum Iterations:
Default:
max_iter=1000(usually sufficient)Increase to 2000-5000 if convergence issues
Monitor skipped SNPs
Data Splitting:
50/50 split recommended (balanced power and stability)
Larger training set (60/40) for unstable traits
Larger test set (40/60) for well-behaved traits
Result Interpretation
Genome-wide significant hits (p < 5×10⁻⁸):
Report α value for each hit
Classify inheritance pattern
Compare with additive GWAS results
Check for biological plausibility
Alpha interpretation:
def classify_inheritance(alpha): if alpha < 0: return "Under-recessive (heterozygote disadvantage)" elif alpha < 0.3: return "Recessive to partially recessive" elif 0.45 <= alpha <= 0.55: return "Additive" elif 0.7 < alpha < 1.3: return "Partially dominant to dominant" elif alpha >= 1.3: return "Over-dominant (heterozygote advantage)" else: return "Intermediate" sig_hits = gwas_results[gwas_results['pval'] < 5e-8] sig_hits['inheritance'] = sig_hits['alpha_value'].apply(classify_inheritance) print(sig_hits[['variant_id', 'alpha_value', 'inheritance', 'pval']])
Quality checks:
λ should be ≈1.0 (acceptable: 0.95-1.05)
Q-Q plot should show no systematic deviation
Check consistency of α values for nearby SNPs (LD)
Appendix A: Optimization Method Details
BFGS Algorithm:
The Broyden-Fletcher-Goldfarb-Shannon algorithm approximates the Hessian matrix using gradient information:
where:
\(s_k = x_{k+1} - x_k\) (step)
\(y_k = \nabla f(x_{k+1}) - \nabla f(x_k)\) (gradient change)
\(H_k\) is the approximate Hessian at iteration k
Convergence criteria:
Default tolerance: \(\epsilon = 10^{-6}\)
L-BFGS Algorithm:
Limited-memory BFGS stores only m recent vector pairs (typically m=10):
Memory requirement: \(O(m \cdot n)\) instead of \(O(n^2)\) for full BFGS
Trade-off: Slightly slower convergence but much lower memory usage
Newton-Raphson:
Uses exact Hessian matrix:
where \(H(x_k)\) is the true Hessian matrix.
Convergence rate: Quadratic (fastest when close to optimum)
Cost: Expensive Hessian computation at each iteration
Nelder-Mead Simplex:
Derivative-free method using simplex of n+1 points:
Operations: reflection, expansion, contraction, shrinkage
Pros: Robust, no gradient needed Cons: Slow, may not converge for high dimensions
See Also
Documentation:
Documentation Home - Home
Installation Guide - Installation instructions and requirements
Quick Start Guide - Getting started guide with simple examples
User Guide - Comprehensive user guide and tutorials
API Reference - Complete API documentation
Example Workflows - Example analyses and case studies
Visualization Guide - Plotting and visualization guide
Statistical Model - Statistical methods and mathematical background
Troubleshooting Guide - Troubleshooting guide and common issues
Frequently Asked Questions (FAQ) - Frequently asked questions
Citation - How to cite EDGE in publications
Changelog - Version history and release notes
Advanced Topics for Further Updates - Planned features and roadmap
—
Last updated: 2025-12-25 for edge-gwas v0.1.1
For questions or issues, visit: https://github.com/nicenzhou/edge-gwas/issues