API Reference

Complete API documentation for edge-gwas package.

Core Module

EDGEAnalysis Class

Main class for EDGE GWAS analysis.

from edge_gwas import EDGEAnalysis

# Binary outcome
edge = EDGEAnalysis(outcome_type='binary')

# Continuous outcome with transformation
edge = EDGEAnalysis(
    outcome_type='continuous',
    outcome_transform='rank_inverse_normal'
)

Parameters:

  • outcome_type: ‘binary’ or ‘continuous’

  • outcome_transform: ‘log’, ‘log10’, ‘inverse_normal’, ‘rank_inverse_normal’

  • n_jobs: Number of parallel jobs (-1 = all cores)

  • outcome_transform (str, optional): NEW in v0.1.1 - Transformation for continuous outcomes

    • None: No transformation (default)

    • 'log': Natural log transformation

    • 'log10': Log base 10 transformation

    • 'inverse_normal': Inverse normal transformation (parametric)

    • 'rank_inverse_normal': Rank-based inverse normal transformation (robust to outliers)

  • ols_method (str, optional): NEW in v0.1.1 - Optimization method for OLS regression (continuous outcomes only)

    • 'bfgs': Broyden-Fletcher-Goldfarb-Shannon algorithm (default)

      • Best for: General purpose, balanced speed and accuracy

      • Convergence: Fast, uses approximate Hessian

      • Memory: Moderate (O(p²) where p = number of parameters)

      • Recommended: Most analyses

    • 'newton': Newton-Raphson algorithm

      • Best for: High precision requirements

      • Convergence: Very fast when close to optimum (quadratic)

      • Memory: High (requires exact Hessian computation)

      • Recommended: Small datasets, need high accuracy

    • 'lbfgs': Limited-memory BFGS

      • Best for: Large datasets (N > 100,000)

      • Convergence: Slightly slower than BFGS

      • Memory: Low (O(mp) where m ≈ 10)

      • Recommended: Biobank-scale data, memory-constrained systems

    • 'nm': Nelder-Mead simplex algorithm

      • Best for: Robustness, convergence issues with gradient methods

      • Convergence: Slow, derivative-free

      • Memory: Low

      • Recommended: Troubleshooting convergence failures

    • 'cg': Conjugate Gradient

      • Best for: Sparse problems, many covariates

      • Convergence: Moderate

      • Memory: Low

      • Recommended: Large covariate sets

    • 'ncg': Newton Conjugate Gradient

      • Best for: Large-scale problems with >100 covariates

      • Convergence: Good, uses Hessian-vector products

      • Memory: Moderate

      • Recommended: High-dimensional analyses

    • 'powell': Powell’s conjugate direction method

      • Best for: Derivative-free optimization

      • Convergence: Slow, robust

      • Memory: Low

      • Recommended: Non-smooth objective functions (rare in GWAS)

    • 'basinhopping': Basin-hopping global optimization

      • Best for: Multiple local minima (very rare in GWAS)

      • Convergence: Very slow, explores parameter space thoroughly

      • Memory: Moderate

      • Recommended: Only when suspecting multiple solutions

    Note: Optimization method only affects continuous outcomes with linear regression. Binary outcomes always use BFGS for logistic regression.

  • n_jobs (int): Number of CPU cores for parallel processing

    • -1: Use all available cores

    • 1: Single-threaded (no parallelization)

    • n: Use n cores

  • max_iter (int): Maximum iterations for model convergence

    • Default: 1000 (usually sufficient for BFGS, Newton)

    • Increase to 2000-5000 for Nelder-Mead or if convergence warnings occur

    • Reduce to 500 if most variants converge quickly

  • verbose (bool): Print progress messages

Methods

calculate_alpha()

Calculate encoding parameters from training data.

alpha_df = edge.calculate_alpha(
    genotype_data,
    phenotype_df,
    outcome,
    covariates,
    variant_info=None,
    grm_matrix=None,        # NEW in v0.1.1
    grm_sample_ids=None,    # NEW in v0.1.1
    mean_centered=False     # NEW in v0.1.1
)

Parameters:

  • genotype_data (pandas.DataFrame): Genotype matrix (samples × variants), values {0, 1, 2}

  • phenotype_df (pandas.DataFrame): Phenotype data with outcome and covariates

  • outcome (str): Name of outcome column

  • covariates (list): List of covariate column names

  • variant_info (pandas.DataFrame, optional): Variant information (variant_id, chr, pos, ref_allele, alt_allele)

  • grm_matrix (numpy.ndarray, optional): NEW in v0.1.1 - GRM matrix for population structure control

  • grm_sample_ids (pandas.DataFrame, optional): NEW in v0.1.1 - Sample IDs corresponding to GRM

  • mean_centered (bool, optional): NEW in v0.1.1 - If True, fit codominant regression without intercept (default: False)

Returns:

  • pandas.DataFrame: Alpha values with columns: variant_id, alpha_value, ref_allele, alt_allele, eaf, coef_het, coef_hom, std_err_het, std_err_hom, pval_het, pval_hom, n_samples

Examples:

# Basic usage
alpha_df = edge.calculate_alpha(
    genotype_data=train_geno,
    phenotype_df=train_pheno,
    outcome='trait',
    covariates=['age', 'sex']
)

# With GRM for population structure control
from edge_gwas.utils import load_grm_gcta
grm_matrix, grm_ids = load_grm_gcta('grm_prefix')

alpha_df = edge.calculate_alpha(
    genotype_data=train_geno,
    phenotype_df=train_pheno,
    outcome='trait',
    covariates=['age', 'sex'],
    grm_matrix=grm_matrix,
    grm_sample_ids=grm_ids
)

# Mean-centered model (Y - E[Y], no intercept)
alpha_df = edge.calculate_alpha(
    genotype_data=train_geno,
    phenotype_df=train_pheno,
    outcome='trait',
    covariates=['age', 'sex'],
    mean_centered=True
)

Notes:

  • Model: Y ~ β_het × I(geno=1) + β_hom × I(geno=2) + covariates

  • Alpha: α = β_het / β_hom

  • When mean_centered=True: no intercept term in the regression model

apply_alpha()

Apply alpha values to test data for GWAS.

gwas_df = edge.apply_alpha(
    genotype_data,
    phenotype_df,
    outcome,
    covariates,
    alpha_values,
    grm_matrix=None,        # NEW in v0.1.1
    grm_sample_ids=None     # NEW in v0.1.1
)

Parameters:

  • genotype_data (pandas.DataFrame): Test genotype matrix

  • phenotype_df (pandas.DataFrame): Test phenotype data

  • outcome (str): Name of outcome column

  • covariates (list): List of covariate column names

  • alpha_values (pandas.DataFrame): Alpha values from training

  • grm_matrix (numpy.ndarray, optional): NEW in v0.1.1 - GRM matrix for mixed model

  • grm_sample_ids (pandas.DataFrame, optional): NEW in v0.1.1 - Sample IDs for GRM

Returns:

  • pandas.DataFrame: GWAS results with columns:

    • variant_id: SNP identifier

    • chr: Chromosome

    • pos: Position

    • pval: P-value

    • coef: Effect coefficient

    • std_err: Standard error

    • stat: Test statistic

    • alpha_value: Applied alpha

run_full_analysis()

Complete two-stage EDGE analysis.

alpha_df, gwas_df = edge.run_full_analysis(
    train_genotype,
    train_phenotype,
    test_genotype,
    test_phenotype,
    outcome,
    covariates,
    variant_info=None,
    grm_matrix=None,        # NEW in v0.1.1
    grm_sample_ids=None,    # NEW in v0.1.1
    mean_centered=False,    # NEW in v0.1.1
    output_prefix=None
)

Parameters:

  • train_genotype/test_genotype (pandas.DataFrame): Training/test genotype data

  • train_phenotype/test_phenotype (pandas.DataFrame): Training/test phenotype data

  • outcome (str): Outcome column name

  • covariates (list): Covariate column names

  • variant_info (pandas.DataFrame, optional): Variant information

  • grm_matrix (numpy.ndarray, optional): NEW in v0.1.1 - GRM for population structure control

  • grm_sample_ids (pandas.DataFrame, optional): NEW in v0.1.1 - Sample IDs for GRM

  • mean_centered (bool, optional): NEW in v0.1.1 - If True, fit codominant regression without intercept (default: False)

  • output_prefix (str, optional): Prefix for output files

Returns:

  • tuple: (alpha_df, gwas_df)

Examples:

# Basic usage
alpha_df, gwas_df = edge.run_full_analysis(
    train_geno, train_pheno,
    test_geno, test_pheno,
    outcome='trait',
    covariates=['age', 'sex'],
    output_prefix='results/edge'
)

# With GRM and mean-centered model
from edge_gwas.utils import load_grm_gcta
grm_matrix, grm_ids = load_grm_gcta('grm_prefix')

alpha_df, gwas_df = edge.run_full_analysis(
    train_geno, train_pheno,
    test_geno, test_pheno,
    outcome='trait',
    covariates=['age', 'sex'],
    grm_matrix=grm_matrix,
    grm_sample_ids=grm_ids,
    mean_centered=True,
    output_prefix='results/edge'
)

get_skipped_snps()

Get list of SNPs that failed convergence.

skipped = edge.get_skipped_snps()

Returns:

  • list: List of variant IDs that were skipped

Utilities Module

Calculate genetic relationship matrix (GRM) using GCTA.

from edge_gwas.utils import calculate_grm_gcta

grm_prefix = calculate_grm_gcta(
    plink_prefix='genotypes',
    output_prefix='output/grm',
    maf_threshold=0.01,
    method='grm',
    max_threads=8,
    verbose=True
)

Parameters:

  • plink_prefix (str): Prefix for PLINK binary files (.bed/.bim/.fam)

  • output_prefix (str, optional): Prefix for output GRM files (default: temp directory)

  • maf_threshold (float): MAF threshold for variant filtering (default: 0.01)

  • method (str): GRM calculation method

    • 'grm': Full GRM (default)

    • 'grm-sparse': Sparse GRM (faster for large cohorts)

  • max_threads (int): Maximum number of threads to use (default: 1)

  • verbose (bool): Print progress information

Returns:

  • str: Path to output GRM prefix

Output files:

  • prefix.grm.bin: GRM values (lower triangle, binary format)

  • prefix.grm.N.bin: Number of SNPs used for each pair

  • prefix.grm.id: Sample IDs (FID and IID)

Note:

Requires GCTA to be installed. Install with:

edge-gwas-install-tools

Load GRM calculated by GCTA.

from edge_gwas.utils import load_grm_gcta

grm_matrix, sample_ids = load_grm_gcta(
    grm_prefix='output/grm',
    verbose=True
)

Parameters:

  • grm_prefix (str): Prefix for GRM files (without .grm.bin extension)

  • verbose (bool): Print loading information

Returns:

  • tuple: (grm_matrix, sample_ids_df)

    • grm_matrix (numpy.ndarray): n_samples × n_samples symmetric GRM matrix

    • sample_ids_df (pandas.DataFrame): DataFrame with FID and IID columns

Example:

# Calculate and load GRM
grm_prefix = calculate_grm_gcta('genotypes', output_prefix='grm/output')
grm_matrix, grm_ids = load_grm_gcta('grm/output')

print(f"GRM shape: {grm_matrix.shape}")
print(f"Mean diagonal: {np.diag(grm_matrix).mean():.3f}")

Identify pairs of related samples based on GRM threshold.

from edge_gwas.utils import identify_related_samples

related_pairs = identify_related_samples(
    grm_matrix,
    sample_ids,
    threshold=0.0884,  # ~3rd degree relatives
    verbose=True
)

Parameters:

  • grm_matrix (numpy.ndarray): GRM matrix

  • sample_ids (pandas.DataFrame): Sample IDs from load_grm_gcta()

  • threshold (float): Relatedness threshold

    • 0.354: 1st degree (parent-offspring, full siblings)

    • 0.177: 2nd degree (half-siblings, grandparent-grandchild)

    • 0.0884: 3rd degree (first cousins) - default

  • verbose (bool): Print summary statistics

Returns:

  • pandas.DataFrame: Related pairs with columns:

    • IID1: First sample ID

    • IID2: Second sample ID

    • kinship: Kinship coefficient

Example:

grm_matrix, sample_ids = load_grm_gcta('grm_prefix')

# Find 2nd degree or closer relatives
related = identify_related_samples(grm_matrix, sample_ids, threshold=0.177)
print(f"Found {len(related)} related pairs")

Filter out related samples to create an unrelated subset.

from edge_gwas.utils import filter_related_samples

unrelated_pheno = filter_related_samples(
    phenotype_df,
    grm_matrix,
    sample_ids,
    threshold=0.0884,
    method='greedy',
    sample_id_col=None,
    verbose=True
)

Parameters:

  • phenotype_df (pandas.DataFrame): Phenotype DataFrame

  • grm_matrix (numpy.ndarray): GRM matrix

  • sample_ids (pandas.DataFrame): Sample IDs from load_grm_gcta()

  • threshold (float): Relatedness threshold

  • method (str): Method for selecting unrelated samples

    • 'greedy': Iteratively remove sample with most relatives (default)

    • 'random': Randomly remove one from each related pair

  • sample_id_col (str, optional): Column name in phenotype_df for sample IDs (if None, uses index)

  • verbose (bool): Print filtering information

Returns:

  • pandas.DataFrame: Filtered phenotype DataFrame with unrelated samples only

Example:

# Remove related samples for QC
unrelated_pheno = filter_related_samples(
    phenotype_df, grm_matrix, sample_ids,
    threshold=0.0884, method='greedy'
)
print(f"Unrelated samples: {len(unrelated_pheno)}")

UPDATED in v0.1.1 - Calculate principal components using PLINK2 with support for multiple input formats.

from edge_gwas.utils import calculate_pca_plink

pca_df = calculate_pca_plink(
    file_prefix='genotypes',
    n_pcs=10,
    file_format='bfile',
    output_prefix=None,
    maf_threshold=0.01,
    ld_window=50,
    ld_step=5,
    ld_r2=0.2,
    approx=False,
    verbose=True
)

Parameters:

  • file_prefix (str): Prefix for input files (renamed from plink_prefix)

  • n_pcs (int): Number of principal components to calculate (default: 10)

  • file_format (str): NEW in v0.1.1 - Input file format (default: ‘bfile’). Options:

    • 'bfile': PLINK1 binary (.bed/.bim/.fam)

    • 'pfile': PLINK2 binary (.pgen/.pvar/.psam)

    • 'vcf': VCF file (.vcf or .vcf.gz)

    • 'bgen': BGEN file (.bgen)

  • output_prefix (str, optional): Prefix for output files (default: temp directory)

  • maf_threshold (float, optional): UPDATED - MAF threshold for variant filtering (default: 0.01, None to skip)

  • ld_window (int, optional): UPDATED - Window size for LD pruning in variant count (default: 50, None to skip LD pruning)

  • ld_step (int, optional): UPDATED - Step size for LD pruning (default: 5, None to skip LD pruning)

  • ld_r2 (float, optional): UPDATED - R² threshold for LD pruning (default: 0.2, None to skip LD pruning)

  • approx (bool): Use approximate PCA for large cohorts (default: False)

  • verbose (bool): Print progress information (default: True)

Returns:

  • pandas.DataFrame: PCA results with IID as index and PC1, PC2, …, PCn as columns

Note:

Requires PLINK2 to be installed. Install with:

edge-gwas-install-tools

Changes from v0.1.0:

  • Added file_format parameter for multiple input formats

  • Renamed plink_prefix to file_prefix

  • MAF and LD parameters now optional (set to None to skip)

  • Removed approx_samples parameter (PLINK2 handles automatically)

  • Fixed output parsing to skip header lines

Example:

# PLINK1 binary format (default, backward compatible)
pca_df = calculate_pca_plink('genotypes', n_pcs=10)

# PLINK2 binary format
pca_df = calculate_pca_plink(
    'genotypes',
    n_pcs=10,
    file_format='pfile'
)

# VCF file
pca_df = calculate_pca_plink(
    'mydata.vcf.gz',
    n_pcs=10,
    file_format='vcf'
)

# Skip LD pruning for small datasets
pca_df = calculate_pca_plink(
    'genotypes',
    n_pcs=10,
    ld_window=None,
    ld_step=None,
    ld_r2=None
)

Calculate PC-AiR (Principal Components - Analysis in Related samples).

from edge_gwas.utils import calculate_pca_pcair

pca_df = calculate_pca_pcair(
    plink_prefix='genotypes',
    n_pcs=10,
    kinship_matrix=None,
    divergence_matrix=None,
    output_prefix=None,
    kin_threshold=0.0884,
    div_threshold=-0.0884,
    maf_threshold=0.01,
    verbose=True
)

Parameters:

  • plink_prefix (str): Prefix for PLINK binary files (.bed/.bim/.fam)

  • n_pcs (int): Number of principal components to calculate (default: 10)

  • kinship_matrix (str, optional): Path to kinship matrix file (GCTA GRM format prefix). If None, will compute using calculate_grm_gcta()

  • divergence_matrix (str, optional): Path to divergence matrix file (optional)

  • output_prefix (str, optional): Prefix for output files (default: temp directory)

  • kin_threshold (float): Kinship threshold for defining relatedness (default: 0.0884, ~ 3rd degree relatives)

  • div_threshold (float): Divergence threshold (default: -0.0884)

  • maf_threshold (float): MAF threshold for GRM calculation if kinship_matrix is None (default: 0.01)

  • verbose (bool): Print progress information (default: True)

Returns:

  • pandas.DataFrame: PCA results with IID as index and PC1, PC2, …, PCn as columns

Note:

Requires R with GENESIS package installed:

edge-gwas-install-tools

Or install manually in R:

if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("GENESIS")
BiocManager::install("SNPRelate")
BiocManager::install("gdsfmt")

Reference:

Conomos et al. (2015) Genetic Epidemiology 39(4):276-293

Example:

# PC-AiR with automatic GRM calculation
pca_df = calculate_pca_pcair('genotypes', n_pcs=10)

# PC-AiR with pre-computed GRM
pca_df = calculate_pca_pcair(
    'genotypes',
    n_pcs=10,
    kinship_matrix='grm_prefix'
)

# Custom kinship threshold (2nd degree relatives)
pca_df = calculate_pca_pcair(
    'genotypes',
    n_pcs=10,
    kin_threshold=0.177
)

# Use PC-AiR results as covariates
pheno = attach_pcs_to_phenotype(pheno, pca_df, n_pcs=10)
covariates = ['age', 'sex'] + get_pc_covariate_list(10)

Calculate principal components using scikit-learn (basic PCA without relatedness correction).

from edge_gwas.utils import calculate_pca_sklearn

pca_df = calculate_pca_sklearn(
    genotype_df,
    n_pcs=10,
    verbose=True
)

Parameters:

  • genotype_df (pandas.DataFrame): Genotype matrix (samples × variants)

  • n_pcs (int): Number of principal components to calculate (default: 10)

  • verbose (bool): Print progress information

Returns:

  • pandas.DataFrame: PCA results with IID as index and PC1, PC2, …, PCn as columns

Note:

This is a basic PCA without correction for relatedness. For robust PCA, use calculate_pca_plink() or calculate_pca_pcair().

Example:

# Quick PCA for exploratory analysis
pca_df = calculate_pca_sklearn(genotype_df, n_pcs=10)

# Check variance explained
print("PCA complete")

UPDATED in v0.1.1 - Attach principal components to phenotype DataFrame with option to drop missing samples.

from edge_gwas.utils import attach_pcs_to_phenotype

pheno_with_pcs = attach_pcs_to_phenotype(
    phenotype_df,
    pca_df,
    n_pcs=10,
    pc_prefix='PC',
    sample_id_col=None,
    drop_na=False,
    verbose=True
)

Parameters:

  • phenotype_df (pandas.DataFrame): Phenotype DataFrame

  • pca_df (pandas.DataFrame): PCA DataFrame with IID as index

  • n_pcs (int): Number of PCs to attach (default: 10)

  • pc_prefix (str): Prefix for PC column names (default: ‘PC’)

  • sample_id_col (str, optional): Column name in phenotype_df for sample IDs (None = use index)

  • drop_na (bool): NEW in v0.1.1 - If True, remove samples with missing PCs after merging (default: False)

  • verbose (bool): Print information about merging (default: True)

Returns:

  • pandas.DataFrame: Phenotype DataFrame with PC columns added

Note:

The function automatically handles type mismatches between sample IDs (e.g., strings vs integers).

Changes from v0.1.0:

  • Added drop_na parameter to remove samples without PCs

  • Improved sample ID matching with automatic type conversion

  • Fixed duplicate column issues during merge

Example:

# Calculate PCA
pca_df = calculate_pca_plink('genotypes', n_pcs=10)

# Attach PCs, keep samples with missing PCs (default)
pheno = attach_pcs_to_phenotype(pheno, pca_df, n_pcs=10)

# Attach PCs and remove samples without PCs
pheno = attach_pcs_to_phenotype(
    pheno,
    pca_df,
    n_pcs=10,
    drop_na=True
)

# When phenotype has IID as column
pheno = attach_pcs_to_phenotype(
    pheno,
    pca_df,
    n_pcs=10,
    sample_id_col='IID',
    drop_na=True
)

# Complete workflow with PCs as covariates
pca_df = calculate_pca_plink('genotypes', n_pcs=10)
pheno = attach_pcs_to_phenotype(pheno, pca_df, n_pcs=10, drop_na=True)

# Generate PC covariate list
from edge_gwas.utils import get_pc_covariate_list
covariates = ['age', 'sex'] + get_pc_covariate_list(10)

# Run EDGE analysis with PCs
alpha_df, gwas_df = edge.run_full_analysis(
    train_g, train_p, test_g, test_p,
    outcome='disease',
    covariates=covariates
)

Generate list of PC covariate names for use in EDGE analysis.

from edge_gwas.utils import get_pc_covariate_list

pc_list = get_pc_covariate_list(n_pcs=10, pc_prefix='PC')

Parameters:

  • n_pcs (int): Number of PCs

  • pc_prefix (str): Prefix for PC column names (default: ‘PC’)

Returns:

  • list: List of PC column names [‘PC1’, ‘PC2’, …, ‘PCn’]

Example:

# Generate PC covariate list
pc_list = get_pc_covariate_list(10)
# Returns: ['PC1', 'PC2', 'PC3', ..., 'PC10']

# Use in EDGE analysis
covariates = ['age', 'sex'] + get_pc_covariate_list(10)
alpha_df, gwas_df = edge.run_full_analysis(
    train_g, train_p, test_g, test_p,
    outcome='disease',
    covariates=covariates
)

load_pgen_data()

Load PLINK 2 binary format data (.pgen/.pvar/.psam).

geno, info = load_pgen_data('data.pgen', 'data.pvar', 'data.psam')

Requires: pip install pgenlib

load_vcf_data()

Load VCF format data (.vcf or .vcf.gz).

geno, info = load_vcf_data('data.vcf.gz', dosage=True)

Parameters:

  • dosage (bool): Use dosages (DS field) vs hard calls (GT field)

  • minor_allele_as_alt (bool): Ensure minor allele is ALT

Requires: pip install cyvcf2

load_bgen_data()

Load BGEN format data.

geno, info = load_bgen_data('data.bgen', sample_file='data.sample')

Requires: pip install bgen-reader

prepare_phenotype_data()

Load and prepare phenotype data.

pheno = prepare_phenotype_data(
    'pheno.txt',
    outcome_col='disease',
    covariate_cols=['age', 'sex', 'pc1', 'pc2'],
    sample_id_col='IID',
    sep='\t'
)

Parameters:

  • outcome_col: Name of outcome column

  • covariate_cols: List of covariate columns

  • sample_id_col: Sample ID column (becomes index)

  • sep: File separator

Split data into training and test sets with stratification and flexible sample ID matching.

from edge_gwas.utils import stratified_train_test_split

train_g, test_g, train_p, test_p = stratified_train_test_split(
    genotype_df,
    phenotype_df,
    outcome_col='disease',
    test_size=0.5,
    random_state=42,
    is_binary=True,
    geno_id_col=None,
    pheno_id_col=None
)

Parameters:

  • genotype_df (pandas.DataFrame): Genotype data

  • phenotype_df (pandas.DataFrame): Phenotype data

  • outcome_col (str): Outcome column for stratification

  • test_size (float): Proportion for test set (0-1)

  • random_state (int): Random seed for reproducibility

  • is_binary (bool): Whether to stratify by outcome

  • geno_id_col (str, optional): NEW in v0.1.1 - Column or index name for sample IDs in genotype_df. If None, uses the DataFrame index (default: None)

  • pheno_id_col (str, optional): NEW in v0.1.1 - Column or index name for sample IDs in phenotype_df. If None, uses the DataFrame index (default: None)

Returns:

  • tuple: (train_geno, test_geno, train_pheno, test_pheno)

Examples:

# Example 1: Basic usage with matching indices (recommended)
pheno = pheno.set_index('IID')  # Set IID as index
train_g, test_g, train_p, test_p = stratified_train_test_split(
    geno, pheno, 'disease', test_size=0.5
)

# Example 2: Specify different ID columns for genotype and phenotype
train_g, test_g, train_p, test_p = stratified_train_test_split(
    geno, pheno, 'disease',
    geno_id_col='sample_id',  # Use 'sample_id' column/index from genotype
    pheno_id_col='IID',        # Use 'IID' column from phenotype
    test_size=0.5
)

# Example 3: Genotype uses index, phenotype uses column
train_g, test_g, train_p, test_p = stratified_train_test_split(
    geno, pheno, 'disease',
    geno_id_col=None,          # Use genotype index
    pheno_id_col='IID',        # Use phenotype 'IID' column
    test_size=0.3,
    random_state=123
)

Note:

The function automatically handles type mismatches between sample IDs (e.g., strings vs integers) by converting both to strings for comparison, then mapping back to original types for proper indexing. This ensures samples are matched correctly even if genotype IDs are strings (‘11’) and phenotype IDs are integers (11).

Tip:

For simplest usage, ensure both DataFrames have matching sample IDs as their index:

# Prepare data with matching indices
pheno = pheno.set_index('IID')

# Then use without specifying ID columns
train_g, test_g, train_p, test_p = stratified_train_test_split(
    geno, pheno, 'disease'
)

filter_genotype_data()

Comprehensive genotype QC filtering (all filters optional).

# MAF filter only
geno_qc = filter_genotype_data(geno, min_maf=0.01)

# MAF + missing rate
geno_qc = filter_genotype_data(
    geno,
    min_maf=0.01,
    max_missing_per_variant=0.1
)

# All filters (returns genotype + phenotype)
geno_qc, pheno_qc = filter_genotype_data(
    geno, pheno,
    min_maf=0.01,
    max_missing_per_variant=0.05,
    min_call_rate_per_sample=0.95
)

Parameters:

  • min_maf: Minimum MAF (e.g., 0.01 = 1%)

  • max_missing_per_variant: Max missing rate per variant (e.g., 0.1 = 10%)

  • min_call_rate_per_sample: Min call rate per sample (e.g., 0.95 = 95%)

Filter order: MAF → Variant missing → Sample call rate

Filter variants by minor allele frequency.

from edge_gwas.utils import filter_variants_by_maf

filtered_geno = filter_variants_by_maf(
    genotype_df,
    min_maf=0.01,
    verbose=True
)

Parameters:

  • genotype_df (pandas.DataFrame): Genotype data (works with both hard calls and dosages)

  • min_maf (float): Minimum MAF threshold

  • verbose (bool): Print filtering information

Returns:

  • pandas.DataFrame: Filtered genotype data

Filter variants by missingness rate.

from edge_gwas.utils import filter_variants_by_missing

filtered_geno = filter_variants_by_missing(
    genotype_df,
    max_missing=0.05,
    verbose=True
)

Parameters:

  • genotype_df (pandas.DataFrame): Genotype data

  • max_missing (float): Maximum missingness rate (0-1)

  • verbose (bool): Print filtering information

Returns:

  • pandas.DataFrame: Filtered genotype data

NEW in v0.1.1 - Filter samples by genotype call rate.

from edge_gwas.utils import filter_samples_by_call_rate

filtered_geno, filtered_pheno = filter_samples_by_call_rate(
    genotype_df,
    phenotype_df,
    min_call_rate=0.95,
    verbose=True
)

Parameters:

  • genotype_df (pandas.DataFrame): Genotype data

  • phenotype_df (pandas.DataFrame): Phenotype data

  • min_call_rate (float): Minimum call rate (proportion of non-missing genotypes)

  • verbose (bool): Print filtering information

Returns:

  • tuple: (filtered_genotype_df, filtered_phenotype_df)

Example:

# Remove samples with >5% missing genotypes
geno, pheno = filter_samples_by_call_rate(geno, pheno, min_call_rate=0.95)

NEW in v0.1.1 - Filter variants by Hardy-Weinberg Equilibrium p-value.

from edge_gwas.utils import filter_variants_by_hwe

filtered_geno = filter_variants_by_hwe(
    genotype_df,
    hwe_threshold=1e-6,
    verbose=True
)

Parameters:

  • genotype_df (pandas.DataFrame): Genotype DataFrame (samples × variants)

  • hwe_threshold (float): Minimum HWE p-value threshold (default: 1e-6)

  • verbose (bool): Print filtering information (default: True)

Returns:

  • pandas.DataFrame: Filtered genotype DataFrame

Note:

  • Variants with HWE p-value < threshold are removed

  • Variants with insufficient data for HWE test (NaN) are kept

  • Common thresholds: 1e-6 (stringent), 1e-4 (moderate), 1e-3 (lenient)

Example:

# Filter variants deviating from HWE
geno = filter_variants_by_hwe(geno, hwe_threshold=1e-6)

# More lenient threshold
geno = filter_variants_by_hwe(geno, hwe_threshold=1e-4)

# Complete QC workflow
from edge_gwas.utils import (
    filter_variants_by_maf,
    filter_variants_by_missing,
    filter_variants_by_hwe
)

# Apply multiple filters
geno = filter_variants_by_maf(geno, min_maf=0.01)
geno = filter_variants_by_missing(geno, max_missing=0.05)
geno = filter_variants_by_hwe(geno, hwe_threshold=1e-6)

print(f"Variants after QC: {geno.shape[1]}")

NEW in v0.1.1 - Check case/control balance in binary outcome.

from edge_gwas.utils import check_case_control_balance

balance = check_case_control_balance(
    phenotype_df,
    outcome_col='disease',
    verbose=True
)

Parameters:

  • phenotype_df (pandas.DataFrame): Phenotype data

  • outcome_col (str): Name of outcome column

  • verbose (bool): Print balance information

Returns:

  • dict: Dictionary with keys:

    • case_count (int): Number of cases

    • control_count (int): Number of controls

    • ratio (float): Case/control ratio

Example:

balance = check_case_control_balance(pheno, 'disease')
print(f"Cases: {balance['case_count']}, Controls: {balance['control_count']}")
print(f"Ratio: {balance['ratio']:.2f}")

NEW in v0.1.1 - Calculate Hardy-Weinberg Equilibrium p-values for each variant.

from edge_gwas.utils import calculate_hwe_pvalues

hwe_pvals = calculate_hwe_pvalues(
    genotype_df,
    verbose=True
)

Parameters:

  • genotype_df (pandas.DataFrame): Genotype DataFrame (samples × variants)

  • verbose (bool): Print calculation information (default: True)

Returns:

  • pandas.Series: HWE p-values for each variant

Note:

  • Only uses hard calls (0, 1, 2) for HWE calculation

  • Requires minimum 10 samples per variant

  • Uses chi-square test with minimum expected count of 5

  • Variants with insufficient data return NaN

Example:

# Calculate HWE p-values
hwe_pvals = calculate_hwe_pvalues(genotype_df)

# Find variants deviating from HWE
hwe_violations = hwe_pvals[hwe_pvals < 1e-6]
print(f"Variants violating HWE (p < 1e-6): {len(hwe_violations)}")

# Summary statistics
print(f"Mean HWE p-value: {hwe_pvals.mean():.4f}")
print(f"Median HWE p-value: {hwe_pvals.median():.4f}")

Impute missing values in covariates using various methods.

from edge_gwas.utils import impute_covariates

pheno_imputed = impute_covariates(
    phenotype_df,
    covariate_cols=['age', 'bmi', 'PC1', 'PC2'],
    method='median',
    drop_na=False,
    verbose=True
)

Parameters:

  • phenotype_df (pandas.DataFrame): Phenotype DataFrame with covariates

  • covariate_cols (list): List of covariate column names to impute

  • method (str): Imputation method (default: ‘median’). Options:

    • 'drop': Remove all rows with any missing values

    • 'mean': Replace with mean (numeric only)

    • 'median': Replace with median (numeric only)

    • 'mode': Replace with mode (works for categorical)

    • 'knn': K-Nearest Neighbors imputation (k=5)

    • 'missforest': MissForest algorithm (requires missingpy)

    • 'mice': Multiple Imputation by Chained Equations (requires missingpy)

  • drop_na (bool): If True, drop rows with any remaining missing values after imputation (default: False)

  • verbose (bool): Print imputation information (default: True)

Returns:

  • pandas.DataFrame: DataFrame with imputed covariates

Note:

For ‘missforest’ and ‘mice’, install missingpy:

pip install missingpy

Example:

# Simple median imputation
pheno = impute_covariates(
    pheno,
    covariate_cols=['age', 'bmi'],
    method='median'
)

# KNN imputation for multiple covariates
pheno = impute_covariates(
    pheno,
    covariate_cols=['age', 'bmi', 'PC1', 'PC2', 'PC3'],
    method='knn'
)

# Drop rows with missing values
pheno = impute_covariates(
    pheno,
    covariate_cols=['age', 'sex'],
    method='drop'
)

validate_and_align_data()

Validate and align genotype and phenotype data by sample IDs.

geno_aligned, pheno_aligned = validate_and_align_data(
    geno, pheno,
    outcome_col='disease',
    covariate_cols=['age', 'sex']
)

Parameters:

  • keep_only_common (bool): Keep only overlapping samples (default: True)

  • outcome_col: Outcome column to validate

  • covariate_cols: Covariate columns to validate

validate_genotype_df()

Validate genotype DataFrame format and encoding.

# Basic validation
validate_genotype_df(geno)

# With encoding check
is_valid = validate_genotype_df(geno, variant_info, check_encoding=True)

# Detailed report
is_valid, report = validate_genotype_df(
    geno, variant_info,
    check_encoding=True,
    return_details=True
)

Parameters:

  • genotype_df: Genotype DataFrame (samples × variants)

  • variant_info_df: Optional variant info for encoding validation

  • check_encoding (bool): Validate minor allele as ALT

  • return_details (bool): Return detailed report

validate_and_fix_encoding()

Validate and automatically fix genotype encoding.

geno_fixed, info_fixed, report = validate_and_fix_encoding(geno, variant_info)

# Check what was fixed
print(report[report['was_fixed'] == True])

Returns: (fixed_genotype_df, fixed_variant_info_df, report_df)

Calculate genomic inflation factor (λ).

from edge_gwas.utils import calculate_genomic_inflation

lambda_gc = calculate_genomic_inflation(pvalues)

Parameters:

  • pvalues (pandas.Series or array): P-values from GWAS

Returns:

  • float: Genomic inflation factor (λ)

Merge GWAS results with alpha values.

from edge_gwas.utils import merge_alpha_with_gwas

merged_df = merge_alpha_with_gwas(gwas_df, alpha_df)

Parameters:

  • gwas_df (pandas.DataFrame): GWAS results

  • alpha_df (pandas.DataFrame): Alpha values

Returns:

  • pandas.DataFrame: Merged results with both GWAS and alpha information

NEW in v0.1.1 - Perform standard additive GWAS for comparison with EDGE.

from edge_gwas.utils import additive_gwas

results_df = additive_gwas(
    genotype_df,
    phenotype_df,
    outcome='disease',
    covariates=['age', 'sex', 'PC1', 'PC2'],
    outcome_type='binary'
)

Parameters:

  • genotype_df (pandas.DataFrame): Genotype data

  • phenotype_df (pandas.DataFrame): Phenotype data

  • outcome (str): Name of outcome column

  • covariates (list): List of covariate column names

  • outcome_type (str): 'binary' for logistic regression, 'continuous' for linear regression

Returns:

  • pandas.DataFrame: Results with columns:

    • variant_id: SNP identifier

    • coef: Effect coefficient

    • std_err: Standard error

    • pval: P-value

Example:

# Binary outcome (logistic regression)
additive_results = additive_gwas(
    geno, pheno, 'disease',
    ['age', 'sex'],
    outcome_type='binary'
)

# Continuous outcome (linear regression)
additive_results = additive_gwas(
    geno, pheno, 'BMI',
    ['age', 'sex'],
    outcome_type='continuous'
)

NEW in v0.1.1 - Perform k-fold cross-validation for EDGE analysis.

from edge_gwas.utils import cross_validated_edge_analysis

avg_alpha, meta_gwas, all_alpha, all_gwas = cross_validated_edge_analysis(
    genotype_df,
    phenotype_df,
    outcome='disease',
    covariates=['age', 'sex', 'PC1', 'PC2'],
    outcome_type='binary',
    n_folds=5,
    n_jobs=8,
    random_state=42
)

Parameters:

  • genotype_df (pandas.DataFrame): Genotype data

  • phenotype_df (pandas.DataFrame): Phenotype data

  • outcome (str): Name of outcome column

  • covariates (list): List of covariate column names

  • outcome_type (str): 'binary' or 'continuous'

  • n_folds (int): Number of cross-validation folds

  • n_jobs (int): Number of parallel jobs for EDGE analysis

  • random_state (int): Random seed for reproducibility

Returns:

  • tuple: (avg_alpha, meta_gwas_df, combined_alpha, combined_gwas)

    • avg_alpha: Averaged alpha values across folds

    • meta_gwas_df: Meta-analyzed GWAS results (Fisher’s method)

    • combined_alpha: All alpha values from all folds

    • combined_gwas: All GWAS results from all folds

Example:

# Run 5-fold cross-validation
avg_alpha, meta_gwas, all_alpha, all_gwas = cross_validated_edge_analysis(
    genotype_df=geno,
    phenotype_df=pheno,
    outcome='disease',
    covariates=['age', 'sex'] + [f'PC{i}' for i in range(1, 11)],
    outcome_type='binary',
    n_folds=5
)

# Check alpha stability
print(f"Mean alpha std: {avg_alpha['alpha_std'].mean():.3f}")

# Find unstable variants
unstable = avg_alpha[avg_alpha['alpha_std'] > 0.3]
print(f"Unstable variants: {len(unstable)}")

Visualization Module

Create Manhattan plot of GWAS results.

from edge_gwas.visualize import manhattan_plot

manhattan_plot(
    gwas_df,
    output='manhattan.png',
    title='EDGE GWAS Manhattan Plot',
    sig_threshold=5e-8,
    suggestive_threshold=1e-5,
    figsize=(14, 6),
    colors=['#1f77b4', '#ff7f0e']
)

Parameters:

  • gwas_df (pandas.DataFrame): GWAS results with columns: chr, pos, pval

  • output (str): Output filename

  • title (str): Plot title

  • sig_threshold (float): Genome-wide significance threshold

  • suggestive_threshold (float): Suggestive significance threshold

  • figsize (tuple): Figure size (width, height) in inches

  • colors (list): Colors for alternating chromosomes

Returns:

  • None (saves plot to file)

Create QQ plot and calculate genomic inflation factor.

from edge_gwas.visualize import qq_plot

lambda_gc = qq_plot(
    gwas_df,
    output='qq_plot.png',
    title='QQ Plot',
    figsize=(8, 8)
)

Parameters:

  • gwas_df (pandas.DataFrame): GWAS results with pval column

  • output (str): Output filename

  • title (str): Plot title

  • figsize (tuple): Figure size (width, height) in inches

Returns:

  • float: Genomic inflation factor (λ)

Plot distribution of alpha values.

from edge_gwas.visualize import plot_alpha_distribution

plot_alpha_distribution(
    alpha_df,
    output='alpha_distribution.png',
    bins=50,
    figsize=(10, 6),
    title='Alpha Value Distribution'
)

Parameters:

  • alpha_df (pandas.DataFrame): Alpha values with alpha_value column

  • output (str): Output filename

  • bins (int): Number of histogram bins

  • figsize (tuple): Figure size (width, height)

  • title (str): Plot title

Returns:

  • None (saves plot to file)

I/O Handlers Module

NEW in v0.1.1 - Download test data files from GitHub repository for tutorials and testing.

from edge_gwas import download_test_files

results = download_test_files(
    output_dir='tests',
    version='v0.1.1',
    overwrite=False,
    verbose=True
)

Parameters:

  • output_dir (str): Directory to save test files (default: ‘tests’)

  • version (str): GitHub release version tag to download from (default: ‘v0.1.1’)

  • overwrite (bool): If True, overwrite existing files (default: False)

  • verbose (bool): Print download progress (default: True)

Returns:

  • dict: Dictionary with download results containing:

    • downloaded (list): List of successfully downloaded files

    • skipped (list): List of skipped (already existing) files

    • failed (list): List of failed downloads

Downloaded Files:

  • test.bed: PLINK binary genotype file (3,925 samples × 1,000 variants)

  • test.bim: PLINK variant information file

  • test.fam: PLINK sample information file

  • test.phen: Phenotype file with disease status and age

Examples:

# Example 1: Basic usage - download to default 'tests' directory
from edge_gwas import download_test_files

download_test_files()
# Files saved to: tests/test.bed, tests/test.bim, tests/test.fam, tests/test.phen

# Example 2: Download to custom directory
download_test_files(output_dir='data/examples')

# Example 3: Force re-download (overwrite existing files)
results = download_test_files(overwrite=True)
print(f"Downloaded: {results['downloaded']}")
print(f"Skipped: {results['skipped']}")

# Example 4: Download specific version
download_test_files(version='main')  # Download from main branch

# Example 5: Verify download results
results = download_test_files()
if results['failed']:
    print(f"Warning: Failed to download {results['failed']}")
else:
    print("✓ All test files ready!")

Complete Tutorial Workflow:

from edge_gwas import (
    download_test_files,
    load_plink_data,
    EDGEAnalysis
)
import pandas as pd

# Step 1: Download test data
download_test_files()

# Step 2: Load test data
geno, info = load_plink_data('tests/test.bed', 'tests/test.bim', 'tests/test.fam')
pheno = pd.read_csv('tests/test.phen', sep=' ')

# Step 3: Prepare data
pheno = pheno.set_index('IID')

# Step 4: Split and analyze
from edge_gwas.utils import stratified_train_test_split

train_g, test_g, train_p, test_p = stratified_train_test_split(
    geno, pheno, 'disease', test_size=0.5
)

# Step 5: Run EDGE analysis
edge = EDGEAnalysis(outcome_type='binary')
alpha_df, gwas_df = edge.run_full_analysis(
    train_g, train_p, test_g, test_p,
    outcome_col='disease',
    covariate_cols=['age']
)

print(f"✓ Analysis complete! Found {len(gwas_df)} variants")

File Verification:

import os

# Download files
results = download_test_files()

# Verify file sizes and existence
test_files = {
    'tests/test.bed': 'Genotype data (binary)',
    'tests/test.bim': 'Variant information',
    'tests/test.fam': 'Sample information',
    'tests/test.phen': 'Phenotype data'
}

for filepath, description in test_files.items():
    if os.path.exists(filepath):
        size_kb = os.path.getsize(filepath) / 1024
        print(f"✓ {filepath}")
        print(f"  {description} - {size_kb:.1f} KB")
    else:
        print(f"✗ {filepath} - MISSING!")

Re-downloading Files:

# Method 1: Use overwrite parameter
download_test_files(overwrite=True)

# Method 2: Delete directory first
import shutil
if os.path.exists('tests'):
    shutil.rmtree('tests')
download_test_files()

# Method 3: Download to new location
download_test_files(output_dir='tests_fresh')

Troubleshooting:

If download fails:

# Check internet connection and retry
results = download_test_files(overwrite=True, verbose=True)

# If still failing, download manually:
# 1. Visit: https://github.com/nicenzhou/edge-gwas/tree/v0.1.1/tests
# 2. Download files: test.bed, test.bim, test.fam, test.phen
# 3. Place in 'tests' directory

Note:

  • Requires internet connection to download from GitHub

  • Files are downloaded from the specified GitHub release version

  • Total download size: ~1-2 MB

  • Test data contains simulated genotypes for demonstration purposes only

See Also:

  • Quick Start Guide - Quick start tutorial using test data

  • Tutorials - Detailed tutorials with test data examples

  • load_plink_data() - Load the downloaded PLINK files

  • prepare_phenotype_data() - Load the downloaded phenotype file

Save GWAS and alpha results to files.

from edge_gwas.io_handlers import save_results

output_files = save_results(
    gwas_df,
    alpha_df=None,
    output_prefix='edge_gwas',
    save_alpha=True
)

Parameters:

  • gwas_df (pandas.DataFrame): GWAS results

  • alpha_df (pandas.DataFrame, optional): Alpha values

  • output_prefix (str): Prefix for output files

  • save_alpha (bool): Whether to save alpha values

Returns:

  • dict: Dictionary with keys ‘gwas’ and ‘alpha’ containing file paths

Load pre-calculated alpha values from file.

from edge_gwas.io_handlers import load_alpha_values

alpha_df = load_alpha_values('alpha_values.txt')

Parameters:

  • filename (str): Path to alpha values file

Returns:

  • pandas.DataFrame: Alpha values

Generate text summary of GWAS results.

from edge_gwas.io_handlers import create_summary_report

report = create_summary_report(
    gwas_df,
    alpha_df=None,
    significance_threshold=5e-8,
    output_file='summary.txt'
)

Parameters:

  • gwas_df (pandas.DataFrame): GWAS results

  • alpha_df (pandas.DataFrame, optional): Alpha values

  • significance_threshold (float): P-value threshold for significance

  • output_file (str, optional): Output file path

Returns:

  • str: Summary report text

Data Structures

Expected format:

  • Index: Sample IDs (as strings)

  • Columns: Variant IDs

  • Values: Genotype encoding (0, 1, 2, or NaN for missing)

    • 0 = Reference homozygote (REF/REF)

    • 1 = Heterozygote (REF/ALT)

    • 2 = Alternate homozygote (ALT/ALT)

    • NaN = Missing

    • For dosages: continuous values 0-2

Example:

#              rs123  rs456  rs789
# Sample1        0      1      2
# Sample2        1      2      0
# Sample3        2      0      1

Expected format:

  • Index: Sample IDs (as strings, matching genotype data)

  • Columns: Outcome and covariate columns

Example:

#          disease  age  sex  PC1   PC2
# Sample1        1   45    0  0.1  -0.2
# Sample2        0   38    1 -0.3   0.4
# Sample3        1   52    0  0.2   0.1

Expected format:

  • Columns: variant_id, chr, pos, ref, alt, etc.

Example:

#    variant_id  chr      pos ref alt
# 0       rs123    1  1234567   A   G
# 1       rs456    1  2345678   C   T
# 2       rs789    2  3456789   G   A

Output format from calculate_alpha():

#    variant_id  alpha_value    eaf  coef_het  coef_hom
# 0       rs123         0.45  0.235      0.23      0.51
# 1       rs456         0.52  0.412      0.31      0.60
# 2       rs789         0.38  0.189      0.19      0.50

Output format from apply_alpha():

#    variant_id  chr      pos       pval    coef  std_err    stat  alpha_value
# 0       rs123    1  1234567  1.23e-08  0.0451   0.0089    5.07         0.45
# 1       rs456    1  2345678  3.45e-06  0.0312   0.0067    4.66         0.52
# 2       rs789    2  3456789  5.67e-10  0.0623   0.0095    6.56         0.38

Output format from PCA functions:

#             PC1    PC2    PC3    PC4    PC5
# IID
# Sample1   0.12  -0.23   0.45  -0.12   0.34
# Sample2  -0.34   0.56  -0.12   0.23  -0.45
# Sample3   0.23  -0.12   0.34  -0.45   0.12

Output format from load_grm_gcta():

#      FID      IID
# 0  FAM001  Sample1
# 1  FAM001  Sample2
# 2  FAM002  Sample3

Constants and Defaults

# Significance thresholds
GWAS_THRESHOLD = 5e-8          # Genome-wide significance
SUGGESTIVE_THRESHOLD = 1e-5    # Suggestive threshold

# Quality control defaults
DEFAULT_MAF = 0.01             # Minimum MAF
DEFAULT_MISSING = 0.05         # Maximum missingness
DEFAULT_HWE = 1e-6             # HWE p-value threshold
DEFAULT_CALL_RATE = 0.95       # Minimum sample call rate

# Analysis defaults
DEFAULT_TEST_SIZE = 0.5        # Train/test split ratio
DEFAULT_N_JOBS = -1            # Use all CPU cores
DEFAULT_MAX_ITER = 1000        # Maximum iterations

# GRM defaults (NEW in v0.1.1)
DEFAULT_GRM_MAF = 0.01         # MAF for GRM calculation
DEFAULT_KIN_THRESHOLD = 0.0884 # Kinship threshold (3rd degree)

# PCA defaults (NEW in v0.1.1)
DEFAULT_N_PCS = 10             # Number of PCs
DEFAULT_LD_R2 = 0.2            # LD pruning threshold
DEFAULT_LD_WINDOW = 50         # LD window size (kb)

Exceptions

class EdgeGWASError(Exception):
    """Base exception for edge-gwas."""
    pass

class DataFormatError(EdgeGWASError):
    """Raised when input data format is invalid."""
    pass

class ConvergenceError(EdgeGWASError):
    """Raised when model fails to converge."""
    pass

class MissingDataError(EdgeGWASError):
    """Raised when required data is missing."""
    pass

class GRMError(EdgeGWASError):
    """Raised when GRM-related operations fail."""
    pass

class ToolNotFoundError(EdgeGWASError):
    """Raised when external tool (PLINK2, GCTA, R) is not found."""
    pass

Version Information

import edge_gwas

# Get version string
print(edge_gwas.__version__)  # '0.1.1'

# Get author information
print(edge_gwas.__author__)   # 'Your Name'

# Get license
print(edge_gwas.__license__)  # 'GPL-3.0'

Command-Line Tools (NEW in v0.1.1)

Interactive installer for external tools (PLINK2, GCTA, R packages).

# Install all tools interactively
edge-gwas-install-tools

# The installer will:
# 1. Detect your operating system and architecture
# 2. Download and install PLINK2
# 3. Download and install GCTA
# 4. Install R packages (GENESIS, SNPRelate, gdsfmt)
# 5. Configure PATH if needed

Supported platforms:

  • Linux (x86_64)

  • macOS (Intel x86_64 and Apple Silicon ARM64)

Features:

  • Automatic architecture detection

  • SSL-safe downloads with retry logic

  • Verification of installed tools

  • PATH configuration guidance

Verify that external tools are properly installed.

# Check all tools
edge-gwas-check-tools

# Output example:
# ======================================================================
# EDGE-GWAS External Tools Check
# ======================================================================
#
# Python Packages:
# ----------------------------------------------------------------------
# ✓ numpy: Installed
# ✓ pandas: Installed
# ✓ scipy: Installed
# ✓ statsmodels: Installed
# ✓ sklearn: Installed
# ✓ matplotlib: Installed
# ✓ pandas_plink: Installed
#
# External Tools:
# ----------------------------------------------------------------------
# ✓ PLINK2: PLINK v2.00a3.7 64-bit Intel (24 Jan 2024)
# ✓ GCTA: GCTA 1.95.0
#
# R and Packages:
# ----------------------------------------------------------------------
# ✓ R: R version 4.3.0
# ✓ R package GENESIS: Installed
# ✓ R package SNPRelate: Installed
# ✓ R package gdsfmt: Installed
#
# ======================================================================
# ✓ All tools and packages are properly installed!
# ======================================================================

Function Reference Summary

Class: EDGEAnalysis

Method

Purpose

Key Input

Output

calculate_alpha()

Calculate encoding parameters from training data

Genotype, phenotype, covariates, GRM (optional)

Alpha DataFrame

apply_alpha()

Apply alpha values to test data for GWAS

Genotype, phenotype, alpha values, GRM (optional)

GWAS results DataFrame

run_full_analysis()

Complete two-stage workflow

Train/test genotype & phenotype, GRM (optional)

(alpha_df, gwas_df)

get_skipped_snps()

Get list of SNPs that failed convergence

None

List of variant IDs

Data Loading:

Function

Purpose

Requirements

load_plink_data()

Load PLINK binary files (.bed/.bim/.fam)

pandas-plink

load_pgen_data()

Load PLINK 2 files (.pgen/.pvar/.psam)

pgenlib

load_vcf_data()

Load VCF/VCF.GZ files

cyvcf2

load_bgen_data()

Load BGEN files

bgen-reader

prepare_phenotype_data()

Load and prepare phenotype data

-

Data Quality Control:

Function

Purpose

Output

filter_genotype_data()

Comprehensive QC (MAF, missing, call rate)

Filtered data

apply_standard_qc()

Apply standard GWAS filters (MAF≥0.01, missing≤0.05, call rate≥0.95)

Filtered data

validate_genotype_df()

Validate format and encoding

bool or report

validate_and_fix_encoding()

Auto-fix minor allele encoding

Fixed data + report

validate_and_align_data()

Align genotype and phenotype by sample IDs

Aligned data

diagnose_genotype_data()

Print diagnostic information

Console output

Population Structure Control:

Function

Purpose

Requirements

calculate_grm_gcta()

Calculate GRM using GCTA

GCTA

load_grm_gcta()

Load GRM from GCTA files

-

calculate_pca_plink()

Calculate PCs using PLINK2

PLINK2

attach_pcs_to_phenotype()

Merge PCs with phenotype

-

identify_related_samples()

Find related sample pairs

-

filter_related_samples()

Remove related samples

-

Data Processing:

Function

Purpose

stratified_train_test_split()

Split data into train/test sets with stratification

impute_covariates()

Impute missing covariate values

Statistical Functions:

Function

Purpose

calculate_genomic_inflation()

Calculate genomic inflation factor (λ)

additive_gwas()

Standard additive GWAS for comparison

cross_validated_edge_analysis()

K-fold cross-validation for EDGE

Complete Workflow Example

from edge_gwas import EDGEAnalysis
from edge_gwas.utils import (
    load_plink_data,
    prepare_phenotype_data,
    apply_standard_qc,
    validate_and_align_data,
    calculate_grm_gcta,
    load_grm_gcta,
    calculate_pca_plink,
    attach_pcs_to_phenotype,
    stratified_train_test_split
)
from edge_gwas.visualize import manhattan_plot, qq_plot

# 1. Load data
geno, info = load_plink_data('data.bed', 'data.bim', 'data.fam')
pheno = prepare_phenotype_data(
    'pheno.txt',
    outcome_col='disease',
    covariate_cols=['age', 'sex']
)

# 2. QC and alignment
geno_qc, pheno_qc = apply_standard_qc(geno, pheno)
geno_qc, pheno_qc = validate_and_align_data(geno_qc, pheno_qc)

# 3. Population structure
grm_prefix = calculate_grm_gcta('data', output_prefix='grm')
grm_matrix, grm_ids = load_grm_gcta('grm')
pca = calculate_pca_plink('data', n_pcs=10)
pheno_qc = attach_pcs_to_phenotype(pheno_qc, pca, n_pcs=10)

# 4. Split data
train_g, test_g, train_p, test_p = stratified_train_test_split(
    geno_qc, pheno_qc, outcome_col='disease', test_size=0.5
)

# 5. EDGE analysis
edge = EDGEAnalysis(outcome_type='binary')
covariates = ['age', 'sex'] + [f'PC{i}' for i in range(1, 11)]

alpha_df, gwas_df = edge.run_full_analysis(
    train_g, train_p, test_g, test_p,
    outcome='disease',
    covariates=covariates,
    grm_matrix=grm_matrix,
    grm_sample_ids=grm_ids,
    output_prefix='results/edge'
)

# 6. Visualize
manhattan_plot(gwas_df, output='manhattan.png')
qq_plot(gwas_df, output='qq.png')

Migration Guide (v0.1.0 → v0.1.1)

Key Changes:

  1. New QC function - Use filter_genotype_data() for comprehensive filtering:

    # New unified approach
    geno_qc, pheno_qc = filter_genotype_data(
        geno, pheno,
        min_maf=0.01,
        max_missing_per_variant=0.05,
        min_call_rate_per_sample=0.95
    )
    
  2. Data validation - New encoding validation and auto-fix:

    # Validate and fix minor allele encoding
    geno_fixed, info_fixed, report = validate_and_fix_encoding(geno, info)
    
  3. LocusZoom output - GWAS results now include LocusZoom-compatible columns:

    gwas_df = edge.apply_alpha(
        test_g, test_p,
        outcome='disease',
        covariates=covariates,
        variant_info=info,
        split_variant_id=True,
        variant_id_pattern='chr:pos:ref:alt'
    )
    
    # Save for LocusZoom
    edge.save_for_locuszoom(gwas_df, 'locuszoom_input.txt')
    
  4. GRM support - Add population structure control:

    grm_matrix, grm_ids = load_grm_gcta('grm_prefix')
    alpha_df = edge.calculate_alpha(
        ...,
        grm_matrix=grm_matrix,
        grm_sample_ids=grm_ids
    )
    

Function Index

By Category:

Core Analysis: - EDGEAnalysis (class) → calculate_alpha(), apply_alpha(), run_full_analysis()

Data Loading: - load_plink_data(), load_pgen_data(), load_vcf_data(), load_bgen_data(), prepare_phenotype_data()

Quality Control: - filter_genotype_data(), apply_standard_qc(), validate_genotype_df(), validate_and_fix_encoding(), validate_and_align_data(), diagnose_genotype_data()

Population Structure: - calculate_grm_gcta(), load_grm_gcta(), calculate_pca_plink(), attach_pcs_to_phenotype(), identify_related_samples(), filter_related_samples()

Statistical: - calculate_genomic_inflation(), additive_gwas(), cross_validated_edge_analysis()

Visualization: - manhattan_plot(), qq_plot(), plot_alpha_distribution()

Quick Function Finder

“I want to…”

  • Load genetic data:

    • PLINK binary (.bed/.bim/.fam): load_plink_data()

    • PLINK2 (.pgen/.pvar/.psam): load_pgen_data()

    • BGEN: load_bgen_data()

    • VCF: load_vcf_data()

  • Control for population structure:

    • Calculate GRM: calculate_grm_gcta()

    • Load GRM: load_grm_gcta()

    • Calculate PCs (unrelated): calculate_pca_plink()

    • Calculate PCs (related): calculate_pca_pcair()

    • Add PCs to phenotype: attach_pcs_to_phenotype()

  • Quality control:

    • Filter by MAF: filter_variants_by_maf()

    • Filter by missingness: filter_variants_by_missing()

    • Filter by HWE: filter_variants_by_hwe()

    • Filter samples: filter_samples_by_call_rate()

    • Check case/control balance: check_case_control_balance()

    • Find related samples: identify_related_samples()

    • Remove related samples: filter_related_samples()

  • Run EDGE analysis:

    • Calculate alpha: calculate_alpha()

    • Apply alpha: apply_alpha()

    • Full workflow: run_full_analysis()

    • Cross-validation: cross_validated_edge_analysis()

  • Compare with standard GWAS:

    • Run additive model: additive_gwas()

  • Visualize results:

    • Manhattan plot: manhattan_plot()

    • QQ plot: qq_plot()

    • Alpha distribution: plot_alpha_distribution()

  • Save/load results:

    • Save results: save_results()

    • Load alpha values: load_alpha_values()

    • Create summary: create_summary_report()

See Also

Documentation:

Last updated: 2025-12-25 for edge-gwas v0.1.1

For questions or issues, visit: https://github.com/nicenzhou/edge-gwas/issues