API Reference

Quick Function Finder

“I want to…”

Get started with test data:
- Download test files: download_test_files()
Load genetic data:
- PLINK binary (.bed/.bim/.fam): load_plink_data()
- PLINK2 (.pgen/.pvar/.psam): load_pgen_data()
- VCF: load_vcf_data()
- BGEN: load_bgen_data()
- Phenotype: prepare_phenotype_data()
Validate and check data quality:
- Validate genotype encoding: validate_genotype_df()
- Fix encoding issues: validate_and_fix_encoding()
- Validate phenotype data: validate_phenotype_df()
- Validate and align datasets: validate_and_align_data()
Quality control:
- Comprehensive QC: filter_genotype_data()
- Filter by MAF: filter_variants_by_maf()
- Filter by missingness: filter_variants_by_missing()
- Filter by HWE: filter_variants_by_hwe()
- Calculate HWE p-values: calculate_hwe_pvalues()
- Filter samples: filter_samples_by_call_rate()
- Check case/control balance: check_case_control_balance()
Control for population structure:
- Calculate GRM: calculate_grm_gcta()
- Load GRM: load_grm_gcta()
- Calculate PCs (basic): calculate_pca_sklearn()
- Calculate PCs (with PLINK2): calculate_pca_plink()
- Calculate PCs (for related samples): calculate_pca_pcair()
- Add PCs to phenotype: attach_pcs_to_phenotype()
- Get PC covariate names: get_pc_covariate_list()
- Find related samples: identify_related_samples()
- Remove related samples: filter_related_samples()
Prepare data for analysis:
- Split train/test: stratified_train_test_split()
- Impute missing covariates: impute_covariates()
Run EDGE analysis:
- Initialize EDGE: EDGEAnalysis
- Calculate alpha: EDGEAnalysis.calculate_alpha()
- Apply alpha: EDGEAnalysis.apply_alpha()
- Full workflow: EDGEAnalysis.run_full_analysis()
- Cross-validation: cross_validated_edge_analysis()
- Check failed SNPs: EDGEAnalysis.get_skipped_snps()
Compare with standard GWAS:
- Run additive model: standard_gwas() / additive_gwas()
Visualize results:
- Manhattan plot: manhattan_plot()
- QQ plot: qq_plot()
- Alpha distribution: plot_alpha_distribution()
Save and format results:
- Save results: save_results()
- Load alpha values: load_alpha_values()
- Format for publication: format_gwas_output_for_locuszoom()
- Save for LocusZoom: save_for_locuszoom()
- Validate LocusZoom format: validate_locuszoom_format()
- Create summary report: create_summary_report()

Core Analysis

class EDGEAnalysis(outcome_type='binary', outcome_transform=None, ols_method='bfgs', n_jobs=-1, max_iter=1000, verbose=True)

Main class for EDGE GWAS analysis.

Parameters:

outcome_type (str) – Type of outcome - ‘binary’ for logistic regression or ‘continuous’ for linear regression
outcome_transform (str, optional) – Transformation for continuous outcomes. Options: None, ‘log’, ‘log10’, ‘inverse_normal’, ‘rank_inverse_normal’
ols_method (str) – Optimization method for OLS regression. Options: ‘newton’, ‘bfgs’, ‘lbfgs’, ‘nm’, ‘cg’, ‘ncg’, ‘powell’, ‘basinhopping’
n_jobs (int) – Number of parallel jobs (-1 uses all available cores)
max_iter (int) – Maximum iterations for model convergence
verbose (bool) – Print progress information

calculate_alpha(genotype_data, phenotype_df, outcome, covariates, variant_info=None, grm_matrix=None, grm_sample_ids=None, mean_centered=False, use_fast_approximation=True)

Calculate EDGE alpha values from training data.

Parameters:

genotype_data (pd.DataFrame) – Genotype data with samples as index and variants as columns (0/1/2 encoding)
phenotype_df (pd.DataFrame) – Phenotype data with sample IDs as index
outcome (str) – Name of outcome variable in phenotype_df
covariates (list) – List of covariate names in phenotype_df
variant_info (pd.DataFrame, optional) – Optional variant information with variant_id as index
grm_matrix (np.ndarray, optional) – Optional GRM matrix from GCTA (for population structure control)
grm_sample_ids (pd.DataFrame, optional) – DataFrame with FID, IID, and sample_id corresponding to GRM rows
mean_centered (bool) – If True, use mean-centered codominant model without intercept
use_fast_approximation (bool) – If True, use faster approximation for GRM-based binary models

Returns:

DataFrame with alpha values for each variant

Return type:

pd.DataFrame

apply_alpha(genotype_data, phenotype_df, outcome, covariates, alpha_values=None, grm_matrix=None, grm_sample_ids=None, variant_info=None, use_fast_approximation=True)

Apply EDGE alpha values to test data and perform GWAS.

Parameters:

genotype_data (pd.DataFrame) – Genotype data with samples as index and variants as columns (0/1/2 encoding)
phenotype_df (pd.DataFrame) – Phenotype data with sample IDs as index
outcome (str) – Name of outcome variable in phenotype_df
covariates (list) – List of covariate names in phenotype_df
alpha_values (pd.DataFrame, optional) – DataFrame with alpha values (from calculate_alpha). If None, uses self.alpha_values
grm_matrix (np.ndarray, optional) – Optional GRM matrix from GCTA
grm_sample_ids (pd.DataFrame, optional) – DataFrame with FID, IID, and sample_id corresponding to GRM rows
variant_info (pd.DataFrame, optional) – Optional variant information DataFrame
use_fast_approximation (bool) – If True, use faster approximation for GRM-based binary models

Returns:

DataFrame with GWAS results

Return type:

pd.DataFrame

run_full_analysis(train_genotype, train_phenotype, test_genotype, test_phenotype, outcome, covariates, variant_info=None, grm_matrix=None, grm_sample_ids=None, mean_centered=False, use_fast_approximation=True, output_prefix=None)

Run complete EDGE analysis: calculate alpha on training data, apply alpha on test data.

Parameters:

train_genotype (pd.DataFrame) – Training genotype data
train_phenotype (pd.DataFrame) – Training phenotype data
test_genotype (pd.DataFrame) – Test genotype data
test_phenotype (pd.DataFrame) – Test phenotype data
outcome (str) – Name of outcome variable
covariates (list) – List of covariate names
variant_info (pd.DataFrame, optional) – Optional variant information DataFrame
grm_matrix (np.ndarray, optional) – Optional GRM matrix from GCTA
grm_sample_ids (pd.DataFrame, optional) – Optional sample IDs for GRM
mean_centered (bool) – If True, use mean-centered model without intercept
use_fast_approximation (bool) – If True, use faster approximation for GRM-based binary models
output_prefix (str, optional) – Optional prefix for output files

Returns:

Tuple of (alpha_df, gwas_df)

Return type:

tuple

get_skipped_snps()

Get list of SNPs that were skipped due to convergence issues.

Returns:: List of skipped SNP IDs
Return type:: list

Data Loading

load_plink_data(bed_file, bim_file, fam_file, minor_allele_as_alt=True, verbose=True)

Load PLINK binary format data (.bed/.bim/.fam).

Parameters:

bed_file (str) – Path to .bed file
bim_file (str) – Path to .bim file
fam_file (str) – Path to .fam file
minor_allele_as_alt (bool) – If True, ensure minor allele is coded as ALT (2)
verbose (bool) – Print loading information

Returns:

Tuple of (genotype_df, variant_info_df)

Return type:

tuple

load_pgen_data(pgen_file, pvar_file, psam_file, minor_allele_as_alt=True, verbose=True)

Load PLINK 2 binary format data (.pgen/.pvar/.psam).

Parameters:

pgen_file (str) – Path to .pgen file
pvar_file (str) – Path to .pvar file
psam_file (str) – Path to .psam file
minor_allele_as_alt (bool) – If True, ensure minor allele is coded as ALT (2)
verbose (bool) – Print loading information

Returns:

Tuple of (genotype_df, variant_info_df)

Return type:

tuple

Note

Requires pgenlib package: pip install pgenlib

load_vcf_data(vcf_file, dosage=True, minor_allele_as_alt=True, verbose=True)

Load VCF format data.

Parameters:

vcf_file (str) – Path to .vcf or .vcf.gz file
dosage (bool) – If True, use dosages (DS field); if False, use hard calls (GT field)
minor_allele_as_alt (bool) – If True, ensure minor allele is coded as ALT (2)
verbose (bool) – Print loading information

Returns:

Tuple of (genotype_df, variant_info_df)

Return type:

tuple

Note

Requires cyvcf2 package: pip install cyvcf2

load_bgen_data(bgen_file, sample_file=None, minor_allele_as_alt=True, verbose=True)

Load BGEN format data.

Parameters:

bgen_file (str) – Path to .bgen file
sample_file (str, optional) – Path to .sample file (optional, can be embedded in BGEN)
minor_allele_as_alt (bool) – If True, ensure minor allele is coded as ALT (2)
verbose (bool) – Print loading information

Returns:

Tuple of (genotype_df, variant_info_df) - genotypes are dosages

Return type:

tuple

Note

Requires bgen_reader package: pip install bgen-reader

prepare_phenotype_data(phenotype_file, outcome_col, covariate_cols, sample_id_col='IID', sep='\\t', log_transform_outcome=False)

Load and prepare phenotype data.

Parameters:

phenotype_file (str) – Path to phenotype file
outcome_col (str) – Name of outcome column
covariate_cols (list) – List of covariate column names
sample_id_col (str) – Name of sample ID column (will become index)
sep (str) – File separator
log_transform_outcome (bool) – Apply log10(x+1) transformation to outcome

Returns:

DataFrame with sample IDs as index, outcome and covariates as columns

Return type:

pd.DataFrame

download_test_files(output_dir='tests', version='v0.1.2', overwrite=False, verbose=True)

Download test files from GitHub repository.

Parameters:

output_dir (str) – Directory to save test files
version (str) – GitHub release version tag
overwrite (bool) – If True, overwrite existing files
verbose (bool) – Print download progress

Returns:

Dictionary with download results (downloaded, skipped, failed)

Return type:

dict

Data Validation

validate_genotype_df(genotype_df, variant_info_df=None, name='genotype_df', check_encoding=True, verbose=True, return_details=False)

Validate genotype DataFrame format and encoding.

Parameters:

genotype_df (pd.DataFrame) – Genotype DataFrame (samples x variants)
variant_info_df (pd.DataFrame, optional) – Optional variant information DataFrame
name (str) – Name for error messages
check_encoding (bool) – If True, validate encoding (requires variant_info_df)
verbose (bool) – Print validation results
return_details (bool) – If True, return (passed, report_df)

Returns:

None (raises errors if invalid) OR bool (validation passed) OR Tuple[bool, pd.DataFrame] (if return_details=True)

Return type:

None, bool, or tuple

validate_and_fix_encoding(genotype_df, variant_info_df, verbose=True)

Validate and automatically fix genotype encoding.

Parameters:

genotype_df (pd.DataFrame) – Genotype DataFrame
variant_info_df (pd.DataFrame) – Variant info DataFrame
verbose (bool) – Print progress

Returns:

Tuple of (fixed_genotype_df, fixed_variant_info_df, report_df)

Return type:

tuple

validate_phenotype_df(phenotype_df, outcome_col, covariate_cols, name='phenotype_df')

Validate phenotype DataFrame format.

Parameters:

phenotype_df (pd.DataFrame) – Phenotype DataFrame to validate
outcome_col (str) – Name of outcome column
covariate_cols (list) – List of covariate column names
name (str) – Name of the DataFrame for error messages

Raises:

TypeError – If not a pandas DataFrame
ValueError – If required columns are missing or DataFrame is invalid

validate_and_align_data(genotype_df, phenotype_df, outcome_col=None, covariate_cols=None, geno_id_col=None, pheno_id_col=None, keep_only_common=True, verbose=True)

Validate and align genotype and phenotype data by sample IDs.

Parameters:

genotype_df (pd.DataFrame) – Genotype DataFrame (samples x variants)
phenotype_df (pd.DataFrame) – Phenotype DataFrame
outcome_col (str, optional) – Name of outcome column (optional, for validation)
covariate_cols (list, optional) – List of covariate columns (optional, for validation)
geno_id_col (str, optional) – Column name for sample IDs in genotype_df (None = use index)
pheno_id_col (str, optional) – Column name for sample IDs in phenotype_df (None = use index)
keep_only_common (bool) – If True, keep only samples present in both datasets
verbose (bool) – Print validation information

Returns:

Tuple of (aligned_genotype_df, aligned_phenotype_df)

Return type:

tuple

Raises:

ValueError – If no common samples found or if keep_only_common=False and samples don’t match

Quality Control

filter_genotype_data(genotype_df, phenotype_df=None, min_maf=None, max_missing_per_variant=None, min_call_rate_per_sample=None, verbose=True)

Comprehensive genotype data filtering with multiple QC criteria.

Parameters:

genotype_df (pd.DataFrame) – Genotype DataFrame (samples x variants)
phenotype_df (pd.DataFrame, optional) – Optional phenotype DataFrame (required if filtering samples)
min_maf (float, optional) – Minimum minor allele frequency (e.g., 0.01 for 1%). If None, no MAF filtering
max_missing_per_variant (float, optional) – Maximum missing rate per variant (e.g., 0.1 for 10%). If None, no filtering
min_call_rate_per_sample (float, optional) – Minimum call rate per sample (e.g., 0.95 for 95%). If None, no filtering
verbose (bool) – Print filtering information

Returns:

filtered_genotype_df OR (filtered_genotype_df, filtered_phenotype_df)

Return type:

pd.DataFrame or tuple

filter_variants_by_maf(genotype_df, min_maf=0.01, verbose=True)

Filter variants by minor allele frequency.

Parameters:

genotype_df (pd.DataFrame) – Genotype DataFrame (works with both hard calls and dosages)
min_maf (float) – Minimum minor allele frequency
verbose (bool) – Print filtering information

Returns:

Filtered genotype DataFrame

Return type:

pd.DataFrame

filter_variants_by_missing(genotype_df, max_missing=0.1, verbose=True)

Filter variants by missing genotype rate.

Parameters:

genotype_df (pd.DataFrame) – Genotype DataFrame
max_missing (float) – Maximum proportion of missing genotypes allowed (0-1)
verbose (bool) – Print filtering information

Returns:

Filtered genotype DataFrame

Return type:

pd.DataFrame

filter_samples_by_call_rate(genotype_df, phenotype_df, min_call_rate=0.95, verbose=True)

Filter samples by genotype call rate.

Parameters:

genotype_df (pd.DataFrame) – Genotype DataFrame (samples as index)
phenotype_df (pd.DataFrame) – Phenotype DataFrame (sample IDs as index)
min_call_rate (float) – Minimum call rate (proportion of non-missing genotypes, 0-1)
verbose (bool) – Print filtering information

Returns:

Tuple of (filtered_genotype_df, filtered_phenotype_df)

Return type:

tuple

calculate_hwe_pvalues(genotype_df, verbose=True)

Calculate Hardy-Weinberg Equilibrium p-values for each variant.

Parameters:

genotype_df (pd.DataFrame) – Genotype DataFrame
verbose (bool) – Print calculation information

Returns:

Series of HWE p-values for each variant

Return type:

pd.Series

filter_variants_by_hwe(genotype_df, hwe_threshold=1e-6, verbose=True)

Filter variants by Hardy-Weinberg Equilibrium p-value.

Parameters:

genotype_df (pd.DataFrame) – Genotype DataFrame
hwe_threshold (float) – Minimum HWE p-value threshold
verbose (bool) – Print filtering information

Returns:

Filtered genotype DataFrame

Return type:

pd.DataFrame

check_case_control_balance(phenotype_df, outcome_col, verbose=True)

Check case/control balance in binary outcome.

Parameters:

phenotype_df (pd.DataFrame) – Phenotype DataFrame
outcome_col (str) – Name of outcome column
verbose (bool) – Print balance information

Returns:

Dictionary with case_count, control_count, and ratio

Return type:

dict

Population Structure Control

calculate_grm_gcta(plink_prefix, output_prefix=None, maf_threshold=0.01, method='grm', max_threads=1, verbose=True)

Calculate genetic relationship matrix (GRM) using GCTA.

Parameters:

plink_prefix (str) – Prefix for PLINK binary files (.bed/.bim/.fam)
output_prefix (str, optional) – Prefix for output GRM files (default: temp directory)
maf_threshold (float) – MAF threshold for variant filtering
method (str) – GRM calculation method (‘grm’ for full, ‘grm-sparse’ for sparse)
max_threads (int) – Maximum number of threads to use
verbose (bool) – Print progress information

Returns:

Path to output GRM prefix

Return type:

str

Note

Requires GCTA to be installed and available in PATH. Download from: https://yanglab.westlake.edu.cn/software/gcta/

load_grm_gcta(grm_prefix, verbose=True)

Load GRM calculated by GCTA.

Parameters:

grm_prefix (str) – Prefix for GRM files (without .grm.bin extension)
verbose (bool) – Print loading information

Returns:

Tuple of (grm_matrix, sample_ids_df)

Return type:

tuple

Raises:

FileNotFoundError – If GRM files are not found

calculate_pca_sklearn(genotype_df, n_pcs=10, verbose=True)

Calculate principal components using scikit-learn (basic PCA without relatedness correction).

Parameters:

genotype_df (pd.DataFrame) – Genotype DataFrame (samples x variants)
n_pcs (int) – Number of principal components to calculate
verbose (bool) – Print progress information

Returns:

DataFrame with ‘IID’ as index and PC1, PC2, …, PCn columns

Return type:

pd.DataFrame

Note

This is a basic PCA without correction for relatedness. For more robust PCA accounting for relatedness, use calculate_pca_plink().

calculate_pca_plink(file_prefix, n_pcs=10, file_format='bfile', output_prefix=None, maf_threshold=0.01, ld_window=50, ld_step=5, ld_r2=0.2, approx=False, verbose=True)

Calculate principal components using PLINK2.

Parameters:

file_prefix (str) – Prefix for input files
n_pcs (int) – Number of principal components to calculate
file_format (str) – Input file format (‘bfile’, ‘pfile’, ‘vcf’, ‘bgen’)
output_prefix (str, optional) – Prefix for output files (default: temp directory)
maf_threshold (float, optional) – MAF threshold for variant filtering (None to skip)
ld_window (int, optional) – Window size for LD pruning in variant count (None to skip)
ld_step (int, optional) – Step size for LD pruning in variant count (None to skip)
ld_r2 (float, optional) – R² threshold for LD pruning (None to skip)
approx (bool) – Use approximate PCA for large cohorts
verbose (bool) – Print progress information

Returns:

DataFrame with IID as index and PC1, PC2, …, PCn columns

Return type:

pd.DataFrame

calculate_pca_pcair(plink_prefix, n_pcs=10, kinship_matrix=None, divergence_matrix=None, output_prefix=None, kin_threshold=0.0884, div_threshold=-0.0884, maf_threshold=0.01, verbose=True)

Calculate PC-AiR (Principal Components - Analysis in Related samples).

Parameters:

plink_prefix (str) – Prefix for PLINK binary files
n_pcs (int) – Number of principal components to calculate
kinship_matrix (str, optional) – Path to kinship matrix (if None, calculates using GCTA)
divergence_matrix (str, optional) – Path to divergence matrix (optional)
output_prefix (str, optional) – Prefix for output files (default: temp directory)
kin_threshold (float) – Kinship threshold for defining relatives
div_threshold (float) – Divergence threshold
maf_threshold (float) – MAF threshold for variant filtering
verbose (bool) – Print progress information

Returns:

DataFrame with IID as index and PC1, PC2, …, PCn columns

Return type:

pd.DataFrame

Note

Requires R with GENESIS, SNPRelate, and gdsfmt packages installed.

attach_pcs_to_phenotype(phenotype_df, pca_df, n_pcs=10, pc_prefix='PC', sample_id_col=None, drop_na=False, verbose=True)

Attach principal components to phenotype DataFrame.

Parameters:

phenotype_df (pd.DataFrame) – Phenotype DataFrame (IID as index or column)
pca_df (pd.DataFrame) – PCA DataFrame with IID as index and PC columns
n_pcs (int) – Number of PCs to attach (will use PC1 to PCn)
pc_prefix (str) – Prefix for PC column names
sample_id_col (str, optional) – Column name in phenotype_df to use for matching. If None, uses index
drop_na (bool) – If True, remove samples with missing PCs after merging
verbose (bool) – Print information about merging

Returns:

Phenotype DataFrame with PC columns added

Return type:

pd.DataFrame

Raises:

ValueError – If requested PCs are not available in pca_df

get_pc_covariate_list(n_pcs, pc_prefix='PC')

Generate list of PC covariate names for use in EDGE analysis.

Parameters:

n_pcs (int) – Number of PCs
pc_prefix (str) – Prefix for PC column names

Returns:

List of PC column names [‘PC1’, ‘PC2’, …, ‘PCn’]

Return type:

list

identify_related_samples(grm_matrix, sample_ids, threshold=0.0884, verbose=True)

Identify pairs of related samples based on GRM threshold.

Parameters:

grm_matrix (np.ndarray) – n_samples x n_samples GRM matrix
sample_ids (pd.DataFrame) – DataFrame with sample IDs (from load_grm_gcta)
threshold (float) – Relatedness threshold. Common values: 0.354 (1st degree), 0.177 (2nd degree), 0.0884 (3rd degree)
verbose (bool) – Print summary statistics

Returns:

DataFrame with columns IID1, IID2, kinship (sorted by kinship descending)

Return type:

pd.DataFrame

filter_related_samples(phenotype_df, grm_matrix, sample_ids, threshold=0.0884, method='greedy', sample_id_col=None, verbose=True)

Filter out related samples to create an unrelated subset.

Parameters:

phenotype_df (pd.DataFrame) – Phenotype DataFrame
grm_matrix (np.ndarray) – n_samples x n_samples GRM matrix
sample_ids (pd.DataFrame) – DataFrame with sample IDs (from load_grm_gcta)
threshold (float) – Relatedness threshold
method (str) – Method for selecting unrelated samples (‘greedy’ or ‘random’)
sample_id_col (str, optional) – Column name in phenotype_df for sample IDs. If None, uses index
verbose (bool) – Print filtering information

Returns:

Filtered phenotype DataFrame with unrelated samples only

Return type:

pd.DataFrame

Data Preparation

stratified_train_test_split(genotype_df, phenotype_df, outcome_col, test_size=0.5, random_state=42, is_binary=True, geno_id_col=None, pheno_id_col=None)

Split data into training and test sets with stratification.

Parameters:

genotype_df (pd.DataFrame) – Genotype DataFrame (samples x variants)
phenotype_df (pd.DataFrame) – Phenotype DataFrame
outcome_col (str) – Name of outcome column for stratification
test_size (float) – Proportion of samples in test set
random_state (int) – Random seed for reproducibility
is_binary (bool) – Whether outcome is binary (enables stratification)
geno_id_col (str, optional) – Column name in genotype_df for sample IDs. If None, uses index
pheno_id_col (str, optional) – Column name in phenotype_df for sample IDs. If None, uses index

Returns:

Tuple of (train_geno, test_geno, train_pheno, test_pheno)

Return type:

tuple

Raises:

ValueError – If no common samples found or stratification fails

impute_covariates(phenotype_df, covariate_cols, method='median', drop_na=False, verbose=True)

Impute missing values in covariates.

Parameters:

phenotype_df (pd.DataFrame) – Phenotype DataFrame with covariates
covariate_cols (list) – List of covariate column names to impute
method (str) – Imputation method - ‘drop’, ‘mean’, ‘median’, ‘mode’, ‘knn’, ‘missforest’, ‘mice’
drop_na (bool) – If True, drop rows with missing outcome after imputation
verbose (bool) – Print imputation information

Returns:

DataFrame with imputed covariates

Return type:

pd.DataFrame

Note

For ‘missforest’ and ‘mice’, install: pip install missingpy

Standard GWAS

standard_gwas(genotype_df, phenotype_df, outcome, covariates, outcome_type='binary')

Perform standard additive GWAS for comparison with EDGE.

Parameters:

genotype_df (pd.DataFrame) – Genotype DataFrame
phenotype_df (pd.DataFrame) – Phenotype DataFrame
outcome (str) – Name of outcome column
covariates (list) – List of covariate column names
outcome_type (str) – ‘binary’ for logistic regression, ‘continuous’ for linear regression

Returns:

DataFrame with variant_id, coef, pval, std_err

Return type:

pd.DataFrame

additive_gwas(genotype_df, phenotype_df, outcome, covariates, outcome_type='binary')

Alias for standard_gwas(). Perform standard additive GWAS.

Parameters:

genotype_df (pd.DataFrame) – Genotype DataFrame
phenotype_df (pd.DataFrame) – Phenotype DataFrame
outcome (str) – Name of outcome column
covariates (list) – List of covariate column names
outcome_type (str) – ‘binary’ for logistic regression, ‘continuous’ for linear regression

Returns:

DataFrame with variant_id, coef, pval, std_err

Return type:

pd.DataFrame

cross_validated_edge_analysis(genotype_df, phenotype_df, outcome, covariates, outcome_type='binary', n_folds=5, n_jobs=8, random_state=42)

Perform k-fold cross-validation for EDGE analysis.

Parameters:

genotype_df (pd.DataFrame) – Genotype DataFrame
phenotype_df (pd.DataFrame) – Phenotype DataFrame
outcome (str) – Name of outcome column
covariates (list) – List of covariate column names
outcome_type (str) – ‘binary’ or ‘continuous’
n_folds (int) – Number of cross-validation folds
n_jobs (int) – Number of parallel jobs for EDGE analysis
random_state (int) – Random seed for reproducibility

Returns:

Tuple of (avg_alpha, meta_gwas_df, combined_alpha, combined_gwas)

Return type:

tuple

Visualization

manhattan_plot(gwas_df, output='manhattan.png', title='EDGE GWAS Manhattan Plot', sig_threshold=5e-8, figsize=(14, 6), colors=None)

Create Manhattan plot from EDGE GWAS results.

Parameters:

gwas_df (pd.DataFrame or list) – DataFrame or list of DataFrames with columns ‘chrom’, ‘pos’, ‘pval’
output (str) – Output filename for the plot
title (str) – Plot title
sig_threshold (float) – Genome-wide significance threshold
figsize (tuple) – Figure size as (width, height)
colors (list, optional) – List of two colors for alternating chromosomes

qq_plot(gwas_df, output='qq_plot.png', title='EDGE GWAS QQ Plot', figsize=(8, 8))

Create QQ plot from EDGE GWAS results and calculate genomic inflation factor.

Parameters:

gwas_df (pd.DataFrame or list) – DataFrame or list of DataFrames with column ‘pval’
output (str) – Output filename for the plot
title (str) – Plot title
figsize (tuple) – Figure size as (width, height)

Returns:

Genomic inflation factor (lambda_gc)

Return type:

float

plot_alpha_distribution(alpha_df, output='alpha_distribution.png', bins=50, figsize=(10, 6), xlim=None)

Plot distribution of alpha values.

Parameters:

alpha_df (pd.DataFrame) – DataFrame with ‘alpha_value’ column
output (str) – Output filename
bins (int) – Number of histogram bins
figsize (tuple) – Figure size as (width, height)
xlim (tuple, optional) – Optional tuple (min, max) for x-axis limits. If None, uses full range

Input/Output

save_results(gwas_df, alpha_df=None, output_prefix='edge_gwas', save_alpha=True)

Save EDGE GWAS results to files.

Parameters:

gwas_df (pd.DataFrame) – GWAS results DataFrame
alpha_df (pd.DataFrame, optional) – Alpha values DataFrame
output_prefix (str) – Prefix for output files
save_alpha (bool) – Whether to save alpha values

Returns:

Dictionary with output file paths

Return type:

dict

load_alpha_values(alpha_file)

Load pre-calculated alpha values.

Parameters:: alpha_file (str) – Path to alpha values file
Returns:: DataFrame with alpha values
Return type:: pd.DataFrame

format_gwas_output_for_locuszoom(gwas_df, include_alpha=True, sort_by='pval', format_for_locuszoom=False)

Format GWAS output for publication/reporting or LocusZoom upload.

Parameters:

gwas_df (pd.DataFrame) – GWAS results DataFrame
include_alpha (bool) – Include alpha-related columns
sort_by (str) – Column to sort by
format_for_locuszoom (bool) – If True, format for LocusZoom upload with required columns

Returns:

Formatted DataFrame

Return type:

pd.DataFrame

Note

For LocusZoom format, the output will be tab-delimited with columns: chrom, pos, ref, alt, pval, beta, se, eaf, and optionally alpha_value. The file should be sorted by chrom and pos, compressed with bgzip, and indexed with tabix for optimal LocusZoom performance.

save_for_locuszoom(gwas_df, output_file, include_alpha=True, compress=True)

Save GWAS results in LocusZoom-compatible format.

Parameters:

gwas_df (pd.DataFrame) – GWAS results DataFrame
output_file (str) – Output file path (will add .gz if compress=True)
include_alpha (bool) – Include alpha_value column
compress (bool) – If True, compress with gzip

Note

For best performance with LocusZoom:

Compress with bgzip: bgzip output_file.tsv
Index with tabix: tabix -s 1 -b 2 -e 2 output_file.tsv.gz

validate_locuszoom_format(gwas_df)

Validate that GWAS results meet LocusZoom format requirements.

Parameters:: gwas_df (pd.DataFrame) – GWAS results DataFrame
Returns:: Dictionary with validation results (valid, errors, warnings, info)
Return type:: dict