API Reference

Quick Function Finder

“I want to…”

Core Analysis

class EDGEAnalysis(outcome_type='binary', outcome_transform=None, ols_method='bfgs', n_jobs=-1, max_iter=1000, verbose=True)

Main class for EDGE GWAS analysis.

Parameters:
  • outcome_type (str) – Type of outcome - ‘binary’ for logistic regression or ‘continuous’ for linear regression

  • outcome_transform (str, optional) – Transformation for continuous outcomes. Options: None, ‘log’, ‘log10’, ‘inverse_normal’, ‘rank_inverse_normal’

  • ols_method (str) – Optimization method for OLS regression. Options: ‘newton’, ‘bfgs’, ‘lbfgs’, ‘nm’, ‘cg’, ‘ncg’, ‘powell’, ‘basinhopping’

  • n_jobs (int) – Number of parallel jobs (-1 uses all available cores)

  • max_iter (int) – Maximum iterations for model convergence

  • verbose (bool) – Print progress information

calculate_alpha(genotype_data, phenotype_df, outcome, covariates, variant_info=None, grm_matrix=None, grm_sample_ids=None, mean_centered=False, use_fast_approximation=True)

Calculate EDGE alpha values from training data.

Parameters:
  • genotype_data (pd.DataFrame) – Genotype data with samples as index and variants as columns (0/1/2 encoding)

  • phenotype_df (pd.DataFrame) – Phenotype data with sample IDs as index

  • outcome (str) – Name of outcome variable in phenotype_df

  • covariates (list) – List of covariate names in phenotype_df

  • variant_info (pd.DataFrame, optional) – Optional variant information with variant_id as index

  • grm_matrix (np.ndarray, optional) – Optional GRM matrix from GCTA (for population structure control)

  • grm_sample_ids (pd.DataFrame, optional) – DataFrame with FID, IID, and sample_id corresponding to GRM rows

  • mean_centered (bool) – If True, use mean-centered codominant model without intercept

  • use_fast_approximation (bool) – If True, use faster approximation for GRM-based binary models

Returns:

DataFrame with alpha values for each variant

Return type:

pd.DataFrame

apply_alpha(genotype_data, phenotype_df, outcome, covariates, alpha_values=None, grm_matrix=None, grm_sample_ids=None, variant_info=None, use_fast_approximation=True)

Apply EDGE alpha values to test data and perform GWAS.

Parameters:
  • genotype_data (pd.DataFrame) – Genotype data with samples as index and variants as columns (0/1/2 encoding)

  • phenotype_df (pd.DataFrame) – Phenotype data with sample IDs as index

  • outcome (str) – Name of outcome variable in phenotype_df

  • covariates (list) – List of covariate names in phenotype_df

  • alpha_values (pd.DataFrame, optional) – DataFrame with alpha values (from calculate_alpha). If None, uses self.alpha_values

  • grm_matrix (np.ndarray, optional) – Optional GRM matrix from GCTA

  • grm_sample_ids (pd.DataFrame, optional) – DataFrame with FID, IID, and sample_id corresponding to GRM rows

  • variant_info (pd.DataFrame, optional) – Optional variant information DataFrame

  • use_fast_approximation (bool) – If True, use faster approximation for GRM-based binary models

Returns:

DataFrame with GWAS results

Return type:

pd.DataFrame

run_full_analysis(train_genotype, train_phenotype, test_genotype, test_phenotype, outcome, covariates, variant_info=None, grm_matrix=None, grm_sample_ids=None, mean_centered=False, use_fast_approximation=True, output_prefix=None)

Run complete EDGE analysis: calculate alpha on training data, apply alpha on test data.

Parameters:
  • train_genotype (pd.DataFrame) – Training genotype data

  • train_phenotype (pd.DataFrame) – Training phenotype data

  • test_genotype (pd.DataFrame) – Test genotype data

  • test_phenotype (pd.DataFrame) – Test phenotype data

  • outcome (str) – Name of outcome variable

  • covariates (list) – List of covariate names

  • variant_info (pd.DataFrame, optional) – Optional variant information DataFrame

  • grm_matrix (np.ndarray, optional) – Optional GRM matrix from GCTA

  • grm_sample_ids (pd.DataFrame, optional) – Optional sample IDs for GRM

  • mean_centered (bool) – If True, use mean-centered model without intercept

  • use_fast_approximation (bool) – If True, use faster approximation for GRM-based binary models

  • output_prefix (str, optional) – Optional prefix for output files

Returns:

Tuple of (alpha_df, gwas_df)

Return type:

tuple

get_skipped_snps()

Get list of SNPs that were skipped due to convergence issues.

Returns:

List of skipped SNP IDs

Return type:

list

Data Loading

Load PLINK binary format data (.bed/.bim/.fam).

Parameters:
  • bed_file (str) – Path to .bed file

  • bim_file (str) – Path to .bim file

  • fam_file (str) – Path to .fam file

  • minor_allele_as_alt (bool) – If True, ensure minor allele is coded as ALT (2)

  • verbose (bool) – Print loading information

Returns:

Tuple of (genotype_df, variant_info_df)

Return type:

tuple

load_pgen_data(pgen_file, pvar_file, psam_file, minor_allele_as_alt=True, verbose=True)

Load PLINK 2 binary format data (.pgen/.pvar/.psam).

Parameters:
  • pgen_file (str) – Path to .pgen file

  • pvar_file (str) – Path to .pvar file

  • psam_file (str) – Path to .psam file

  • minor_allele_as_alt (bool) – If True, ensure minor allele is coded as ALT (2)

  • verbose (bool) – Print loading information

Returns:

Tuple of (genotype_df, variant_info_df)

Return type:

tuple

Note

Requires pgenlib package: pip install pgenlib

load_vcf_data(vcf_file, dosage=True, minor_allele_as_alt=True, verbose=True)

Load VCF format data.

Parameters:
  • vcf_file (str) – Path to .vcf or .vcf.gz file

  • dosage (bool) – If True, use dosages (DS field); if False, use hard calls (GT field)

  • minor_allele_as_alt (bool) – If True, ensure minor allele is coded as ALT (2)

  • verbose (bool) – Print loading information

Returns:

Tuple of (genotype_df, variant_info_df)

Return type:

tuple

Note

Requires cyvcf2 package: pip install cyvcf2

load_bgen_data(bgen_file, sample_file=None, minor_allele_as_alt=True, verbose=True)

Load BGEN format data.

Parameters:
  • bgen_file (str) – Path to .bgen file

  • sample_file (str, optional) – Path to .sample file (optional, can be embedded in BGEN)

  • minor_allele_as_alt (bool) – If True, ensure minor allele is coded as ALT (2)

  • verbose (bool) – Print loading information

Returns:

Tuple of (genotype_df, variant_info_df) - genotypes are dosages

Return type:

tuple

Note

Requires bgen_reader package: pip install bgen-reader

prepare_phenotype_data(phenotype_file, outcome_col, covariate_cols, sample_id_col='IID', sep='\\t', log_transform_outcome=False)

Load and prepare phenotype data.

Parameters:
  • phenotype_file (str) – Path to phenotype file

  • outcome_col (str) – Name of outcome column

  • covariate_cols (list) – List of covariate column names

  • sample_id_col (str) – Name of sample ID column (will become index)

  • sep (str) – File separator

  • log_transform_outcome (bool) – Apply log10(x+1) transformation to outcome

Returns:

DataFrame with sample IDs as index, outcome and covariates as columns

Return type:

pd.DataFrame

download_test_files(output_dir='tests', version='v0.1.2', overwrite=False, verbose=True)

Download test files from GitHub repository.

Parameters:
  • output_dir (str) – Directory to save test files

  • version (str) – GitHub release version tag

  • overwrite (bool) – If True, overwrite existing files

  • verbose (bool) – Print download progress

Returns:

Dictionary with download results (downloaded, skipped, failed)

Return type:

dict

Data Validation

validate_genotype_df(genotype_df, variant_info_df=None, name='genotype_df', check_encoding=True, verbose=True, return_details=False)

Validate genotype DataFrame format and encoding.

Parameters:
  • genotype_df (pd.DataFrame) – Genotype DataFrame (samples x variants)

  • variant_info_df (pd.DataFrame, optional) – Optional variant information DataFrame

  • name (str) – Name for error messages

  • check_encoding (bool) – If True, validate encoding (requires variant_info_df)

  • verbose (bool) – Print validation results

  • return_details (bool) – If True, return (passed, report_df)

Returns:

None (raises errors if invalid) OR bool (validation passed) OR Tuple[bool, pd.DataFrame] (if return_details=True)

Return type:

None, bool, or tuple

validate_and_fix_encoding(genotype_df, variant_info_df, verbose=True)

Validate and automatically fix genotype encoding.

Parameters:
  • genotype_df (pd.DataFrame) – Genotype DataFrame

  • variant_info_df (pd.DataFrame) – Variant info DataFrame

  • verbose (bool) – Print progress

Returns:

Tuple of (fixed_genotype_df, fixed_variant_info_df, report_df)

Return type:

tuple

validate_phenotype_df(phenotype_df, outcome_col, covariate_cols, name='phenotype_df')

Validate phenotype DataFrame format.

Parameters:
  • phenotype_df (pd.DataFrame) – Phenotype DataFrame to validate

  • outcome_col (str) – Name of outcome column

  • covariate_cols (list) – List of covariate column names

  • name (str) – Name of the DataFrame for error messages

Raises:
  • TypeError – If not a pandas DataFrame

  • ValueError – If required columns are missing or DataFrame is invalid

validate_and_align_data(genotype_df, phenotype_df, outcome_col=None, covariate_cols=None, geno_id_col=None, pheno_id_col=None, keep_only_common=True, verbose=True)

Validate and align genotype and phenotype data by sample IDs.

Parameters:
  • genotype_df (pd.DataFrame) – Genotype DataFrame (samples x variants)

  • phenotype_df (pd.DataFrame) – Phenotype DataFrame

  • outcome_col (str, optional) – Name of outcome column (optional, for validation)

  • covariate_cols (list, optional) – List of covariate columns (optional, for validation)

  • geno_id_col (str, optional) – Column name for sample IDs in genotype_df (None = use index)

  • pheno_id_col (str, optional) – Column name for sample IDs in phenotype_df (None = use index)

  • keep_only_common (bool) – If True, keep only samples present in both datasets

  • verbose (bool) – Print validation information

Returns:

Tuple of (aligned_genotype_df, aligned_phenotype_df)

Return type:

tuple

Raises:

ValueError – If no common samples found or if keep_only_common=False and samples don’t match

Quality Control

filter_genotype_data(genotype_df, phenotype_df=None, min_maf=None, max_missing_per_variant=None, min_call_rate_per_sample=None, verbose=True)

Comprehensive genotype data filtering with multiple QC criteria.

Parameters:
  • genotype_df (pd.DataFrame) – Genotype DataFrame (samples x variants)

  • phenotype_df (pd.DataFrame, optional) – Optional phenotype DataFrame (required if filtering samples)

  • min_maf (float, optional) – Minimum minor allele frequency (e.g., 0.01 for 1%). If None, no MAF filtering

  • max_missing_per_variant (float, optional) – Maximum missing rate per variant (e.g., 0.1 for 10%). If None, no filtering

  • min_call_rate_per_sample (float, optional) – Minimum call rate per sample (e.g., 0.95 for 95%). If None, no filtering

  • verbose (bool) – Print filtering information

Returns:

filtered_genotype_df OR (filtered_genotype_df, filtered_phenotype_df)

Return type:

pd.DataFrame or tuple

filter_variants_by_maf(genotype_df, min_maf=0.01, verbose=True)

Filter variants by minor allele frequency.

Parameters:
  • genotype_df (pd.DataFrame) – Genotype DataFrame (works with both hard calls and dosages)

  • min_maf (float) – Minimum minor allele frequency

  • verbose (bool) – Print filtering information

Returns:

Filtered genotype DataFrame

Return type:

pd.DataFrame

filter_variants_by_missing(genotype_df, max_missing=0.1, verbose=True)

Filter variants by missing genotype rate.

Parameters:
  • genotype_df (pd.DataFrame) – Genotype DataFrame

  • max_missing (float) – Maximum proportion of missing genotypes allowed (0-1)

  • verbose (bool) – Print filtering information

Returns:

Filtered genotype DataFrame

Return type:

pd.DataFrame

filter_samples_by_call_rate(genotype_df, phenotype_df, min_call_rate=0.95, verbose=True)

Filter samples by genotype call rate.

Parameters:
  • genotype_df (pd.DataFrame) – Genotype DataFrame (samples as index)

  • phenotype_df (pd.DataFrame) – Phenotype DataFrame (sample IDs as index)

  • min_call_rate (float) – Minimum call rate (proportion of non-missing genotypes, 0-1)

  • verbose (bool) – Print filtering information

Returns:

Tuple of (filtered_genotype_df, filtered_phenotype_df)

Return type:

tuple

calculate_hwe_pvalues(genotype_df, verbose=True)

Calculate Hardy-Weinberg Equilibrium p-values for each variant.

Parameters:
  • genotype_df (pd.DataFrame) – Genotype DataFrame

  • verbose (bool) – Print calculation information

Returns:

Series of HWE p-values for each variant

Return type:

pd.Series

filter_variants_by_hwe(genotype_df, hwe_threshold=1e-6, verbose=True)

Filter variants by Hardy-Weinberg Equilibrium p-value.

Parameters:
  • genotype_df (pd.DataFrame) – Genotype DataFrame

  • hwe_threshold (float) – Minimum HWE p-value threshold

  • verbose (bool) – Print filtering information

Returns:

Filtered genotype DataFrame

Return type:

pd.DataFrame

check_case_control_balance(phenotype_df, outcome_col, verbose=True)

Check case/control balance in binary outcome.

Parameters:
  • phenotype_df (pd.DataFrame) – Phenotype DataFrame

  • outcome_col (str) – Name of outcome column

  • verbose (bool) – Print balance information

Returns:

Dictionary with case_count, control_count, and ratio

Return type:

dict

Population Structure Control

calculate_grm_gcta(plink_prefix, output_prefix=None, maf_threshold=0.01, method='grm', max_threads=1, verbose=True)

Calculate genetic relationship matrix (GRM) using GCTA.

Parameters:
  • plink_prefix (str) – Prefix for PLINK binary files (.bed/.bim/.fam)

  • output_prefix (str, optional) – Prefix for output GRM files (default: temp directory)

  • maf_threshold (float) – MAF threshold for variant filtering

  • method (str) – GRM calculation method (‘grm’ for full, ‘grm-sparse’ for sparse)

  • max_threads (int) – Maximum number of threads to use

  • verbose (bool) – Print progress information

Returns:

Path to output GRM prefix

Return type:

str

Note

Requires GCTA to be installed and available in PATH. Download from: https://yanglab.westlake.edu.cn/software/gcta/

load_grm_gcta(grm_prefix, verbose=True)

Load GRM calculated by GCTA.

Parameters:
  • grm_prefix (str) – Prefix for GRM files (without .grm.bin extension)

  • verbose (bool) – Print loading information

Returns:

Tuple of (grm_matrix, sample_ids_df)

Return type:

tuple

Raises:

FileNotFoundError – If GRM files are not found

calculate_pca_sklearn(genotype_df, n_pcs=10, verbose=True)

Calculate principal components using scikit-learn (basic PCA without relatedness correction).

Parameters:
  • genotype_df (pd.DataFrame) – Genotype DataFrame (samples x variants)

  • n_pcs (int) – Number of principal components to calculate

  • verbose (bool) – Print progress information

Returns:

DataFrame with ‘IID’ as index and PC1, PC2, …, PCn columns

Return type:

pd.DataFrame

Note

This is a basic PCA without correction for relatedness. For more robust PCA accounting for relatedness, use calculate_pca_plink().

Calculate principal components using PLINK2.

Parameters:
  • file_prefix (str) – Prefix for input files

  • n_pcs (int) – Number of principal components to calculate

  • file_format (str) – Input file format (‘bfile’, ‘pfile’, ‘vcf’, ‘bgen’)

  • output_prefix (str, optional) – Prefix for output files (default: temp directory)

  • maf_threshold (float, optional) – MAF threshold for variant filtering (None to skip)

  • ld_window (int, optional) – Window size for LD pruning in variant count (None to skip)

  • ld_step (int, optional) – Step size for LD pruning in variant count (None to skip)

  • ld_r2 (float, optional) – R² threshold for LD pruning (None to skip)

  • approx (bool) – Use approximate PCA for large cohorts

  • verbose (bool) – Print progress information

Returns:

DataFrame with IID as index and PC1, PC2, …, PCn columns

Return type:

pd.DataFrame

calculate_pca_pcair(plink_prefix, n_pcs=10, kinship_matrix=None, divergence_matrix=None, output_prefix=None, kin_threshold=0.0884, div_threshold=-0.0884, maf_threshold=0.01, verbose=True)

Calculate PC-AiR (Principal Components - Analysis in Related samples).

Parameters:
  • plink_prefix (str) – Prefix for PLINK binary files

  • n_pcs (int) – Number of principal components to calculate

  • kinship_matrix (str, optional) – Path to kinship matrix (if None, calculates using GCTA)

  • divergence_matrix (str, optional) – Path to divergence matrix (optional)

  • output_prefix (str, optional) – Prefix for output files (default: temp directory)

  • kin_threshold (float) – Kinship threshold for defining relatives

  • div_threshold (float) – Divergence threshold

  • maf_threshold (float) – MAF threshold for variant filtering

  • verbose (bool) – Print progress information

Returns:

DataFrame with IID as index and PC1, PC2, …, PCn columns

Return type:

pd.DataFrame

Note

Requires R with GENESIS, SNPRelate, and gdsfmt packages installed.

attach_pcs_to_phenotype(phenotype_df, pca_df, n_pcs=10, pc_prefix='PC', sample_id_col=None, drop_na=False, verbose=True)

Attach principal components to phenotype DataFrame.

Parameters:
  • phenotype_df (pd.DataFrame) – Phenotype DataFrame (IID as index or column)

  • pca_df (pd.DataFrame) – PCA DataFrame with IID as index and PC columns

  • n_pcs (int) – Number of PCs to attach (will use PC1 to PCn)

  • pc_prefix (str) – Prefix for PC column names

  • sample_id_col (str, optional) – Column name in phenotype_df to use for matching. If None, uses index

  • drop_na (bool) – If True, remove samples with missing PCs after merging

  • verbose (bool) – Print information about merging

Returns:

Phenotype DataFrame with PC columns added

Return type:

pd.DataFrame

Raises:

ValueError – If requested PCs are not available in pca_df

get_pc_covariate_list(n_pcs, pc_prefix='PC')

Generate list of PC covariate names for use in EDGE analysis.

Parameters:
  • n_pcs (int) – Number of PCs

  • pc_prefix (str) – Prefix for PC column names

Returns:

List of PC column names [‘PC1’, ‘PC2’, …, ‘PCn’]

Return type:

list

Identify pairs of related samples based on GRM threshold.

Parameters:
  • grm_matrix (np.ndarray) – n_samples x n_samples GRM matrix

  • sample_ids (pd.DataFrame) – DataFrame with sample IDs (from load_grm_gcta)

  • threshold (float) – Relatedness threshold. Common values: 0.354 (1st degree), 0.177 (2nd degree), 0.0884 (3rd degree)

  • verbose (bool) – Print summary statistics

Returns:

DataFrame with columns IID1, IID2, kinship (sorted by kinship descending)

Return type:

pd.DataFrame

Filter out related samples to create an unrelated subset.

Parameters:
  • phenotype_df (pd.DataFrame) – Phenotype DataFrame

  • grm_matrix (np.ndarray) – n_samples x n_samples GRM matrix

  • sample_ids (pd.DataFrame) – DataFrame with sample IDs (from load_grm_gcta)

  • threshold (float) – Relatedness threshold

  • method (str) – Method for selecting unrelated samples (‘greedy’ or ‘random’)

  • sample_id_col (str, optional) – Column name in phenotype_df for sample IDs. If None, uses index

  • verbose (bool) – Print filtering information

Returns:

Filtered phenotype DataFrame with unrelated samples only

Return type:

pd.DataFrame

Data Preparation

stratified_train_test_split(genotype_df, phenotype_df, outcome_col, test_size=0.5, random_state=42, is_binary=True, geno_id_col=None, pheno_id_col=None)

Split data into training and test sets with stratification.

Parameters:
  • genotype_df (pd.DataFrame) – Genotype DataFrame (samples x variants)

  • phenotype_df (pd.DataFrame) – Phenotype DataFrame

  • outcome_col (str) – Name of outcome column for stratification

  • test_size (float) – Proportion of samples in test set

  • random_state (int) – Random seed for reproducibility

  • is_binary (bool) – Whether outcome is binary (enables stratification)

  • geno_id_col (str, optional) – Column name in genotype_df for sample IDs. If None, uses index

  • pheno_id_col (str, optional) – Column name in phenotype_df for sample IDs. If None, uses index

Returns:

Tuple of (train_geno, test_geno, train_pheno, test_pheno)

Return type:

tuple

Raises:

ValueError – If no common samples found or stratification fails

impute_covariates(phenotype_df, covariate_cols, method='median', drop_na=False, verbose=True)

Impute missing values in covariates.

Parameters:
  • phenotype_df (pd.DataFrame) – Phenotype DataFrame with covariates

  • covariate_cols (list) – List of covariate column names to impute

  • method (str) – Imputation method - ‘drop’, ‘mean’, ‘median’, ‘mode’, ‘knn’, ‘missforest’, ‘mice’

  • drop_na (bool) – If True, drop rows with missing outcome after imputation

  • verbose (bool) – Print imputation information

Returns:

DataFrame with imputed covariates

Return type:

pd.DataFrame

Note

For ‘missforest’ and ‘mice’, install: pip install missingpy

Standard GWAS

standard_gwas(genotype_df, phenotype_df, outcome, covariates, outcome_type='binary')

Perform standard additive GWAS for comparison with EDGE.

Parameters:
  • genotype_df (pd.DataFrame) – Genotype DataFrame

  • phenotype_df (pd.DataFrame) – Phenotype DataFrame

  • outcome (str) – Name of outcome column

  • covariates (list) – List of covariate column names

  • outcome_type (str) – ‘binary’ for logistic regression, ‘continuous’ for linear regression

Returns:

DataFrame with variant_id, coef, pval, std_err

Return type:

pd.DataFrame

additive_gwas(genotype_df, phenotype_df, outcome, covariates, outcome_type='binary')

Alias for standard_gwas(). Perform standard additive GWAS.

Parameters:
  • genotype_df (pd.DataFrame) – Genotype DataFrame

  • phenotype_df (pd.DataFrame) – Phenotype DataFrame

  • outcome (str) – Name of outcome column

  • covariates (list) – List of covariate column names

  • outcome_type (str) – ‘binary’ for logistic regression, ‘continuous’ for linear regression

Returns:

DataFrame with variant_id, coef, pval, std_err

Return type:

pd.DataFrame

cross_validated_edge_analysis(genotype_df, phenotype_df, outcome, covariates, outcome_type='binary', n_folds=5, n_jobs=8, random_state=42)

Perform k-fold cross-validation for EDGE analysis.

Parameters:
  • genotype_df (pd.DataFrame) – Genotype DataFrame

  • phenotype_df (pd.DataFrame) – Phenotype DataFrame

  • outcome (str) – Name of outcome column

  • covariates (list) – List of covariate column names

  • outcome_type (str) – ‘binary’ or ‘continuous’

  • n_folds (int) – Number of cross-validation folds

  • n_jobs (int) – Number of parallel jobs for EDGE analysis

  • random_state (int) – Random seed for reproducibility

Returns:

Tuple of (avg_alpha, meta_gwas_df, combined_alpha, combined_gwas)

Return type:

tuple

Visualization

manhattan_plot(gwas_df, output='manhattan.png', title='EDGE GWAS Manhattan Plot', sig_threshold=5e-8, figsize=(14, 6), colors=None)

Create Manhattan plot from EDGE GWAS results.

Parameters:
  • gwas_df (pd.DataFrame or list) – DataFrame or list of DataFrames with columns ‘chrom’, ‘pos’, ‘pval’

  • output (str) – Output filename for the plot

  • title (str) – Plot title

  • sig_threshold (float) – Genome-wide significance threshold

  • figsize (tuple) – Figure size as (width, height)

  • colors (list, optional) – List of two colors for alternating chromosomes

qq_plot(gwas_df, output='qq_plot.png', title='EDGE GWAS QQ Plot', figsize=(8, 8))

Create QQ plot from EDGE GWAS results and calculate genomic inflation factor.

Parameters:
  • gwas_df (pd.DataFrame or list) – DataFrame or list of DataFrames with column ‘pval’

  • output (str) – Output filename for the plot

  • title (str) – Plot title

  • figsize (tuple) – Figure size as (width, height)

Returns:

Genomic inflation factor (lambda_gc)

Return type:

float

plot_alpha_distribution(alpha_df, output='alpha_distribution.png', bins=50, figsize=(10, 6), xlim=None)

Plot distribution of alpha values.

Parameters:
  • alpha_df (pd.DataFrame) – DataFrame with ‘alpha_value’ column

  • output (str) – Output filename

  • bins (int) – Number of histogram bins

  • figsize (tuple) – Figure size as (width, height)

  • xlim (tuple, optional) – Optional tuple (min, max) for x-axis limits. If None, uses full range

Input/Output

save_results(gwas_df, alpha_df=None, output_prefix='edge_gwas', save_alpha=True)

Save EDGE GWAS results to files.

Parameters:
  • gwas_df (pd.DataFrame) – GWAS results DataFrame

  • alpha_df (pd.DataFrame, optional) – Alpha values DataFrame

  • output_prefix (str) – Prefix for output files

  • save_alpha (bool) – Whether to save alpha values

Returns:

Dictionary with output file paths

Return type:

dict

load_alpha_values(alpha_file)

Load pre-calculated alpha values.

Parameters:

alpha_file (str) – Path to alpha values file

Returns:

DataFrame with alpha values

Return type:

pd.DataFrame

format_gwas_output_for_locuszoom(gwas_df, include_alpha=True, sort_by='pval', format_for_locuszoom=False)

Format GWAS output for publication/reporting or LocusZoom upload.

Parameters:
  • gwas_df (pd.DataFrame) – GWAS results DataFrame

  • include_alpha (bool) – Include alpha-related columns

  • sort_by (str) – Column to sort by

  • format_for_locuszoom (bool) – If True, format for LocusZoom upload with required columns

Returns:

Formatted DataFrame

Return type:

pd.DataFrame

Note

For LocusZoom format, the output will be tab-delimited with columns: chrom, pos, ref, alt, pval, beta, se, eaf, and optionally alpha_value. The file should be sorted by chrom and pos, compressed with bgzip, and indexed with tabix for optimal LocusZoom performance.

save_for_locuszoom(gwas_df, output_file, include_alpha=True, compress=True)

Save GWAS results in LocusZoom-compatible format.

Parameters:
  • gwas_df (pd.DataFrame) – GWAS results DataFrame

  • output_file (str) – Output file path (will add .gz if compress=True)

  • include_alpha (bool) – Include alpha_value column

  • compress (bool) – If True, compress with gzip

Note

For best performance with LocusZoom:

  1. Compress with bgzip: bgzip output_file.tsv

  2. Index with tabix: tabix -s 1 -b 2 -e 2 output_file.tsv.gz

validate_locuszoom_format(gwas_df)

Validate that GWAS results meet LocusZoom format requirements.

Parameters:

gwas_df (pd.DataFrame) – GWAS results DataFrame

Returns:

Dictionary with validation results (valid, errors, warnings, info)

Return type:

dict

create_summary_report(gwas_df, alpha_df=None, significance_threshold=5e-8, output_file=None)

Create a summary report of EDGE GWAS analysis.

Parameters:
  • gwas_df (pd.DataFrame) – GWAS results DataFrame

  • alpha_df (pd.DataFrame, optional) – Alpha values DataFrame

  • significance_threshold (float) – P-value threshold for significance

  • output_file (str, optional) – Optional file to save report

Returns:

Summary report as string

Return type:

str


Function Index

Core Analysis (see core_analysis)
Data Loading (see data_loading)
Data Validation (see data_validation)
Quality Control (see quality_control)
Population Structure Control (see population_structure)
Data Preparation (see data_preparation)
Standard GWAS (see standard_gwas)
Visualization (see Visualization Guide)
Input/Output (see input_output)

See Also

Documentation:

Last updated: 2026-02-10 for edge-gwas v0.1.2

For questions or issues, visit: https://github.com/nicenzhou/edge-gwas/issues