Implementing Stratified LDSC for Accurate Heritability Partitioning

Summary

The objective is to move beyond global LD Score Regression (LDSC) to understand how specific biological processes contribute to heritability. While standard LDSC provides a single intercept representing overall polygenicity, Stratified LDSC (s-LDSC) allows us to perform partition correlation analysis. This process quantifies the extent to which genetic signal is enriched within specific functional genomic categories (e.g., promoters, enhancers, or histone marks) compared to a null model.

Root Cause

The “problem” isn’t a system failure, but a methodological limitation in standard LDSC. Standard LDSC treats the entire genome as a monolithic block of SNPs. This fails to capture the heterogeneity of biological signal.

The necessity for s-LDSC arises from:

Signal Dilution: Global LDSC averages out the high-density signals found in functional regions with the noise in non-functional regions.
Biological Granularity: Understanding if a trait is driven by coding variants versus regulatory elements requires dissecting the LD score into specific partitions.
Overlap Complexity: Biological features (like H3K4me3 and promoters) are not mutually exclusive; s-LDSC is required to handle the mathematical overlap between these genomic categories.

Why This Happens in Real Systems

In large-scale genomic studies, we deal with non-independent variables. In a production bioinformatics pipeline, biological annotations are highly correlated.

Feature Co-occurrence: A SNP in a promoter is often also in a DNase hypersensitive site.
Non-Orthogonal Partitions: Unlike simple classification, genomic partitions are overlapping sets.
Model Complexity: As you add more categories (like the 14 functional categories mentioned), the risk of overfitting and multicollinearity increases significantly.

Real-World Impact

Failing to perform partition correlation analysis leads to several scientific and technical risks:

False Interpretations: Attributing heritability to a broad category when the signal is actually driven by a specific, highly correlated sub-feature.
Misguided Biological Hypotheses: In drug discovery, missing the fact that a trait is driven by super-enhancers rather than coding regions can lead to millions of dollars wasted on the wrong therapeutic targets.
Inaccurate Enrichment Scores: Without accounting for the correlation between partitions, your enrichment $Z$-scores will be mathematically invalid.

Example or Code

To perform this, you must first generate partition statistics using a reference panel (like 1000 Genomes) before running the regression.

# Step 1: Calculate partition statistics for your functional categories
# This requires a pre-computed LD score file and the functional annotations
ldsc --l2.ldscore-file eur_w_ldscores.l2.bdg.gz \
     --w ./.w_hm3.snplist \
     --out my_partitions_stats

# Step 2: Run stratified LDSC using the partition statistics
# This uses the '.annot.kuraj' files generated in Step 1
ldsc --h2 ./.sumstats.sumstats.stan \
     --h2-bin ./.my_partitions_stats.annot.kuraj \
     --out sldsc_results

How Senior Engineers Fix It

A senior engineer treats this as a statistical modeling problem rather than just running a script. To ensure accuracy:

Pre-processing Validation: We ensure that the LD score calculation is performed on the exact same reference population as the GWAS summary statistics to avoid population stratification bias.
Handling Multi-collinearity: We use partition correlation matrices to check if our categories are too similar. If two categories (e.g., H3K27ac and Super-enhancers) are $r > 0.9$, we may collapse them to maintain model stability.
Multiple Testing Correction: We apply rigorous Bonferroni or FDR corrections because testing 14+ categories significantly increases the probability of Type I errors.
Bootstrap Resampling: We use bootstrapping to estimate the confidence intervals of the enrichment scores, ensuring the results aren’t driven by outliers.

Why Juniors Miss It

Junior researchers often treat s-LDSC as a “black box” tool, leading to common pitfalls:

Ignoring the Intercept: They focus solely on the enrichment $Z$-score and forget to check the intercept, which signals inflation due to cryptic relatedness or population structure.
Input Mismatch: They attempt to use functional annotations that do not match the SNP set (w-list) used in the LD score calculation.
Complexity Overload: They try to test too many overlapping categories at once without checking for correlation between the partitions, leading to unstable estimates.
Confusion of Terms: They often confuse global polygenicity (the intercept) with partition enrichment (the slope/coefficient), leading to incorrect biological conclusions.