DK-012 — From Phenome to Biomarkers

🧬 From DNA to Diagnosis: Phenomics, Genomics, and Machine Learning for Biomarker Discovery

“The genome is the hardware. The phenome is the software running on it. And machine learning is the debugger.”

When most people hear “AI in medicine,” they imagine a robot doctor scanning X-rays.

But the real frontier is quieter and far more profound: teaching machines to read the molecular language of life itself — encoded in your DNA, your metabolites, your proteins, your vital signs.

This guide takes you from the very beginning — what is DNA? what is a phenome? — all the way to production-grade machine learning pipelines used in cutting-edge research. Whether you are a biology student learning to code, or a data scientist learning biology, this is your complete map.

The Biological Foundation: DNA, Genes, and the Central Dogma
What Is a Phenome? From Traits to Omics
The Omics Universe: Genomics, Transcriptomics, Metabolomics, and More
Why This Is a Machine Learning Problem
Part A: High-Dimensional Tabular Data — Feature Selection and Tabular ML
Part B: Making Models Trustworthy — Imbalanced Data and Interpretability
Part C: Time-Series, Biosignals, and Survival Analysis
Putting It All Together: A Research Pipeline

🔬 Chapter 1: The Biological Foundation

1.1 What Is DNA?

DNA stands for Deoxyribonucleic Acid. It is the molecule that encodes every instruction your body has ever received — from “build a liver” to “produce insulin when blood sugar rises.”

Think of DNA like this:

DNA is the world's most compact hard drive.
In every single cell of your body (~37 trillion cells),
there is a complete copy of your DNA — 3 billion letters long —
coiled into a space smaller than 6 micrometers.

If you stretched your DNA end to end, it would reach the Sun and back 300 times.

The Alphabet of Life

DNA is written in an alphabet of only four letters, called nucleotide bases:

Letter	Base	Pairs With
A	Adenine	T
T	Thymine	A
C	Cytosine	G
G	Guanine	C

These four letters, arranged in specific sequences, encode all the information needed to build and run a human body.

The Double Helix

DNA exists as a double helix — two complementary strands twisted together like a spiral staircase. The famous Watson-Crick structure (1953) revealed that the two strands are antiparallel and held together by hydrogen bonds between complementary base pairs (A-T and G-C).

5' — A — T — G — C — A — G — T — 3'  (Strand 1)
     |   |   |   |   |   |   |
3' — T — A — C — G — T — C — A — 5'  (Strand 2)

This complementarity is what allows DNA to be copied faithfully during cell division — an engineering feat of breathtaking precision.

1.2 Genes: Functional Units of DNA

Your ~3 billion base pairs are organized into ~20,000–25,000 genes. A gene is a specific sequence of DNA that encodes a functional product — usually a protein.

But here is a crucial insight that surprises many students:

Only about 1.5% of the human genome encodes proteins.

The rest was once dismissively called “junk DNA.” We now know much of it:

Regulates when and how genes are expressed
Encodes regulatory RNA molecules
Maintains chromosome structure
Has roles we are still discovering

Gene Expression: From DNA to Protein

The pathway from gene to functional protein follows what biologists call The Central Dogma:

DNA → (Transcription) → RNA → (Translation) → Protein → Function

Step 1: Transcription
The DNA is “read” and a complementary strand of messenger RNA (mRNA) is synthesized. This happens in the cell nucleus.

Step 2: RNA Processing
The pre-mRNA is processed — introns (non-coding regions) are spliced out, exons (coding regions) are joined. This is alternative splicing, which allows one gene to produce multiple protein variants.

Step 3: Translation
The mRNA travels to the ribosome, where transfer RNAs (tRNAs) read the message in triplets (codons) and link the corresponding amino acids together to form a protein chain.

Step 4: Protein Folding and Function
The protein chain folds into its 3D structure (often with help from chaperone proteins) and carries out its biological function — as an enzyme, structural component, signaling molecule, or receptor.

1.3 Genetic Variation: Why We Are All Different

If all humans share ~99.9% of their DNA, what accounts for the incredible diversity in disease risk, drug response, and physical traits?

The answer lies in the 0.1% — and how that 0.1% is distributed across the genome.

Single Nucleotide Polymorphisms (SNPs)

The most common type of genetic variation is the SNP (pronounced “snip”): a position in the genome where a single nucleotide differs between individuals.

Population reference:  ...ACGT[A]CCGT...
Individual 1:          ...ACGT[A]CCGT...  (Reference allele: A)
Individual 2:          ...ACGT[G]CCGT...  (Alternate allele: G)

There are roughly 10 million common SNPs in the human genome. Each can be:

Benign — a normal variation with no functional consequence
Functional — alters protein structure or gene regulation
Pathogenic — directly causes or strongly predisposes to disease

Genotyping Arrays and GWAS

Modern genotyping arrays (chips like the Illumina GSA) can measure 500,000 to 2 million SNPs simultaneously from a saliva sample. This data forms the basis of Genome-Wide Association Studies (GWAS) — which scan millions of SNPs across thousands of individuals to identify variants associated with disease.

A classic GWAS result is a Manhattan plot — named for its resemblance to a city skyline — where peaks indicate genomic regions significantly associated with a trait.

🌐 Chapter 2: What Is a Phenome?

2.1 From Phenotype to Phenome

A phenotype is any observable characteristic of an organism: height, blood pressure, fasting glucose, the color of your eyes, whether you have type 2 diabetes.

The phenome is the complete collection of all phenotypes of an individual — across all conditions, time points, and scales of measurement.

If the genome asks “what could happen?” the phenome asks “what actually is happening?”

Genome     → Blueprint (potential)
Phenome    → Reality (expression)

The Phenome Is Vast and Multidimensional

Consider a single patient in a hospital system:

Clinical measurements: blood pressure, BMI, heart rate, spirometry values
Laboratory values: CBC, metabolic panel, lipid panel, HbA1c, thyroid hormones
Imaging features: CT lung density, retinal vessel diameter, cardiac wall thickness
Behavioral phenotypes: sleep patterns, physical activity, dietary intake
Electronic Health Record (EHR) codes: ICD-10 diagnoses, procedure codes, medication histories

A comprehensive phenome dataset for one patient might contain hundreds to thousands of variables measured over years. Multiply that by 100,000 patients, and you begin to understand why this is fundamentally a big data problem.

2.2 PheWAS: Phenome-Wide Association Studies

Just as GWAS scans the genome for variants associated with one disease, PheWAS (Phenome-Wide Association Study) scans the phenome — asking:

“Given this one genetic variant (or drug, or exposure), what are ALL its associations across the entire landscape of clinical phenotypes?”

This “reverse GWAS” approach has revealed surprising pleiotropy — the same genetic variant influencing seemingly unrelated conditions — and has been instrumental in drug repurposing.

2.3 The Phenome Datasets You Will Use

In this course, we work with two phenome datasets:

Phenome Dataset 1 (Phenome-1)

Source: Population cohort (e.g., UK Biobank-style)
Variables: ~500 phenotypic variables per participant
Sample size: ~10,000 individuals
Use case: Feature selection, biomarker discovery

Phenome Dataset 2 (Phenome-2)

Source: Clinical biobank with linked EHR
Variables: ~2,000 variables including lab values, medications, ICD codes
Special feature: Longitudinal — multiple timepoints per patient
Use case: Survival analysis, risk stratification

🧪 Chapter 3: The Omics Universe

3.1 Layers of Biological Information

Modern biology has fractured into a collection of “omics” disciplines, each studying a different layer of molecular information:

DNA ─────────────── Genomics
  ↓
RNA ─────────────── Transcriptomics
  ↓
Protein ─────────── Proteomics
  ↓
Metabolite ──────── Metabolomics
  ↓
Microbiome ──────── Metagenomics
  ↓
Environment ─────── Exposomics

Each layer adds a new dimension to understanding human health and disease. Together, they constitute a multi-omics picture.

3.2 Metabolomics: The Chemical Fingerprint of Life

Metabolomics measures the small molecules (metabolites) present in a biological sample — blood, urine, cerebrospinal fluid, or tissue.

Metabolites are the end products of metabolism: the sugars, amino acids, lipids, nucleotides, and their countless derivatives that your cells produce and consume every moment.

Why Metabolomics Is Powerful

Unlike genomics (which is static — your DNA doesn’t change), the metabolome is dynamic. It reflects:

What you ate last night
Whether you exercised this morning
Your current disease state
How your medications are being processed
Your microbiome’s activity

Example: Type 2 Diabetes Metabolic Signature

Elevated:
  • Branched-chain amino acids (leucine, isoleucine, valine)
  • Triglycerides and ceramides
  • Gluconeogenic intermediates

Depleted:
  • Bile acids (altered by gut dysbiosis)
  • Omega-3 fatty acids
  • Certain acylcarnitines

These patterns can precede clinical diagnosis by years — which is why metabolomics is so powerful for early biomarker discovery.

Measurement Technologies

Nuclear Magnetic Resonance (NMR) Spectroscopy

Measures ~100–200 metabolites
Highly reproducible and quantitative
Great for targeted metabolite classes (lipoproteins, amino acids)

Mass Spectrometry (MS)

Untargeted: can detect 1,000–10,000 metabolite features
Targeted: quantifies specific metabolites with high accuracy
Combined with chromatography (LC-MS, GC-MS) for separation

The Data Problem

A typical untargeted metabolomics study produces a matrix of:

Rows: samples (patients or time points)
Columns: metabolite features (m/z values or named metabolites)
Values: peak intensities

Sample × Feature matrix:
         Metabolite_1  Metabolite_2  ...  Metabolite_8000
Sample_1     1234.5        678.2    ...      99.1
Sample_2      890.1       1456.8    ...     234.7
...
Sample_500   2345.6        123.4    ...     567.3

This high-dimensional, noisy, collinear matrix is a perfect candidate for ML — and a nightmare for traditional statistical methods.

3.3 DNA Signal Processing: From Raw Reads to Features

Modern DNA sequencing (Next-Generation Sequencing, NGS) produces raw reads — short fragments of DNA sequence. Converting these to usable features involves:

1. Base Calling: Converting electrical signals to nucleotide letters (A, T, G, C)

2. Quality Control: Filtering low-quality reads, trimming adapters

3. Alignment: Mapping reads to the reference human genome (hg38)

4. Variant Calling: Identifying SNPs, insertions, deletions (indels), and structural variants

5. Annotation: Mapping variants to genes, regulatory regions, known disease databases

6. Feature Engineering: Converting variant calls to numerical features (e.g., dosage encoding: 0, 1, 2 for the number of alternate alleles at each SNP)

The final product is a genotype matrix: patients × SNPs, with values 0, 1, or 2.

📊 Chapter 4: Why This Is a Machine Learning Problem

4.1 The Dimensionality Challenge

Here is the fundamental challenge of omics data:

Typical study:
  Samples (n):    500 – 100,000 people
  Features (p):   500 – 10,000,000 variables (SNPs, metabolites, genes)

Classical statistics assumes n >> p.
In omics, we often have p >> n.

When features outnumber samples, classical regression overfits catastrophically. You can fit any pattern — including pure noise — with enough features.

This is the curse of dimensionality, and solving it is the central challenge of computational biology.

4.2 The Signal-to-Noise Challenge

Biological data is extraordinarily noisy:

Batch effects: samples processed on different days produce systematic biases
Biological variability: even identical twins differ in their metabolome
Technical variability: measurement error from instruments
Missing data: not every test is run for every patient

Real biological signals — the SNP that genuinely predicts heart disease, the metabolite that rises before liver failure — are often tiny effects buried in noise.

4.3 The Interpretability Challenge

In medicine, a prediction model is only useful if clinicians can understand and trust it.

A black-box model that says “this patient has an 87% chance of cardiovascular event” is less useful than one that says “this patient’s elevated branched-chain amino acids and reduced bile acids, combined with their APOE ε4 genotype, place them in a high-risk cluster.”

Interpretability is not a luxury in clinical ML. It is a requirement.

🔮 Chapter 5: Part A — High-Dimensional Tabular Data

5.1 Feature Selection: The LASSO Approach

The Problem: Too Many Variables

Imagine you have 500 patients and 5,000 metabolite measurements. You want to predict who will develop type 2 diabetes in the next 5 years.

If you naively run a logistic regression with all 5,000 features:

The model will overfit perfectly to your training data
It will fail completely on new patients
You will have no idea which metabolites actually matter

You need feature selection: the process of identifying the small subset of features that genuinely predict the outcome.

The Main Model: LASSO (L1 Regularization)

LASSO stands for Least Absolute Shrinkage and Selection Operator. It was proposed by Robert Tibshirani in 1996 and remains one of the most important tools in computational biology.

The Intuition

Regular linear regression minimizes:

Loss = Σ(y_i - ŷ_i)²

LASSO adds a penalty term:

LASSO Loss = Σ(y_i - ŷ_i)² + λ × Σ|β_j|
             ─────────────────  ─────────────────
             Fit to data           Penalty for
                                   large coefficients

The key insight: the L1 penalty produces sparse solutions. As λ increases, LASSO drives more and more coefficients exactly to zero — effectively performing automatic feature selection.

Why L1 Produces Sparsity (The Geometric Intuition)

L1 constraint region: A diamond (corners on axes)
L2 constraint region: A circle (smooth everywhere)

The optimal solution often touches the L1 diamond at a corner —
where one coefficient is zero.

This does NOT happen with L2 (Ridge) regression,
whose circular constraint region rarely touches the axes.

The Regularization Path

The most powerful way to use LASSO is to solve it for many values of λ simultaneously:

import numpy as np
from sklearn.linear_model import LassoCV
from sklearn.preprocessing import StandardScaler

# Standardize features (CRITICAL for LASSO — coefficients must be comparable)
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# Cross-validated LASSO — finds optimal λ automatically
lasso_cv = LassoCV(
    alphas=np.logspace(-4, 1, 100),  # λ values from 0.0001 to 10
    cv=10,                            # 10-fold cross-validation
    max_iter=10000,
    random_state=42
)
lasso_cv.fit(X_scaled, y)

# Extract selected features
selected_mask = lasso_cv.coef_ != 0
selected_features = feature_names[selected_mask]
print(f"Selected {selected_mask.sum()} features from {X.shape[1]} total")

Reading the Regularization Path Plot

A regularization path plot shows how each coefficient changes as λ varies:

Coefficient
value
    ^
0.5 |       ──────────────── Feature A
    |      /
0.0 |─────/──────────────── (most features stay at 0)
    |    /
-0.3|   ──────────────────── Feature B
    +──────────────────────────────────> log(λ)
    large λ                    small λ
    (few features)         (all features included)

Reading this plot tells you:

Which features enter the model first (at large λ) are the most robustly predictive
The λ at the elbow (selected by cross-validation) gives your final feature set
Features that enter only at very small λ are likely noise

Choosing λ: The Cross-Validation Criterion

Two common CV criteria:

lambda.min: The λ that minimizes cross-validation error — gives maximum predictive accuracy

lambda.1se: The largest λ whose CV error is within 1 standard error of the minimum — gives sparser, more conservative model

For biomarker discovery, prefer lambda.1se: you want the smallest set of features that still predicts well.

Visualizing with PCA and UMAP

After LASSO selects features, use dimensionality reduction to visualize patient clustering:

from sklearn.decomposition import PCA
import umap
import matplotlib.pyplot as plt

X_selected = X_scaled[:, selected_mask]

# PCA — linear, interpretable
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_selected)

# UMAP — nonlinear, reveals clusters
reducer = umap.UMAP(n_neighbors=15, min_dist=0.1, random_state=42)
X_umap = reducer.fit_transform(X_selected)

fig, axes = plt.subplots(1, 2, figsize=(14, 6))
for ax, embedding, title in zip(
    axes,
    [X_pca, X_umap],
    ["PCA of LASSO-selected features", "UMAP of LASSO-selected features"]
):
    scatter = ax.scatter(
        embedding[:, 0], embedding[:, 1],
        c=y, cmap='RdYlBu', alpha=0.7, s=30
    )
    ax.set_title(title, fontsize=13)
    plt.colorbar(scatter, ax=ax, label='Outcome')
plt.tight_layout()

PCA is linear and interpretable — the axes represent linear combinations of features with maximum variance. Great for detecting batch effects and outliers.

UMAP is nonlinear and reveals local structure — clusters in UMAP space often correspond to meaningful disease subtypes. Great for hypothesis generation.

Notebook Comparison: Elastic Net, Boruta, mRMR

Elastic Net

Elastic Net combines L1 and L2 penalties:

Elastic Net Loss = Σ(y_i - ŷ_i)² + λ₁ × Σ|β_j| + λ₂ × Σβ_j²
                                     ─── L1 ────   ─── L2 ───

When to use over LASSO: When features are highly correlated (e.g., metabolites from the same pathway). LASSO arbitrarily picks one from a correlated group; Elastic Net tends to include all of them at reduced coefficients.

Boruta

Boruta is a wrapper method built on Random Forest:

from boruta import BorutaPy
from sklearn.ensemble import RandomForestClassifier

rf = RandomForestClassifier(n_jobs=-1, class_weight='balanced', max_depth=5)
boruta = BorutaPy(rf, n_estimators='auto', verbose=2, random_state=42)
boruta.fit(X_scaled, y)

# Three categories of features:
confirmed = feature_names[boruta.support_]          # Definitely relevant
tentative = feature_names[boruta.support_weak_]     # Possibly relevant
rejected = feature_names[~boruta.support_ & ~boruta.support_weak_]  # Noise

How it works: Creates “shadow features” — shuffled copies of each real feature. A feature is accepted only if it consistently outperforms its own shuffled version across many random forest runs. This controls family-wise error rate, making it statistically rigorous.

mRMR (Minimum Redundancy Maximum Relevance)

mRMR selects features that are:

Maximally relevant to the outcome (high mutual information with y)
Minimally redundant with each other (low mutual information among features)

from mrmr import mrmr_classif

# Select top 50 features
selected_features = mrmr_classif(X=df_features, y=df_target, K=50)

When to use: When you want a diverse panel of biomarkers from different biological pathways — not five variants of the same measurement.

Comparison Table

Method	Statistical Basis	Handles Correlation	Speed	Interpretability
LASSO	Convex optimization	Poorly (picks one)	Fast	Coefficients
Elastic Net	Convex optimization	Well	Fast	Coefficients
Boruta	Permutation testing	Reasonably	Slow	Importance
mRMR	Information theory	By design	Medium	Mutual info

5.2 Tabular ML: XGBoost

Why Tree-Based Methods Dominate Omics

Before explaining XGBoost, it helps to understand why gradient boosting machines have dominated tabular ML competitions and clinical research alike.

Biological tabular data has several properties that make it hard for neural networks:

Mixed feature types: continuous measurements mixed with binary indicators and ordinal categories
Missing values: clinical data routinely has 10–40% missingness
Non-linear interactions: the effect of metabolite A might depend on the level of gene B
Small-to-medium datasets: hundreds or thousands of samples, not millions

Neural networks excel at finding patterns in millions of identical-format examples (images, text). Decision trees excel at heterogeneous tabular data with complex feature interactions. And XGBoost combines hundreds of trees into an ensemble that is nearly unbeatable on this type of data.

The Main Model: XGBoost (eXtreme Gradient Boosting)

From Decision Tree to Ensemble: Building the Intuition

A single decision tree makes predictions by asking a series of yes/no questions:

Is fasting_glucose > 100 mg/dL?
    YES → Is BMI > 30?
                YES → Predict: High risk (score: 0.82)
                NO  → Predict: Medium risk (score: 0.61)
    NO  → Is age > 50?
                YES → Predict: Low-medium risk (score: 0.34)
                NO  → Predict: Low risk (score: 0.12)

One tree is brittle — a slight change in training data produces a very different tree.

Random Forest trains hundreds of trees on bootstrap samples and averages their predictions — reducing variance but using each tree independently.

Gradient Boosting takes a fundamentally different approach: trees are built sequentially, each one correcting the mistakes of all previous trees.

The Gradient Boosting Algorithm (Intuitive Version)

Start: Predict the same value for all patients (e.g., mean of y)
       Residuals = Actual y - Predicted y  ← these are your "mistakes"

Iteration 1:
  Train Tree_1 to predict the residuals
  Update predictions: ŷ ← ŷ + η × Tree_1(x)
  Compute new residuals

Iteration 2:
  Train Tree_2 on the new residuals
  Update predictions: ŷ ← ŷ + η × Tree_1(x) + η × Tree_2(x)
  ...

Continue for T iterations (trees)
Final prediction: F(x) = η × Σᵢ Treeᵢ(x)

η (eta) is the learning rate — a small number (0.01–0.3) that prevents overfitting by shrinking each tree’s contribution.

What Makes XGBoost “eXtreme”

XGBoost added several innovations over standard gradient boosting:

Regularization: Both L1 (on leaf weights) and L2 (on sum of squared leaf weights) regularization built into the tree-building objective.

Second-Order Optimization: Instead of fitting residuals (first-order gradient), XGBoost uses both gradient and Hessian (second-order Taylor expansion) for more precise tree construction.

Column Subsampling: Like Random Forest, XGBoost samples features for each tree — further reducing overfitting.

Sparse Awareness: XGBoost handles missing values natively by learning the best default direction for missing data at each split.

Parallelization: Tree construction is parallelized across CPU cores — hence “eXtreme.”

Training XGBoost on Phenome Data

import xgboost as xgb
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import roc_auc_score, average_precision_score
import numpy as np

# Prepare DMatrix — XGBoost's optimized data format
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=feature_names)
dval   = xgb.DMatrix(X_val,   label=y_val,   feature_names=feature_names)

# Hyperparameters — the most important ones
params = {
    # Task
    'objective': 'binary:logistic',     # Binary classification
    'eval_metric': 'auc',
    'seed': 42,

    # Tree structure
    'max_depth': 6,          # How deep each tree can grow (3–8 typical)
    'min_child_weight': 5,   # Minimum samples in leaf (fights overfitting)
    'gamma': 0.1,            # Minimum gain for a split to happen

    # Regularization
    'reg_alpha': 0.1,        # L1 on leaf weights
    'reg_lambda': 1.0,       # L2 on leaf weights

    # Sampling
    'subsample': 0.8,        # Use 80% of training rows per tree
    'colsample_bytree': 0.8, # Use 80% of features per tree

    # Boosting
    'learning_rate': 0.05,   # Small η → need more trees, but more stable
    'n_estimators': 5000,    # Maximum trees (early stopping will cut this)
}

# Train with early stopping
evals_result = {}
model = xgb.train(
    params,
    dtrain,
    num_boost_round=5000,
    evals=[(dtrain, 'train'), (dval, 'val')],
    early_stopping_rounds=50,      # Stop if val AUC doesn't improve for 50 rounds
    evals_result=evals_result,
    verbose_eval=100
)

# Predict
y_pred_prob = model.predict(xgb.DMatrix(X_test, feature_names=feature_names))
print(f"Test AUC: {roc_auc_score(y_test, y_pred_prob):.4f}")
print(f"Best iteration: {model.best_iteration}")

Feature Importance in XGBoost

XGBoost provides three built-in feature importance metrics:

weight: How many times a feature is used to split across all trees
gain: Average improvement in loss brought by splits using this feature
cover: Average number of samples covered by splits using this feature

For biomarker discovery, gain is the most meaningful — it tells you which features actually improve predictions.

# Plot feature importance
importance = model.get_score(importance_type='gain')
importance_sorted = sorted(importance.items(), key=lambda x: x[1], reverse=True)

features_top20 = [x[0] for x in importance_sorted[:20]]
values_top20 = [x[1] for x in importance_sorted[:20]]

plt.figure(figsize=(10, 8))
plt.barh(features_top20[::-1], values_top20[::-1])
plt.xlabel('Average Gain')
plt.title('Top 20 Features by XGBoost Gain')
plt.tight_layout()

Early Stopping: Why It Matters

Without early stopping, XGBoost with 5,000 trees will memorize the training data. The validation curve reveals the “sweet spot”:

Loss
  ^
  |\ Training loss
  | \──────────────────────── (keeps decreasing)
  |  \  
  |   \  Validation loss
  |    \──────────────\─────── (starts increasing = overfitting)
  |     \              \
  |      \              Best model ← early_stopping_rounds finds this
  +─────────────────────────────────────────────────────> Iteration

Hyperparameter Tuning: A Practical Guide

The tuning priority order (most impactful first):

n_estimators + learning_rate — Start with 0.1, use early stopping
max_depth — Typical range: 3–8. Lower for regularization, higher for complex interactions
min_child_weight — Critical for imbalanced data. Higher values reduce leaf specificity
subsample + colsample_bytree — Typical range: 0.6–1.0
gamma — Minimum gain required to make a split (acts as tree pruning)
reg_alpha + reg_lambda — Fine-tune regularization last

# Bayesian optimization with Optuna
import optuna

def objective(trial):
    params = {
        'objective': 'binary:logistic',
        'eval_metric': 'auc',
        'max_depth': trial.suggest_int('max_depth', 3, 8),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 20),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-4, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-4, 10.0, log=True),
        'gamma': trial.suggest_float('gamma', 0, 5),
    }
    
    # Cross-validation
    cv_results = xgb.cv(
        params, dtrain,
        num_boost_round=500,
        nfold=5,
        early_stopping_rounds=30,
        metrics='auc',
        seed=42
    )
    return cv_results['test-auc-mean'].max()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100)
print(f"Best AUC: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

Notebook Comparison: LightGBM, CatBoost, TabNet

LightGBM (Microsoft, 2017)

Uses Leaf-wise tree growth instead of level-wise — finds the split that reduces loss most, regardless of depth
Often 10–20× faster than XGBoost on large datasets
GOSS (Gradient-based One-Side Sampling): focuses more gradient computation on hard examples
Excellent for large genomic datasets (100K+ samples, millions of SNPs)

CatBoost (Yandex, 2018)

Designed for categorical features — uses ordered target statistics instead of one-hot encoding
Symmetric trees: each level of the tree uses the same split, which aids regularization and fast inference
Often requires less tuning than XGBoost or LightGBM
Best choice when your phenome data contains many categorical variables (ICD codes, medication classes)

TabNet (Google, 2019) and FT-Transformer

Neural approaches to tabular data
TabNet uses sequential attention to select features step-by-step
FT-Transformer uses transformer attention over feature embeddings
When deep learning beats GBM on tabular: data has >100K samples, many categorical features with high cardinality, or when transfer learning from pretrained models is possible

Rule of thumb:
  n < 10K:    XGBoost / LightGBM almost always wins
  n = 10–100K: XGBoost / LightGBM likely wins, but TabNet worth trying
  n > 100K:   LightGBM fastest; TabNet/FT-Transformer competitive
  Many categoricals: CatBoost often best

⚖️ Chapter 6: Part B — Making Models Trustworthy

6.1 Imbalanced Classification: Type 1 Diabetes (~2,000 cases)

The Problem: Rare Disease in a Sea of Healthy Controls

Type 1 Diabetes (T1D) is an autoimmune disease affecting approximately 0.5–1% of the general population. In a dataset of 200,000 individuals, you might have only 2,000 T1D cases and 198,000 controls.

Class balance:
  T1D (positive class):  2,000 patients   → 1%
  Controls (negative):  198,000 patients  → 99%

Class imbalance ratio: 1:99

This creates a catastrophic trap for naive models:

# A model that ALWAYS predicts "healthy" achieves:
accuracy = 198000 / 200000 = 99.0%

# But it identifies ZERO T1D patients!
recall = 0 / 2000 = 0.0%
precision = undefined (0 positive predictions)

Accuracy is a completely useless metric for rare disease classification.

The correct metrics for imbalanced clinical problems:

AUROC (Area Under ROC Curve): measures overall discrimination across all thresholds
AUPRC (Area Under Precision-Recall Curve): more sensitive to rare class performance
Sensitivity/Recall: “Of all T1D patients, what fraction did we catch?”
Specificity: “Of all healthy people, what fraction did we correctly label?”
F1 Score: Harmonic mean of precision and recall

The Main Approach: Class-Weighted Loss

Core idea: When computing the loss function, give each T1D case 99× more weight than each control. This forces the model to pay equal total attention to both classes.

from sklearn.utils.class_weight import compute_class_weight
import numpy as np

# Compute balanced class weights automatically
classes = np.unique(y_train)
weights = compute_class_weight('balanced', classes=classes, y=y_train)
class_weight_dict = {0: weights[0], 1: weights[1]}
# → {0: 0.505, 1: 50.0}  (T1D cases get 99× more weight)

# Apply to XGBoost
scale_pos_weight = (y_train == 0).sum() / (y_train == 1).sum()
# = 198,000 / 2,000 = 99

xgb_model = xgb.XGBClassifier(
    scale_pos_weight=scale_pos_weight,  # XGBoost's way of handling imbalance
    n_estimators=500,
    max_depth=5,
    learning_rate=0.05,
    random_state=42
)
xgb_model.fit(X_train, y_train)

Why class weighting is preferred over resampling:

Does not distort the data distribution — you are not creating fake patients or throwing away real ones
Mathematically equivalent to resampling when done correctly, but more stable
Preserves all information in the original dataset
Transparent: the weighting is explicit and documented

Notebook Comparison: SMOTE, Focal Loss, CTGAN

SMOTE (Synthetic Minority Over-sampling Technique)

SMOTE creates synthetic T1D cases by interpolating between existing ones:

For each minority sample x:
  1. Find k nearest neighbors in feature space
  2. Randomly select one neighbor x_neighbor
  3. Create synthetic sample: x_new = x + rand(0,1) × (x_neighbor - x)

Advantages: Increases minority class size, reducing overfitting to few real examples

Dangers in clinical/omics research:

Synthetic patients are biologically meaningless — you cannot linearly interpolate between two real patients’ metabolome profiles
Can introduce artifacts that models learn instead of real biological signals
ADASYN (Adaptive Synthetic Sampling) is a smarter variant — generates more synthetic samples in harder-to-classify regions

from imblearn.over_sampling import SMOTE, ADASYN

smote = SMOTE(sampling_strategy=0.1, k_neighbors=5, random_state=42)
X_resampled, y_resampled = smote.fit_resample(X_train, y_train)

Focal Loss

Focal Loss (Lin et al., 2017, originally for object detection) modifies cross-entropy to focus training on hard examples:

Standard cross-entropy: CE(p, y) = -y·log(p) - (1-y)·log(1-p)

Focal Loss: FL(p, y) = -(1 - p_t)^γ · log(p_t)

Where:
  p_t = p if y=1 else (1-p)
  γ (gamma) = focusing parameter (typically 2)

When a sample is correctly classified with high confidence (p_t ≈ 1), the factor (1-p_t)^γ ≈ 0, meaning the loss is nearly zero — easy examples contribute little to training.

When a sample is misclassified or uncertain (p_t ≈ 0.5), the factor is ≈ 0.25, amplifying those examples’ contribution.

CTGAN (Conditional Tabular GAN)

CTGAN is a generative adversarial network designed specifically for tabular data. It can generate statistically realistic synthetic T1D patients by learning the joint distribution of all features conditioned on the T1D label.

from ctgan import CTGAN

# Train on real T1D cases only
X_t1d = X_train[y_train == 1]
df_t1d = pd.DataFrame(X_t1d, columns=feature_names)

ctgan = CTGAN(epochs=300, verbose=True)
ctgan.fit(df_t1d)

# Generate 10,000 synthetic T1D patients
synthetic_t1d = ctgan.sample(10000)

Important caveat: Synthetic patients from CTGAN are useful for augmenting training data but should never be mixed with real patients in validation or test sets.

6.2 Interpretability: SHAP Values

The Black Box Problem in Medicine

You train an XGBoost model. It achieves AUROC 0.91. You show it to a cardiologist.

She asks: “Which features drove this patient’s high-risk prediction?”

You say: “The model uses 847 features and 500 trees. I can’t tell you.”

She says: “Then I cannot use this in clinical practice.”

This is the interpretability crisis in clinical ML. And SHAP solves it.

The Main Model: SHAP (SHapley Additive exPlanations)

SHAP, developed by Scott Lundberg and Su-In Lee (2017), provides a theoretically principled framework for explaining any ML model’s predictions.

The Game Theory Foundation

SHAP values are inspired by Shapley values from cooperative game theory (Lloyd Shapley, 1953 — Nobel Prize 2012).

The analogy:

A team of workers (features) collaborates to produce a profit (model prediction)
How much credit does each worker deserve?
Shapley’s answer: calculate each worker’s marginal contribution averaged over all possible coalitions

Translated to ML:

Coalition = subset of features provided to the model
Contribution = how much a feature’s presence changes the prediction

For feature j, the SHAP value is:

φ_j = Σ_{S ⊆ F\{j}}  |S|!(|F|-|S|-1)!/|F|!  ×  [f(S∪{j}) - f(S)]
       all subsets                               ─────────────────────
       not containing j                          marginal contribution
                                                 of adding feature j

This guarantees four desirable properties:

Efficiency: SHAP values sum to the prediction minus baseline
Symmetry: Features with equal impact get equal SHAP values
Dummy: Features that never matter get SHAP = 0
Additivity: SHAP values from different models can be combined

TreeSHAP: Efficient for Tree Models

Computing exact Shapley values naively requires exponentially many model evaluations. For tree models, Lundberg et al. devised TreeSHAP — an algorithm that computes exact SHAP values in polynomial time.

import shap

# Compute SHAP values
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# Global: What features matter most overall?
shap.summary_plot(
    shap_values, X_test,
    feature_names=feature_names,
    plot_type='dot',    # Shows both importance and direction
    max_display=20
)

Reading the SHAP summary plot:

Y-axis: Features ordered by mean |SHAP value| (most important at top)
X-axis: SHAP value (negative = pushes toward class 0, positive = pushes toward class 1)
Color: Feature value (red = high, blue = low)

Example reading: "High BCAA level (red) → positive SHAP (pushes toward T1D risk)"

Individual Patient Explanation: Waterfall Plot

# Explain a single patient
patient_idx = 42

shap.waterfall_plot(
    shap.Explanation(
        values=shap_values[patient_idx],
        base_values=explainer.expected_value,
        data=X_test[patient_idx],
        feature_names=feature_names
    )
)

This produces a waterfall chart showing how each feature pushed the prediction away from the baseline (average prediction):

Baseline (average): 0.23
+ Leucine = 284 μM:        +0.18  →  0.41
+ Age = 45:                +0.09  →  0.50
+ HOMA-IR = 3.2:           +0.07  →  0.57
- HDL = 1.8 mmol/L:       -0.05  →  0.52
+ Isoleucine = 198 μM:     +0.12  →  0.64
...
Final prediction:           0.78  (High T1D risk)

This is clinically actionable: the doctor can see exactly which measurements drove the risk score and discuss them with the patient.

SHAP for Biomarker Discovery: Feature Clustering

SHAP values can be used for unsupervised biomarker clustering — grouping features by their pattern of effects across patients:

# Cluster features by SHAP correlation
shap_df = pd.DataFrame(shap_values, columns=feature_names)
feature_correlation = shap_df.corr()

# Features with similar SHAP patterns likely represent the same biological pathway
import seaborn as sns
sns.clustermap(
    feature_correlation,
    method='ward',
    cmap='RdBu_r',
    center=0,
    figsize=(16, 16)
)

This reveals biological pathway clusters — metabolites from BCAA metabolism cluster together, lipid metabolism features cluster together — providing pathway-level insights even when individual feature significance is noisy.

📡 Chapter 7: Part C — Time-Series, Biosignals, and Survival

7.1 Biosignal Time-Series: Ventilator Waveforms

Mechanical Ventilation: A Life-or-Death Signal

In the ICU, mechanical ventilators breathe for patients who cannot breathe on their own. They generate continuous waveforms:

Pressure waveform: airway pressure over time (cmH₂O)
Flow waveform: airflow into/out of lungs (L/min)
Volume waveform: cumulative air volume per breath (mL)

These waveforms are sampled at 100 Hz — 100 measurements per second — producing enormous streams of continuous data.

Why ML matters here: Subtle abnormalities in waveform shape — patient-ventilator asynchrony, dynamic hyperinflation, flow starvation — indicate distress, lung injury, or need for ventilator adjustment. Many of these are visually subtle and easy for even experienced clinicians to miss.

Data Preprocessing: Windowing

Before modeling, continuous waveforms must be segmented into fixed-length windows:

import numpy as np

def create_windows(signal, window_size=500, stride=50, fs=100):
    """
    signal:      1D array — ventilator waveform
    window_size: samples per window (500 @ 100Hz = 5 seconds)
    stride:      step between consecutive windows (50 = 90% overlap)
    fs:          sampling frequency (Hz)
    """
    windows = []
    start_times = []
    
    for i in range(0, len(signal) - window_size, stride):
        window = signal[i:i + window_size]
        windows.append(window)
        start_times.append(i / fs)  # in seconds
    
    return np.array(windows), np.array(start_times)

# Each window: [n_samples, window_size]
# → Feed to model as [batch, channels, length]

Window size choice: A tradeoff between:

Short windows: Faster response, but less temporal context (may miss multi-breath patterns)
Long windows: More context, but higher computational cost and may span multiple conditions

The Main Model: 1D-CNN

Convolutional Neural Networks were designed for images — detecting local patterns (edges, textures) that appear anywhere in the image. 1D-CNNs apply the same idea to sequential signals.

The Core Operation: 1D Convolution

A 1D convolution slides a small kernel (learnable filter) across the signal, computing the inner product at each position:

Signal:  [1.2, 2.3, 3.1, 2.8, 1.9, 2.1, 3.4, 2.7, ...]
Kernel:  [-1, 0, +1]  ← edge detector

Position 0: (-1)(1.2) + (0)(2.3) + (+1)(3.1) = 1.9  ← rising edge
Position 1: (-1)(2.3) + (0)(3.1) + (+1)(2.8) = 0.5
Position 2: (-1)(3.1) + (0)(2.8) + (+1)(1.9) = -1.2 ← falling edge
...

A bank of kernels learns to detect different temporal patterns: peaks, troughs, gradual rises, sharp drops.

1D-CNN Architecture for Ventilator Waveforms

import torch
import torch.nn as nn

class VentilatorCNN(nn.Module):
    def __init__(self, n_classes=2, window_size=500):
        super().__init__()
        
        # Convolutional blocks — extract local patterns
        self.conv_blocks = nn.Sequential(
            # Block 1: detect fine-grained features (short-term)
            nn.Conv1d(in_channels=3,   # pressure, flow, volume
                      out_channels=32,
                      kernel_size=5,   # 5 samples = 50ms at 100Hz
                      padding=2),
            nn.BatchNorm1d(32),
            nn.ReLU(),
            nn.MaxPool1d(kernel_size=2, stride=2),  # Downsample 2×
            
            # Block 2: detect medium-term patterns (individual breath features)
            nn.Conv1d(32, 64, kernel_size=7, padding=3),
            nn.BatchNorm1d(64),
            nn.ReLU(),
            nn.MaxPool1d(2, 2),
            
            # Block 3: detect long-term patterns (multi-breath trends)
            nn.Conv1d(64, 128, kernel_size=9, padding=4),
            nn.BatchNorm1d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool1d(16),   # Compress to 16 time steps
        )
        
        # Classification head
        self.classifier = nn.Sequential(
            nn.Flatten(),
            nn.Linear(128 * 16, 256),
            nn.ReLU(),
            nn.Dropout(0.5),
            nn.Linear(256, n_classes)
        )
    
    def forward(self, x):
        # x shape: [batch, channels, time]
        features = self.conv_blocks(x)
        return self.classifier(features)

model = VentilatorCNN(n_classes=2)
print(f"Parameters: {sum(p.numel() for p in model.parameters()):,}")

Anomaly Detection: Isolation Forest

Beyond classification, we want to detect abnormal waveforms automatically — catching dangerous patterns without needing labeled examples of every possible anomaly.

Isolation Forest is an elegant unsupervised anomaly detection algorithm:

Core insight: Anomalous points are rare and different — they are easy to isolate by random partitioning.

Build an isolation tree:
  1. Randomly select a feature
  2. Randomly select a split value between feature min and max
  3. Recursively partition until each point is isolated

Anomaly score: Points requiring FEW splits to isolate are anomalies
               Points requiring MANY splits are normal (surrounded by neighbors)

from sklearn.ensemble import IsolationForest
import numpy as np

# Extract features from windows (e.g., CNN embeddings, or handcrafted stats)
def extract_waveform_features(windows):
    """Extract statistical features from each waveform window"""
    features = []
    for w in windows:
        features.append([
            np.mean(w),
            np.std(w),
            np.max(w) - np.min(w),      # Peak-to-peak amplitude
            np.sum(np.diff(np.sign(np.diff(w))) != 0),  # Zero crossings
            np.percentile(w, 10),
            np.percentile(w, 90),
        ])
    return np.array(features)

X_features = extract_waveform_features(windows)

# Fit Isolation Forest
iforest = IsolationForest(
    n_estimators=200,
    contamination=0.05,   # Expect ~5% anomalies
    random_state=42
)
iforest.fit(X_features)

# Score all windows (more negative = more anomalous)
anomaly_scores = iforest.score_samples(X_features)
anomaly_labels = iforest.predict(X_features)  # -1 = anomaly, +1 = normal

Notebook Comparison: LSTM, TCN, Wavelet Features

LSTM (Long Short-Term Memory) LSTMs model sequential dependencies with gates that control information flow:

Forget gate: What past information to discard
Input gate: What new information to store
Output gate: What to output at this step

LSTMs excel at long-range dependencies — patterns that span many time steps. They are computationally expensive and can be tricky to train. For waveform classification, 1D-CNN often outperforms LSTM while being faster.

TCN (Temporal Convolutional Network) TCNs use dilated causal convolutions to capture long-range patterns without the gating complexity of LSTMs:

Standard Conv: kernel sees adjacent time steps
Dilated Conv (dilation=4): kernel sees every 4th time step
Dilated Conv (dilation=16): kernel sees every 16th time step

Stacking dilations: each layer doubles the receptive field
→ Exponentially large context with linear parameters

TCNs are generally faster than LSTMs and easier to train (no vanishing gradient problem).

Wavelet and FFT Features for Tabular Models Before deep learning, biosignal features were handcrafted:

import scipy.signal as signal
import pywt

# FFT features: frequency domain
def fft_features(window, fs=100):
    freqs = np.fft.rfftfreq(len(window), 1/fs)
    fft_mag = np.abs(np.fft.rfft(window))
    return {
        'dominant_freq': freqs[np.argmax(fft_mag)],
        'spectral_entropy': -np.sum(fft_mag * np.log(fft_mag + 1e-9)),
        'power_0_2Hz': np.sum(fft_mag[(freqs < 2)]),
        'power_2_5Hz': np.sum(fft_mag[(freqs >= 2) & (freqs < 5)]),
    }

# Wavelet features: time-frequency decomposition
def wavelet_features(window, wavelet='db4', level=4):
    coeffs = pywt.wavedec(window, wavelet, level=level)
    features = {}
    for i, c in enumerate(coeffs):
        features[f'wavelet_L{i}_mean'] = np.mean(np.abs(c))
        features[f'wavelet_L{i}_energy'] = np.sum(c**2)
        features[f'wavelet_L{i}_std'] = np.std(c)
    return features

These features can be fed into XGBoost/LightGBM when deep learning models are impractical.

7.2 Survival Analysis: Hypertension Risk Stratification (130K patients)

Why Standard Classification Fails for Survival

Imagine you want to predict cardiovascular events in 130,000 hypertensive patients.

You have 10 years of follow-up. Some patients had events (heart attack, stroke). Others are still event-free at the last observation. Some dropped out of the study after 3 years.

The censoring problem:

A patient who had no event after 5 years of follow-up is NOT the same as a patient who had no event after 10 years of follow-up
A patient who was lost to follow-up at year 3 provides valuable information up to year 3, but you do not know their outcome afterward

Standard binary classification throws away this time information — treating “event at year 1” and “event at year 9” as equivalent, and discarding censored patients entirely.

Survival analysis handles this correctly.

Key Concepts in Survival Analysis

Survival Function S(t):
The probability that an individual survives (does not experience the event) past time t.
S(t) is always between 0 and 1, starts at S(0) = 1, and is non-increasing.

Hazard Function h(t):
The instantaneous rate of the event occurring at time t, given survival to time t.
Think of it as: “How dangerous is this moment, if you made it this far?”

Cumulative Hazard H(t):
The total accumulated risk from time 0 to time t.

The Main Model: Cox Proportional Hazards (Cox-PH)

The Cox model (David Cox, 1972) is the workhorse of clinical survival analysis. Its elegance lies in modeling the hazard as a product of a baseline hazard and a feature-dependent term:

h(t | x) = h₀(t) × exp(β₁x₁ + β₂x₂ + ... + βₚxₚ)

Where:
  h₀(t)  = baseline hazard (unspecified — "semi-parametric")
  β       = coefficients we estimate
  x       = patient features

The "proportional hazards" assumption:
  The ratio of hazards between two patients is CONSTANT over time
  (Feature effects don't change as time progresses)

The Hazard Ratio: Clinical Interpretation

The key output of Cox regression is the hazard ratio (HR):

HR for feature j = exp(βⱼ)

Interpretation:
  HR = 2.5 → patients with one unit higher feature j have
              2.5× the instantaneous risk of the event
              
  HR = 0.7 → patients with one unit higher feature j have
              30% LOWER risk (protective effect)
              
  HR = 1.0 → no effect

Clinical example:

Cox model for cardiovascular event in hypertensives:

Feature                      HR      95% CI          p-value
─────────────────────────────────────────────────────────────
Systolic BP (per 10 mmHg)   1.28   [1.19, 1.38]    <0.001
Age (per 10 years)          1.67   [1.55, 1.80]    <0.001
Diabetes (yes vs no)        2.14   [1.89, 2.42]    <0.001
Smoking (current vs never)  1.73   [1.54, 1.95]    <0.001
ACE inhibitor use           0.81   [0.73, 0.90]    <0.001

Reading: Diabetic patients have 2.14× higher CV event rate
         compared to non-diabetics, at any given time point,
         holding other factors constant.

Implementing Cox-PH

from lifelines import CoxPHFitter
from lifelines.statistics import logrank_test

# Data format for survival analysis
# T = time to event (or censoring)
# E = event indicator (1 = event occurred, 0 = censored)

df_survival = pd.DataFrame({
    'T': time_to_event,    # In months
    'E': event_occurred,   # 1 = cardiovascular event, 0 = censored
    'systolic_bp': systolic_bp,
    'age': age,
    'diabetes': diabetes,
    'smoking': smoking,
    'acei_use': acei_use,
})

# Fit Cox model
cph = CoxPHFitter(penalizer=0.1)  # L2 regularization
cph.fit(df_survival, duration_col='T', event_col='E')

# Summary table
cph.print_summary()

# Check proportional hazards assumption
cph.check_assumptions(df_survival, p_value_threshold=0.05)

# Predict survival curve for a new patient
new_patient = pd.DataFrame({
    'systolic_bp': [155],
    'age': [62],
    'diabetes': [1],
    'smoking': [0],
    'acei_use': [1],
})
survival_func = cph.predict_survival_function(new_patient)

plt.figure(figsize=(10, 6))
survival_func.plot()
plt.xlabel('Time (months)')
plt.ylabel('Survival probability')
plt.title('Predicted survival curve for high-risk patient')
plt.ylim(0, 1)

Kaplan-Meier Curves: Visualizing Survival

Before fitting any model, always explore your data with Kaplan-Meier (KM) curves — non-parametric estimates of the survival function for each group:

from lifelines import KaplanMeierFitter

fig, ax = plt.subplots(figsize=(10, 6))

for diabetic in [0, 1]:
    mask = df_survival['diabetes'] == diabetic
    kmf = KaplanMeierFitter()
    kmf.fit(
        df_survival.loc[mask, 'T'],
        df_survival.loc[mask, 'E'],
        label='Diabetic' if diabetic else 'Non-diabetic'
    )
    kmf.plot_survival_function(ax=ax, ci_show=True)

# Log-rank test
results = logrank_test(
    df_survival.loc[df_survival['diabetes']==1, 'T'],
    df_survival.loc[df_survival['diabetes']==0, 'T'],
    event_observed_A=df_survival.loc[df_survival['diabetes']==1, 'E'],
    event_observed_B=df_survival.loc[df_survival['diabetes']==0, 'E']
)
ax.set_title(f'Cardiovascular Event-Free Survival\nLog-rank p = {results.p_value:.4f}')

Notebook Comparison: Random Survival Forest, DeepSurv

Random Survival Forest (RSF)

RSF extends Random Forest to survival data using the log-rank splitting criterion — at each node, it finds the split that maximizes the difference in survival distributions between left and right children.

from sksurv.ensemble import RandomSurvivalForest
import numpy as np

# sksurv expects structured array for the outcome
y_survival = np.array(
    [(bool(e), t) for e, t in zip(event_occurred, time_to_event)],
    dtype=[('event', bool), ('time', float)]
)

rsf = RandomSurvivalForest(
    n_estimators=300,
    min_samples_split=10,
    min_samples_leaf=15,
    max_features='sqrt',
    n_jobs=-1,
    random_state=42
)
rsf.fit(X_train, y_survival_train)

# Concordance index (C-index): main metric for survival models
c_index = rsf.score(X_test, y_survival_test)
print(f"C-index: {c_index:.4f}")

C-index (Concordance Index): For a randomly selected pair of patients where one had the event earlier, what fraction of the time does the model correctly rank the earlier-event patient as higher risk? C-index = 0.5 is random; 1.0 is perfect.

Advantages over Cox: No proportional hazards assumption; captures nonlinear interactions and complex feature effects; often outperforms Cox on complex clinical data.

DeepSurv

DeepSurv (Katzman et al., 2018) is a neural network that generalizes the Cox model:

# DeepSurv architecture
# Instead of linear combination: h(t|x) = h₀(t) × exp(β·x)
# We use: h(t|x) = h₀(t) × exp(NN(x))
# The neural network learns nonlinear transformations of features

import torch
import torch.nn as nn

class DeepSurv(nn.Module):
    def __init__(self, n_features, hidden_dims=[256, 128, 64]):
        super().__init__()
        
        layers = []
        prev_dim = n_features
        for dim in hidden_dims:
            layers.extend([
                nn.Linear(prev_dim, dim),
                nn.BatchNorm1d(dim),
                nn.ReLU(),
                nn.Dropout(0.3),
            ])
            prev_dim = dim
        
        layers.append(nn.Linear(prev_dim, 1))  # Output: log risk score
        self.network = nn.Sequential(*layers)
    
    def forward(self, x):
        return self.network(x).squeeze()
    
    def partial_log_likelihood(self, risk_scores, events, times):
        """Cox partial likelihood loss"""
        # Sort by descending time
        sort_idx = torch.argsort(times, descending=True)
        risk_scores = risk_scores[sort_idx]
        events = events[sort_idx]
        
        # For each event, compute log of ratio: event risk / sum of all at-risk risks
        log_cumsum_exp = torch.logcumsumexp(risk_scores, dim=0)
        loss = -(risk_scores - log_cumsum_exp)[events.bool()].mean()
        return loss

🏗️ Chapter 8: Putting It All Together — A Research Pipeline

8.1 The Complete Workflow

Raw Data Sources
  ├── Biobank Phenome Dataset (tabular)
  ├── EHR / EMR Records (clinical codes, labs, meds)
  ├── Omics Assays (metabolomics, genomics)
  └── ICU Biosignals (ventilator waveforms)
           │
           ▼
   Data Engineering
  ├── Quality Control & Normalization
  ├── Batch Effect Correction
  ├── Missing Value Imputation
  └── Feature Engineering
           │
           ▼
   Feature Selection (LASSO, Boruta, mRMR)
  └── Reduced, stable biomarker set
           │
           ▼
   Model Training
  ├── XGBoost / LightGBM (tabular)
  ├── 1D-CNN / LSTM / TCN (biosignal)
  └── Cox-PH / RSF / DeepSurv (survival)
           │
           ▼
   Validation
  ├── Internal (cross-validation, holdout)
  ├── External (independent cohort)
  └── Clinical calibration (Brier score, calibration curves)
           │
           ▼
   Interpretation
  ├── SHAP global (biomarker importance)
  ├── SHAP local (individual patient explanation)
  └── Pathway enrichment analysis
           │
           ▼
   Clinical Translation
  ├── Risk score calculator
  ├── Decision support tool
  └── Biomarker panel validation (wet lab)

8.2 The Evaluation Framework

Classification Metrics

from sklearn.metrics import (
    roc_auc_score, average_precision_score,
    classification_report, brier_score_loss,
    RocCurveDisplay, PrecisionRecallDisplay
)

def comprehensive_evaluation(y_true, y_pred_prob, threshold=0.5):
    y_pred = (y_pred_prob >= threshold).astype(int)
    
    print("=" * 50)
    print("CLASSIFICATION REPORT")
    print("=" * 50)
    print(classification_report(y_true, y_pred, target_names=['Control', 'Case']))
    
    print(f"\nAUROC:  {roc_auc_score(y_true, y_pred_prob):.4f}")
    print(f"AUPRC:  {average_precision_score(y_true, y_pred_prob):.4f}")
    print(f"Brier:  {brier_score_loss(y_true, y_pred_prob):.4f}")
    
    # Net Reclassification Improvement, Integrated Discrimination Improvement
    # (For comparison against a clinical baseline model)

Calibration: Does the Model Know What It Doesn’t Know?

A model that predicts “80% probability” should be right 80% of the time. Models that are systematically over- or under-confident mislead clinical decision-making.

from sklearn.calibration import calibration_curve, CalibratedClassifierCV

# Plot calibration curve
fraction_of_positives, mean_predicted_value = calibration_curve(
    y_true, y_pred_prob, n_bins=10
)

plt.figure(figsize=(8, 6))
plt.plot(mean_predicted_value, fraction_of_positives, 's-', label='XGBoost')
plt.plot([0, 1], [0, 1], 'k--', label='Perfect calibration')
plt.xlabel('Mean predicted probability')
plt.ylabel('Fraction of positives')
plt.title('Calibration Curve')
plt.legend()

8.3 Common Pitfalls in Clinical ML Research

Data Leakage

The most dangerous mistake in clinical ML: information from the future leaks into your training features.

Common sources:

Temporal leakage: Using lab values that were only measured because the patient was already diagnosed
Processing leakage: Fitting normalization parameters on the entire dataset before splitting
Duplicate leakage: The same patient appears in both training and test sets

# WRONG: Scales entire dataset before splitting
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)  # ← LEAKAGE
X_train, X_test = train_test_split(X_scaled)

# CORRECT: Fit scaler only on training data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)   # Fit on train only
X_test_scaled  = scaler.transform(X_test)         # Apply to test

P-Hacking and Feature Mining

With 5,000 metabolites and α = 0.05, you expect 250 false positives by chance alone. Without correction:

Use Bonferroni correction (strict) or Benjamini-Hochberg FDR correction (less strict, better powered)
Replicate in an independent cohort before reporting

The Validation Hierarchy

Not all validation is equal:

Level 1: Internal CV on development dataset          (necessary, not sufficient)
Level 2: Holdout test set (same institution)         (good, but optimistic)
Level 3: External validation (different institution)  (required for publication)
Level 4: Prospective clinical validation              (required for clinical use)
Level 5: Randomized trial comparing ML-guided vs     (gold standard)
          standard care

Most published ML models never progress beyond Level 2. The ones that transform clinical practice reach Level 4–5.

🌟 Conclusion: The Future of Omics Machine Learning

We stand at an inflection point in biological discovery. The tools now exist to:

Read the complete molecular state of a human being with unprecedented depth
Identify signals in data that would be invisible to any single human analyst
Explain predictions in terms clinicians and patients can understand
Learn from millions of patient experiences to personalize medicine for individual people

The methods in this guide — LASSO, XGBoost, SHAP, Cox-PH, 1D-CNN — are not temporary tools. They are the vocabulary of a new discipline that sits at the intersection of molecular biology, clinical medicine, and machine learning.

What makes this field special is that the stakes are real. The biomarker you discover might be the one that leads to a diagnostic test. The risk model you build might be the one that changes screening guidelines. The waveform anomaly detector you deploy might be the one that saves a life in the ICU at 3 AM.

The genome encoded the instructions. The phenome is the living expression of them. Machine learning is how we finally learn to read both — and act on what we find.

📚 Further Reading

Foundations

Hartl & Clark — Principles of Population Genetics (population genetics)
Alberts et al. — Molecular Biology of the Cell (cellular biology)
Strimmer & von Haeseler — Statistical Genomics (statistical foundations)

Machine Learning

Hastie, Tibshirani & Friedman — The Elements of Statistical Learning (LASSO, boosting, survival)
Molnar — Interpretable Machine Learning (free online, SHAP chapter excellent)
Chen & Guestrin (2016) — XGBoost: A Scalable Tree Boosting System

Clinical Applications

Obermeyer & Emanuel (2016) — “Predicting the Future — Big Data, Machine Learning, and Clinical Medicine” (NEJM)
Lundberg et al. (2020) — “From local explanations to global understanding with explainable AI for trees” (Nature Machine Intelligence)
Katzman et al. (2018) — “DeepSurv: personalized treatment recommender system using a Cox proportional hazards deep neural network”

Omics & Metabolomics

Wishart et al. — Human Metabolome Database (hmdb.ca)
Wishart (2016) — “Emerging applications of metabolomics in drug discovery and precision medicine” (Nature Reviews Drug Discovery)

This blog post was written for students and researchers entering the field of computational biomedicine. All code examples are illustrative and should be adapted to your specific dataset and research question.

Last updated on 2026