Why Do We Need MCMC?

In short, because many distributions are only known in unnormalized forms, making traditional sampling/integration methods ineffective.

Goal: Sample from a complex distribution $\pi(x)$ (often only known as “unnormalized” $\tilde\pi(x)\propto \pi(x)$), or compute expectation/marginal:

$$ \mathbb{E}_\pi[f(X)] \;=\; \int f(x)\,\pi(x)\,dx. $$

Reality Difficulties:

High Dimensionality: As dimensions increase, grid/numerical integration explodes exponentially;
Unknown Normalization Constant: In Bayesian context, the denominator $p(y)=\int p(y\mid \theta)p(\theta)\,d\theta$ in posterior $\pi(\theta\mid y)\propto p(y\mid \theta)p(\theta)$ is often intractable;
Multi-modal/Strong Correlation: Variance of Rejection Sampling or Importance Sampling becomes huge or suffers from “weight degeneracy”.

Core Idea of Monte Carlo: Given samples $x^{(1)},\dots,x^{(T)}\sim \pi$, we can use sample mean

$$ \frac{1}{T}\sum_{t=1}^T f\!\big(x^{(t)}\big) $$

to approximate $\mathbb{E}_\pi[f(X)]$. The problem is how to sample from $\pi$? This is exactly what MCMC solves: Without requiring the normalization constant, just being able to compute $\tilde\pi(x)$ (or its log), we can construct a stochastic process that “stays at $\pi$ in the long run” to draw samples.

Comparison with Other Approaches (Intuition):

Variational Inference (VI): Fast, scalable, but approximates with a “tractable family”, introducing approximation bias;
SMC/Particle Methods: Suitable for sequential problems, but design and resampling/annealing are complex;
MCMC: Asymptotically Unbiased (can arbitrarily approximate $\pi$ if run long enough), but samples are correlated, computation can be expensive, and requires diagnostics/tuning.

Example

Suppose we want to sample from the following distribution:

$$ \pi(x) \propto e^{-x^4}, \quad x \in \mathbb{R}. $$

This is a “super sharp” unimodal distribution.
Normalization constant $Z=\int e^{-x^4}\,dx$ is unknown (cannot be solved by hand).
Want to compute expectation $\mathbb{E}[X^2]$.

👉 Problem:

Direct integration is impossible (analytically intractable).
Rejection Sampling needs a “suitable envelope function”, but the tail here is very heavy, hard to find.

👉 Intuitive Conclusion: This is where MCMC shines: As long as we can compute $\tilde\pi(x)=e^{-x^4}$, i.e., the unnormalized density, we can design a Markov chain to converge to it.

# Plot e^{-x^4}
import numpy as np
import matplotlib.pyplot as plt

# Domain range (-5 to 5 is enough to see main shape)
x = np.linspace(-5, 5, 4001) 
y = np.exp(-x**4)

plt.figure(figsize=(7, 4))
plt.plot(x, y, lw=2)
plt.title(r"Curve of $e^{-x^4}$")
plt.xlabel("x")
plt.ylabel(r"$e^{-x^4}$")
plt.grid(True, ls="--", alpha=0.4)
plt.tight_layout()
plt.show()

# Normalize e^{-x^4} to make it a Probability Density Function (PDF)
Z = np.trapezoid(y, x)                    # Numerical integration
pdf = y / Z

plt.figure(figsize=(7, 4))
plt.plot(x, pdf, lw=2, label="Normalized $e^{-x^4}$ (PDF)")
plt.title(r"Normalized PDF from $e^{-x^4}$")
plt.xlabel("x")
plt.ylabel("Density")
plt.grid(True, ls="--", alpha=0.4)
plt.legend()
plt.tight_layout()
plt.show()

png

From Markov Chain to Sampling

In short, by constructing a “correct Markov chain”, its stationary distribution becomes the target distribution; trajectory’s long-term distribution ≈ target distribution.

Turning Sampling into “Building a Chain”

Let state space be $\mathsf{X}$, want samples from target distribution $\pi$. We don’t sample directly, but construct a transition kernel $P(x,A)=\Pr(X_{t+1}\in A\mid X_t=x)$, making $\pi$ its Stationary Distribution (invariant):

$$ \pi(A) \;=\; \int_{\mathsf{X}} \pi(dx)\,P(x,A),\quad \forall A. $$

Intuition: If you randomly pick a start point under $\pi$, then take a step according to $P$, the distribution remains unchanged. Such a “distribution-invariant” random walk, keeps walking, keeps staying on $\pi$.

Ensuring “Reachable, Aperiodic, Forgetting Start”

“Stationary” alone is not enough; we also need the chain to converge to $\pi$. Common sufficient conditions:

Irreducible: Can reach any “mass” region from anywhere with positive probability in finite steps;
Aperiodic: Not stuck in fixed cycles;
Reasonable “Recurrence” (Harris recurrence etc. technical conditions).

With these, classic results tell us: regardless of initial distribution, as time $t\to\infty$, distribution $P^t(x_0,\cdot)$ converges to $\pi$ in Total Variation distance:

$$ \big\|P^t(x_0,\cdot)-\pi\big\|_{\mathrm{TV}}\to 0. $$

So, discard early samples (burn-in), subsequent trajectory is approximately from $\pi$.

Why “Unnormalized is Fine”

Many MCMC constructions only need the ratio $\tilde\pi(x)\propto \pi(x)$. The reason lies in Detailed Balance / Reversibility (see next section): using only the ratio $\tilde\pi(y)/\tilde\pi(x)$ ensures “symmetric flow”, thereby getting $\pi$ as stationary distribution. No need for $Z$ is the key advantage of MCMC.

Example

To avoid introducing specific algorithms, determining let’s use the most familiar distribution: Uniform Distribution to illustrate.

Suppose target distribution is

$$ \pi(x) = \text{Uniform}\{1,2,3\}. $$

We design a Markov chain:

Walk among 3 states: 1, 2, 3;
At each position, jump to another position with equal probability;
E.g., at 1, jump to 2 with prob 0.5, to 3 with prob 0.5.

👉 The transition matrix P:

$$ P=\begin{bmatrix} 0 & 0.5 & 0.5 \\ 0.5 & 0 & 0.5 \\ 0.5 & 0.5 & 0 \end{bmatrix}. $$

Let’s run it and see how its distribution evolves.

import numpy as np
import matplotlib.pyplot as plt

# Transition Matrix P
P = np.array([
    [0.0, 0.5, 0.5],
    [0.5, 0.0, 0.5],
    [0.5, 0.5, 0.0]
])

# Initial distribution: All in state 1
dist = np.array([1.0, 0.0, 0.0])
history = [dist]

# Evolve 20 steps
for t in range(20):
    dist = dist @ P
    history.append(dist)

history = np.array(history)

# Theoretical stationary distribution (Uniform)
pi = np.array([1/3, 1/3, 1/3])

# Plot
plt.figure(figsize=(6,4))
for i in range(3):
    plt.plot(history[:,i], label=f"State {i+1}")
plt.axhline(pi[0], color="k", linestyle="--", label="Stationary dist.")
plt.xlabel("Step")
plt.ylabel("Probability")
plt.title("Markov chain approaching uniform distribution")
plt.legend()
plt.show()

png

We can see:

Initially distribution is all in state 1;
As steps increase, three curves gradually approach $1/3$;
Finally converges to stationary distribution (Uniform).

👉 Intuition: Markov chain’s “walking” allows us to sample from $\pi$ by relying on “long-term stay proportion” even if we don’t sample directly from $\pi$.

Theory and Intuition

Why it works, why it converges
In short,
Stationary distribution exists, and satisfies detailed balance → Correctness.
Convergence speed (Mixing time) affects sample quality.
Smaller auto-correlation, larger ESS, “more efficient” chain.

Detailed Balance (Reversibility) and Stationarity

If there exists $\pi$ such that

$$ \pi(dx)\,P(x,dy) \;=\; \pi(dy)\,P(y,dx) \quad (\text{Symmetric Flow}) $$

Then the chain is called reversible with respect to $\pi$, implying $\pi$ is a stationary distribution.

Intuition: Starting from $\pi$, the joint distribution of a forward step is same as a “backward step”, overall “no net flow”, so the steady state is “undisturbed”.

Many MCMC algorithms (MH, Gibbs, HMC, etc.) explicitly or implicitly construct this reversibility/invariance.

Convergence: Ergodic Theorem, LLN, CLT

When the chain is Ergodic (irreducible, aperiodic, and properly recurrent), we have:

Ergodic Theorem / Markov Chain Law of Large Numbers
$$ \frac{1}{T}\sum_{t=1}^T f(X_t) \;\xrightarrow{a.s.}\; \mathbb{E}_\pi[f(X)]. $$
This guarantees using trajectory mean to estimate expectation is consistent.
Central Limit Theorem (CLT) (under geometric ergodicity etc. conditions)
$$ \sqrt{T}\Big(\bar f_T-\mathbb{E}_\pi[f]\Big)\ \Rightarrow\ \mathcal N\!\Big(0,\ \sigma_f^2\Big), $$
Where
$$ \sigma_f^2 \;=\; \mathrm{Var}_\pi(f)\Big(1+2\sum_{k=1}^\infty \rho_k\Big), $$
$\rho_k$ is the auto-correlation at lag $k$. Define Integrated Autocorrelation Time (IACT)
$$ \tau_{\text{int}} \;=\; 1+2\sum_{k\ge1}\rho_k, $$
Then Effective Sample Size $\mathrm{ESS}\approx T/\tau_{\text{int}}$. Intuition: Stronger correlation (slow decay of $\rho_k$) means each sample has lower information, smaller ESS.

Mixing Time, Spectral Gap, and Geometric Convergence

Mixing Time characterizes approach speed of $P^t$ to $\pi$, commonly defined by TV distance:
$$ \tau(\varepsilon)\;=\;\min\{t:\ \sup_{x_0}\|P^t(x_0,\cdot)-\pi\|_{\mathrm{TV}}\le \varepsilon\}. $$
For finite reversible chains, convergence speed is closely related to Spectral Gap $\gamma=1-\lambda_\star$ (second largest eigenvalue modulus $\lambda_\star$): $\tau(\varepsilon)$ typically scales as $\frac{1}{\gamma}\log(1/\varepsilon)$. Intuition: Large $\gamma$ ⇒ “Weak memory, fast forgetting”, faster mixing.
Conductance / Bottleneck characterizes “difficulty of crossing regions”, related to $\gamma$ via Cheeger inequality. Intuition: Multi-modal distribution, deep “valleys” between peaks ⇒ Small conductance ⇒ Slow mixing (easy to get “stuck in a mode”).

Burn-in, Thinning, Diagnostics (From Theory to Practice)

Burn-in: Early samples “not yet close to $\pi$” bias the estimate, discarding an initial segment is usually safer.
Thinning: “Subsampling” to save storage/reduce correlation, but from variance minimization perspective not strictly necessary; often better to keep all samples and use IACT/ESS for correct variance estimation.
Diagnostics (Intuition-driven):
- Trace: Wandering freely without obvious drift?
- ACF/ESS: Does correlation decay fast enough?
- Multi-chain R-hat: Do chains from multiple starting points mix together?
- Multi-modal: Any pattern of “stuck in mode” (long stay + sudden jump)?

Why MCMC can “Walk and Stay in High Density Areas”

View $\pi$ as Boltzmann distribution: $\pi(x)\propto e^{-U(x)}$, $U(x)=-\log \tilde\pi(x)$ looks like “energy landscape”. MCMC is like random thermal motion in the landscape:

More willing to stay in Low Energy (High Density) regions;
But through mechanisms like “perturbation/momentum/accept-reject”, still has chance to cross energy barriers to explore other regions;
As long as mechanism guarantees invariant distribution is $\pi$ and chain is ergodic, long-term statistics will be correct.

Example

Let’s build a “Slow Mixing” chain:

States $\{0,1\}$,
From 0, stay with prob 0.95, jump to 1 with prob 0.05;
From 1, same logic.

👉 Although its stationary distribution is still uniform $[0.5,0.5]$, “correlation” is very strong.

import numpy as np
import matplotlib.pyplot as plt

# Construct 2-state chain (Sticky)
P_slow = np.array([
    [0.95, 0.05],
    [0.05, 0.95]
])

# Simulate trajectory
T = 5000
x = np.zeros(T, dtype=int)
for t in range(1, T):
    x[t] = np.random.choice([0,1], p=P_slow[x[t-1]])

# Compute Auto-Correlation Function (ACF)
def autocorr(x, lag):
    n = len(x)
    x_mean = np.mean(x)
    num = np.sum((x[:n-lag]-x_mean)*(x[lag:]-x_mean)) # Auto-covariance
    den = np.sum((x-x_mean)**2) # Variance
    return num/den

lags = np.arange(50)
acfs = [autocorr(x, lag) for lag in lags]

# Plot
plt.figure(figsize=(6,4))
plt.bar(lags, acfs)
plt.xlabel("Lag")
plt.ylabel("Autocorrelation")
plt.title("ACF of slow-mixing 2-state chain")
plt.show()

png

Here you can see:

Auto-Correlation Function (ACF) decays very slowly;
Means adjacent samples are highly dependent, Effective Sample Size (ESS) is far less than total steps.

👉 Intuitive Summary:

Markov chain will definitely converge to stationary distribution (if conditions met);
But “Mixing Speed” varies: some equalize in few steps, some drag on;
In MCMC, slow mixing chain has poor efficiency, resulting samples have “little information”.

Critical Indicators in Practice (No Algorithms Yet)

Only need $\tilde\pi(x)$: Can compute log-density up to constant, then MCMC is go.
Three things must be balanced:
- Invariance ($\pi$ is stationary) — if not invariant then long-term wrong;
- Reachable (Irreducible/Aperiodic) — if unreachable then pointless;
- Fast Mixing (Spectral gap/Conductance/Small IACT) — otherwise ESS too low, cost too high.
Error Assessment: Use IACT/ESS + Markov Chain CLT for MC variance and confidence intervals.
Multi-modal Warning: Multi-modal + high-dim correlation often leads to Metastability, need algorithms or strategies better at “climbing hills crossing valleys” (Tempering, Gradient methods, etc.) to improve conductance.

Summary

Purpose: Estimate expectation/sample from complex $\pi$; No normalization constant needed.
Method: Build a chain $P$ such that $\pi$ is invariant (often using Detailed Balance).
Condition: Irreducible + Aperiodic + Proper Recurrence ⇒ $P^t\to\pi$.
Assessment:
- Convergence (after burn-in) + Mixing (IACT/ESS, Spectral gap, Conductance)
- Diagnostics: Trace, ACF, R-hat, Multi-chain consistency
Error: $\mathrm{ESS}\approx T/\tau_{\text{int}}$, CLT gives confidence intervals.
Difficulty: Multi-modal/High-dim correlation ⇒ Slow mixing; Need better “moving strategies”.

Introduction to MCMC

Why Do We Need MCMC?

Example

From Markov Chain to Sampling

Turning Sampling into “Building a Chain”

Ensuring “Reachable, Aperiodic, Forgetting Start”

Why “Unnormalized is Fine”

Example

Theory and Intuition

Detailed Balance (Reversibility) and Stationarity

Convergence: Ergodic Theorem, LLN, CLT

Mixing Time, Spectral Gap, and Geometric Convergence

Burn-in, Thinning, Diagnostics (From Theory to Practice)

Why MCMC can “Walk and Stay in High Density Areas”

Example

Critical Indicators in Practice (No Algorithms Yet)

Summary

See also

目录

Why Do We Need MCMC?

Example

From Markov Chain to Sampling

Turning Sampling into “Building a Chain”

Ensuring “Reachable, Aperiodic, Forgetting Start”

Why “Unnormalized is Fine”

Example

Theory and Intuition

Detailed Balance (Reversibility) and Stationarity

Convergence: Ergodic Theorem, LLN, CLT

Mixing Time, Spectral Gap, and Geometric Convergence

Burn-in, Thinning, Diagnostics (From Theory to Practice)

Why MCMC can “Walk and Stay in High Density Areas”

Example

Critical Indicators in Practice (No Algorithms Yet)

Summary

Scan to Share微信扫一扫分享

See also

目录

Scan to Share
微信扫一扫分享