11: Dynamic Programming

Dynamic programming (DP) solves optimization problems by decomposing them into overlapping subproblems and storing intermediate results to avoid redundant computation. It is the standard approach for sequence alignment, shortest paths, and many combinatorial optimization problems that arise in ML pipelines and systems.


When to Use Dynamic Programming

A problem is amenable to DP when it has:

  1. Optimal substructure. The optimal solution to the problem contains optimal solutions to subproblems. If the shortest path from A to C goes through B, then the sub-path from A to B must itself be shortest.

  2. Overlapping subproblems. The same subproblems are solved multiple times in a naive recursive approach. Without memoization, the recursive Fibonacci algorithm computes F(n2)F(n-2) twice, F(n3)F(n-3) three times, etc., leading to exponential time.

DP trades space for time: store subproblem solutions in a table to avoid recomputation.


Implementation Strategies

Top-Down (Memoization)

Write the recursive solution, then cache results:

def fib(n, memo={}):
    if n <= 1: return n
    if n not in memo:
        memo[n] = fib(n-1, memo) + fib(n-2, memo)
    return memo[n]

Time: O(n)O(n), Space: O(n)O(n). The recursion tree is pruned: each subproblem is solved once.

Bottom-Up (Tabulation)

Fill a table iteratively from the base cases:

def fib(n):
    dp = [0] * (n + 1)
    dp[1] = 1
    for i in range(2, n + 1):
        dp[i] = dp[i-1] + dp[i-2]
    return dp[n]

Same time and space complexity, but avoids recursion overhead and stack depth limits. Bottom-up is generally preferred for production code.

Space optimization. When the recurrence only depends on a fixed number of previous entries, reduce space by keeping only those entries. Fibonacci needs only the last 2 values: O(1)O(1) space.


Classic Problems

Longest Common Subsequence (LCS)

Given sequences X=x1xmX = x_1 \ldots x_m and Y=y1ynY = y_1 \ldots y_n, find the length of their longest common subsequence.

Recurrence:

LCS(i,j)={0if i=0 or j=0LCS(i1,j1)+1if xi=yjmax(LCS(i1,j),LCS(i,j1))otherwise\text{LCS}(i, j) = \begin{cases} 0 & \text{if } i = 0 \text{ or } j = 0 \\ \text{LCS}(i-1, j-1) + 1 & \text{if } x_i = y_j \\ \max(\text{LCS}(i-1, j), \text{LCS}(i, j-1)) & \text{otherwise} \end{cases}

Time: O(mn)O(mn), Space: O(mn)O(mn) (reducible to O(min(m,n))O(\min(m, n))).

ML connection. LCS and edit distance are used in NLP for string similarity, text alignment, and evaluation metrics (ROUGE-L uses LCS for summarization evaluation).

Edit Distance (Levenshtein Distance)

Minimum number of insertions, deletions, and substitutions to transform string XX into string YY.

d(i,j)={iif j=0jif i=0d(i1,j1)if xi=yj1+min(d(i1,j),d(i,j1),d(i1,j1))otherwised(i, j) = \begin{cases} i & \text{if } j = 0 \\ j & \text{if } i = 0 \\ d(i-1, j-1) & \text{if } x_i = y_j \\ 1 + \min(d(i-1,j), d(i,j-1), d(i-1,j-1)) & \text{otherwise} \end{cases}

Time: O(mn)O(mn). Used in spell checking, DNA sequence alignment, and fuzzy string matching.

0/1 Knapsack

Given nn items with weights wiw_i and values viv_i, and a capacity WW, maximize total value without exceeding capacity.

OPT(i,w)=max(OPT(i1,w),  vi+OPT(i1,wwi))\text{OPT}(i, w) = \max(\text{OPT}(i-1, w), \; v_i + \text{OPT}(i-1, w - w_i))

Time: O(nW)O(nW). This is pseudo-polynomial: polynomial in nn and WW but exponential in the number of bits needed to represent WW. The knapsack problem is NP-hard, so no truly polynomial algorithm exists (unless P = NP).

Matrix Chain Multiplication

Given matrices A1,,AnA_1, \ldots, A_n with dimensions p0×p1,,pn1×pnp_0 \times p_1, \ldots, p_{n-1} \times p_n, find the parenthesization that minimizes the total number of scalar multiplications.

m(i,j)=minik<j{m(i,k)+m(k+1,j)+pi1pkpj}m(i, j) = \min_{i \leq k < j} \{m(i, k) + m(k+1, j) + p_{i-1} p_k p_j\}

Time: O(n3)O(n^3), Space: O(n2)O(n^2). The optimal substructure: the best way to multiply AiAjA_i \cdots A_j must use an optimal split point kk and optimal parenthesizations of both halves.


Shortest Paths

Single-Source: Bellman-Ford

For a graph with VV vertices and EE edges (possibly negative weights):

d[v]=min(u,v)E{d[u]+w(u,v)}d[v] = \min_{(u,v) \in E} \{d[u] + w(u,v)\}

Iterate V1V - 1 times over all edges. Time: O(VE)O(VE). Detects negative cycles (if any d[v]d[v] decreases on the VV-th iteration).

All-Pairs: Floyd-Warshall

d(k)[i][j]=min(d(k1)[i][j],  d(k1)[i][k]+d(k1)[k][j])d^{(k)}[i][j] = \min(d^{(k-1)}[i][j], \; d^{(k-1)}[i][k] + d^{(k-1)}[k][j])

Time: O(V3)O(V^3), Space: O(V2)O(V^2). Considers all intermediate vertices k=1,,Vk = 1, \ldots, V.


DP in Machine Learning

AlgorithmDP ComponentApplication
ViterbiMost probable state sequence in HMMsSpeech recognition, POS tagging
CTC decodingBest alignment between input and outputSpeech-to-text
Beam searchApproximate sequence decodingText generation, machine translation
Needleman-WunschGlobal sequence alignmentBioinformatics, token alignment
Value iterationBellman equation for MDPsReinforcement learning

The Viterbi algorithm finds the most likely hidden state sequence in a Hidden Markov Model. It is a DP over the trellis of states and time steps:

δt(j)=maxi[δt1(i)aij]bj(ot)\delta_t(j) = \max_i [\delta_{t-1}(i) \cdot a_{ij}] \cdot b_j(o_t)

where δt(j)\delta_t(j) is the probability of the most likely path ending in state jj at time tt, aija_{ij} is the transition probability, and bj(ot)b_j(o_t) is the emission probability. Time: O(TS2)O(T \cdot S^2) where TT is sequence length and SS is the number of states.

Value iteration in reinforcement learning applies the Bellman optimality equation:

V(s)=maxa[R(s,a)+γsP(ss,a)V(s)]V^*(s) = \max_a \left[R(s, a) + \gamma \sum_{s'} P(s' \mid s, a) V^*(s')\right]

This is DP over states: each iteration improves the value estimate by propagating information one step backward from successor states.


Complexity Analysis

ProblemTimeSpaceType
FibonacciO(n)O(n)O(1)O(1)1D, linear
LCS / Edit distanceO(mn)O(mn)O(min(m,n))O(\min(m,n))2D, quadratic
KnapsackO(nW)O(nW)O(W)O(W)Pseudo-polynomial
Matrix chainO(n3)O(n^3)O(n2)O(n^2)Interval DP
Floyd-WarshallO(V3)O(V^3)O(V2)O(V^2)All-pairs
ViterbiO(TS2)O(TS^2)O(TS)O(TS)Trellis DP

Summary

Dynamic programming reduces exponential brute-force search to polynomial time by exploiting overlapping subproblems and optimal substructure. The key steps: define the state space, write the recurrence relation, determine the base cases, and choose top-down (memoization) or bottom-up (tabulation) implementation. In ML, DP appears in sequence decoding (Viterbi, CTC, beam search), reinforcement learning (value iteration), and evaluation metrics (ROUGE-L).