In previous post, I wrote about Levenshtein distance which is a good distance metric for sequences with different length. This post is about Longest common sub sequence (LCS)
The longest common subsequence (LCS) problem is the problem of finding the longest subsequence common to all sequences in a set of sequences (often just two sequences).
The thing about LCS is that the common subsequence is not consecutive positions within the sequences.
The first step is calculating the length matrix for each element in both sequences. With each matching elements, the length increases. In case of non-matching elements, the value max(C[i][j - 1], C[i - 1][j])
is used to repeat the max of last 2 lengths.
for i in range(m+1):
for j in range(n+1):
print(f"debug {i}:{a[i - 1]} {j}:{b[j - 1]}")
if i == 0 or j == 0:
C[i][j] = 0
elif (a[i - 1] == b[j - 1]):
C[i][j] = C[i - 1][j - 1] + 1;
else:
C[i][j] = max(C[i][j - 1], C[i - 1][j])
To calculate the LCS, we start with the end of sequences and backtrace in length distance to either to first col and row(row0 or col0).
lcs = ""
(i,j) = (m,n)
while i > 0 and j > 0:
if a[i-1] == b[j-1]:
print(f"match {a[i-1]}:{i} {b[j-1]}:{j}")
lcs += a[i-1]
i -= 1
j -= 1
elif C[i-1][j] > C[i][j-1]:
print(f"C {a[i-1]}:{i}:{C[i-1][j]} {b[j-1]}:{j}:{C[i][j-1]}")
i -= 1
else:
print(f"else {a[i-1]}:{i} {b[j-1]}:{j}")
j -= 1
For completeness, The whole code is found here
def lcs_simple(a, b):
m = len(a)
n = len(b)
# Calculate lcs length
C = []
for i in range(m + 1):
C.append([0] * (n + 1))
for i in range(m+1):
for j in range(n+1):
print(f"debug {i}:{a[i - 1]} {j}:{b[j - 1]}")
if i == 0 or j == 0:
C[i][j] = 0
elif (a[i - 1] == b[j - 1]):
C[i][j] = C[i - 1][j - 1] + 1;
else:
C[i][j] = max(C[i][j - 1], C[i - 1][j]);
# get the max lcs
lcs = ""
(i,j) = (m,n)
while i > 0 and j > 0:
if a[i-1] == b[j-1]:
print(f"match {a[i-1]}:{i} {b[j-1]}:{j}")
lcs += a[i-1]
i -= 1
j -= 1
elif C[i-1][j] > C[i][j-1]:
print(f"C {a[i-1]}:{i}:{C[i-1][j]} {b[j-1]}:{j}:{C[i][j-1]}")
i -= 1
else:
print(f"else {a[i-1]}:{i} {b[j-1]}:{j}")
j -= 1
lcs = lcs[::-1]
return C[m][n], lcs, C
A = "ABCDGH"
B = "AEDFHR"
C = lcs_simple(A, B)
print(A, " ", B)
print(C)