This post is a breakdown of the code in link which uses levenshtein and affinity propagation to cluster string list.

In previous post, I wrote simple levenshtein implementation to calculate the similarity between strings. And it turns out to be a good metric for clustering. Basically, affinity propagation and distance similarity matrix is match made in heaven.

distance.levenshtein is called with each word in the wordlist to calculate the distance.

lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

And calling AffinityPropagation from sklearn.cluster

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)

The full example

import distance
import numpy as np
from sklearn.cluster import AffinityPropagation
from matplotlib import pyplot as plt


words = "YOUR WORDS HERE HE FOOOO fo".split(" ")
words = np.asarray(words)
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])

print(lev_similarity)

affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)

print(f"affprop.labels_: {affprop.labels_}")
print(f"affprop.cluster_centers_indices_: {affprop.cluster_centers_indices_}")
for cluster_id in np.unique(affprop.labels_):
    exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
    cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
    cluster_str = ", ".join(cluster)
    print(" - *%s:* %s" % (exemplar, cluster_str))