This post is a breakdown of the code in link which uses levenshtein and affinity propagation to cluster string list.
In previous post, I wrote simple levenshtein implementation to calculate the similarity between strings. And it turns out to be a good metric for clustering. Basically, affinity propagation and distance similarity matrix is match made in heaven.
distance.levenshtein
is called with each word in the wordlist to calculate the distance.
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
And calling AffinityPropagation
from sklearn.cluster
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
The full example
import distance
import numpy as np
from sklearn.cluster import AffinityPropagation
from matplotlib import pyplot as plt
words = "YOUR WORDS HERE HE FOOOO fo".split(" ")
words = np.asarray(words)
lev_similarity = -1*np.array([[distance.levenshtein(w1,w2) for w1 in words] for w2 in words])
print(lev_similarity)
affprop = AffinityPropagation(affinity="precomputed", damping=0.5)
affprop.fit(lev_similarity)
print(f"affprop.labels_: {affprop.labels_}")
print(f"affprop.cluster_centers_indices_: {affprop.cluster_centers_indices_}")
for cluster_id in np.unique(affprop.labels_):
exemplar = words[affprop.cluster_centers_indices_[cluster_id]]
cluster = np.unique(words[np.nonzero(affprop.labels_==cluster_id)])
cluster_str = ", ".join(cluster)
print(" - *%s:* %s" % (exemplar, cluster_str))