Huffman code is one of the lossless compression algos. The idea is using a coding where more frequent symbols have shorter encoding and less frequent symbols with longer encoding(variable length encoding). A variant of Huffman is used for JPEG standard, so apparently it’s a big deal.
The simple algorithm on wiki is described as follows:
Create a leaf node for each symbol and add it to the priority queue. While there is more than one node in the queue: Remove the two nodes of highest priority (lowest probability) from the queue Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities. Add the new node to the queue. The remaining node is the root node and the tree is complete.
The algorithm Link to heading
First, we need to get the frequency of symbols. This is quick and (very) dirty way.
TEXT = "A_DEAD_DAD_CEDED_A_BAD_BABE_A_BEADED_ABACA_BED"
def calc_symbols_freq(txt):
symbols = []
txt = sorted(list(txt))
start = 0
end = 0
i = 0
txt += ['$']
while (i < len(txt)):
if i == len(txt) - 1:
break
if txt[i] != txt[i+1]:
end = i
symbols.append((txt[i], end-start+1))
start = end + 1
i += 1
return symbols
The algorithm mentions using priority queue to keep the queue sorted. Well, Life is too short. So, I used a normal list and kept sorting it (I am probably wanted by the performance police by now)
class Node:
def __init__(self, sym, w, left=None, right=None, parent=None):
self.sym = sym
self.w = w
self.parent = parent
self.left = left
self.right = right
def __str__(self):
return f"{self.sym}: {self.w}"
class Tree():
def __init__(self, syms):
self.root = None
q = []
for sym in syms:
q.append(Node(sym[0], sym[1]))
q = sorted(q, key=lambda x: x.w, reverse=True)
while ( len(q) > 1):
a = q.pop()
b = q.pop()
node = Node(a.sym + b.sym , a.w + b.w, a, b)
a.parent = node
b.parent = node
q.append(node)
q = sorted(q, key=lambda x: x.w, reverse=True)
# the remaining node is the root
assert(len(q) == 1)
self.root = q.pop()
# traverse all paths to leaf nodes
def traverse(n, path):
if(n.left):
traverse(n.left, path+"0")
else:
print(n, " ", path)
if(n.right):
traverse(n.right, path+"1")
traverse(self.root, "")
def encode(self, txt):
return None
syms = calc_symbols_freq(TEXT)
tree = Tree(syms)
And the output matches wiki results. hurrah!
_: 10 00
D: 10 01
A: 11 10
E: 7 110
C: 2 1110
B: 6 1111