Huffman code is one of the lossless compression algos. The idea is using a coding where more frequent symbols have shorter encoding and less frequent symbols with longer encoding(variable length encoding). A variant of Huffman is used for JPEG standard, so apparently it’s a big deal.

The simple algorithm on wiki is described as follows:

Create a leaf node for each symbol and add it to the priority queue. While there is more than one node in the queue: Remove the two nodes of highest priority (lowest probability) from the queue Create a new internal node with these two nodes as children and with probability equal to the sum of the two nodes' probabilities. Add the new node to the queue. The remaining node is the root node and the tree is complete.

The algorithm Link to heading

First, we need to get the frequency of symbols. This is quick and (very) dirty way.

TEXT = "A_DEAD_DAD_CEDED_A_BAD_BABE_A_BEADED_ABACA_BED"

def calc_symbols_freq(txt):
    symbols = []

    txt = sorted(list(txt))

    start = 0
    end = 0
    i = 0
    txt += ['$']
    while (i < len(txt)):
        if i == len(txt) - 1:
            break
        if txt[i] != txt[i+1]:
            end = i
            symbols.append((txt[i], end-start+1))
            start = end + 1
        i += 1
    return symbols

The algorithm mentions using priority queue to keep the queue sorted. Well, Life is too short. So, I used a normal list and kept sorting it (I am probably wanted by the performance police by now)

class Node:
    def __init__(self, sym, w, left=None, right=None, parent=None):
        self.sym    = sym
        self.w      = w
        self.parent = parent
        self.left   = left
        self.right  = right
    def __str__(self):
        return f"{self.sym}: {self.w}"

class Tree():

    def __init__(self, syms):
        self.root = None
        q = []

        for sym in syms:
            q.append(Node(sym[0], sym[1]))

        q = sorted(q, key=lambda x: x.w, reverse=True)

        while ( len(q) > 1):
            a = q.pop()
            b = q.pop()

            node = Node(a.sym + b.sym , a.w + b.w, a, b)
            a.parent = node
            b.parent = node

            q.append(node)
            q = sorted(q, key=lambda x: x.w, reverse=True)


        # the remaining node is the root
        assert(len(q) == 1)
        self.root = q.pop()

        # traverse all paths to leaf nodes
        def traverse(n, path):
            if(n.left):
                traverse(n.left, path+"0")
            else:
                print(n, " ", path)
            if(n.right):
                traverse(n.right, path+"1")

        traverse(self.root, "")

    def encode(self, txt):
        return None

syms = calc_symbols_freq(TEXT)
tree = Tree(syms)

And the output matches wiki results. hurrah!

_: 10   00
D: 10   01
A: 11   10
E: 7   110
C: 2   1110
B: 6   1111