There’s a very popular children’s game in the states of former Yugoslavia. It’s called Kaladont (or Kalodont) Named after a brand of tooth paste popular in Yugoslavia. Named after a brand of tooth paste popular in Yugoslavia. . You can play with as many players as possible, but the gist of the game is this: one player says a word. The next player then has to say a word that begins with the last two letters of the previous word. Words that have been said cannot be repeated, and this goes on until a player cannot come up with a new word.
The game is called Kaladont because the word (which is to this day a synonym for toothpaste in the southern Balkans) kaladont is a game-ender: there is no word in Bosnian/Croatian/Serbian that begins with nt, so once a player says the word kaladont, the next player in line has effectively lost the game.
Here’s a practical example:
I’ve played this game as a kid. A lot.
You can imagine my surprise when, as an adult, I was scrolling the web and stumbled upon a challenge based on this game: A Croatian company has set up a simple website The company is OptimoRoute (I’m in no way affiliated with them), and the challenge is available on https://shaolin.dev/. The company is OptimoRoute (I’m in no way affiliated with them), and the challenge is available on https://shaolin.dev/. with some rules and a dictionary of around $160k$ words. Their challenge:
Given this dictionary of words, what is the longest game of Kaladont you can play?
If you’ve even accidentally passed by a Discrete Mathematics class, you’ll immediately put two and two together: this is a graph problem! And boy, do I love graphs.
Challenge accepted!
From games to graphs
If we’re trying to play for as long as possible, that means using up as many words from the dictionary as we can. And if the game is just a graph, then that means finding the longest path in said graph.
The dataset contains around $160k$ words - so that’s gonna be our graph.
Dear God, it’s in NP
Let’s be naive for a second, and represent each word in the dictionary as a node in our graph. Then, nodes A and B are connected by an edge from A to B if the last two letters of word A are the same as the first two letters of the word B. What we end up with is a graph of around $160k$ nodes, connected by the rules of Kaladont.
Graphs are weird - seemingly very similar problems can vastly vary in how difficult they are to solve. For example, finding the shortest path in a directed graph? Child’s play. But, finding the longest path? You’ll be able to brute-force it for smaller graphs, sure. But ours has $160k$ words (nodes) - no efficient algorithm that delivers an exact solution exists. This is why the problem of the longest path in a graph is NP-hard.
Note: Without going into a tangent about complexity theory, NP-hard basically means that you could easily verify the solution to the problem, but actually coming up with a solution would take up a ridiculous amount of time. For solving NP problems, we’re usually happy with approximating a solution that is good enough, using metaheuristics such as genetic algorithms and simulated annealing.Well, pack it up boys, we’re done here. It’s NP, can’t be solved.
… is what I would say if there wasn’t a key insight here - a simple trick which we can use to turn this problem from something in NP to something in P - after that, we can whack this challenge with the old Python-shaped hammer.
One weird trick
I’m gonna spill the beans right away: If we consider our graph, and invert it in a way that the bigrams become the nodes, and the words become the edges, then we simplify this problem by a lot. For our example from before, that would look something like this:
Doesn’t seem like much of a change. We even have one extra node for our example from before! How does this help? Well, for starters, we have vastly reduced the number of nodes in total: Instead of $160k$ nodes (for $160k$ words), now we have $26^2 = 676$ (all possible two-letter combinations from a–z
Note that the dataset has omitted South Slavic letters such as č and š.
Note that the dataset has omitted South Slavic letters such as č and š.
) nodes, with edges between two nodes representing all words which connect the two bigrams.
We went from a graph with many nodes, to a multigraph with few nodes.
You might say that this still does not really help, because now instead of many nodes, we have many edges (and arguably an even more complex data structure, since now we are dealing with a multigraph). But, remember that graphs are weird:
Finding the longest path (going through most nodes) is stupidly hard, but going through most edges is actually solvable in polynomial time (meaning it’s easy!).
Now that we’re in P territory, we can find a nice solution to this problem.
Sinks, connected components, and a guy named Hierholzer
Tackling this problem will require some planning, but all the tools we need to find the longest game of Kaladont are problems that are well-understood in graph theory. I’m gonna lay down the steps we’ll need to do, and then explain them one by one:
- Load data and build bigram graph
- Find largest weakly connected component (WCC)
- Balance the graph for an Eulerian trail
- Apply Hierholzer’s algorithm
- Validate solution
I’ll skip data loading, since it is trivial (one txt file) and jump straight into WCCs.
Connected Components
The easiest way to understand connected components (or as they are also called, components) in graph theory is to imagine the easiest example of a complex graph: the network of roads in a city.
A component of a graph is a set of nodes that are part of the graph (a subgraph), but that is not connected to any other set of nodes (subgraphs). If you now imagine the graph as the road network of San Francisco, the roads on Alcatraz island would be a component of the graph - they are part of the larger graph (SF), but they are not connected to the rest of the city - there exists no road between Alcatraz and the city.
Here’s a very brutal abstraction of San Francisco: The top purple node is Alcatraz island - it is part of SF, but not connected to SF proper - it is a component of the graph.
Why do components matter to us? There’s actually a special definition, called a weakly connected component (WCC) and that is what we are after: A WCC is a set of nodes in the graph, where any node can be reached from any other node, if we ignore edge direction. Finding the largest WCC in our graph of bigrams is a huge help, because words in different WCCs cannot be part of the same chain If they were in the same chain, they’d be part of the same WCC as well. If they were in the same chain, they’d be part of the same WCC as well. - it’s a property of Weakly Connected Components.
Finding the largest WCC can be done using the networkx library in Python - it’s one simple line of code
nx.weakly_connected_components()
Alternatively, one could use the Union-Find algorithm in order to compute WCCs, and then just discard all but the largest one. It works with two operations, union and find. The first operation unites two groups, while the second operation finds the group a bigram belongs to, by returning the representative bigram of that group. Implementation is straightforward:
parent: dict[str, str] = {}
def find(x: str) -> str:
parent.setdefault(x, x)
while parent[x] != x:
parent[x] = parent[parent[x]] # path halving
x = parent[x]
return x
def union(x: str, y: str) -> None:
parent[find(x)] = find(y)
# go through all words, form unions of bigrams
for word in words:
union(word[:2], word[-2:])
After this, we simply apply find to each bigram again, and count how many bigrams share a parent/group (we make a histogram, effectively). The largest of these groups is also the largest WCC. What we end up with is:
| Original Dataset | Largest WCC |
|---|---|
| $159 836$ words | $159 829$ words |
The 7 discarded words are: kg, mg, vb, wtf, www, xu, xxx - all of them are either abbreviations for physical units, or Internet slang. These words do not connect to any other word in the Croatian language by the rules of Kaladont.
We’ve got our largest WCC (even tho we have eliminated only a few words) - now we have to set the stage for actually being able to find an Eulerian trail in it. We do this with graph balancing.
Note: An Eulerian trail of a graph is a trail that passes through every edge of the graph exactly once. If the trail starts and ends in the same node, then it’s called an Eulerian circuit.Graph Balancing
If we want to find the Eulerian trail, we’ll need to make sure the graph follows certain rules:
- At most one node has
balance = +1- this is the start of the trail. - At most one node has
balance = -1- this is the end of the trail. - All other nodes have
balance = 0 - The graph is weakly connected (we ensured this by working only with the largest WCC)
Where balance is defined as
balance(node) = out_degree(node) - in_degree(node)
Croatian (and other South Slavic languages) are extremely unbalanced: there are certain bigrams which are common word endings (like ic), but there are almost no words starting with this bigram. The opposite is true as well: words that begin with the bigram pr are extremely common, but words ending with it are not. In other words, our graph contains a couple of large sinks and sources.
We’ll need to balance those out. We cannot just insert new edges to make the balance zero (that would be like inventing new words which are not in the dataset), so we will have to remove edges from the graph (discard words) until the graph is balanced.
Min-Cost Flow
The trick is to remove as few edges as possible to balance the graph, because we want to keep as many words as possible. This is a min-cost flow problem, which networkx easily cracks.
Simply put, for each bigram pair, we’ll add a directed edge between them. The edge will have a capacity assigned to it, with each unit of capacity representing removing one word from the group of words that connect these two bigrams (the solver needs these special edges). Then it’s just a matter of removing either incoming (if balance is negative) or outgoing (if balance is positive) words.
I’ve applied two more strategies at this step, namely:
- Min-cost flow runs over multiple
(start, end)candidate pairs. Each pair produces a different linear program (LP) instance, which leads to a different optimal flow solution and a different balanced subgraph. For start, I’ve picked the top imbalanced source nodes, while for end, we go with the top imbalanced sink nodes. - There’s a special bridge removal cost assigned to bridge edges, so the solver avoids removing them unless no alternative routing exists.
The code for this is a bit too long for a blog post, but in essence it looks like this:
- Find top
Nsource and sink nodes (N=3for me) - For every
(source, sink)pair- Build flow graph
- Apply
networkxmin-cost flow - Remove words according to results of min-cost flow
- If new solution is better than current best, keep it
- Return new balanced graph
Something strange happens after running graph balancing - $26571$ words remain after it. From our initial ${\sim}160k$, we are down to just ${\sim}26k$. At first, this might seem like a bug, however given the morphology of South Slavic languages, and the contents of this dataset, it makes perfect sense.
Think of each bigram as a junction in a network of pipes. Water enters a junction when the previous word in the chain ends in that bigram. In the same sense, water leaves a junction when the next word in the chain starts with that bigram. Now, following the rules of Kaladont, if we enter the junction pr, we also have to exit it via a word that begins with pr.
Now, let’s count how many words in the dataset end with pr. It’s two words. Yup. knepr and npr, the second one being an abbreviation for naprimjer. That means that the junction pr can be entered at most three times
Two entries, plus one free exit as the start node.
Two entries, plus one free exit as the start node.
, meaning at most three words that begin with pr can appear in the entire chain. Let us now count how many words in the dataset begin with pr - the answer is $13426$. And we can use three of these. The other $13423$ are stuck and of no use to us. There exists no such chain that contains more than three words starting with pr, no matter how clever our approach is.
Here’s a visual representation of the dataset. Notice the huge sink and source nodes:
Now extrapolate this to other bigrams, and the vast majority of our initial dataset gets removed.
| Bigram | Words starting with it | Words ending with it | Maximum usable | Words lost |
|---|---|---|---|---|
pr |
$13426$ | $2$ | $3$ | ${\sim}13423$ |
po |
$4664$ | $44$ | $45$ | ${\sim}4619$ |
ic |
$13$ | $14202$ | $13$ | ${\sim}14189$ |
in |
$1119$ | $11079$ | $1119$ | ${\sim}9960$ |
So, the result is a graph of $26571$ edges where an Eulerian trail from source to sink is guaranteed to exist. Now it’s just a matter of finding it.
Luckily, a German mathematician can help us out here.
Hierholzer’s Algorithm
Finding the Eulerian trail is actually pretty simple, thanks to Hierholzer’s Algorithm Named after the mathematician Carl Hierholzer Named after the mathematician Carl Hierholzer . Here’s the approach:
- Start at our designated start node, and greedily follow edges until we get stuck (by reaching a dead end). If not all edges have been used up until that point, jump back and start a new sub-path from there, and splice it into the main path at that junction. The key detail for us is to keep tracking edge labels (words), not just nodes (bigrams).
- The algorithm is stack-based, meaning we maintain two stacks, one for nodes visited on the current trail, and one for the words that were used to arrive at the nodes. When we hit a dead end, we pop from both stacks. The popped word gets put into a result list. When we reverse the result at the end, we get our list of words.
Here’s a Python implementation of this elegant algorithm:
def hierholzer(adj: dict[str, deque], start: str) -> list[str]:
node_stack: list[str] = [start]
edge_stack: list[str | None] = [None]
trail: list[str] = []
while node_stack:
v = node_stack[-1]
if adj[v]:
next_v, word = adj[v].popleft()
node_stack.append(next_v)
edge_stack.append(word)
else:
node_stack.pop()
word = edge_stack.pop()
if word is not None:
trail.append(word)
trail.reverse()
return trail
Solution
So, after running all of this, we finally arrive at a solution:
A game of Kaladont where a whopping $26549$ words are used.That ain’t so bad, we’ve used up almost all words we had available after balancing!
I won’t be posting the entire chain here, since it is huge. That said, here is the beginning and end of the chain:
['pretu',
'tu',
'tumpa',
'pa',
'pampa',
...,
'vsetecka',
'kaznjiv',
'ivcevski',
'kirs',
'rstic']
Can it be better?
The obvious question here is - theoretically, can this chain be longer? Well, it’s certainly possible, but probably not by much. And the reasoning is simple:
Our balancing step here is the wild-card. Min-cost flow has multiple equally optimal solutions, and networkx just picks one at random, instead of exploring the full theoretical solution space. One way to solve this could be to use Integer Linear Programming (ILP), which might yield better results. The min-cost flow solves degree balance and minimum removals in a truly optimal way, but it is not aware of connectivity. ILP solves all three of these, albeit NP-hard (which is fine for a graph of such few nodes as ours).
Running ILP (for example through gurobi
Again, I am in no way affiliated with them.
Again, I am in no way affiliated with them.
) could find an even better path - still, if such a path exists, it would only be marginally better.
What I wanted to do with this post is to show off how a simple child’s game maps to different fun graph problems (all in P!), so I won’t be doing any ILP now.
Maybe you can take a stab at it :)
Why do we even need this?
Engineering is basically optimization: You want to save time, money, resources, you name it. The problem of the longest path is, at first glance, something that is entirely against the core philosophy of engineering - would I not prefer to get from A to B in the fastest, cheapest way possible? And yes, that’s true for many problems. However, there are concrete applications in industry - I’d do this post a disservice if I didn’t mention them, so here are some of my favorites:
- Logistics & Route Planning: The classic example from my college days is the Chinese Postman Problem - a postman needs to go through every street in a city, at least once. The Eulerian trail is the optimal solution. You can imagine that the same idea applies to waste management, road inspection, etc.
- DNA sequencing: Assembling a genome requires reconstructing the full DNA sequence from millions of fragments (which, just like our Kaladont model, overlap with one another). These fragments are edges in a directed graph Called the De Bruijn graph - I knew my Bioinformatics elective would pay off one day! Called the De Bruijn graph - I knew my Bioinformatics elective would pay off one day! , and finding the original sequence is basically finding the Eulerian trail.
- Diameter signalling route testing: This one I’ve picked up from the telco folks which I currently work with - signalling networks are complex mesh topologies, and testing every link requires covering every edge. If we want to get full testing coverage, while at the same time minimizing the number of messages we need to send, then we are basically solving the same problem as this blog post.
Final words
Graphs are everywhere around us, and a lot of problems can be mapped to unwrangling some kind of graph. Even a simple word game can teach us many fundamental things in graph theory, and isn’t that just neat?
And where does the longest game of Kaladont end? At rstic, a word ending in ic, the language’s biggest sink. As it turns out, all roads really do lead to ic.