You read the title and got pissed off, right? You told yourself:
who the hell does this guy think he is, fixing my code?
Not at all. Easy there. I do not care about your source code at all. Compile it, interpret it, push it, patch it: do whatever you want. This article is not about your code, it is about codes. The ones that do not live in a repository. It is about transmissions, noisy channels, and bits that leave as kings and arrive as idiots. It is about the fact that, at some point, engineering gave up: reality is messy, it introduces errors everywhere, and the only sensible way to deal with that is not to avoid errors, but to correct them. This article is about error-correcting codes, the ones that let you search for "biopark" without getting... well, never mind. In short, these are the mechanisms behind all our digital conversations, which we use constantly without even noticing. The unsung heroes of information theory. The ones that make sure information arrives intact or, at worst, can be recovered. All right, the scope is clear. In this article we will do three things:
- see why redundancy is necessary
- analyze two classic techniques (repetition and parity check)
- formalize everything with the concept of minimum distance
Let us begin...
Damn Open Space
You know those open-space offices? The whole we are a modern company thing, when really they just do not want to spend money on partitions. Cheap as hell. As if I cared about everyone else's calls. Anyway, back to it.
You know the scene, right? There is one guy talking to another, the one who is permanently on a call, the one shouting on the phone, and then there is the guy firing plastic darts with a Nerf 2.0 (that is me, and I am not joking, best purchase of 2025). In short, total chaos. Well: communication channels are the same. There are thousands of distractions: interference from other channels, equipment degradation, electrical impulses, and everything else you can imagine. And when a bit gets distracted, it starts as \(0\) and arrives as \(1\) (or vice versa). All this is called noise, and it is the Moriarty of communications: invisible criminal, always present.
Just like holding a conversation in an open space gets harder as noise increases, data transmission gets harder as the communication channel gets noisier. To be understood, you either raise your voice (maybe with some swearing), or you are forced to repeat yourself. Since bits cannot swear (poor things), we only have the second option. And that option has a precise name: redundancy. We add information to the transmission so the receiver can reconstruct the message correctly, even when the channel does everything it can to ruin it.
Finally, A Partition
Let us start this journey by showing why redundancy is necessary. Intuitively, it is easy to grasp. My grandma cannot hear well, so I arm myself with patience, repeat the same sentence three times, and the third one (maybe, otherwise I give up) works. But let us look at it in a less geriatric and more mathematical way, and consider this diagram:
Figure 1. Noisy Transmission
Suppose we have a communication channel, maybe two cups connected by a string, where the message corruption probability is \(p=0.1\), that is \(10\%\) (which sounds low but is huge: imagine one wrong letter every \(10\) in a text). Suppose Alice wants to send Bob a message, and that message is a very expressive letter \(C\). There is a \(90\%\) probability that Bob receives \(C\) correctly, but we cannot be sure. So Alice has a great idea: she transmits the same message multiple times. She sends \(CCC\). Let us see what happens: the probability of receiving one message correctly is \(p=0.9\), so the probability of receiving all three \(C\)s correctly, since events are independent, is:
Now suppose a transmission error happens. Bob knows he received the same message three times, so he uses majority voting, thinking: if I receive two \(C\)s and one \(B\), since correct transmission is more likely than wrong transmission, that \(B\) was probably meant to be a \(C\). Smart move by Bob. So he is confident that the intended message is \(C\), and he is right. Let us see why. Assuming exactly one transmission error occurred and \(C\) became \(B\), the received message must be one of: \(CCB\), \(CBC\), \(BCC\). Let us compute the probability of each message with exactly one error:
So the probability of receiving one of those erroneous messages is:
At this point we know three things:
- The probability of receiving the error-free message is \(p=0.729\).
- The probability of receiving a message with exactly one error is \(p=0.243\).
- When exactly one error occurs, Bob can still reconstruct the message correctly.
This means the message is received and decoded correctly in four different cases. So we can write the probability of correct decoding as:
Consequently, the probability of incorrect decoding is:
Pretty good. We have seen how adding redundancy to transmissions reduces error probability from \(10\%\) to \(2.8\%\). Nice result, right? In real transmissions, we send binary sequences, not letters. Suppose devices can send only \(4\) symbols: \(A,\, B,\, C,\, D\), encoded as:
- \(A=00\)
- \(B=01\)
- \(C=10\)
- \(D=11\)
Sending one of these symbols with the redundancy scheme above means transmitting:
- \(000000\) to send letter \(A\)
- \(010101\) to send letter \(B\)
- \(101010\) to send letter \(C\)
- \(111111\) to send letter \(D\)
The decoder waits for the full codeword (in this case six bits) before decoding. To close this section: if you were wondering, the general formula for single-bit error probability is:
Formula 1. Error Probability
where:
- \(p_e\) is the probability that decoding is wrong
- \(n\) is the number of repetitions
- \(p\) is the probability that a bit is flipped
- \(k\) is the number of wrong bits
- \(\binom{n}{k}\) is the number of different ways to have \(k\) errors
- \(p^k\) is the probability that those \(k\) bits are wrong
- \((1-p)^{\,n-k}\) is the probability that the remaining \(n-k\) are correct
- \(\lceil n/2\rceil\) is the majority threshold above which decoding fails
- \(\sum\) adds all events that lead to error
The formula gives the error probability of one information bit encoded by an \(n\)-length repetition code. From this, the probability that a message is received correctly is:
If we want the probability that a message of arbitrary length is decoded correctly, we write:
where \(L\) is the number of bits in the message. For example, if the message is one of:
as seen above, then \(L=2\).
Kind Of An Expensive Partition
In the previous section we saw a simple redundancy example that allows us not only to detect an error, but also to correct it. But at what cost? We had to send \(2\) bits and we sent \(6\). That is pretty expensive: message length is tripled. So the question is: do we really need error correction? Maybe in some situations it is better to retransmit than to make the message heavier, so just detecting errors is enough. In these cases, we can still use our friend redundancy, but in a much more economical way.
Enter the parity check. It is a very simple mechanism to determine whether a transmission error occurred, by adding only one extra bit. Let us see an example. Suppose we need to send a \(7\)-bit message (just as an example, length can be arbitrary): \(0110010\). What we do is add one bit, called the parity bit, used to make the number of \(1\)s in the message even. From this definition, we can write parity as:
Formula 2. Parity Bit Definition
where \(s\) is the number of bits equal to \(1\). So:
- if \(s\) is even, \(p=0\)
- if \(s\) is odd, \(p=1\)
If this is not obvious, it means you did not read this article, where I also talk about modular arithmetic. In the previous message there are three bits equal to \(1\), which is odd, so \(p=1\). The transmitted message is therefore: \(0110010\color{green}{1}\).
If one error happens, or more generally an odd number of errors, parity changes, so we can detect that a transmission error occurred. Consider this message: \(1110010\). Here \(s=4\) so \(p=0\), and the transmitted message is: \(11100100\). Suppose one transmission error occurs and a bit changes from \(0\) to \(1\):
We see the last bit is \(0\), so the number of \(1\)s should be even, but now it is odd, because corruption changed it from four to five. So there is an error, and even if we do not know where, who cares, we ask for retransmission. Now let us see what happens with two errors:
The last bit is still \(0\), but with two errors we now have six \(1\)s. Parity did not change, so we do not detect the error.
Parity check is therefore a code that uses redundancy to detect errors, but not to correct them. It is cheap, fast, and deeply limited.
A Better Partition
We have seen that parity check is cheap, fast, and limited. It tells us something went wrong, but has no clue where. A bit unsatisfying, right? It looks like an impasse. If we repeat bits, cost increases; if we do not repeat bits, we do not know where to correct. Luckily, there is middle ground. That middle ground is not adding more information, but organizing it better. The idea is simple: instead of treating the message as a flat bit sequence, we arrange it as a matrix. Not because we have a table fetish, but because a matrix lets us check the message in two independent directions: rows and columns. Then we add parity bits, not just one, but one for every row and every column. Redundancy does not explode; it gets distributed. The result is that a single error breaks one specific row and one specific column, pinpointing the error coordinates. Enough words, let us do an example. Suppose we need to send this \(20\)-bit message: \(10011011001100101011\). Using raw repetition would be brutal, so we place bits into a \(4 \times 5\) matrix.
Compute parity bits on rows and columns and we get:
Now we have a \(5 \times 6\) matrix. On top of \(20\) bits, we added \(10\) redundancy bits (not \(40\)). Suppose the receiver gets this matrix:
By checking parity bits on the last row and last column, we see the third row has an odd number of \(1\)s, and the same happens in the fourth column. So the error occurred at coordinates \((3, \, 4)\). Not only did we detect the error, we can also correct it. Now suppose we receive this matrix:
In this case, two errors occurred, specifically in columns two and three of row two. The row parity bit remains correct, while column two and three parity bits are wrong. So we know an error occurred, but we cannot identify one position. This method is smarter but has limits. It can detect that something went wrong even with multiple errors, but it can correct only one.
Let Us Straighten This Out
Now an obvious question appears: is there a law connecting the number of added bits to our ability to correct errors? In other words, can we quantify how robust a code is against noise? Start with a deliberately trivial case. Suppose the only valid messages are
If the receiver gets
there is little doubt: this message is much closer to \(000000\) than to \(111111\), so the original message is easy to infer. But what if the receiver gets
then what? We can no longer tell which one was the original. The problem is not the error itself, it is the ambiguity: the received message is compatible with two different valid messages. What we are doing, even if we are not naming it yet, is counting how many bits differ between messages: we are measuring Hamming distance. In our example we have only two transmittable messages, and they differ in all six bits, so the code has minimum distance \(d_{min}=6\). Minimum distance is the smallest number of bit changes needed to transform one valid message into another valid message. When a message is received, we apply nearest-neighbor decoding, that is, we choose the closest valid message in terms of differing bits. As long as the received message is closer to exactly one codeword, decoding is possible. When it lands exactly in the middle, every decision becomes arbitrary. Let us make this explicit with numbers:
With one error, the smaller Hamming distance between \(000100\) and possible codewords is \(1\), so we know the correct codeword is \(000000\).
With two errors, the minimum Hamming distance is \(2\), same conclusion as above.
With three errors, there is no smaller distance, so we have ambiguity. From here comes a first important consequence. We say a code can detect (not correct) up to \(s\) errors, if by changing at most \(s\) bits it is not possible to transform one valid codeword into another. This happens when
Back to our example. The inequality says we can detect up to \(s=5\) errors, because
is true, and up to \(s=5\) we still get a non-valid codeword compared to valid ones. If instead \(s=6\), we flip all bits, moving from valid codeword \(000000\) to the other valid one \(111111\), and we can no longer tell whether an error happened.
We found a condition that links error detection to minimum distance, but correction is a different story. It is not enough to know something went wrong, we must also know which way to go back. To avoid ambiguity, the received message must remain closer to exactly one valid codeword. That is possible only while the number of errors is strictly less than half the minimum distance. The maximum number of correctable errors is therefore
In our example, with \(d_{min} = 6\), we get:
Which means that up to two errors, the received message always stays closer to one codeword, matching what we saw before.
To summarize this jungle of numbers and letters: the minimum distance of a code, \(d_{min}\), governs robustness against noise and gives us two conditions:
Formula 3. Correction Condition
and
Formula 4. Detection Condition
If true, the correction condition tells us the code can correct up to \(t\) errors, and if true, the detection condition tells us the code can detect up to \(s\) errors.
Let us take one small step back and apply the laws we just found.
Case 1
Take this example. The defined code was:
- \(000000\) to send letter \(A\)
- \(010101\) to send letter \(B\)
- \(101010\) to send letter \(C\)
- \(111111\) to send letter \(D\)
so minimum distance (for example from \(A\) to \(B\)) is \(d_{min}=3\) (and more generally, \(d_{min}\) is equal to the number of repetitions). From the detection condition, this means we can detect up to \(s=2\) errors, and from the correction condition we can correct \(t=1\) error, which matches what we saw.
Case 2
Take the examples seen here. In the one-dimensional case, we had messages of length \(8\) with parity bit. Minimum distance is \(d_{min}=2\). If that was intuitive before, here is why. Consider message \(00000000\). If one bit gets corrupted, we get \(\color{red}{1}0000000\), which is not valid because parity does not match, so it is not in the set of codewords. With two errors, we could get \(\color{red}{1}000000\color{red}{1}\), which is a valid message at distance \(2\). Applying the same conditions gives \(s=1\) and \(t=0\), so we detect one error but correct none.
For the two-dimensional case, to move from one valid codeword to another, I must break at least two rows and two columns, so \(d_{min}=4\). From this we get detection up to \(s=3\) errors, and correction up to \(t=1\) error.
There is a remark to make on these last two examples. If, in the shown example (and a similar argument applies in one dimension), one error occurred per column, we could detect all errors even beyond the computed value of \(s\). But this does not break the condition, because the condition gives a guarantee; detecting more than \(s\) errors is not always guaranteed, it happens only in special cases (such as one error per column).
Conclusions
We have reached the end of this first article. I will tell you: there is still a long road ahead, and here we only scratched the surface. With simple but concrete examples, we saw what error-correcting codes are. We understood why redundancy is necessary, and how better organization of information can let us not only detect errors, but also correct them, often with surprisingly limited overhead. In the next articles we will dive deeper with more structured and powerful examples: Hamming code, Reed-Solomon, and friends. There we will see how these keep the everyday digital world standing. If you want to tinker a bit, I wrote the examples shown in this repo so you can get your hands dirty without reinventing the wheel.
See you next time.