Hamming distance

The Hamming distance is a metric for comparing two strings of equal length, defined as the number of positions at which the corresponding symbols differ. It quantifies the minimum number of substitutions required to change one string into the other, without considering insertions or deletions.

Mathematical definition
For two strings $x = x_1x_2\ldots x_n$ and $y = y_1y_2\ldots y_n$ of length $n$, the Hamming distance $d_H(x, y)$ is

$$ d_H(x, y) = \sum_{i=1}^{n} \mathbf{1}(x_i eq y_i), $$

where $\mathbf{1}(\cdot)$ is the indicator function, yielding 1 when its argument is true and 0 otherwise.

Historical background
The concept is named after the American mathematician and information theorist Richard W. Hamming, who introduced it in his 1950 paper on error-detecting and error-correcting codes. Hamming developed the distance as a tool for analyzing the error‑correcting capability of binary codes, leading to the creation of Hamming codes.

Properties

  • Non‑negativity: $d_H(x, y) \ge 0$ and $d_H(x, y) = 0$ iff $x = y$.
  • Symmetry: $d_H(x, y) = d_H(y, x)$.
  • Triangle inequality: For any three strings $x, y, z$ of equal length, $d_H(x, z) \le d_H(x, y) + d_H(y, z)$.
    These properties confirm that Hamming distance is a valid metric on the space of fixed‑length strings.

Applications

  • Error detection and correction: In digital communications, the minimum Hamming distance between codewords determines the number of errors that can be reliably detected or corrected.
  • Coding theory: Used to evaluate and design block codes, such as Hamming codes, Reed–Solomon codes, and BCH codes.
  • Cryptography: Assists in assessing the avalanche effect of cryptographic functions and in designing hash functions.
  • Bioinformatics: Employed to measure dissimilarity between DNA, RNA, or protein sequences of equal length.
  • Information retrieval and machine learning: Utilized as a similarity measure for binary feature vectors, particularly in nearest‑neighbor classification and clustering.
  • Data storage: Applied in fault‑tolerant memory systems to detect and correct bit flips.

Variants and related metrics

  • Generalized Hamming distance: Extends the concept to non‑binary alphabets by counting mismatched symbols regardless of their specific values.
  • Weighted Hamming distance: Assigns different costs to mismatches at different positions, useful when certain bits are more critical.
  • Levenshtein distance: A related metric that also accounts for insertions and deletions, applicable to strings of differing lengths.

Computational aspects
The Hamming distance can be computed in linear time $O(n)$ by a single pass over the strings. For binary strings, it is efficiently calculated using bitwise XOR followed by a population count (count of set bits).

References

  1. Hamming, R. W. (1950). "Error Detecting and Error Correcting Codes". Bell System Technical Journal, 29(2), 147–160.
  2. MacWilliams, F. J.; Sloane, N. J. A. (1977). The Theory of Error‑Correcting Codes. North-Holland.
  3. Cover, T. M.; Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley.
Browse

More topics to explore