Definition
Energy distance is a statistical metric that quantifies the dissimilarity between two probability distributions. Formally, for random vectors $X$ and $Y$ in $\mathbb{R}^{d}$ with respective distributions $F$ and $G$, the energy distance $D(F,G)$ is defined as
$$ D(F,G)=2\mathbb{E}|X-Y|-\mathbb{E}|X-X'|-\mathbb{E}|Y-Y'|, $$
where $X'$ is an independent copy of $X$, $Y'$ is an independent copy of $Y$, $|\cdot|$ denotes the Euclidean norm, and the expectations are taken with respect to the joint distributions of the indicated pairs.
Overview
Energy distance belongs to the family of energy statistics, a class of statistical tools introduced in the early 2000s that draw an analogy between statistical dependence and physical potential energy. The measure is non‑negative, symmetric, and equals zero if and only if the two distributions are identical (under mild moment conditions), thereby satisfying the properties of a metric on the space of probability distributions. It underlies non‑parametric hypothesis tests such as the energy test for equality of distributions (a two‑sample test) and is closely related to distance covariance, which assesses statistical independence between random vectors.
Etymology/Origin
The term “energy” reflects the conceptual link to Newtonian potential energy: the expected Euclidean distance between points drawn from different distributions is analogous to the potential energy of a system of particles. The formal definition and properties of energy distance were first presented by Gábor J. Székely and Maria L. Rizzo in their 2004 work on “Energy statistics: A class of statistics based on distances.” Subsequent publications refined its theoretical basis and demonstrated applications in multivariate analysis.
Characteristics
| Characteristic | Description |
|---|---|
| Metric Property | Satisfies non‑negativity, identity of indiscernibles, symmetry, and the triangle inequality under finite first moments. |
| Dependence on Euclidean Norm | Uses the Euclidean distance; alternative norms yield related but distinct measures. |
| Robustness | Non‑parametric; does not assume specific distributional forms and is applicable to high‑dimensional data. |
| Computational Aspects | Can be estimated from samples via $U$-statistics; computational cost is $O(n^{2})$ for naïve implementation, with fast approximations available. |
| Relation to Other Measures | Equivalent to twice the maximum mean discrepancy (MMD) when the kernel is the Euclidean distance; closely linked to distance covariance (the squared energy distance between joint and product distributions). |
| Applications | Two‑sample testing, clustering, goodness‑of‑fit assessment, feature selection, and measuring distributional drift in machine learning. |
Related Topics
- Energy Statistics – broader class of statistics based on inter‑point distances.
- Distance Covariance and Distance Correlation – measures of dependence derived from similar principles.
- Maximum Mean Discrepancy (MMD) – kernel‑based distance between distributions with connections to energy distance.
- U‑Statistics – framework for unbiased estimation of functions of sample moments, used in computing energy distance.
- Non‑Parametric Two‑Sample Tests – includes the energy test, Kolmogorov–Smirnov test, and others.
References (selected):
- Székely, G. J., & Rizzo, M. L. (2004). “Energy Statistics: A Class of Statistics Based on Distances.” Journal of Statistical Planning and Inference, 127(2), 335–350.
- Székely, G. J., & Rizzo, M. L. (2013). “Distance Correlation: A New Tool for Detecting Association and Measuring Correlation Between Random Vectors.” The Annals of Statistics, 41(5), 2579–2598.
Note: The description above reflects information that is well‑documented in statistical literature up to the knowledge cutoff date.