Surrogate data are artificially generated data sets that are designed to replicate certain statistical properties of observed empirical data while lacking other specific characteristics. They are commonly employed in the analysis of time‑series and other stochastic processes to test hypotheses, assess model adequacy, and augment limited observational records.
Definition and Scope
Surrogate data, sometimes called analogous data, typically refer to time‑series produced using well‑defined linear models (e.g., autoregressive–moving‑average (ARMA) processes) that preserve features such as the autocorrelation structure, amplitude distribution, or power spectrum of the original series. The concept also encompasses synthetic data generated from statistical procedures or transformed from alternative sources to serve as proxies for unavailable or incomplete measurements.
Methodological Basis
The principal methodological framework is surrogate data testing, a statistical proof‑by‑contradiction technique akin to permutation tests and parametric bootstrapping. The procedure involves:
- Formulating a null hypothesis that the observed data arise from a specified linear stochastic process (often a stationary Gaussian linear model).
- Generating an ensemble of surrogate data sets that satisfy the constraints of the null hypothesis using Monte‑Carlo methods.
- Computing a discriminating statistic (e.g., nonlinear invariants, Lyapunov exponents) for both the original series and each surrogate.
- Comparing the original statistic to the surrogate distribution; a significant deviation leads to rejection of the null hypothesis, indicating non‑linearity or other complex structure in the original data.
Generation Techniques
Surrogate data are produced by two broad families of algorithms:
Typical realizations – Surrogates are generated as outputs of a fitted model that reproduces the target statistical properties.
Constrained realizations – Surrogates are derived directly from the original data through transformations that preserve chosen characteristics. Common constrained‑realization methods include:
| Algorithm | Key Features | Preserved Properties |
|---|---|---|
| Algorithm 0 (Random Shuffle, RS) | Random permutation of the original series | Amplitude distribution; destroys temporal correlation |
| Algorithm 1 (Random Phases, RP / FT) | Randomizes Fourier phases while retaining magnitude spectrum | Autocorrelation / periodogram |
| Algorithm 2 (Amplitude Adjusted Fourier Transform, AAFT) | Gaussianizes data, applies RP, then re‑maps to original amplitudes | Both linear correlation and amplitude distribution (approximate) |
| Iterative AAFT (IAAFT) | Iteratively refines AAFT to better match both autocorrelation and amplitude distribution | Improved fidelity to original linear structure and distribution |
Additional approaches use wavelet transforms, optimization procedures, or tailored resampling schemes to address non‑stationary or multivariate data.
Applications
- Testing for Non‑linearity: Detecting deterministic chaos or higher‑order dependencies in physical, biological, and economic time series.
- Model Validation: Assessing whether a proposed stochastic model captures essential dynamics of observed data.
- Forecasting Enhancement: Pooling surrogate series from similar processes to improve predictive accuracy.
- Environmental and Ecological Modeling: Substituting surrogate data when direct measurements are unavailable, e.g., estimating population trends of wildlife or biodiversity metrics.
- Synthetic Data Generation: Creating data for algorithm development, machine‑learning training, or privacy‑preserving data sharing.
Relation to Other Concepts
Surrogate data are closely related to synthetic data, bootstrapping, jackknife resampling, and data augmentation techniques. While bootstrapping draws repeated samples from the observed data, surrogate generation often imposes additional constraints to mimic specific statistical structures under a predefined null hypothesis.
References
- Theiler, J., et al. “Testing for nonlinearity in time series: The method of surrogate data.” Physical Review Letters, 1992.
- Prichard, D., and Theiler, J. “Generating surrogate data for time series with several simultaneously measured variables.” Physical Review Letters, 1994.
- Kaefer, P. E. “Transforming Analogous Time Series Data to Improve Natural Gas Demand Forecast Accuracy,” M.Sc. thesis, Marquette University, 2015.
- Hernández‑Camacho, C. J., et al. “The Use of Surrogate Data in Demographic Population Viability Analysis: A Case Study of California Sea Lions.” PLOS ONE, 2015.