Investigating synthetic medical time-series resemblance
Résumé
Access to private medical data is restricted due to privacy laws, hindering research and real-world use. Synthetic data generation provides a viable solution by generating data with high utility and privacy protection without releasing the real data. Healthcare data records are often longitudinal in nature, being affected by covariates like age, gender, ethnicity, etc. As a result, synthetic healthcare data generation falls in the domain of time-series modeling and requires time-series based measures to investigate real and synthetic data resemblance. Covariate plots can be used for qualitative time-series resemblance but lack an empirical quantitative measure, thus, resulting in interpretations biased towards viewer's perspective. In this paper, we describe four time-series metrics to quantitatively evaluate the real and synthetic time-series resemblance on datasets from previously published healthcare research studies, both public and private. We apply the metrics on covariate plots for synthetic datasets to investigate the resemblance and compare the results with baseline synthetic datasets. We infer that the metrics effectively capture the time-series resemblance between real and synthetic datasets. The results highlight varying degrees of resemblance across subgroups of covariates and multivariate time-series.