The UCR Time Series Archive

Hoang Anh Dau; Anthony Bagnall; Kaveh Kamgar; Chin-Chia Michael Yeh; Yan Zhu; Shaghayegh Gharghabi; Chotirat Ann Ratanamahatana; Eamonn Keogh

doi:10.1109/JAS.2019.1911747

Volume 6 Issue 6

Nov. 2019

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 15.3, Top 1 (SCI Q1)

CiteScore: 23.5, Top 2% (Q1)
Google Scholar h5-index: 77， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2019 > 6(6): 1293-1305

Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana and Eamonn Keogh, "The UCR Time Series Archive," IEEE/CAA J. Autom. Sinica, vol. 6, no. 6, pp. 1293-1305, Nov. 2019. doi: 10.1109/JAS.2019.1911747

Citation:

Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana and Eamonn Keogh, "The UCR Time Series Archive," IEEE/CAA J. Autom. Sinica, vol. 6, no. 6, pp. 1293-1305, Nov. 2019. doi: 10.1109/JAS.2019.1911747

Citation:

Hoang Anh Dau, Anthony Bagnall, Kaveh Kamgar, Chin-Chia Michael Yeh, Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana and Eamonn Keogh, "The UCR Time Series Archive," IEEE/CAA J. Autom. Sinica, vol. 6, no. 6, pp. 1293-1305, Nov. 2019. doi: 10.1109/JAS.2019.1911747

PDF( 1557 KB)

The UCR Time Series Archive

doi: 10.1109/JAS.2019.1911747

More Information

Author Bio:
Hoang Anh Dau received the Ph.D. degree in computer science from the University of California, Riverside in 2019. Her research interests include machine learning and data mining. She is particularly interested in time series data analysis for solving real-world problems

Anthony Bagnall is a Full Professor of computer science at the University of East Anglia, UK. His research interests include span data mining and machine learning, with a focus on time series mining and ensemble techniques. He specialises in designing and evaluating algorithms for time series classification

Kaveh Kamgar received the M.S. degree in computer science from the University of California, Riverside, where he currently works as a Research Assistant. His research interest is data mining

Chin-Chia Michael Yeh is currently a Staff Research Scientist at visa research. He received the Ph.D. degree in computer science from the University of California, Riverside. His research interests include time series analysis, data mining, machine learning, and information retrieval

Yan Zhu is currently a Software Engineer at Google Research. She received the Ph.D. degree in computer science from the University of California, Riverside in 2018. Her research interests include machine learning and data mining

Shaghayegh Gharghabi is currently a Ph.D. candidate of computer science at the University of California, Riverside. Her research interests include data mining and machine learning, with a focus on time series data segmentation and clustering

Chotirat Ann Ratanamahatana is an Associate Professor of computer engineering at Chulalongkorn University, Bangkok, Thailand. Her research interests include data mining, machine learning, artificial intelligence, with a focus on time series data mining techniques

Eamonn Keogh is a Full Professor of computer science at the University of California, Riverside. His research areas include data mining, machine learning and information retrieval, specializing in techniques for solving similarity and indexing problems in time series data sets
Corresponding author: H. A. Dau, K. Kamgar, C-C. M. Yeh, Y. Zhu, S. Gharghabi, and E. Keogh are with the Department of Computer Science and Engineering, University of California, Riverside, CA 92521 USA (e-mail: hdau001@ucr.edu; kkamg001@ucr.edu; myeh003@ucr.edu; yzhu015@ucr.edu; sghar003@ucr.edu; eamonn@cs.ucr.edu)
¹Why would someone use the archive and not acknowledge it? Carelessness probably explains the majority of such omissions. In addition, for several years (approximately 2006 to 2011), access to the archive was conditional on informally pledging to test on all data sets to avoid cherry picking (see Section IV). Some authors who did then go on to test on only a limited subset, possibly choosing not to cite the archive to avoid bringing attention to their failure to live up to their implied pledge.
²These works should not be confused with papers that suggest using a wavelet representation to perform dimensionality reduction to allow more efficient indexing of time series.
Received Date: 2019-04-06
Revised Date: 2019-07-08
Accepted Date: 2019-10-09

Available Online: 2019-10-14

Abstract

Abstract

The UCR time series archive – introduced in 2002, has become an important resource in the time series data mining community, with at least one thousand published papers making use of at least one data set from the archive. The original incarnation of the archive had sixteen data sets but since that time, it has gone through periodic expansions. The last expansion took place in the summer of 2015 when the archive grew from 45 to 85 data sets. This paper introduces and will focus on the new data expansion from 85 to 128 data sets. Beyond expanding this valuable resource, this paper offers pragmatic advice to anyone who may wish to evaluate a new algorithm on the archive. Finally, this paper makes a novel and yet actionable claim: of the hundreds of papers that show an improvement over the standard baseline (1-nearest neighbor classification), a fraction might be mis-attributing the reasons for their improvement. Moreover, the improvements claimed by these papers might have been achievable with a much simpler modification, requiring just a few lines of code.
- Data mining,
- time series classification,
- UCR time series archive

FullText(HTML)

¹Why would someone use the archive and not acknowledge it? Carelessness probably explains the majority of such omissions. In addition, for several years (approximately 2006 to 2011), access to the archive was conditional on informally pledging to test on all data sets to avoid cherry picking (see Section IV). Some authors who did then go on to test on only a limited subset, possibly choosing not to cite the archive to avoid bringing attention to their failure to live up to their implied pledge.
²These works should not be confused with papers that suggest using a wavelet representation to perform dimensionality reduction to allow more efficient indexing of time series.

References(53)

References

[1]	R. Agrawal, C. Faloutsos, and A. Swami, " Efficient similarity search in sequence databases,” in Proc. Int. Conf. Foundations of Data Organization and Algorithms. Springer, 1993, pp. 69–84.
[2]	E. Keogh and S. Kasetty, " On the need for time series data mining benchmarks: a survey and empirical demonstration,” Data Mining and Knowledge Discovery, vol. 7, no. 4, pp. 349–371, 2003. doi: 10.1023/A:1024988512476
[3]	Y.-W. Huang and P. S. Yu, " Adaptive query processing for time-series data,” in Proc. 5th ACM SIGKDD Int. Conf. Knowledge Discovery and Data Mining. ACM, 1999, pp. 282–286.
[4]	E. D. Kim, J. M. W. Lam, and J. Han, " Aim: approximate intelligent matching for time series data,” in Proc. Int. Conf. Data Warehousing and Knowledge Discovery. Springer, 2000, pp. 347–357.
[5]	N. Saito and R. R. Coifman, " Local discriminant bases and their applications,” J. Mathematical Imaging and Vision, vol. 5, no. 4, pp. 337–358, 1995. doi: 10.1007/BF01250288
[6]	M. Lichman, " UCI machine learning repository,” http://archive.ics.uci.edu/ml/index.php, 2013.
[7]	E. Keogh and T. Folias, " The UCR time series data mining archive,” 2002.
[8]	Y. Chen, E. Keogh, B. Hu, N. Begum, A. Bagnall, A. Mueen, and G. Batista, " The UCR time series classification archive,” [Online]. Available: https://www.cs.ucr.edu/~eamonn/time_series_data/, 2015.
[9]	B. Hu, Y. Chen, and E. Keogh, " Classification of streaming time series under more realistic assumptions,” Data Mining and Knowledge Discovery, vol. 30, no. 2, pp. 403–437, 2016. doi: 10.1007/s10618-015-0415-0
[10]	H. A. Dau, D. F. Silva, F. Petitjean, G. Forestier, A. Bagnall, A. Mueen, and E. Keogh, " Optimizing dynamic time warping’s window width for time series data mining applications,” Data Mining and Knowledge Discovery, 2018.
[11]	T. Rakthanmanon, B. Campana, A. Mueen, G. Batista, B. Westover, Q. Zhu, J. Zakaria, and E. Keogh, " Addressing big data time series: mining trillions of time series subsequences under dynamic time warping,” Trans. Knowledge Discovery from Data (TKDD), 2013.
[12]	A. Bagnall, J. Lines, W. Vickers, and E. Keogh, " The UEA and UCR time series classification repository,” [Online]. Available: https://www.timeseriesclassification.com, 2018.
[13]	M. Taktak, S. Triki, and A. Kamoun, " SAX-based representation with longest common subsequence dissimilarity measure for time series data classification,” in Proc. IEEE/ACS 14th Int. Conf. Computer Systems and Applications (AICCSA), 2017, pp. 821–828.
[14]	A. Silva and R. Ishii, " A new time series classification approach based on recurrence quantification analysis and Gabor filter,” in Proc. 31st Annual ACM Symposium on Applied Computing, 2016, pp. 955–957.
[15]	Y. He, J. Pei, X. Chu, Y. Wang, Z. Jin, and G. Peng, " Characteristic subspace learning for time series classification,” in Proc. IEEE Int. Conf. Data Mining (ICDM), 2018, pp. 1019–1024.
[16]	" Supporting web page,” [Online]. Available: https://www.cs.ucr.edu/~hdau001/ucr_archive/.
[17]	Z. C. Lipton and J. Steinhardt, " Troubling trends in machine learning scholarship,” arXiv preprint arXiv: 1807.03341, 2018.
[18]	J. Paparrizos and L. Gravano, " k-Shape: efficient and accurate clustering of time series,” in Proc. ACM SIGMOD Int. Conf. Management of Data ACM Sigmod, pp. 1855-1870, 2015. [Online]. Available: http://dl.acm.org/citation.cfm?id=2723372.2737793
[19]	J. Hills, J. Lines, E. Baranauskas, J. Mapp, and A. Bagnall, " Classification of time series by shapelet transformation,” Data Mining and Knowledge Discovery, vol. 28, no. 4, pp. 851–881, 2014. doi: 10.1007/s10618-013-0322-1
[20]	H. A. Dau, D. F. Silva, F. Petitjean, G. Forestier, A. Bagnall, and E. Keogh, " Judicious setting of dynamic time warping’s window width allows more accurate classification of time series,” in Proc. IEEE Int. Conf. Big Data (Big Data), 2017, pp. 917–922. [Online]. Available: http://ieeexplore.ieee.org/document/8258009/
[21]	S. Lu, G. Mirchevska, S. S. Phatak, D. Li, J. Luka, R. A. Calderone, and W. A. Fonzi, " Dynamic time warping assessment of high-resolution melt curves provides a robust metric for fungal identification,” PLOS ONE, vol. 12, no. 3, 2017.
[22]	D. F. Silva, G. E. Batista, and E. Keogh, " Prefix and suffix invariant dynamic time warping,” in Proc. IEEE Int. Conf. Data Mining(ICDM), 2017, pp. 1209–1214.
[23]	D. Li, T. F. Bissyande, J. Klein, and Y. L. Traon, " Time series classification with discrete wavelet transformed data,” Int. J. Software Engineering and Knowledge Engineering, vol. 26, no. 09n10, pp. 1361–1377, 2016.
[24]	H. Zhang, T. B. Ho, M. S. Lin, and W. Huang, " Combining the global and partial information for distance-based time series classification and clustering,” JACIII, vol. 10, no. 1, pp. 69–76, 2006. doi: 10.20965/jaciii.2006.p0069
[25]	H. Zhang and T. B. Ho, " Finding the clustering consensus of time series with multi-scale transform,” Soft Computing as Transdisciplinary Science and Technology. Springer, 2005, pp. 1081–1090.
[26]	E. Keogh, K. Chakrabarti, M. Pazzani, and S. Mehrotra, " Dimensionality reduction for fast similarity search in large time series databases,” Knowledge and Information Systems, vol. 3, no. 3, pp. 263–286, 2001. doi: 10.1007/PL00011669
[27]	P. Schäfer, " The BOSS is concerned with time series classification in the presence of noise,” Data Mining and Knowledge Discovery, vol. 29, no. 6, pp. 1505–1530, 2015. doi: 10.1007/s10618-014-0377-7
[28]	D. Li, T. F. D. A. Bissyande, J. Klein, and Y. Le Traon, " Time series classification with discrete wavelet transformed data: insights from an empirical study,” in Proc. 28th Int. Conf. Software Engineering and Knowledge Engineering (SEKE), 2016.
[29]	U. M. Okeh and C. N. Okoro, " Evaluating measures of indicators of diagnostic test performance: fundamental meanings and formulars,” J Biom Biostat, vol. 3, no. 1, pp. 2, 2012.
[30]	S. Uguroglu, " Robust learning with highly skewed category distributions,” 2013.
[31]	K. H. Brodersen, C. S. Ong, K. E. Stephan, and J. M. Buhmann, " The balanced accuracy and its posterior distribution,” in Proc. of IEEE 20th Int. Conf. Pattern Recognition (ICPR), 2010, pp. 3121–3124.
[32]	J. Lines, S. Taylor, and A. Bagnall, " HIVE-COTE: the hierarchical vote collective of transformation-based ensembles for time series classification,” in Proc. 16th IEEE Int. Conf. Data Mining (ICDM), 2016, pp. 1041–1046.
[33]	J. Demšar, " Statistical comparisons of classifiers over multiple data sets,” J. Machine Learning Research, vol. 7, no.1, pp. 1–30, 2006.
[34]	S. Garcia and F. Herrera, " An extension on ‘statistical comparisons of classifiers over multiple data sets’ for all pairwise comparisons,” J. Machine Learning Research, vol. 9, no. 12, pp. 2677–2694, 2008.
[35]	S. L. Salzberg, " On comparing classifiers: pitfalls to avoid and a recommended approach,” Data Mining and Knowledge Discovery, vol. 1, no. 3, pp. 317–328, 1997. doi: 10.1023/A:1009752403260
[36]	Student, " The probable error of a mean,” Biometrika, vol. 6, no. 1, pp. 1–25, 1908.
[37]	F. Wilcoxon, " Individual comparisons by ranking methods,” Biometrics Bulletin, vol. 1, no. 6, pp. 80–83, 1945. doi: 10.2307/3001968
[38]	S. Siegal, Nonparametric Statistics for the Behavioral Sciences. vol. 7. New York: McGraw-hill, 1956.
[39]	M. Friedman, " The use of ranks to avoid the assumption of normality implicit in the analysis of variance,” J. Tmerican Statistical Association, vol. 32, no. 200, pp. 675–701, 1937. doi: 10.1080/01621459.1937.10503522
[40]	M. Friedman, " A comparison of alternative tests of significance for the problem of m rankings,” The Annals of Mathematical Statistics, vol. 11, no. 1, pp. 86–92, 1940. doi: 10.1214/aoms/1177731944
[41]	A. Benavoli, G. Corani, and F. Mangili, " Should we really use post-hoc tests based on mean-ranks,” J. Machine Learning Research, vol. 17, no. 5, pp. 1–10, 2016.
[42]	M. Hollander, D. A. Wolfe, and E. Chicken, Nonparametric Statistical Methods. John Wiley & Sons, 2013, vol. 751.
[43]	S. Holm, " A simple sequentially rejective multiple test procedure,” Scandinavian J. Statistics, vol. 6, no. 2, pp. 65–70, 1979.
[44]	S. Gharghabi, S. Imani, A. Bagnall, A. Darvishzadeh, and E. Keogh, " Matrix profile XII: MPdist: a novel time series distance measure to allow data mining in more challenging scenarios,” in Proc. IEEE Int. Conf. Data Mining (ICDM), 2018, pp. 965–970.
[45]	G. E. Batista, E. J. Keogh, O. M. Tataw, and V. M. A. De Souza, " CID: an efficient complexity-invariant distance for time series,” Data Mining and Knowledge Discovery, vol. 28, no. 3, pp. 634–669, 2014. doi: 10.1007/s10618-013-0312-3
[46]	H. A. Dau, E. Keogh, K. Kamgar, C.-C. M. Yeh, Y. Zhu, S. Gharghabi, C. A. Ratanamahatana, Y. Chen, B. Hu, N. Begum, A. Bagnall, A. Mueen, and G. Batista, " The UCR time series classification archive,” [Online]. Available: https://www.cs.ucr.edu/~eamonn/time_series_data_2018/, 2018.
[47]	A. Bagnall, J. Lines, A. Bostrom, J. Large, and E. Keogh, " The great time series classification bake off: a review and experimental evaluation of recent algorithmic advances,” Data Mining and Knowledge Discovery, vol. 31, no. 3, pp. 606–660, 2017. doi: 10.1007/s10618-016-0483-9
[48]	J. Lines, S. Taylor, and A. Bagnall, " Time series classification with HIVE-COTE: the hierarchical vote collective of transformation-based ensembles,” ACM Trans. Knowledge Discovery from Data (TKDD), vol. 12, no. 5, pp. 52, 2018.
[49]	R. A. Fisher, " The use of multiple measurements in taxonomic problems,” Annals of Eugenics, vol. 7, no. 2, pp. 179–188, 1936. doi: 10.1111/j.1469-1809.1936.tb02137.x
[50]	A. Mezari and I. Maglogiannis, " Gesture recognition using symbolic aggregate approximation and dynamic time warping on motion data,” in Proc. 11th ACM EAI Int. Conf. Pervasive Computing Technologies for Healthcare, 2017, pp. 342–347.
[51]	M. Guillame-Bert and A. Dubrawski, " Classification of time sequences using graphs of temporal constraints,” J. Machine Learning Research, vol. 18, pp. 1–34, 2017.
[52]	D. Murray, J. Liao, L. Stankovic, V. Stankovic, C. Wilson, M. Coleman, and T. Kane, " A data management platform for personalised real-time energy feedback,” Eedal, pp. 1–15, 2015.
[53]	G. Forestier, F. Petitjean, H. A. Dau, G. I. Webb, and E. Keogh, " Generating synthetic time series to augment sparse datasets,” in Proc. IEEE Int. Conf. Data Mining (ICDM), 2017, pp. 865–870. [Online]. Available: http://ieeexplore.ieee.org/document/8215569/

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(12) / Tables(3)

Get Citation

PDF

XML

Article Metrics

Article views (5191) PDF downloads(257)

Highlights

introduces a significant expansion of the UCR Time Series Archive, the standard benchmark for time series classification for the last two decades.
offers advice and “pitfalls-to-avoid” for researchers working in time series classification.
offers some concrete demonstrations of the dangers of “cherry picking”, a common problem in literature that makes comparisons between rival methods difficult and can give the false illusion of progress.

The UCR Time Series Archive

doi: 10.1109/JAS.2019.1911747

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Highlights

Export File

Citation

Format

Content