Abstract
Machine Learning is currently an important research field that attracts interest due to its importance for discovering hidden knowledge or patterns from big datasets. In this paper, we propose a heuristic algorithm which can solve problems related to only classification, only clustering, or classification with clustering by creating models with the ability to evolve to another class/cluster configuration without a retraining process for new incoming data. This algorithm combines supervised and unsupervised learning principles for the incremental construction of both classes and clusters, by using the main guidelines from two classical methods of classification based on distance and clustering based on prototypes, such as KNN and K-means. The algorithm is able to deal with labeled and unlabeled samples as inputs in order to create new groups (classes or clusters), merge or reconfigure existing ones. Basically, the creation of new groups follows three sequential steps: (i) locate the provisional group for an input sample using K-means, (ii) using 1NN, locate the nearest sample to the input sample, only considering the samples in the provisional group, and (iii) merge or reconfigure existing groups following specific guidelines. Several benchmarks, related to classification and clustering problems, were evaluated by our proposal; the results were compared with classical algorithms. On the other hand, artificial datasets with labeled and unlabeled samples have been created to show the ability of our algorithm in the hybrid context to solve classification and clustering combined. As a result, the algorithm is able to create clusters and classes, simultaneously, when required. Finally, a real case study of fault diagnosis in rotating machinery is presented for discovering new groups that might represent patterns from unknown data.
Similar content being viewed by others
References
Aggarwal CC, Han J, Wang J, Yu PS (2006) A framework for on-demand classification of evolving data streams. IEEE Trans Knowl Data Eng 18(5):577–589
Amini A, Saboohi H, Herawan T, Wah TY (2016) Mudi-stream: a multi density clustering algorithm for evolving data stream. J Netw Comput Appl 59:370–385
Basu S (2004) A comparison of inference techniques for semi-supervised clustering with Hidden Markov random fields. In: Proceedings of the ICML-2004 workshop on statistical relational learning and its connections to other fields
Bishop C (2006) Pattern recognition and machine learning. Springer, Cambridge
Bordignon F, Gomide F (2014) Uninorm based evolving neural networks and approximation capabilities. Neurocomputing 127:13–20
Cai Q, He H, Man H (2014) Imbalanced evolving self-organizing learning. Neurocomputing 133:258–270
Cara A, Herrera L, Pomares H, Rojas I (2013) New online self-evolving neuro fuzzy controller based on the TaSe-NF model. Inf Sci 220:226–243
Chapelle O, Scholkopf B, Zien A (2006) Semi-supervised learning. MIT Press, Cambridge
Czarnowski I, Jdrzejowicz P (2014) Ensemble classifier for mining data streams. Procedia Comput Sci 35:397–406 (knowledge Based and Intelligent Information; Engineering Systems 18th Annual Conference, KES 2014 Gdynia, September 2014, Poland)
de Andrade SJ, Hruschka ER, Gama J (2017) An evolutionary algorithm for clustering data streams with a variable number of clusters. Expert Syst Appl 67:228–238
Djouadi A, Snorrason O, Garber F (1990) The quality of training-sample estimates of the Bhattacharyya coefficient. IEEE Trans Pattern Anal Mach Intell 12:92–97
Dunham M (2003) Data mining: introductory and advanced topics. Prentice Hall, Upper Saddle River
Ertoz L, Steinbach M, Kumar V (2004) Finding topics in collections of documents: a shared nearest neighbor approach. In: Wu W, Xiong H, Shekhar S (eds) Clustering and information retrieval, network theory and applications, vol 11. Springer, Boston, pp 83–103
Geng X, Smith-Miles K (2009) Incremental learning. Springer, Boston, pp 731–735
Ghesmoune M, Lebbah M, Azzag H (2015) Micro-batching growing neural gas for clustering data streams using spark streaming. Procedia Comput Sci 53:158–166 (INNS Conference on Big Data 2015 Program San Francisco, CA, August 2015, USA)
Hartert L, Sayed Mouchaweh M, Billaudel P (2010) A semi-supervised dynamic version of fuzzy K-nearest neighbours to monitor evolving systems. Evol Syst 1(1):3–15
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction. Springer, New York
He H, Garcia E (2009) Learning from imbalanced data. IEEE Trans Knowl Data Eng 21(9):1263–1284
Hyde R, Angelov P, MacKenzie A (2017) Fully online clustering of evolving data streams into arbitrarily shaped clusters. Inf Sci 382383:96–114
Klancar G, Skrjanc I (2015) Evolving principal component clustering with a low run-time complexity for LRF data mapping. Appl Soft Comput 35:349–358
Kontaki M, Gounaris A, Papadopoulos AN, Tsichlas K, Manolopoulos Y (2016) Efficient and flexible algorithms for monitoring distance-based outliers over data streams. Inf Syst 55:37–53
Krawczyk B, Woniak M (2015) Incremental weighted one-class classifier for mining stationary data streams. J Comput Sci 9:19–25
Larose D (2004) k-nearest neighbor algorithm, in discovering knowledge in data: an introduction to data mining. Wiley, Hoboken
Liu B, Xiao Y, Yu P, Cao L, Zhang Y, Hao Z (2014) Uncertain one-class learning and concept summarization learning on uncertain data streams. IEEE Trans Knowl Data Eng 26(2):468–484
Lughofer E (2013) On-line assurance of interpretability criteria in evolving fuzzy systems achievements, new concepts and open issues. Inf Sci 251:22–46
Lughofer E, Sayed-Mouchaweh M (2015) Autonomous data stream clustering implementing split-and-merge concepts towards a plug-and-play approach. Inf Sci 304:54–79
Lughofer E, Weigl E, Heidl W, Eitzinger C, Radauer T (2015a) Integrating new classes on the fly in evolving fuzzy classifier designs and their application in visual inspection. Appl Soft Comput 35:558–582
Lughofer E, Weigl E, Heidl W, Eitzinger C, Radauer T (2015b) Integrating new classes on the fly in evolving fuzzy classifier designs and their application in visual inspection. Appl Soft Comput 35:558–582
Maslennikov OV, Nekorkin VI (2015) Evolving dynamical networks with transient cluster activity. Commun Nonlinear Sci Numer Simul 23(13):10–16
Mythily R, Banu A, Raghunathan S (2015) Clustering models for data stream mining. Procedia Comput Sci 46:619–626 (proceedings of the International Conference on Information and Communication Technologies, ICICT, December 2014, India)
Pacheco F, de Oliveira JV, Sánchez RV, Cerrada M, Cabrera D, Li C, Zurita G, Artés M (2016) A statistical comparison of neuroclassifiers and feature selection methods for gearbox fault diagnosis under realistic conditions. Neurocomputing 194:192–206
Pacheco F, Cerrada M, Snchez RV, Cabrera D, Li C, de Oliveira JV (2017) Attribute clustering using rough set theory for feature selection in fault severity classification of rotating machinery. Expert Syst Appl 71:69–86
PhridviRaj M, GuruRao C (2014) Data mining past, present and future a typical survey on data streams. Procedia Technol 12:255–263 (the 7th International Conference Interdisciplinarity in Engineering, INTER-ENG 2013, October 2013, Romania)
Pimentel M, Clifton D, Clifton L, Tarassenko L (2014) A review of novelty detection. Signal Process 99:215–249
Rehman M, Li T, Yang Y, Wang H (2014) Hyper-ellipsoidal clustering technique for evolving data stream. Knowl Based Syst 70:3–14
Rosenberg A, Hirschberg J (2007) V-measure: a conditional entropy-based external cluster evaluation measure. In: Proceedings of the 2007 joint conference on empirical methods in natural language processing and computational natural language learning (EMNLP-CoNLL), pp 410–420
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Saffari A, Leistner C, Santner J, Godec M, Bischof H (2009) On-line random forests. In: 2009 IEEE 12th international conference on computer vision workshops, ICCV workshops, pp 1393–1400
Silva AM, Caminhas W, Lemos A, Gomide F (2014) A fast learning algorithm for evolving neo-fuzzy neuron. Appl Soft Comput Part B 14:194–209
Skrjanc I, Dovzan D (2015) Evolving gustafson-kessel possibilistic c-means clustering. Procedia Computer Science 53:191–198
Syed Z, Rubinfeld I (2010) Fast anomaly detection in dynamic clinical datasets using near-optimal hashing with concentric expansions. In: 2010 IEEE international conference on data mining workshops (ICDMW), pp 763–770
Tan P, Steinbach M, Kumar V (2005) Introduction to data mining. Addison Wesley, Boston
Xia S, Xiong Z, Luo Y, WeiXu Zhang G (2015) Effectiveness of the euclidean distance in high dimensional spaces. Optik Int J Light Electron Opt 126(24):5614–5619
Yang H, Fong S (2015) Countering the concept-drift problems in big data by an incrementally optimized stream mining model. J Syst Softw 102:158–166
Zhang R, Rudnicky AI (2002) A large scale clustering scheme for kernel k-means. In: Object recognition supported by user interaction for service robots, vol 4, pp 289–292
Zhu X, Ding W, Yu P, Zhang C (2011) One-class learning and concept summarization for data streams. Knowl Inf Syst 28(3):523–553
Acknowledgements
The authors want to acknowledge deeply M.Sc. Fannia Pacheco for her contribution in this work. The results and guidelines proposed in her Master Thesis (https://bit.ly/2EQtOgm) have sustained the development of this approach. This work was sponsored by the Prometeo Project of the Secretariat for Higher Education, Science, Technology and Innovation (SENESCYT) of the Republic of Ecuador. The experiment of the real case application was developed at the Vibration Lab of the GIDTEC research group, at the Universidad Politécnica Salesiana, Cuenca-Ecuador.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Cerrada, M., Aguilar, J., Altamiranda, J. et al. A hybrid heuristic algorithm for evolving models in simultaneous scenarios of classification and clustering. Knowl Inf Syst 61, 755–798 (2019). https://doi.org/10.1007/s10115-019-01336-3
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-019-01336-3