الفهرس | Only 14 pages are availabe for public view |
Abstract Nowadays, fast growing of large-scale data makes its processing task more challenging. To handle this, there is a need for cluster computing systems that can parallelize the computations of large volume of data over a set of nodes. Among these is Apache Spark which is a reliable framework for iterative machine learning processes such as clustering. Spark stores and processes data in-memory over the nodes of a cluster which makes it faster and fault-tolerant. On the other side, because of its simplicity, K-means is still considered a significant approach for researchers. The large volume of data, however, increases the number of iterations and, thus increases the overall computational complexity. Further, good initial centroids play an important part in boosting the performance of K-means, especially with large data. This thesis proposes a hybrid approach to handle the above challenges through reducing the iterations of the K-means algorithm using a cutting-off method for the latest iterations and initializing centers through Scalable K-means++ on Apache Spark framework. The proposed hybrid approach, called Fast Scalable Spark K-means (FSS.K-means). Two standard datasets are used for conducting our experimentations and comparing our work with other implementations of K-means according to the number of iterations and the execution time. The first is included in YouTube Multiview Video Games datasets (vision_misc), and the second is SUSY dataset; the utilized datasets are available on the UCI Machine Learning Repository. The experimentations results show that our proposed hybrid approach speeds up the clustering process over 46% of the time taken by the standard K-means while maintaining about 96% accuracy. |