Author: Abo Hatab, Hasnaa Sayed Ahmed./ Title: An enhanced technique for large scale data clustering /

Search In this Thesis

العنوان

An enhanced technique for large scale data clustering /

المؤلف

Abo Hatab, Hasnaa Sayed Ahmed.

هيئة الاعداد

باحث / حسناء سيد أحمد عبدالواحد أبوحطب

مشرف / سمير الدسوقي الموجي

مشرف / سامح عبدالغنى

مناقش / محي الدين اسماعيل العلامي

مناقش / حازم مختار اليكرى

الموضوع

Computer-aided design. Data mining. Parallel processing. Large scale data.

تاريخ النشر

2020.

عدد الصفحات

online resource (86 pages) :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

Information Systems

تاريخ الإجازة

1/1/2020

مكان الإجازة

جامعة المنصورة - كلية الحاسبات والمعلومات - قسم نظم المعلومات

الفهرس

Only 14 pages are availabe for public view

from

Abstract

Nowadays, fast growing of large-scale data makes its processing task more challenging. To handle this, there is a need for cluster computing systems that can parallelize the computations of large volume of data over a set of nodes. Among these is Apache Spark which is a reliable framework for iterative machine learning processes such as clustering. Spark stores and processes data in-memory over the nodes of a cluster which makes it faster and fault-tolerant. On the other side, because of its simplicity, K-means is still considered a significant approach for researchers. The large volume of data, however, increases the number of iterations and, thus increases the overall computational complexity. Further, good initial centroids play an important part in boosting the performance of K-means, especially with large data. This thesis proposes a hybrid approach to handle the above challenges through reducing the iterations of the K-means algorithm using a cutting-off method for the latest iterations and initializing centers through Scalable K-means++ on Apache Spark framework. The proposed hybrid approach, called Fast Scalable Spark K-means (FSS.K-means). Two standard datasets are used for conducting our experimentations and comparing our work with other implementations of K-means according to the number of iterations and the execution time. The first is included in YouTube Multiview Video Games datasets (vision_misc), and the second is SUSY dataset; the utilized datasets are available on the UCI Machine Learning Repository. The experimentations results show that our proposed hybrid approach speeds up the clustering process over 46% of the time taken by the standard K-means while maintaining about 96% accuracy.