Author: Ayyad, Sarah Mohammed./ Title: A new strategy to enhance knowledge discovery from big data using data mining techniques /

Search In this Thesis

العنوان

A new strategy to enhance knowledge discovery from big data using data mining techniques /

المؤلف

Ayyad, Sarah Mohammed.

هيئة الاعداد

باحث / سارة محمد أحمد عياد

مشرف / أحمد إبراهيم صالح

مشرف / لبيب محمد لبيب

مشرف / أحمد إبراهيم صالح

الموضوع

Gene Expression Microarray. Cancer Classification. Feature selection. Data Mining.

تاريخ النشر

2018.

عدد الصفحات

online resource (86 pages) :

اللغة

الإنجليزية

الدرجة

ماجستير

التخصص

هندسة النظم والتحكم

تاريخ الإجازة

1/1/2018

مكان الإجازة

جامعة المنصورة - كلية الهندسة - التحكم الالي

الفهرس

Only 14 pages are availabe for public view

from

Abstract

In an era of increasingvolume,velocity, variety and complexity of datasets, and the advent of big data, classical machine learning techniques are not generally well suited to delivering precise decision system. Analogous to big data, the term high dimensionality has been coined to denote to the huge number of features arriving at levels that are rendering existing machine learning techniques insufficient. That new big data scenario provides both opportunities and challenges to machine learning researchers, as the existing techniques are likely to be insufficient. Gene expression microarray classification is a crucial research field as it has been employed in cancer prediction and diagnosis systems, as cancer is noted as the most common invasive disease. Gene expression data are high in dimensionality (i.e., hundreds or thousands). Hence an accurate and effective classification of such samples is challenging. Machine learning techniques have been broadly utilized to build substantial and precise classification models for gene expression data. In In this thesis, a new classification strategy will be proposed by employing data mining techniques. The proposed classification strategy consists of four stages which are; (I) Gene expression dataset, (II) Data preprocessing, (III) Gene “feature” selection, and finally (IV) Sample classification. The proposed classification strategy applies two new contributions in feature selection and classification stages which are called Distributed Feature selection (DFS) and Modified K-Nearest Neighbors (MKNN). DFS is based on detecting the most possibly cancer-related genes in a distributed manner, which helps in effectively classifying the samples. Initially, the available huge amount of considered features are subdivided and distributed among several processors. Then, a new filter selection method based on a fuzzy inference system is applied to each subset of the dataset. Finally, all the resulted features are ranked, then a wrapper based selection method is applied. On the other hand, MKNN is a new classification technique for gene expression data based on KNN proposed in two ways which are Smallest Modified K-Nearest Neighbors (SMKNN) and Largest Modified K-Nearest Neighbors (LMKNN). The modification in this technique is taken to enhance the performance of KNN. The key idea is to employ robust neighbors in training data by using a weighting strategy. Experimental results have shown the effectiveness of the proposed feature selection, and classification techniques. Different experiments have been performed to compare the performance of the new classification technique with and without applying feature selection. Results showed the significant role of feature selection in gene expression classification. Through applying feature selection, the classification performance can be significantly boosted by using a small number of features.