Author: Ibrahim, Elhossiny Ibrahim Elhossiny./ Title: Big Data Analysis Techniques for Healthcare System /

Search In this Thesis

العنوان

Big Data Analysis Techniques for Healthcare System /

المؤلف

Ibrahim, Elhossiny Ibrahim Elhossiny.

هيئة الاعداد

باحث / الحسيني إبراهيم الحسيني إبراهيم

مشرف / أيمن السيد أحمد السيد عميره

مناقش / السيد عبد الحميد السيد سلام

مناقش / نرمين عبد الوهاب البهنساوي

الموضوع

Electronic data processing Technological innovations. Health Information Systems. Big data.

تاريخ النشر

2021.

عدد الصفحات

101 p. :

اللغة

الإنجليزية

الدرجة

الدكتوراه

التخصص

علوم الحاسب الآلي

تاريخ الإجازة

10/11/2021

مكان الإجازة

جامعة المنوفية - كلية الهندسة الإلكترونية - قسم علوم الحاسب

الفهرس

Only 14 pages are availabe for public view

from

129

from

129

Abstract

Dangerous diseases like diabetes, in which blood glucose levels are too high, some
machine learning models have been used to classify the patient state.
During the collection of the data, some errors may occur due to human mistakes, devices’
errors, or transmission process noise. The correct treatment of the missing data and
outliers conserves the data size and improves the models’ performance.
After the research in the diabetes disease, there some dangerous complications appear
that need to be classified early. Most of the datasets describe this complications are in the
form of images. A huge size need a large amount of resources and take a long training
time. We need to search about solutions like using transfer learning and use Apache spark
to distribute the learning process.
In this thesis, we developed three algorithms to handle missing values and outliers in
numerical datasets. The main idea of the first algorithm is dividing the dataset into its
different classes or clustering it by using k-means++, then calculate the average or
Standard deviation value of each part, finally replace the missing data and outliers with
its corresponding part mean value.
The second algorithm to handling missing data with a complete healthcare system for
diabetics has been used, as well as two new algorithms are developed to handle the
crucial problem of missed data from MIoT wearable sensors. The proposed work is based
on the integration of Random Forest, mean, class’ mean, interquartile range (IQR), and
Deep Learning to produce a clean and complete dataset. Which can enhance any machine
learning model performance. Moreover, the outliers repair technique is proposed based
on dataset class detection, then repair it by Deep Learning (DL).
from the previous work, some conclusions are gathered. Fuzzy C_Mean Clustering is a
good choice to divide the dataset into small parts. When dealing with missing data theiii
outliers’ values affect the calculation of imputation, so skipping the outliers by using the
Isolation Forest.
The projected imputation and outliers’ data handling algorithms are tested on a dataset
called Pima Indian diabetic, which contains 2768 patients dividing into 952 diabetic and
1816 controls.
After the previous work is finished, choose the best missing data imputation algorithm
with the best classification algorithm, then deploy the learned model in the form of a
portable website by using the Flask framework. Easily Doctor or patient can use the
website by entering the patient values and press classify button, the Flask takes the values
send them to the learned model, display the output on the page. The full code can be
accessed from https://github.com/elhossiny/diabetes-flask
Transfer learning saves time and increases the models’ performance. In some cases, a
single decision-maker is not perfect, so ensemble learning is good architecture in the
important applications of deep learning. In this part, we integrate transfer learning with
ensemble learning to produce a new learning model with high performance that can be
used in medical applications. The proposed model has experimented on three different
medical large datasets. The results show that it outperforms the existing models. Finally,
a very important point is that the dataset size in the recent model is very big, so we
integrate Apache Spark Cluster consists of a different number of nodes to speed up the
learning process. In healthcare applications, time is a very critical parameter. As clear in
results, the process of using distributed learning on Apache Spark saves much time.
The final model accuracy with the two steps of imputation and outliers repair is 97.41 %
and 99.71 % Area Under Curve (AUC).
Also the retina classification model that based on enhanced ensemble transfer learning
gives 99.69 % accuracy.
In comparison with literature methods, the achieved results demonstrate the validity and
effectiveness of the proposed work.