الفهرس | Only 14 pages are availabe for public view |
Abstract Dangerous diseases like diabetes, in which blood glucose levels are too high, some machine learning models have been used to classify the patient state. During the collection of the data, some errors may occur due to human mistakes, devices’ errors, or transmission process noise. The correct treatment of the missing data and outliers conserves the data size and improves the models’ performance. After the research in the diabetes disease, there some dangerous complications appear that need to be classified early. Most of the datasets describe this complications are in the form of images. A huge size need a large amount of resources and take a long training time. We need to search about solutions like using transfer learning and use Apache spark to distribute the learning process. In this thesis, we developed three algorithms to handle missing values and outliers in numerical datasets. The main idea of the first algorithm is dividing the dataset into its different classes or clustering it by using k-means++, then calculate the average or Standard deviation value of each part, finally replace the missing data and outliers with its corresponding part mean value. The second algorithm to handling missing data with a complete healthcare system for diabetics has been used, as well as two new algorithms are developed to handle the crucial problem of missed data from MIoT wearable sensors. The proposed work is based on the integration of Random Forest, mean, class’ mean, interquartile range (IQR), and Deep Learning to produce a clean and complete dataset. Which can enhance any machine learning model performance. Moreover, the outliers repair technique is proposed based on dataset class detection, then repair it by Deep Learning (DL). from the previous work, some conclusions are gathered. Fuzzy C_Mean Clustering is a good choice to divide the dataset into small parts. When dealing with missing data theiii outliers’ values affect the calculation of imputation, so skipping the outliers by using the Isolation Forest. The projected imputation and outliers’ data handling algorithms are tested on a dataset called Pima Indian diabetic, which contains 2768 patients dividing into 952 diabetic and 1816 controls. After the previous work is finished, choose the best missing data imputation algorithm with the best classification algorithm, then deploy the learned model in the form of a portable website by using the Flask framework. Easily Doctor or patient can use the website by entering the patient values and press classify button, the Flask takes the values send them to the learned model, display the output on the page. The full code can be accessed from https://github.com/elhossiny/diabetes-flask Transfer learning saves time and increases the models’ performance. In some cases, a single decision-maker is not perfect, so ensemble learning is good architecture in the important applications of deep learning. In this part, we integrate transfer learning with ensemble learning to produce a new learning model with high performance that can be used in medical applications. The proposed model has experimented on three different medical large datasets. The results show that it outperforms the existing models. Finally, a very important point is that the dataset size in the recent model is very big, so we integrate Apache Spark Cluster consists of a different number of nodes to speed up the learning process. In healthcare applications, time is a very critical parameter. As clear in results, the process of using distributed learning on Apache Spark saves much time. The final model accuracy with the two steps of imputation and outliers repair is 97.41 % and 99.71 % Area Under Curve (AUC). Also the retina classification model that based on enhanced ensemble transfer learning gives 99.69 % accuracy. In comparison with literature methods, the achieved results demonstrate the validity and effectiveness of the proposed work. |