الفهرس | Only 14 pages are availabe for public view |
Abstract With the evolution of social media platforms, the Internet is used as a source for obtaining news about current events. Recently, Twitter has become one of the most popular social media platforms that afford public users to share the news. The platform is overgrowing especially among young people who may be influenced by the information from anonymous sources. Therefore, predicting the credibility of news in Twitter becomes a necessity, especially in the case of emergencies. In this thesis, we proposed four models for handling the problem of credibility prediction and tested them over two different datasets in two languages (English, and Arabic). The first model relies on extracting an extensive set of content and sourcerelated features. Five different classifiers are used for training with different feature sets to determine whether content features only or source features only can be good indicators for credibility. The best performance is achieved when using a combined set of content and source features and applying Random Forests as a classifier. The second model focuses on textual features and uses word-based N-gram analysis. The experiments examine two feature representations (TF and TF-IDF) and different word N-gram ranges. Best results are achieved using a combination of unigrams and bigrams, 30000 TF-IDF extracted features, and Linear Support Vector Machines as a classifier. The third model relies on semantic features extracted using Skip-Thoughts algorithm. Finally, the hybrid model that concatenates the feature vectors of the previous three models is proposed showing significant improvements over the three models. The proposed models are evaluated using five machine learning classifiers and 10- fold cross-validation. The best results are achieved using Linear Support Vector Machines with 85.3% Accuracy, 89.2% Precision, 91.6% Recall, and 90.4% FMeasure. Moreover, the evaluation shows a higher performance of the proposed hybrid model in comparison with two different models existing in the literature over the same dataset |