Sentiment Analysis of Social Media Data for Airline Brand Reputation Management Using Machine Learning Techniques in Python

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Social media platforms, particularly Twitter, have become vital channels where airline companies encounter vast volumes of customer feedback daily. This abundance of user- generated content presents significant opportunities for sentiment analysis applications. Previous studies have demonstrated the potential of machine learning and natural language processing approaches to identify meaningful patterns within customer opinions, although several obstacles persist, including the detection of sarcastic content, handling uneven data distributions, and interpreting ambiguous expressions. This research aimed to determine the feasibility of using Twitter conversations about airlines to accurately classify customer sentiment through computational methods. Our analysis utilized a corpus of 14,640 manually annotated tweets targeting six major U.S. airline carriers, with each message categorized into positive, negative, or neutral sentiment classes. Following text preprocessing procedures and feature extraction using Term Frequency- Inverse Document Frequency (TF-IDF) vectorization, we developed and assessed three distinct classification algorithms: Logistic Regression, Random Forest, and XGBoost models. Our experimental results revealed that XGBoost achieved superior classification accuracy compared to the other approaches, although certain misclassification patterns emerged, particularly in distinguishing between neutral and positive sentiment expressions. This study uses the methods like machine learning for knowing the sentiment level of customer which is provided in airline-based tweets. Twitter US Airline Sentiment Dataset is used which has 14,640 rows. The dataset has totally six U.S. carriers. The data will be preprocessed then cleaned, tokenization, elimination of stopword, lemmatization and feature engineering using TF-IDF vectorization are performed. Labels for airline and complaint reasons are the categorical features will be converted into certain formats for performing the computational modeling. The study joins the preprocessing of the data, feature engineering and learning has enhanced the classification of the automated sentiment. The findings allow to track the feedback collected from customer feedback, enhance the quality of the service and allows to make decision in the airline industry. The sentiment is classified into negative or neutral or positive and they are predicted using machine learning algorithms. Some of the machine learning algorithms are Logistic Regression, Random Forest, and XGBoost. 70:30 split is used in the dataset for partitioning into training and testing subsets. It helps to manage the proportions for the class. Model performance is used for finding many measures like recall, accuracy, precision, F1-score, confusion matrices and ROC curves. The output showed that XGBoost have performed well compared to other models and gradient boosting is used for managing the patterns which are based on text. It provides the importance of preprocessing of the data. Topic modeling of negative tweets showed the main reason for dissatisfaction like delays and problems in customer service and shares the insights for managing the airline. The tweets which are misclassified has many issues related to sarcasm, mixed sentiment expressions and usage of informal language.

Article activity feed