Using Machine Learning-Based Approaches for the Detection and Classification of Human Papillomavirus Vaccine Misinformation: Infodemiology Study of Reddit Discussions

The University of Texas Health Science Center at Houston (Du, Preston, Sun, Shegog, Savas, Amith, Tao); Texas Children's Hospital (Cunningham, Boom); Baylor College of Medicine (Boom)
"The accurate and timely understanding of vaccine misinformation on social media can assist vaccine promotion campaigns to prevent such information from misleading the vulnerable public."
The rapid growth of social media has enabled the quick spread of inaccurate or false vaccine information, thus creating obstacles for vaccine promotion. Mitigation of medical and public health misinformation on social media is important; however, the sheer amount of information makes efficient, accurate identification of these posts challenging. This study develops and evaluates an intelligent automated protocol for identifying and classifying human papillomavirus (HPV) vaccine misinformation on social media using machine learning (ML)-based methods. ML involves the use of algorithms and statistical modeling that provide the ability to automatically conduct tasks and learn without using explicit programming; deep learning (DL) is a subset of ML algorithms based on deep neural networks.
This study focuses on the social media platform Reddit, whose users are primarily under the age of 35 years (as with HPV vaccine recipients). The researchers compiled Reddit posts from 2007 to 2017 that contained keywords related to HPV vaccination. They manually labeled a random subset (2,200/28,121, 7.82%) for misinformation. The purpose of this step was to build a gold standard corpus (i.e., Reddit posts with their expert-assigned labels) that was used for the training and evaluation of the automated ML algorithms. The researchers then evaluated 5 ML-based algorithms designed to identify vaccine misinformation - 3 conventional (a support vector machine, logistic regression (LR), and extremely randomised trees) and 2 DL (a convolutional neural network (CNN) and a recurrent neural network, or RNN) - for identification performance. Topic modeling was applied to identify the major categories associated with HPV vaccine misinformation.
There was an increasing trend of HPV-vaccine-related discussions (in terms of both the number of posts and number of unique users) during the study period. Overall, 6 major topics related to HPV vaccine misinformation, including vaccine death and serious reactions and aluminum-containing adjuvants, were identified. The highest proportion of vaccine misinformation content on Reddit identified concerned general vaccine adverse effects (2,672/7,207, 37.07%), followed by content about vaccine conspiracy theories (1,072/7,207, 14.87%).
The results of the study's network analysis of the 6 identified vaccine misinformation topics demonstrate the strength of the connectedness of each topic. "Although general concerns about the safety of the vaccine emerged as the main source of hesitancy regarding HPV vaccination, the network analysis indicates that the other prominent topics identified, such as the presence of conspiracy theories, may also be rooted in fears about the side effects of the vaccine. Mere exposure to beliefs that the government and pharmaceutical companies gain or profit from mass vaccination through deception or at the consumers’ expense, has strong negative effects on attitudes about the safety and effectiveness of vaccines, consequently affecting choices about whether to vaccinate."
The researchers note that the Reddit posts identified in this study did not seem to be connected to any organised movements; rather, they were by single users advocating their personal views. "A potential method to combat these misinformed messages once identified is to counter them with an organized campaign, composed of factual, evidence-based messages, that does not acknowledge disinformation."
Among the 3 traditional ML algorithms used to identify vaccine misinformation in the Reddit posts, the LR algorithm performed best. However, both DL algorithms (CNN and RNN) outperformed the traditional ML algorithms - with the CNN model doing slightly better in the identification of misinformation.
In conclusion: "ML-based approaches are effective in the identification and classification of HPV vaccine misinformation on Reddit and may be generalizable to other social media platforms. ML-based methods may provide the capacity and utility to meet the challenge involved in intelligent automated monitoring and classification of public health misinformation on social media platforms."
Journal of Medical Internet Research 2021 (Aug 05); 23(8):e26478. Image credit: Jorge Franganillo via Flickr - (CC BY 2.0)
- Log in to post comments











































