INTRODUCTION
ANALYSIS
ANNOTATION
TESTING
PIPELINE
EVALUATION
DEPLOY
CONCLUSION
DEMO

The Usefulness of Social Media for Understanding Adverse Drug Reactions

The rapid expansion and ubiquity of social media platforms have generated vast amounts of user-generated content, offering unprecedented opportunities for the exploration of various research questions. In the realm of pharmacovigilance, investigating the degree to which adverse medication reactions are shared on social media beyond official reporting channels is of particular interest. This study aims to identify both similarities and differences in reported adverse drug reactions (ADRs) between social media and the FDA's Adverse Event Reporting System (FAERS), to determine whether observed differences are meaningful for healthcare providers, pharmaceutical companies, and public health researchers, and to ascertain the reliability of social media as a data source for pharmacovigilance.

To accomplish this, we will conduct a comparative analysis of the reporting rates of ADRs for the most prescribed medications in the United States, comparing data extracted from social media platforms with that from FAERS. Our measurable objectives include collecting ADR data from both sources, developing ADR extraction methods for social media, comparing social media ADR rates to FAERS baselines, and identifying similarities as well as significant differences between them. Due to constraints, our analysis will be limited to a specific number of medications and will consider Reddit-specific limitations, such as the platform's younger demographic and longer comments. Additionally, we will account for differences in data collection methods and potential social media biases.

We seek an audience of pharmaceutical, bioinformatics, and public health researchers to consider and explore our results through this comprehensive website featuring visualizations, dashboards, and statistics, offering a narrative of the social media data mining process and its uses for pharmacovigilance.

Data Sources

The initial data sources for this study are the FDA's Adverse Event Reporting System (FAERS) and Reddit, as they provide rich and diverse perspectives on adverse drug reactions (ADRs). FAERS is the standard source of information on ADRs reported to the FDA, with voluntary reporting from healthcare providers, patients, and mandatory reporting from pharmaceutical companies. Reddit, on the other hand, serves as a platform where communities can discuss various diseases and their treatments, offering detailed reports of side effects and potentially unreported ADRs. Utilizing the respective APIs for each data source allows for targeted searches of specific medications.

To efficiently extract Reddit data, we employed the Python Reddit API Wrapper (PRAW) and PushshiftAPI for targeted searches of comments mentioning specific medications. Challenges in Reddit data extraction include the length and grammar of comments, as well as the lack of context, which may hinder accurate interpretation and understanding of the full scope of ADRs. To address these challenges and ensure secure, scalable data storage and processing, we implemented a data management strategy using Jupyter and Google Cloud.

Exploratory Data Analysis

Sentiment analysis is a valuable tool for identifying the underlying tone and sentiment of user-generated content, which we thought could enable us to uncover potential ADRs that may not be evident through traditional data mining techniques. After using NLTK's SentimentIntensityAnalyzer on Reddit comment content, we were able to determine polarity scores and assign sentiment labels. Following this, we created visualizations using Seaborn count plots and box plots (pictured below are results for Lipitor). We then attempted basic topic modeling of negative Reddit comments using Latent Dirichlet Allocation (LDA). We were not able to gain many useful insights from this data that would answer our question about which specific ADRs were more common. However, we were still curious about different ways that sentiment analysis could be done on our data.

Sentiment Analysis Sentiment Analysis Distribution

To further explore the data, we used the Wordcloud library to visualize the most frequently used words per medication. We colored these words based on sentiment analysis conducted with the TextBlob library (pictured right are the results for Lexapro). These visualizations only provided generic and noticeable results where words such as “better” and “good” were positive, and words like “bad” and “less” were negative, so we knew we needed to dig deeper than these surface value interpretations.

Wordcloud

Subsequently, we compared word frequencies across different sentiments, as illustrated in the adjacent image, by calculating the frequency disparities between positive and negative Reddit comments. This analysis revealed emerging trends in the data, such as expressions like "feel like I'm much better" and potential adverse drug reactions (ADRs), including "zaps" or "panic attacks." However, upon further examination, we recognized the inherent ambiguity in determining whether mentions of "panic attacks" indicate the medication caused or alleviated the condition or even if the term is used in the context of an ADR. This complexity necessitated exploring more sophisticated approaches to analyze the data effectively.

Word Frequency Comparison

Pivot to Deep Learning

Recognizing the complexity of our problem and with the guidance of our mentor, we decided to explore neural networks and deep learning methods. This required a transformation in our approach, which involved narrowing our scope from multiple medications to a single medication for building a robust model. Subsequently, we aimed to develop a pipeline that could be applied to other drugs.

spaCy NLP Library Compatibility w/ Prodigy

The spaCy library is a powerful and widely-used natural language processing (NLP) library designed for efficient and high-performance text processing tasks. It offers pre-trained models that are developed on large and diverse datasets, ensuring versatility and applicability across various domains. One of the key strengths of spaCy is its fine-tunability, which enables us to achieve higher accuracy and better performance by adapting the pre-trained models to specific tasks or domains.

Prodigy, an annotation tool developed by the creators of spaCy, is designed to streamline and optimize the annotation process. Prodigy offers the ability to suggest new annotations based on patterns defined by the user, significantly reducing the time and effort required for manual annotation. Furthermore, the tool supports multiple users, allowing us to asynchronously annotate documents and collaborate more effectively.

Another advantage of Prodigy is its ability to be cloud hosted, in our case using Google Cloud Platform (GCP), which enabled us to scale our model training and deployment to handle large volumes of data. This integration ensures that our NLP pipeline can be easily adapted to meet the demands of various projects and workloads.

Inter-rater Reliability

We redefined our key performance indicators to include checks of the inter-rater reliability (IRR) between annotator pairs. IRR is essential for maintaining consistency and accuracy in annotation tasks, minimizing subjective bias and variability among annotators, and ultimately enhancing the quality of training data and overall model performance. Cohen's kappa, a statistical measure that quantifies the level of agreement between two annotators while accounting for chance agreement, served as the key metric. Kappa values range from -1 to 1, with 1 indicating perfect agreement and values above 0.6 generally being considered acceptable.

Inter-rater Reliability Scale

Prodigy Annotation Styles

Named Entity Recognition (NER)

  • Proper nouns and self-contained expressions like person names or products
  • Single-token-based tags; better with clear token boundaries

Span Categorization (SpanCat)

  • Multi-token-based tags; better with complex or overlapping entities
  • Can be used to annotate relations between entities

Text Classification (TextCat)

  • Binary or multi-class classification of text
  • Can be used to annotate text with a single or multiple labels

NER Example
SpanCat Example
TextCat Example

Choosing a Medication

We opted for Ocrevus (ocrelizumab) as the initial drug of our study, owing to its recent FDA approval as a "first-in-class" multiple sclerosis (MS) medication. We also decided to utilize the Medical Dictionary for Regulatory Activities (MedDRA) terminology of side effects to serve as patterns for the Prodigy tool, with the aim of enhancing our annotation process.

Initial annotation attempts

The named entity recognition approach (NER) to annotation involved labeling drugs as well as associated adverse side effects within Reddit posts. Our first attempt was unsuccessful due to ambiguous annotation rules. Consequently, we transitioned to the span categorization approach to annotation, hypothesizing that our initial challenges could be attributed to the NER approach.

The span categorization (SpanCat) approach to annotation of Reddit posts involved labeling drugs, their associated adverse side effects, and ADR phrases directly connecting a drug to its undesired side effect. For our annotations, we paired up: Aidan was paired with Jackson, while Zach was paired with Taylor. Cohen’s kappa was computed for each pair of annotators to check to see if each pair was agreeing sufficiently well on the number of ADRs in each Reddit post. The results were as follows:

  • Zach and Taylor's Cohen's Kappa: .46
  • Aidan and Jackson's Cohen's Kappa: .34
These kappas did not hit the 0.5+ standard mentioned previously for Cohen’s kappa, but we made modifications (as explained in subsequent sections below) to greatly improve the IRR results.

Annotation Updates and Model Testing

During our testing phase, we realized that Ocrevus, despite being popular on Reddit, was not an ideal medication for our study due to its relatively low number of adverse drug reactions (ADRs). To provide the model with sufficient ADR data for training, we switched to studying Humira—a tumor necrosis factor (TNF) blocker known for reducing inflammation—observing that many individuals mentioned and complained about it while annotating Ocrevus comments.

Annotation Updates

To expedite the annotation process, we discussed annotation guidelines and updated our patterns file by incorporating resources such as the ADR Lexicon V 1.1 from HLP Cedars-Sinai, SIDER, CHV, COSTART, and DIEGO_Lab for ADRs, and Drugs@FDA Data Files for drug names and active ingredients. Integrating these resources into our annotation process enabled us to increase annotation speed and improve the quality of our training data, thereby enhancing the overall performance of our model.

IRR improvements

SpanCat (from previous SpanCat):

  • Taylor and Zach: .46 -> .55
  • Aidan and Jackson : .34 -> .56

TextCat:

  • Taylor and Zach: .74
  • Aidan and Jackson: .58

Sense2Vec

Another tool from the creators of spaCy and Prodigy, that had the potential to assist in our labeling of drugs and ADRs was Sense2Vec. Sense2Vec is a novel method for word sense disambiguation in neural word embeddings that utilizes supervised NLP labels instead of unsupervised clustering. Sense2Vec can disambiguate different parts of speech, sentiment, named entities, and syntactic roles of words, as well as demonstrate subjective and qualitative examples of disambiguated embeddings. We were not able to add and utilize the new side effect and drug vectors because beyond very general words or phrases the results were less than satisfactory.

GPT-2

We also tested a GPT-2 model with a classifier for ADR identification. We found that it was "learning" the general format of the ChatGPT comments and started to label any shorter comment an ADR. Some were correctly classified, but overall, the model results were underwhelming.

GPT-2 Training

Our final pipeline's structure begins to emerge during this testing phase, but we knew we needed to revamp and improve our models as well as find a way to not only label the drugs and ADRs but also link them together.

Remedying Class Imbalances with OpenAI's gpt-3.5-turbo API

  • Useful in creating examples that don't naturally occur very often in the data.
  • Able to generate similar perspectives to examples provided that may not be present in the original dataset.
  • Potential to counter biased data and recognize anomalies.
  • Reduce manual annotation workload through the usage of effective prompt engineering.

Pipeline Development

Our primary goal was to develop a reproducible pipeline that enables the analysis of not only Reddit comments but also extends to other social media platforms, such as Twitter and Facebook. This pipeline aims to be accessible and usable by academic, corporate, governmental, and public entities, allowing users to input any medication or adverse drug reaction of interest.

Stage 1: RoBERTa Text Classification

We employed the RoBERTa text classification model, a robustly optimized BERT pretraining approach built on the classic BERT model with longer and more focused pretraining and hyperparameter optimization. Utilizing the pre-trained RoBERTa base model from the HuggingFace transformers library, we developed custom PyTorch datasets and dataloaders for text preprocessing, which involved removing punctuation and links.

The comments were split into lists of strings and passed into the RoBERTa tokenizer item by item with overlap. The PyTorch classification head on top of the RoBERTa base model takes the pooled output and performs classification (ADR or no ADR) with dropout layers to prevent overfitting. We trained the model using 5 epochs with validation cycles, CrossEntropyLoss with class weights, the AdamW optimizer, and a linear learning rate scheduler.

To augment text classification training data, we employed OpenAI's GPT-3.5-turbo model for generating and classifying comments, addressing class imbalances and diversifying the training data to counteract biases and improve annotation efficiency.

Stage 2: Flair NER

For Named Entity Recognition (NER), we used Flair's SequenceTagger with stacked embeddings, including GloVe and Flair embeddings, providing different embeddings for the same word depending on its contextual use.

We incorporated word dropout and locked dropout to prevent overfitting, as well as a bidirectional Long Short-Term Memory (biLSTM) RNN to maintain short-term memory throughout the input sequence processing.

We trained the model using 25 epochs with validation cycles with the ViterbiLoss function.

Stage 3: spaCy Dependency Parser

Finally, we utilized a pre-trained spaCy dependency parser to extract and pair drug-ADR entities based on the output of the SequenceTagger. This parser links drugs and ADRs in the input text by finding the shortest dependency path between them, using the en_core_web_md model. The extracted and paired drug-ADR entities were saved in a CSV file for further analysis.

Pipeline Summary

This pipeline offers a robust and flexible approach to extracting ADR phrases in relation to medications in text, making it an optimal method for our project's objectives. By combining advanced text classification, NER, and dependency parsing techniques, our pipeline can efficiently identify and extract relevant information from social media data for pharmacovigilance purposes.

Pipeline Evaluation and Optimization

During the final annotation phase, Taylor and Zach maintained a satisfactory Cohen's Kappa for text classification, while Jackson and Aidan initially achieved poor results. To address this issue, the team created a document outlining rules for both annotation methods. Subsequently, Jackson and Aidan significantly improved their Cohen's Kappa scores, achieving better consistency and accuracy in their annotations. The inter-rater reliability improvements were as follows:

  • Text Classification (from previous TextCat):
    • Taylor and Zach: 0.74 -> 0.73
  • Named Entity Recognition:
    • Aidan and Jackson: 0.38 -> 0.58

RoBERTa Text Classification

To evaluate the RoBERTa model, we split the data into training, validation, and test sets. The model was trained using a custom PyTorch data loader and the RoBERTa tokenizer, which is a byte pair encoding (BPE) tokenizer. We ran the model for five epochs with a validation cycle after each epoch and calculated the training and validation accuracy and loss metrics at each step to observe convergence.

In this first iteration, we tried the RoBERTa classification model with standard parameters and got decent results for predicting texts without ADRs but performed poorly at recognizing texts with ADRs.

RoBERTa Training

In the following classification reports, we see the results of the addition of a long text modifier added to the RoBERTa model (left) as well as the same modified-RoBERTa but adding GPT-generated comments to the training data (right). This process entails segmenting the textual content of comments into a series of strings, each comprising 150-200 tokens, prior to being input into the tokenizer. As the text is fed into the RoBERTa tokenizer, it is processed item by item to prevent truncation, which may occur due to the model's maximum sequence length constraint of 512 tokens. Consequently, the entire comment can be tokenized and encoded without omission, ensuring the preservation of the full comment.

RoBERTa Training RoBERTa Training

It is essential to note that the proportion of comments exceeding 200 tokens in length is relatively insignificant. As a result, the application of this technique is primarily limited to exceptionally lengthy comments. The observed improvement in performance is likely attributable to factors unrelated to the modification for handling long text. The incorporation of GPT-generated comments served to balance the class distribution, providing more affirmative class examples for RoBERTa's fine-tuning process. This ultimately contributed to the enhanced performance observed in the model.

RoBERTa Training

We also performed hyperparameter optimization using the Optuna library, attempting to improve model performance. Using Google Colab, we ran 10 trial cycles optimizing learning rate, batch size, max length and number of epochs, with the objective of maximizing validation accuracy. In the course of this optimization, our best cycle was the first cycle. with 0.87 validation accuracy

NER Visualization NER Visualization

Flair NER

For Named Entity Recognition (NER), we used Flair's SequenceTagger with stacked embeddings, including GloVe and Flair embeddings, providing different embeddings for the same word depending on its contextual use. We incorporated word dropout and locked dropout to prevent overfitting, as well as a bidirectional Long Short-Term Memory (biLSTM) RNN to maintain short-term memory throughout the input sequence processing. We trained the model using 5 epochs with validation cycles, CrossEntropyLoss with class weights, the AdamW optimizer, and a linear learning rate scheduler, employing the ViterbiLoss function.

Flair Training Metrics (Relaxed Annotations)

In these classification reports (left is with only GloVe embeddings; right is GloVe embeddings AND Flair embeddings), we see pretty stellar results with accuracies of 0.73 and 0.78 respectively. However, we needed a better annotation strategy between the two raters and better training metrics other than a classification report to understand whether our model was fitting correctly.

NER Visualization NER Visualization

Flair Training Metrics (Strict Annotations)

With regards to the changes needed for our annotations, one of the major questions we needed to ask ourselves was whether we should annotate all ADR-like words as ADR when they occur or only when they occur in a sentence using the word in the context of an ADR? To test this, we decided to do a round of annotations where we used the latter method, which we called the strict method, only marking a word as an ADR if it was mentioned in a sentence as an ADR.

Flair Training

The problem with the strict labeling style is that by being so selective with labeling ADRs in the text we were unable to provide enough examples of ADRs in the training data for the model to train on as a majority of the words that could be ADRs but were marked as not teaches the model not to label those words at all. In turn, the model was underfit, unable to learn from the training dataset at all. Along with this, the accuracy, precision, recall, and f1-score all plateaued showing no improvement.

Flair Training Metrics (Relaxed + Strict Annotations)

To resolve some of these issues, we decided to combine the two annotated datasets, 1/2 strict 1/2 relaxed, hoping to provide a variety of examples in the training data to the model.

Flair Training

As you can see, our model was able to fit to the training data and showed improvement in all other metrics. Leading us to be confident in its ability to label our data effectively.

spaCy Dependency Parser

For our dependency parser implementation, we have developed an interactive network graph that visually represents the relationships between drug and ADR entities extracted using the NER model. This graph, constructed by generating nodes for each unique drug and ADR entity and forming edges based on the dependency relations identified by the spaCy dependency parser, allows users to explore the connections between drugs and their associated ADRs, thereby providing an intuitive way to assess the quality of the entity extraction and dependency parsing processes. The interactive visualization enables users to click on individual nodes and view the corresponding text containing the extracted pairing, offering valuable insights into the context in which the entities were identified and the accuracy of the dependency parsing process.

Dependency Parser Visualization

This evaluation method offers several advantages for understanding the performance of the NER and dependency parsing models, such as facilitating a more intuitive understanding of the underlying structure of the extracted entities and their associations, and allowing users to explore specific instances of drug-ADR pairings to assess their validity based on the surrounding context. Furthermore, the integration of NER results and dependency parser output enables users to evaluate the overall effectiveness of the pipeline in identifying and connecting relevant entities.

Docker Containers for Data Storage and Reproducibility

By providing a consistent and reproducible environment across various platforms and systems, Docker streamlines the deployment process, minimizing the risk of errors or conflicts. This is particularly beneficial when employing multiple deep learning methods, as it ensures seamless integration and execution. Moreover, Docker facilitates collaboration and sharing of code and configurations through container images and registries, further enhancing efficiency. Its ability to enable easy scaling and management of resources is particularly useful for large-scale data processing, making it an ideal solution for complex NLP projects.

GitHub Repository and Actions Workflows

GitHub Actions is a feature that allows you to automate your software development workflows in your GitHub repository. You can create workflows that run on GitHub-hosted CPU runners or on your own self-hosted GPU runners, depending on your needs.

One of the initial workflows we set up for our project is to push a select GitHub branch to a specific Docker tag. Another workflow we created is to enter a medication name, select a Docker tag, then run the pipeline on said Docker tag and deploy the results to Shiny Apps on a self-hosted runner.

Challenges and Workflow Adjustments

We faced some challenges using the previous approach to processing user requests. Our main bottleneck was the limited computing power of using one of our group members personal GPU, that can only handle a certain amount of workload. Running a pipeline for each user request on that GPU was time-consuming and we wanted the tool itself to be quick and easy to use. Another issue we encountered some issues with the PushShift API used to pull Reddit comments, which sometimes returns incomplete or outdated data. We have to refresh our config file regularly to ensure that we are using the latest data, but this also causes some hiccups when running the pipeline. We usually need to reset the config file once daily, but it works otherwise.

Nautilus HyperCluster

Our project utilized the Nautilus HyperCluster, a valuable resource provided by the University of Missouri and the National Research Platform, which offers extensive remote storage, CPU, and GPU capabilities for our data processing tasks.

Initially, our workflow consisted of creating a single pod to execute a pipeline for an individual medication and deploying a Shiny App to display the results. However, this approach proved time-consuming for users when analyzing medications sequentially. To address this issue, we developed an enhanced workflow incorporating Medicrawl.sh, a custom script that generates a pod capable of extracting 1000 comments for each medication listed in a ConfigMap file. These comments are subsequently appended to CSV files stored in a Persistent Volume Claim (PVC).

To further streamline our deployment process, we integrated GitHub automation into our workflow. By initiating the "Deploy Shinyapp" workflow, we establish a connection to the self-hosted runner, which in turn executes the pod on Nautilus, retrieves the CSVs from the PVC, and facilitates the construction and deployment of the Shiny App. This refined approach has significantly enhanced the efficiency and effectiveness of our data processing and analysis tasks.

Conclusion

Does our work answer our research questions?

  • Our project showed that social media can be considered a beneficial data source for pharmacovigilance provided that a significant amount of data can be gathered and analyzed.
  • There are many similarities between the ADRs identified from Reddit and those officially reported to FAERS for most drugs. There are noticeable differences, however — e.g., Tylenol was claimed to cause autism by multiple Reddit posts but was not a top ADR for Tylenol in the FAERS data
  • In what ways are any observed differences meaningful for healthcare providers, pharmaceutical companies, and public health researchers?
    • Healthcare providers want to know what ADRs people are discussing with each other on social media so they can provide perspective to patients about prescriptions being considered for treatment
    • While ADRs are not officially verified in either data set (Reddit or FAERS), pharmaceutical companies would like to be aware of their drugs' possible ADRs regardless
    • Increased amounts of data related to ADRs could help inform the research of the FDA, EMA, and other such entities

Tying it all together

  • Our approach can provide cost-effective, efficient and rapid identification of safety concerns for medications out on the market
  • The pipeline we created is robust and modular, and was constructed out of mostly open-source components available to everyday consumers
  • Overall, we demonstrated social media’s great potential as a source of more rapid and current reports of adverse drug reactions. We also showed that there are numerous potential adverse drug reactions for commonly prescribed medications that are heavily discussed on social media but are not officially reported to regulators