Named Entity Recognition (NER)
- Proper nouns and self-contained expressions like person names or products
- Single-token-based tags; better with clear token boundaries
The rapid expansion and ubiquity of social media platforms have generated vast amounts of user-generated content, offering unprecedented opportunities for the exploration of various research questions. In the realm of pharmacovigilance, investigating the degree to which adverse medication reactions are shared on social media beyond official reporting channels is of particular interest. This study aims to identify both similarities and differences in reported adverse drug reactions (ADRs) between social media and the FDA's Adverse Event Reporting System (FAERS), to determine whether observed differences are meaningful for healthcare providers, pharmaceutical companies, and public health researchers, and to ascertain the reliability of social media as a data source for pharmacovigilance.
To accomplish this, we will conduct a comparative analysis of the reporting rates of ADRs for the most prescribed medications in the United States, comparing data extracted from social media platforms with that from FAERS. Our measurable objectives include collecting ADR data from both sources, developing ADR extraction methods for social media, comparing social media ADR rates to FAERS baselines, and identifying similarities as well as significant differences between them. Due to constraints, our analysis will be limited to a specific number of medications and will consider Reddit-specific limitations, such as the platform's younger demographic and longer comments. Additionally, we will account for differences in data collection methods and potential social media biases.
We seek an audience of pharmaceutical, bioinformatics, and public health researchers to consider and explore our results through this comprehensive website featuring visualizations, dashboards, and statistics, offering a narrative of the social media data mining process and its uses for pharmacovigilance.
The initial data sources for this study are the FDA's Adverse Event Reporting System (FAERS) and Reddit, as they provide rich and diverse perspectives on adverse drug reactions (ADRs). FAERS is the standard source of information on ADRs reported to the FDA, with voluntary reporting from healthcare providers, patients, and mandatory reporting from pharmaceutical companies. Reddit, on the other hand, serves as a platform where communities can discuss various diseases and their treatments, offering detailed reports of side effects and potentially unreported ADRs. Utilizing the respective APIs for each data source allows for targeted searches of specific medications.
To efficiently extract Reddit data, we employed the Python Reddit API Wrapper (PRAW) and PushshiftAPI for targeted searches of comments mentioning specific medications. Challenges in Reddit data extraction include the length and grammar of comments, as well as the lack of context, which may hinder accurate interpretation and understanding of the full scope of ADRs. To address these challenges and ensure secure, scalable data storage and processing, we implemented a data management strategy using Jupyter and Google Cloud.
Sentiment analysis is a valuable tool for identifying the underlying tone and sentiment of user-generated content, which we thought could enable us to uncover potential ADRs that may not be evident through traditional data mining techniques. After using NLTK's SentimentIntensityAnalyzer on Reddit comment content, we were able to determine polarity scores and assign sentiment labels. Following this, we created visualizations using Seaborn count plots and box plots (pictured below are results for Lipitor). We then attempted basic topic modeling of negative Reddit comments using Latent Dirichlet Allocation (LDA). We were not able to gain many useful insights from this data that would answer our question about which specific ADRs were more common. However, we were still curious about different ways that sentiment analysis could be done on our data.
To further explore the data, we used the Wordcloud library to visualize the most frequently used words per medication. We colored these words based on sentiment analysis conducted with the TextBlob library (pictured right are the results for Lexapro). These visualizations only provided generic and noticeable results where words such as “better” and “good” were positive, and words like “bad” and “less” were negative, so we knew we needed to dig deeper than these surface value interpretations.
Subsequently, we compared word frequencies across different sentiments, as illustrated in the adjacent image, by calculating the frequency disparities between positive and negative Reddit comments. This analysis revealed emerging trends in the data, such as expressions like "feel like I'm much better" and potential adverse drug reactions (ADRs), including "zaps" or "panic attacks." However, upon further examination, we recognized the inherent ambiguity in determining whether mentions of "panic attacks" indicate the medication caused or alleviated the condition or even if the term is used in the context of an ADR. This complexity necessitated exploring more sophisticated approaches to analyze the data effectively.
Recognizing the complexity of our problem and with the guidance of our mentor, we decided to explore neural networks and deep learning methods. This required a transformation in our approach, which involved narrowing our scope from multiple medications to a single medication for building a robust model. Subsequently, we aimed to develop a pipeline that could be applied to other drugs.
The spaCy library is a powerful and widely-used natural language processing (NLP) library designed for efficient and high-performance text processing tasks. It offers pre-trained models that are developed on large and diverse datasets, ensuring versatility and applicability across various domains. One of the key strengths of spaCy is its fine-tunability, which enables us to achieve higher accuracy and better performance by adapting the pre-trained models to specific tasks or domains.
Prodigy, an annotation tool developed by the creators of spaCy, is designed to streamline and optimize the annotation process. Prodigy offers the ability to suggest new annotations based on patterns defined by the user, significantly reducing the time and effort required for manual annotation. Furthermore, the tool supports multiple users, allowing us to asynchronously annotate documents and collaborate more effectively.
Another advantage of Prodigy is its ability to be cloud hosted, in our case using Google Cloud Platform (GCP), which enabled us to scale our model training and deployment to handle large volumes of data. This integration ensures that our NLP pipeline can be easily adapted to meet the demands of various projects and workloads.
We redefined our key performance indicators to include checks of the inter-rater reliability (IRR) between annotator pairs. IRR is essential for maintaining consistency and accuracy in annotation tasks, minimizing subjective bias and variability among annotators, and ultimately enhancing the quality of training data and overall model performance. Cohen's kappa, a statistical measure that quantifies the level of agreement between two annotators while accounting for chance agreement, served as the key metric. Kappa values range from -1 to 1, with 1 indicating perfect agreement and values above 0.6 generally being considered acceptable.
We opted for Ocrevus (ocrelizumab) as the initial drug of our study, owing to its recent FDA approval as a "first-in-class" multiple sclerosis (MS) medication. We also decided to utilize the Medical Dictionary for Regulatory Activities (MedDRA) terminology of side effects to serve as patterns for the Prodigy tool, with the aim of enhancing our annotation process.
The named entity recognition approach (NER) to annotation involved labeling drugs as well as associated adverse side effects within Reddit posts. Our first attempt was unsuccessful due to ambiguous annotation rules. Consequently, we transitioned to the span categorization approach to annotation, hypothesizing that our initial challenges could be attributed to the NER approach.
The span categorization (SpanCat) approach to annotation of Reddit posts involved labeling drugs, their associated adverse side effects, and ADR phrases directly connecting a drug to its undesired side effect. For our annotations, we paired up: Aidan was paired with Jackson, while Zach was paired with Taylor. Cohen’s kappa was computed for each pair of annotators to check to see if each pair was agreeing sufficiently well on the number of ADRs in each Reddit post. The results were as follows:
During our testing phase, we realized that Ocrevus, despite being popular on Reddit, was not an ideal medication for our study due to its relatively low number of adverse drug reactions (ADRs). To provide the model with sufficient ADR data for training, we switched to studying Humira—a tumor necrosis factor (TNF) blocker known for reducing inflammation—observing that many individuals mentioned and complained about it while annotating Ocrevus comments.
To expedite the annotation process, we discussed annotation guidelines and updated our patterns file by incorporating resources such as the ADR Lexicon V 1.1 from HLP Cedars-Sinai, SIDER, CHV, COSTART, and DIEGO_Lab for ADRs, and Drugs@FDA Data Files for drug names and active ingredients. Integrating these resources into our annotation process enabled us to increase annotation speed and improve the quality of our training data, thereby enhancing the overall performance of our model.
SpanCat (from previous SpanCat):
TextCat:
Another tool from the creators of spaCy and Prodigy, that had the potential to assist in our labeling of drugs and ADRs was Sense2Vec. Sense2Vec is a novel method for word sense disambiguation in neural word embeddings that utilizes supervised NLP labels instead of unsupervised clustering. Sense2Vec can disambiguate different parts of speech, sentiment, named entities, and syntactic roles of words, as well as demonstrate subjective and qualitative examples of disambiguated embeddings. We were not able to add and utilize the new side effect and drug vectors because beyond very general words or phrases the results were less than satisfactory.
We also tested a GPT-2 model with a classifier for ADR identification. We found that it was "learning" the general format of the ChatGPT comments and started to label any shorter comment an ADR. Some were correctly classified, but overall, the model results were underwhelming.
Our final pipeline's structure begins to emerge during this testing phase, but we knew we needed to revamp and improve our models as well as find a way to not only label the drugs and ADRs but also link them together.
Our primary goal was to develop a reproducible pipeline that enables the analysis of not only Reddit comments but also extends to other social media platforms, such as Twitter and Facebook. This pipeline aims to be accessible and usable by academic, corporate, governmental, and public entities, allowing users to input any medication or adverse drug reaction of interest.
We employed the RoBERTa text classification model, a robustly optimized BERT pretraining approach built on the classic BERT model with longer and more focused pretraining and hyperparameter optimization. Utilizing the pre-trained RoBERTa base model from the HuggingFace transformers library, we developed custom PyTorch datasets and dataloaders for text preprocessing, which involved removing punctuation and links.
The comments were split into lists of strings and passed into the RoBERTa tokenizer item by item with overlap. The PyTorch classification head on top of the RoBERTa base model takes the pooled output and performs classification (ADR or no ADR) with dropout layers to prevent overfitting. We trained the model using 5 epochs with validation cycles, CrossEntropyLoss with class weights, the AdamW optimizer, and a linear learning rate scheduler.
To augment text classification training data, we employed OpenAI's GPT-3.5-turbo model for generating and classifying comments, addressing class imbalances and diversifying the training data to counteract biases and improve annotation efficiency.
For Named Entity Recognition (NER), we used Flair's SequenceTagger with stacked embeddings, including GloVe and Flair embeddings, providing different embeddings for the same word depending on its contextual use.
We incorporated word dropout and locked dropout to prevent overfitting, as well as a bidirectional Long Short-Term Memory (biLSTM) RNN to maintain short-term memory throughout the input sequence processing.
We trained the model using 25 epochs with validation cycles with the ViterbiLoss function.
Finally, we utilized a pre-trained spaCy dependency parser to extract and pair drug-ADR entities based on the output of the SequenceTagger. This parser links drugs and ADRs in the input text by finding the shortest dependency path between them, using the en_core_web_md model. The extracted and paired drug-ADR entities were saved in a CSV file for further analysis.
This pipeline offers a robust and flexible approach to extracting ADR phrases in relation to medications in text, making it an optimal method for our project's objectives. By combining advanced text classification, NER, and dependency parsing techniques, our pipeline can efficiently identify and extract relevant information from social media data for pharmacovigilance purposes.
During the final annotation phase, Taylor and Zach maintained a satisfactory Cohen's Kappa for text classification, while Jackson and Aidan initially achieved poor results. To address this issue, the team created a document outlining rules for both annotation methods. Subsequently, Jackson and Aidan significantly improved their Cohen's Kappa scores, achieving better consistency and accuracy in their annotations. The inter-rater reliability improvements were as follows:
To evaluate the RoBERTa model, we split the data into training, validation, and test sets. The model was trained using a custom PyTorch data loader and the RoBERTa tokenizer, which is a byte pair encoding (BPE) tokenizer. We ran the model for five epochs with a validation cycle after each epoch and calculated the training and validation accuracy and loss metrics at each step to observe convergence.
In this first iteration, we tried the RoBERTa classification model with standard parameters and got decent results for predicting texts without ADRs but performed poorly at recognizing texts with ADRs.
In the following classification reports, we see the results of the addition of a long text modifier added to the RoBERTa model (left) as well as the same modified-RoBERTa but adding GPT-generated comments to the training data (right). This process entails segmenting the textual content of comments into a series of strings, each comprising 150-200 tokens, prior to being input into the tokenizer. As the text is fed into the RoBERTa tokenizer, it is processed item by item to prevent truncation, which may occur due to the model's maximum sequence length constraint of 512 tokens. Consequently, the entire comment can be tokenized and encoded without omission, ensuring the preservation of the full comment.
It is essential to note that the proportion of comments exceeding 200 tokens in length is relatively insignificant. As a result, the application of this technique is primarily limited to exceptionally lengthy comments. The observed improvement in performance is likely attributable to factors unrelated to the modification for handling long text. The incorporation of GPT-generated comments served to balance the class distribution, providing more affirmative class examples for RoBERTa's fine-tuning process. This ultimately contributed to the enhanced performance observed in the model.
We also performed hyperparameter optimization using the Optuna library, attempting to improve model performance. Using Google Colab, we ran 10 trial cycles optimizing learning rate, batch size, max length and number of epochs, with the objective of maximizing validation accuracy. In the course of this optimization, our best cycle was the first cycle. with 0.87 validation accuracy
For Named Entity Recognition (NER), we used Flair's SequenceTagger with stacked embeddings, including GloVe and Flair embeddings, providing different embeddings for the same word depending on its contextual use. We incorporated word dropout and locked dropout to prevent overfitting, as well as a bidirectional Long Short-Term Memory (biLSTM) RNN to maintain short-term memory throughout the input sequence processing. We trained the model using 5 epochs with validation cycles, CrossEntropyLoss with class weights, the AdamW optimizer, and a linear learning rate scheduler, employing the ViterbiLoss function.
In these classification reports (left is with only GloVe embeddings; right is GloVe embeddings AND Flair embeddings), we see pretty stellar results with accuracies of 0.73 and 0.78 respectively. However, we needed a better annotation strategy between the two raters and better training metrics other than a classification report to understand whether our model was fitting correctly.
With regards to the changes needed for our annotations, one of the major questions we needed to ask ourselves was whether we should annotate all ADR-like words as ADR when they occur or only when they occur in a sentence using the word in the context of an ADR? To test this, we decided to do a round of annotations where we used the latter method, which we called the strict method, only marking a word as an ADR if it was mentioned in a sentence as an ADR.
The problem with the strict labeling style is that by being so selective with labeling ADRs in the text we were unable to provide enough examples of ADRs in the training data for the model to train on as a majority of the words that could be ADRs but were marked as not teaches the model not to label those words at all. In turn, the model was underfit, unable to learn from the training dataset at all. Along with this, the accuracy, precision, recall, and f1-score all plateaued showing no improvement.
To resolve some of these issues, we decided to combine the two annotated datasets, 1/2 strict 1/2 relaxed, hoping to provide a variety of examples in the training data to the model.
As you can see, our model was able to fit to the training data and showed improvement in all other metrics. Leading us to be confident in its ability to label our data effectively.
For our dependency parser implementation, we have developed an interactive network graph that visually represents the relationships between drug and ADR entities extracted using the NER model. This graph, constructed by generating nodes for each unique drug and ADR entity and forming edges based on the dependency relations identified by the spaCy dependency parser, allows users to explore the connections between drugs and their associated ADRs, thereby providing an intuitive way to assess the quality of the entity extraction and dependency parsing processes. The interactive visualization enables users to click on individual nodes and view the corresponding text containing the extracted pairing, offering valuable insights into the context in which the entities were identified and the accuracy of the dependency parsing process.
This evaluation method offers several advantages for understanding the performance of the NER and dependency parsing models, such as facilitating a more intuitive understanding of the underlying structure of the extracted entities and their associations, and allowing users to explore specific instances of drug-ADR pairings to assess their validity based on the surrounding context. Furthermore, the integration of NER results and dependency parser output enables users to evaluate the overall effectiveness of the pipeline in identifying and connecting relevant entities.
By providing a consistent and reproducible environment across various platforms and systems, Docker streamlines the deployment process, minimizing the risk of errors or conflicts. This is particularly beneficial when employing multiple deep learning methods, as it ensures seamless integration and execution. Moreover, Docker facilitates collaboration and sharing of code and configurations through container images and registries, further enhancing efficiency. Its ability to enable easy scaling and management of resources is particularly useful for large-scale data processing, making it an ideal solution for complex NLP projects.
GitHub Actions is a feature that allows you to automate your software development workflows in your GitHub repository. You can create workflows that run on GitHub-hosted CPU runners or on your own self-hosted GPU runners, depending on your needs.
One of the initial workflows we set up for our project is to push a select GitHub branch to a specific Docker tag. Another workflow we created is to enter a medication name, select a Docker tag, then run the pipeline on said Docker tag and deploy the results to Shiny Apps on a self-hosted runner.
We faced some challenges using the previous approach to processing user requests. Our main bottleneck was the limited computing power of using one of our group members personal GPU, that can only handle a certain amount of workload. Running a pipeline for each user request on that GPU was time-consuming and we wanted the tool itself to be quick and easy to use. Another issue we encountered some issues with the PushShift API used to pull Reddit comments, which sometimes returns incomplete or outdated data. We have to refresh our config file regularly to ensure that we are using the latest data, but this also causes some hiccups when running the pipeline. We usually need to reset the config file once daily, but it works otherwise.
Our project utilized the Nautilus HyperCluster, a valuable resource provided by the University of Missouri and the National Research Platform, which offers extensive remote storage, CPU, and GPU capabilities for our data processing tasks.
Initially, our workflow consisted of creating a single pod to execute a pipeline for an individual medication and deploying a Shiny App to display the results. However, this approach proved time-consuming for users when analyzing medications sequentially. To address this issue, we developed an enhanced workflow incorporating Medicrawl.sh, a custom script that generates a pod capable of extracting 1000 comments for each medication listed in a ConfigMap file. These comments are subsequently appended to CSV files stored in a Persistent Volume Claim (PVC).
To further streamline our deployment process, we integrated GitHub automation into our workflow. By initiating the "Deploy Shinyapp" workflow, we establish a connection to the self-hosted runner, which in turn executes the pod on Nautilus, retrieves the CSVs from the PVC, and facilitates the construction and deployment of the Shiny App. This refined approach has significantly enhanced the efficiency and effectiveness of our data processing and analysis tasks.