diff --git a/.ruby-version b/.ruby-version new file mode 100644 index 0000000..ef538c2 --- /dev/null +++ b/.ruby-version @@ -0,0 +1 @@ +3.1.2 diff --git a/README.md b/README.md index 791857e..e823a48 100644 --- a/README.md +++ b/README.md @@ -1,8 +1,4 @@ -# CS 329S (Winter 2021) Final Project Reports - -We recommend that you write your report in Google Docs then migrate it Markdown. I find the migration fairly straightforward, and you can also use the [Docs to Markdown add-on](https://workspace.google.com/marketplace/app/docs_to_markdown/700168918607) to automatically convert Docs to Markdown. - -Once you've had your post in Markdown, create a pull request to add your post to the course's website. +# Template Residência em TIC BRISA - UnB FGA 1. Fork this repository. 2. Clone it to your local machine. @@ -16,6 +12,3 @@ Once you've had your post in Markdown, create a pull request to add your post to You might need to install Jekyll. If you're note familiar with Jekyll, you can find [Jekyll's installation instructions here](https://docs.github.com/en/github/working-with-github-pages/testing-your-github-pages-site-locally-with-jekyll). - -Let us know if you have any question! - diff --git a/_config.yml b/_config.yml index 26363a3..8c4c633 100755 --- a/_config.yml +++ b/_config.yml @@ -1,25 +1,18 @@ -title: CS 329S Winter 2021 Reports +title: Residência BRISA description: > # this means to ignore newlines until "baseurl:" - Final project reports for the Stanford's course CS 329S, Winter 2021 + Projeto Residência em TIC da BRISA - UnB FGA permalink: ':title/' -baseurl: "/reports" # the subpath of your site, e.g. /blog -url: "https://stanford-cs329s.github.io/" # the base hostname & protocol for your site, e.g. http://example.com -site-twitter: chipro # if your site has a twitter account, enter it here +baseurl: "/" # the subpath of your site, e.g. /blog +url: "http://guilhermedfs.github.io" # the base hostname & protocol for your site, e.g. http://example.com # Author Settings -author: CS 329S -author-img: stanfordlogo.png -about-author: Machine Learning Systems Design -social-twitter: chipro -social-github: chiphuyen -social-linkedin: chiphuyen -social-email: chip@huyenchip.com - -# Disqus -discus-identifier: cs329s-winter2021-reports - -# Tracker -analytics: UA-103070243-6 # Google Analytics +author: +author-img: lappis.png +about-author: Residência BRISA +social-twitter: +social-github: +social-linkedin: +social-email: caguiar@unb.br # Build Settings markdown: kramdown diff --git a/_includes/head.html b/_includes/head.html index 0f5ef71..2467022 100755 --- a/_includes/head.html +++ b/_includes/head.html @@ -85,4 +85,5 @@ + diff --git a/_layouts/main.html b/_layouts/main.html index 7e80a2b..73c8f9c 100644 --- a/_layouts/main.html +++ b/_layouts/main.html @@ -10,7 +10,7 @@ {{site.author}}
{{site.author}}
-

{{site.about-author}}

+

{{site.about-author}}

@@ -26,23 +26,8 @@ diff --git a/_posts/2021-03-17-covidbot-report.markdown b/_posts/2021-03-17-covidbot-report.markdown deleted file mode 100644 index 4f3439d..0000000 --- a/_posts/2021-03-17-covidbot-report.markdown +++ /dev/null @@ -1,155 +0,0 @@ ---- -layout: post -title: CovidBot Project Report -date: 2021-03-17 13:32:20 +0700 -description: Building a CovidBot for Personal and General Covid-Related Question-Answering -img: # Add image post (optional) -fig-caption: # Add figcaption (optional) -tags: [Covid] -comments: false ---- - -### The Team -- Andra Fehmiu -- Bernard Lange - -## Problem Definition - -One year of the global coronavirus pandemic has led to 550,574 deaths and 30,293,016 cases in the United States so far. It has forced the government and authorities to implement various restrictions and recommendations to hinder its spread until a long-lasting solution is determined. In this constantly evolving political and news landscape, it has been challenging for people all over the world, including those in the U.S., to remain informed about Covid-related matters (i.e. Covid-19 symptoms, recommended actions and guidelines, nearest test and vaccination centers, active restrictions and rules, etc). Medical call centers and providers have also been overloaded with volumes of individuals seeking reliable answers to their Covid19-related questions and/or seeking guidance with their Covid-19 symptoms.[^1] - -To tackle the challenges that have arisen due to these unusual circumstances, we have decided to build CovidBot, a Covid-19 Chatbot. CovidBot provides easy access to the most up-to-date Covid-19 news and information to individuals in the U.S and, as a result, eases the burden of medical providers. CovidBot enables us to standardize the answers to the most prevalent Covid-19-related questions, such as _What are Covid-19 symptoms?_ and _How does Covid spread?_, based on the information provided by WHO and CDC and provide them instantaneously and simultaneously to thousands of users seeking assistance. We have also added capabilities for handling user-specific queries that require personalized responses (i.e. When is my next test? When did I test positive?, etc.). Thus, CovidBot is able to answer both general, frequently-asked questions about COVID-19 and user-specific questions. - -Having come across multiple articles such as the one by Harvard Business Review about hospitals using AI to battle Covid-19, it was apparent to us that there is a clear need for a CovidBot that could also be easily integrated and used by hospital and medical centers around the U.S. While searching for available open-source code to build chatbots for Covid-19, we realized that the existing Covid question-answering models and chatbots were either limited in their capabilities and/or were not accessible. For example, Deepset’s Covid question-answering model API [2] and UI were taken offline in June 2020[^2]. Covid question-answering model deployed by students at Korea University[^3] [3] provides out of date Covid-related news and information. When we asked _“What vaccines are available?”,_ we were given an answer containing a scholarly article from 2016 about the different types of vaccines in general (see Figure 1) as opposed to our Chatbot’s QA model, which is able to provide an accurate and up-to-date answer to this question by listing the Pfizer and Moderna vaccines (see Figure 4). In addition, none of the Covid chatbots we came across have implemented necessary capabilities to address user-specific queries and provide personalized responses. - -
-
- -
-Figure 1. CovidAsk's response to the user query “What vaccines are available?” -
- -The bot can be used to find the most up-to-date Covid-related information at the time of writing, can provide answers to personal or general questions, and can be easily integrated with various popular social platforms that people use to communicate (e.g. Slack, Facebook Messenger, etc.). The implementation behind the CovidBot is available at [https://github.com/BenQLange/CovidBot](https://github.com/BenQLange/CovidBot). - -## System Design - -Our general framework is visualized in Figure 2 and comprises the following modules. _Natural Language Modelling module_ handles the queries and generates responses. The datasets used for training and any general information used or stored during inference is encapsulated in the _Knowledge Base_. Data-driven models and all the datasets are described in further detail in the Machine Learning Component section. All personal data, e.g. user-bot interaction history, personal information, and analytics, is stored in the _Internal Data Storage_. Finally, _Dialog Deployment Engine module_ enables interaction with our bot via popular messaging platforms such as Facebook Messenger and Slack. The deployment framework used is Google’s Dialogflow. We have decided to use it for building our conversational interface due to its straightforward integration with our ML system via the webhook service and various popular messaging platforms (e.g. Slack, Facebook Messenger, Skype etc.). This in turn, makes the deployment of the chatbot easier and makes CovidBot easy to use for our end-users. - -
-
- -
-Figure 2. Overview of the CovidBot Architecture -
- -
-
- -
-Figure 3. General CovidBot Framework -
- -## Machine Learning Component - -Our general CovidBot framework is visualized in Figure 3. The CovidBot is powered by multiple ML models running simultaneously. Depending on the type of the query, whether it relates to personal or general COVID-19 information, different models are responsible for response generation. - -We build an intent classifier using the GPT-3 model thanks to the OpenAI Beta access. GPT-3 uses the Transformer framework and has 175B trainable parameters, 96 layers, 96 heads in each layer, each head with a dimension of 128. To successfully perform intent classification, GPT-3 requires only a few examples of correct classification. Depending on the intent of the user’s query, we either use a GPT-3 model to generate and extract personalized (user-specific) response or a Covid General QA model that uses either DialoGPT, RoBERTa, or GPT-3 to generate a response. If the query is personal, GPT-3 extracts the type of the information provided, e.g. I have a test on the 2nd of April, and stores it locally. If it is a question, e.g. _When was my last negative test?_, it loads locally stored information based on which GPT-3 generates the answer. - -The answers to general COVID-19 questions are generated by the DialoGPT by default. However, we have also built in an additional capability to pick RoBERTa, or GPT-3. Although the GPT-3 model is a powerful text generation model, we can not fine-tune the model to our tasks and we have a limited number of input tokens. This limits the amount of knowledge about COVID-19 which is provided to the model making it inadequate for our task. For this reason, we build 2 additional models, namely RoBERTa and DialoGPT, that do not have these limitations. - -RoBERTa [5] is a retrained BERT model that builds on BERT’s language masking strategy and removes its next-sentence pretraining objective.[^4] We use the RoBERTa model fine-tuned on a SQuAD-style CORD-19[^5] dataset provided by Deepset, which is publicly available on HuggingFace[^6]. After testing the model performance and inspecting the Covid QA dataset, we observe that a lot of the annotated examples contain non-Covid content, which is reflected in the poor performance of the Covid QA model. Due to this, we fine-tuned the RoBERTa model again using our custom dataset containing Covid-related FAQ pages from the CDC and WHO websites. Although far from ideal, the RoBERTa model results after this iteration were more reasonable, indicating the importance of a larger and higher quality dataset in providing more robust answers. Another important observation made is that even with GPU acceleration, the RoBERTa Covid QA model is slow and would not be suitable for production as is. Thus, to reduce the model throughput, we implemented a retrieval-based RoBERTA model where the retriever scans through documents in the training set and returns only a small number of them that are most relevant to the query. The retrieval methods considered are: TF-IDF, Elastic Search, and DPR (all implemented using the Haystack package). However, even with the retrieval methods implemented, the model is still slower than other models and requires further optimization to be deployed in production. - -DialoGPT model is based on the GPT-2 transformer [6] which uses masked multi-head self-attention layers on the web collected data [7]. It is a small 12-layer architecture and uses byte pair encoding tokenizer [8]. The model was accessed through HuggingFace. It applies a similar training regime as OpenAI’s GPT-2 where conversation generation is framed as language modelling task with all previous dialog turns concatenated and ended with the end-of-text token. - -To fine-tune the pre-trained DialoGPT and RoBERTa models, we build scraper functions that collect data from the CDC and WHO FAQ pages. Our custom Covid QA dataset has 516 examples of Covid-related questions and answers and both models’ performance improves noticeably after fine-tuning them with this dataset. - -## System Evaluation - -In order to evaluate the performance of our Covidbot system, we integrated each of the 3 response generation models into the messenger platform using Dialogflow and simulated multiple user-bot interactions per session. We validated the performance of our system by testing it using different types of queries; these queries include: semantically different queries, queries with different intents (personal vs. general) as well as queries that are both implicitly and explicitly related to Covid (e.g. ”implicit” queries are “What is quarantine?”, “Are vaccines available?” vs. “explicit” queries are “What are Covid-19 symptoms?”, “What is Covid-19?”). - -We also evaluated the latency and throughput of our system in generating responses for queries with different complexity levels and also in generating responses when multiple users are using it simultaneously. - -We also asked our peers to interact with the CovidBot and give us feedback based on the bot’s responses to their queries, and they were all satisfied with the performance of our bot. They thought the answers CovidBot gave were reasonable and the only remark they made was that the bot’s responses occasionally contained incomplete sentences, which is a limitation we are aware of and will work on improving for the next iteration. - -If we had more users testing the system and we had an environment that resembles more the real-time production environment then we would also analyze some user-experience metrics (i.e. the average number of questions asked, the total number of sessions that are simultaneously active), as well as bot-quality metrics (i.e. the most frequent responses given, percentage of fallback responses where the chatbot did not know the answer to a question). We would also integrate an optional CovidBot rating feature that uses "thumbs up/down" buttons in order to allow users to rate their experience using the system at the end of each session. - -## Application Demonstration - - -![alt_text](../assets/img/demo1.png "image_tooltip") - -![alt_text](../assets/img/demo2.png "image_tooltip") - -![alt_text](../assets/img/demo3.png "image_tooltip") -*Figure 4. CovidBot Demonstration for Personal and General Covid Question-Answering* - -In terms of the core interface decisions, we chose to build a chatbot through a messenger platform as a channel. We use Dialogflow, Google’s conversational interface designer, because it allows us to seamlessly integrate our ChatBot with different, popular messenger platforms and other types of deployment methods, such as Google Assistant. - -As can be seen in Figure 4, the latest version of our CovidBot is deployed on Slack and provides a visual interface that can appear on both desktop and mobile. This allows users to easily access the CovidBot without having to open their web browser and makes their user experience smoother. We assume a good amount of users are familiar with similar interactions to the ones they have with our CovidBot. The bot is initialized by asking the user about the model they want to use for response generation, giving them the freedom to pick and explore the models on their own. By default or if a user inputs an invalid model name, we use the DialoGPT model. After initializing a response generating model, we begin by asking our CovidBot more general Covid questions, such as: “What are the symptoms of Covid-19?”, “Are there vaccines available?”. For all questions, we receive satisfactory and up-to-date responses as shown in Figure 4. When the CovidBot identifies a personal statement, e.g. “I have a test on the 22nd of April”, it will store it locally and reply “Noted down!”. Based on the locally stored information, the bot is capable of answering personal questions, such as “When is my next test?”. - -Given that there is already a significant amount of Covid-related news and information on the web, we believe that deploying CovidBot is essential in this ever-changing Covid-19 landscape which can (and does) become overwhelming at times for a lot of people. - -As part of this project, we have built an AI-driven bot because text generation is a difficult task especially in this context where the term “Covid” has multiple synonyms. So, given the gravity of the Covid-19 pandemic and the need for spreading accurate Covid-related information, it is highly important to build a model that is able to recognize, analyze, and understand a wide variety of responses and derive meaning from implications without relying solely on syntactic commands. - -## Reflection - -We believe that we have achieved all our major objectives with the CovidBot framework. All models trained on the dataset scraped from the WHO and CDC websites outperformed our expectations both in terms of information accuracy, and inference time. They are also efficient enough to enable regular updates/re-trainings on a daily basis as more information becomes available. Model deployment with Google’s Dialogflow to Slack was also surprisingly easy making the bot easy to share. One of the issues which should be addressed is our reliance on GPT-3 provided by the OpenAI API Beta Access to perform intent classification and personal queries handling. However, we think that training both intent classification and personal response generation shouldn’t be more challenging than the general response generation achieved with DialoGPT and RoBERTa. - -We would like to thank CS329s course staff for advice during the development of the CovidBot and for the access to the OpenAI API Beta Access. - -## Broader Impacts - -The intended uses of the CovidBot include getting the most up-to-date Covid-related news and receiving personal reminders about Covid-related matters (i.e. testing dates etc). However, we do not intend to have the CovidBot substitute doctors, which is why it is highly important for us to ensure that users understand that they should not be using the bot to seek for serious medical advice as it could have significant health consequences for the users. We have attempted to mitigate harms associated with this unintended use of the system by carefully picking the examples included in our custom Covid QA dataset, which come from trusted health organizations and agencies that also take precautions when answering FAQs in their website in order to prevent the same unintended uses as ours. As a concrete example, there is a publicly available dataset that includes examples of Covid-related conversations between patients and doctors, but we decided to not include it in our model fine-tuning step in order to mitigate the harms associated with having our bot respond like a doctor. - -In the future, we could perform analysis of the type of queries being inputted into our system and see if we can detect a pattern in how users interact with the bot. We could also implement features that are easy to notice (i.e. a disclaimer below the query bar) in order to remind users of the intended use cases of our CovidBot. - -## Contributions - -Andra worked on data collection and preprocessing, the RoBERTa models, and integration of models for chatbot deployment using Dialogflow. - -Bernard worked on the DialoGPT models, GPT-3 integration and CovidBot system design. - -## References - -[1] Wittbold, K., Carroll, C., Iansiti, M., Zhang, H. and Landman, A., 2021. _How Hospitals Are Using AI to Battle Covid-19_. [online] Harvard Business Review. Available at: <https://hbr.org/2020/04/how-hospitals-are-using-ai-to-battle-covid-19> [Accessed 19 March 2021]. - -[2] Möller, T., Reina, A., Jayakumar, R. and Pietsch, M., 2020, July. COVID-QA: A Question Answering Dataset for COVID-19. In _Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020_. - -[3] Lee, J., Yi, S.S., Jeong, M., Sung, M., Yoon, W., Choi, Y., Ko, M. and Kang, J., 2020. Answering questions on covid-19 in real-time. _arXiv preprint arXiv:2006.15830_. - -[4] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. and Agarwal, S., 2020. Language models are few-shot learners. _arXiv preprint arXiv:2005.14165_. - -[5] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V., 2019. Roberta: A robustly optimized bert pretraining approach. _arXiv preprint arXiv:1907.11692_. - -[6] Zhang, Y., Sun, S., Galley, M., Chen, Y.C., Brockett, C., Gao, X., Gao, J., Liu, J. and Dolan, B., 2019. Dialogpt: Large-scale generative pre-training for conversational response generation. _arXiv preprint arXiv:1911.00536_. - -[7] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. and Sutskever, I., 2019. Language models are unsupervised multitask learners. _OpenAI blog_, _1_(8), p.9. - -[8] Sennrich, R., Haddow, B. and Birch, A., 2015. Neural machine translation of rare words with subword units. _arXiv preprint arXiv:1508.07909_. - - - -## Notes - -[^1]: - https://hbr.org/2020/04/how-hospitals-are-using-ai-to-battle-covid-19 - -[^2]: - [https://github.com/deepset-ai/COVID-QA](https://github.com/deepset-ai/COVID-QA) - -[^3]: - [https://github.com/dmis-lab/covidAsk](https://github.com/dmis-lab/covidAsk) - -[^4]: - https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/ - -[^5]: - https://allenai.org/data/cord-19 - -[^6]: - https://huggingface.co/deepset/roberta-base-squad2-covid diff --git a/_posts/2021-03-18-Fact-Checking-Tool-for-Public-Health-Claims.markdown b/_posts/2021-03-18-Fact-Checking-Tool-for-Public-Health-Claims.markdown deleted file mode 100644 index 036c8af..0000000 --- a/_posts/2021-03-18-Fact-Checking-Tool-for-Public-Health-Claims.markdown +++ /dev/null @@ -1,184 +0,0 @@ ---- -layout: post -title: Fact-Checking Tool for Public Health Claims -date: 2021-03-12 13:32:20 +0700 -description: Your report description -img: # Add image post (optional) -fig-caption: # Add figcaption (optional) -tags: [Fake-News, NLP] -comments: true ---- - -### The Team -- Alex Gui -- Vivek Lam -- Sathya Chitturi - -## Problem Definition - -Due to the nature and popularity of social networking sites, misinformation can propagate rapidly leading to widespread dissemination of misleading and even harmful information. A plethora of misinformation can make it hard for the public to understand what claims hold merit and which are baseless. The process of researching and validating claims can be time-consuming and difficult, leading to many users reading articles and never validating them. To tackle this issue, we made an easy-to-use tool that will help automate fact checking of various claims focusing on the area of public health. Based on the text the user puts into the search box, our system will generate a prediction that classifies the claim as one of True, False, Mixed or Unproven. Additionally, we develop a model which matches sentences in a news article against common claims that exist in a training set of fact-checking data. Much of the prior inspiration for this work can be found in [Kotonya et al](https://arxiv.org/abs/2010.09926) where the authors generated the dataset used in this project and developed a method to evaluate the veracity of claims and corresponding explanations. With this in mind we tried to address veracity prediction and explainability in our analysis of news articles. - -## System Design - -Our system design used the following steps: 1) Development of ML models and integration with Streamlit 2) Packaging the application into a Docker container 3) Deployment of the application using Google App Engine. - -
-
- -
-
- -1. In order to allow users to have an interactive experience, we designed a web application using Streamlit for fake news detection and claim evaluation. We chose Streamlit for three primary reasons: amenability to rapid prototyping, ready integration with existing ML pipelines and clean user interface. Crucial to the interface design was allowing the users a number of different ways to interact with the platform. Here we allowed the users to either choose to enter text into text boxes directly or enter a URL from which the text could be automatically scraped using the [Beautiful Soup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/) python library. Therefore, using this design pipeline we were able to quickly get a ML-powered web-application working on a local host! -2. To begin the process of converting our locally hosted website to a searchable website, we used Docker containers. Docker is a tool that can help easily package a local project into an environment known as a container that can be run on another machine. For this project, our Docker container hosted the machine learning models, relevant training and testing data, the Streamlit application file (app.py) as well as a file named “requirements.txt” which contained a list of names of packages needed to run the application. -3. With our application packaged, we deployed our Docker container on Google App Engine using the Google Cloud SDK. Essentially, this created a CPU (or sets of CPUs) in the cloud to host the web-app and models. We opted for an auto-scaling option which means that the number of CPUs automatically scale with the number of requests. For example, many CPU cores will be used in periods of high traffic and few CPU cores will be used in periods of low traffic. Here, it is worth noting that we considered other choices for where to host the model including Amazon AWS and Heroku. We opted for Google App Engine over Heroku since we needed more than 500MB of storage; furthermore, we preferred App Engine to AWS in order to take advantage of the $300 free GCP credit! - -## Machine Learning Component - -We build our model to achieve two tasks: veracity prediction and relevant fact-check recommendation. The veracity prediction model is a classifier that takes in a text input and predicts it to be one of true, false, mixed and unproven with the corresponding probabilities. The model is trained on PUBHEALTH, an open source dataset of fact-checked health claims. The dataset contains 11.8k health claims, with the original texts and expert-annotated ground-truth labels and explanations. More details about the dataset can be found [here](https://huggingface.co/datasets/health_fact). - -We first trained a baseline LSTM (Long Short Term Memory Network), a recurrent neural network that’s widely used in text classification tasks. We fit the tokenizer and classification model from scratch using tensorflow and Keras. We trained the model for 3 epochs using an embedding dimension size of 32. With a very simple architecture, we were able to get decent results on the held-out test set (see Table 1). - -In the next iterations, we improved the baseline model leveraging state-of-the-art language model DistilBERT with the huggingface API. Compared to the LSTM which learns one-directional sequentially, BERT makes use of Transformer which encodes the entire sequence at once, and is thus able to learn word embeddings with a deeper context. We used a lightweight pretrained model DistilBERT (a distilled version of BERT, more details can be found [here](https://arxiv.org/abs/1910.01108) and fine-tuned it on the same training dataset. We trained the model for 5 epochs using warm up steps of 500 and weight-decay of 0.02. All prediction metrics were improved on the test set. The DistilBERT model takes 5x longer time to train, however at inference step, both models are fast in generating online predictions. - -![](../assets/img/ImagesVideos_FakeNews/bert_model.png) - -The supervised-learning approach is extremely efficient and has good precision to capture signals of misinformation. However, the end-to-end neural network is black-box in nature and the prediction is never perfect, it is very unsatisfying for users to only receive a prediction without knowing how models arrive at certain decisions. Additionally, users don’t gain new information or become better-informed from reading a single classifier result, which defeats the overall purpose of the application. Therefore, we implemented a relevant claim recommendation feature to promote explainability and trust. Based on the user input, our app would search for claims in the training data that are similar to the input sentences. This provides two additional benefits: 1) users will have proxy knowledge of what kind of signals our classifier learned 2) users can read relevant health claims that are fact-checked by reliable sources to better understand the subject-matter. - -For implementation, we encode the data on a sentence level using [Sentence-BERT](https://arxiv.org/abs/1908.10084). The top recommendations are generated by looking for the nearest neighbors in the embedding space. For each sentence in the user input, we look for most similar claims in the training dataset using cosine similarity. We returned the trigger sentence and most relevant claims with similarity scores above 0.8. - -![](../assets/img/ImagesVideos_FakeNews/embedding_model.png) - -## System Evaluation - -We conducted offline evaluation on the Pubhealth held-out test set (n = 1235). The first table showed the overall accuracy of the two models. Since our task is multi-label classification, we are interested in the performance per each class, particularly in how discriminative our model is in flagging false information. - -| Model | Accuracy (Overall) | F1 (False) | -| --------------------- | ------------------ | ---------- | -| LSTM | 0.667 | 0.635 | -| DistilBERT Fine Tuned | **0.685** | **0.674** | - -Table 1: Overall accuracy of the two models - -| Label | F1 | Precision | Recall | -| ------------------ | ----- | --------- | ------ | -| False (n = 388 ) | 0.674 | 0.683 | 0.665 | -| Mixed (n = 201) | 0.4 | 0.365 | 0.443 | -| True (n = 599 ) | 0.829 | 0.846 | 0.813 | -| Unproven (n = 47 ) | 0.286 | 0.324 | 0.255 | - -Table 2: F1, Precision and Recall per class of Fine-tuned BERT model - -Our overall accuracy isn’t amazing, but it is consistent with the results we see in the original pubhealth paper. There are several explanations: -1. Multi-label classification is inherently challenging, and since our class sizes are imbalanced, we sometimes suffer from poor performance in the minority classes. -2. Text embeddings themselves don’t provide rich enough signals to verify the veracity of content. Our model might be able to pick up certain writing styles and keywords, but they lack the power to predict things that are outside of what experts have fact-checked. -3. It is very hard to predict “mixed” and “unproven” (Table 2). - -However looking at the breakdown performance per class, we observe that the model did particularly well in predicting true information, meaning that most verified articles aren’t flagged as false or otherwise. This is good because it is equally damaging for the model to misclassify truthful information, and thus make users trust our platform less. It also means that if we count mixed and unproven as “potentially containing false information”, our classifier actually achieved good accuracy on a binary label prediction task (>80%). - -### Some interesting examples - -In addition to system-level evaluation, we provide some interesting instances where the model did particularly well and poorly. - -#### Case 1 (Success, DistilBERT): False information, Model prediction: mixture, p = 0.975 - -*“The notion that the cancer industry isn’t truly looking for a ‘cure’ may seem crazy to many, but the proof seems to be in the numbers. As noted by Your News Wire, if any of the existing low-cost, natural and alternative cancer treatments were ever to be approved, then the healthcare industry’s cornerstone revenue producer would vanish within months. Sadly, it doesn’t appear that big pharma would ever want that to happen. The industry seems to be what is keeping us from a cure. Lets think about how big a business cancer has become over the years. In the 1940’s, before all of the technology and innovation that we have today, just one out of every 16 people was stricken with cancer; by the 70’s, that ratio fell to 1 in 10. Today, one in two males are at risk of developing some form of cancer, and for women that ratio is one in three.”* - -This is an example of a very successful prediction. The above article leveraged correct data to draw false conclusions. For example, that cancer rate has increased is true information that was included in the training database, but the writing itself is misleading. The model did a good job of predicting mixture. - -#### Case 2 (Failure, DistilBERT): False information, Model prediction: true, p = 0.993 - -*“WUHAN, China, December 23, 2020 (LifeSiteNews) – A study of almost 10 million people in Wuhan, China, found that asymptomatic spread of COVID-19 did not occur at all, thus undermining the need for lockdowns, which are built on the premise of the virus being unwittingly spread by infectious, asymptomatic people. Published in November in the scientific journal Nature Communications, the paper was compiled by 19 scientists, mainly from the Huazhong University of Science and Technology in Wuhan, but also from scientific institutions across China as well as in the U.K. and Australia. It focused on the residents of Wuhan, ground zero for COVID-19, where 9,899,828 people took part in a screening program between May 14 and June 1, which provided clear results as to the possibility of any asymptomatic transmission of the virus.”* - -This is a case of the model failing completely. We suspect that this is because the article is written very appropriately, and quoted prestigious scientific journals, which all made the claim look legitimate. Given that there is no exact similar claim matched in the training data, the model tends to classify it as true. - -### Slice analysis - -We performed an analysis of the LSTM model performance on various testing dataset slices. Our rationale for doing these experiments was that the LSTM likely makes a number of predictions based on writing style or similar semantics rather than the correct content. Thus, it is very possible that a model written with a “non-standard” with True information might be predicted to be False. Our slices, which included word count, percentage of punctuation, average sentence length, and date published were intended to be style features that might help us learn more about our model’s biases. - -Here we would like to highlight an example of the difficulty in interpreting the results of a slice based analysis for a multi-class problem. In this example, we slice the dataset by word count and create two datasets corresponding to whether the articles contain more than or less than 500 words. We found that the accuracy for the shorter articles was 0.77 while the accuracy for the larger articles was 0.60. Although this seems like a large difference in performance, there are some hidden subtleties that are worth considering further. In Table 3, we show the per-class performance for both splits as well as the number of samples in each split. Here, it is clear to see that class distributions of the two datasets are quite different, making a fair comparison challenging. For instance, it is likely that we do well on the lower split dataset because it contains a large fraction of True articles which is the class which is best predicted by our model. - -| Labels | Lower Split Accuracy | Lower Split Nsamples | Upper Split Accuracy | Upper Split Nsamples | -| -------- | -------------------- | -------------------- | -------------------- | -------------------- | -| False | 0.7320 | 97 | 0.7526 | 291 | -| Mixture | 0.2041 | 49 | 0.2368 | 152 | -| True | 0.8810 | 311 | 0.7118 | 288 | -| Unproven | 0.3636 | 11 | 0.0588 | 34 | - -Table 3: Slice analysis on word count - -### Similarity Matching - -To evaluate the quality of similarity matching, one proxy is to look at the cosine similarity score of the recommended claims. Since we only returned those with similarity scores of more than 0.8, the matching results should be close to each other in the embedding space. However it is less straightforward to evaluate the embedding quality. For the scope of this project, we did not conduct systematic evaluation of semantic similarities of the recommended claims. But we did observe empirically that the recommended claims are semantically relevant to the input article, but they don’t always provide correction to false information. We provide one example in our app demonstration section. - -## Application Demostration - -To serve users, we opted to create a web application for deployment. We settled on this choice as it enabled a highly interactive and user friendly interface. In particular, it is easy to access the website URL via either a phone or a laptop. - -![](../assets/img/ImagesVideos_FakeNews/main_screen.png) - -There are three core tabs in our streamlit web-application: Fake News Prediction, Similarity Matching and Testing. - -### Fake News Prediction Tab - -The Fake News Prediction tab allows the user to make predictions as to whether a news article contains false claims (“False”) , true claims (“True”), claims of unknown veracity (“Unknown”) or claims which have both true and false elements (“Mixed”). Below, we show an example prediction on text from the following article: [Asymptomatic transmission of COVID-19 didn’t occur at all, study of 10 million finds](https://www.lifesitenews.com/news/asymptomatic-transmission-of-covid-19-didnt-occur-at-all-study-of-10-million-finds). Here, our LSTM model correctly identifies that this article contains false claims! - - -
- -
- -### Similarity Matching Tab - -The Similarity Matching Tab compares sentences in a user input article to fact checked claims made in the PUBHEALTH fact-check dataset. Again we allow users the option of being able to enter either a URL or text. The following video demonstrates the web-app usage when provided the URL corresponding to the article: [Study: Covid-19 deadlier than flu](https://www.pharmacytimes.com/view/study-covid-19-deadlier-than-flu). Here, it is clear that the model identifies some relevant claims made in the article including the number of deaths from covid, as well as comparisons between covid and the flu. - -
- -
- -### Testing Tab - -Finally, our “Testing” tab allows users to see the impact of choosing different PUBHEALTH testing slices on the performance of the baseline LSTM model. For this tab, we allow the user to select the break point for the split. For instance, for the word count slicing type, if we select 200, this means that we create two datasets: one with only articles shorter than 200 words and another with only articles larger than 200 words. Check out the video below for a demo of splicing the dataset on the punctuation condition! - -
- -
- -## Reflection - -Overall, our project was quite successful as a proof of concept for the design of a larger ML driven fact-checking platform. We succeeded in developing two models (LSTM and DistilBERT) that can reasonably detect fake news on a wide range of user articles. We achieved promising results on a held-out testing set and found that our model was relatively stable to some common dataset slices. Furthermore, for some inputs, our Sentence-BERT was able to detect claims in the article which were similar to those contained within our training set. We were also able to allocate work and seamlessly integrate among our three team members. Although all members contributed significantly to each part of the project, Alex focused on the model training and validation while Vivek and Satya focused on the UI and deployment. Despite the successes of this project, there are several things that either don’t work or need improvement. - -One major area for improvement is on the sentence claim matcher. Currently when certain articles make claims that are outside of the distribution of the training dataset there will be no relevant training claims that can be matched to. This is inherently due to the lack of training data needed for these types of applications. To address this issue it would be useful to periodically scrape fact-checked claims and evidence from websites such as snopes to keep the database up to date and expanding. Additionally, we could incorporate user feedback in our database after being reviewed by us or an external fact-checking group. - -Another issue is that we have two separate features, one where the veracity of an article is predicted based primarily on style (LSTM and DistilBERT models), and one where we attempt to extract the content by matching with fact checked claims. An ideal model would be able to combine style and content. Additionally, the claims that we can match sentences to are limited by the data in our training set. - -Another improvement we could make pertains to the testing tab. Currently we output the per-class accuracy, but we could additionally output several figures such as histograms and confusion matrices. Better visualization will help users understand quickly how the models perform on different slices. - -## Broader Impacts - -Fake news poses a tremendous risk to the general public. With the high barrier required to fact check certain claims and articles we hope that this project will start to alleviate some of this burden from casual internet users and help people better decide what information they can trust. Although this is the intended use case of our project, we recognize that there is potential harm that can arise from the machine learning models predicting the wrong veracity for some articles. One can easily imagine that if our model predicts that an article has true information, but it is actually fake news this would only cause the user to further believe in the article. To try to mitigate this type of issue, we used the sentence claim matching algorithm where article sentences can be matched to fact-checked claims. If this approach is done correctly the user will in theory have access to training claims that are similar to those in the article and the label associated with the training claims. In addition, we chose to include a tab which showed how our model performed on different slices of the testing data. We believe showing this type of data to users could be a very useful tool for harm mitigation as it allows the users to more fully assess potential biases in the models. At the end of the day because these models are all imperfect we include a disclaimer that these predictions are not a substitute for professional fact-checking. - -## Contributions - -All members contributed significantly to each part of the project. Alex focused more on model training and development. Vivek and Sathya focused more on UI and deployment. We gratefully acknowledge helpful discussions and feedback from Chip Huyen and Xi Yan throughout the project! In addition, special thanks to Navya Lam and Callista Wells for helping find errors and bugs in our UI. - -## References - -We referred the the following models to guide ML model development: - -- Sanh, V., Debut, L., Chaumond, J. and Wolf, T., 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108. -- Reimers, N. and Gurevych, I., 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084. - -We used data from the following dataset: - -- Kotonya, N. and Toni, F., 2020. Explainable automated fact-checking for public health claims. arXiv preprint arXiv:2010.09926. - -We found the following tutorials and code very helpful for model deployment via Streamlit/Docker/App Engine. - -- Jesse E. Agbe (2021), GitHub repository, https://github.com/Jcharis/Streamlit_DataScience_Apps -- Daniel Bourke (2021), GitHub repository https://github.com/mrdbourke/cs329s-ml-deployment-tutorial - -Core technologies used: - -- Tensorflow, Pytorch, Keras, Streamlit, Docker, Google App Engine \ No newline at end of file diff --git a/_posts/2021-03-18-context-graph-generator.markdown b/_posts/2021-03-18-context-graph-generator.markdown deleted file mode 100644 index 11df4a1..0000000 --- a/_posts/2021-03-18-context-graph-generator.markdown +++ /dev/null @@ -1,170 +0,0 @@ ---- -layout: post -title: Building a Context Graph Generator -date: 2021-03-18 20:32:20 +0700 -description: We developed a context graph generator capable of providing visual (graphical) summaries of input text, highlighting salient concepts and their connections. -img: # Add image post (optional) -fig-caption: # Add figcaption (optional) -tags: [Graph-ML, NLP, BERT] -comments: true ---- - -### The Team -- Manan Shah -- Lauren Zhu -- Ella Hofmann-Coyle -- Blake Pagon - -## Problem Definition - - -Get this—55% of users read online articles for less than 15 seconds. The general problem of understanding large spans of content is painstaking, with no efficient solution. - -Current attempts rely on abstractive text summarization, which simply shortens text to its most relevant sentences and often obscures core components that make writing individual and meaningful. Other methods offer a smart search on articles, a sort of smart command+F. But that forces the user to not only look at only small bits and pieces of content at a time, but also know what they’re looking for ahead of time. - -What if, rather than engaging in this lengthy process of understanding content manually, people can leverage a tool that generates analytical, concept graphs over text? These graphs would provide the first-ever standardized, structured, and interpretable medium to understand and draw connections from large spans of content. We created such an interface where users are able to visualize and interact with the key concepts over swaths of text in a meaningful and unique way. Users can even compare and contrast concept graphs between multiple articles with intersecting concepts, to further their analysis and understanding. - -So with users spending less time on individual websites, we make it easier for them to absorb the key components of the content they want in an interactive graphical representation. Specifically, we enable readers to: -- Grasp the main concepts presented in an article -- Understand the connections between topics -- Draw insights across multiple articles fluidly and intuitively - - -## System Design - -### Graph Generation - - - -With the goal of making information digestion faster, we designed a concept graph generator that includes relevant information on a text’s main concepts and visually shows how topics relate to one another. To achieve this, our graph outputs nodes that represent article concepts, and edges that represent links between the concepts. - -We use Streamlit for the front end interface and a custom version of the Streamlit graphing libraries to display our graphs. The resulting graphs are interactive—users can move the graph and individual nodes as well as zoom in and out freely, or click a node to receive a digest of the topic in textual context. - -Additionally, we provide users with a threshold slider that allows users to decide how many nodes/connections they want their graph to provide. This customization doubles as an optimization for the shape and density of the graph. How this works is that connections between nodes are determined by a similarity score between the nodes (via cosine similarity on the word embeddings). A connection is drawn between two topics if the score is above the threshold from the slider. This means that as the slider moves further to the left, the lower threshold makes the graph generate more nodes, and the resulting graph would be more dense. - -### Working with Multiple Graphs - -Beyond generating graphs from a single text source, users can combine graphs they have previously generated to see how concepts from several articles interrelate. In the below example, we see how two related articles interact when graphed together. Here we have one from the Bitcoin Wikipedia page and the other from the Decentralized Finance page. We can distill from a quick glance that both articles discuss networking, bitcoin, privacy, blockchain and currency concepts (as indicated by green nodes), but diverge slightly as the Bitcoin article focuses on the system specification of Bitcoin and the Decentralized Finance article talks more about impacts of Bitcoin on markets. The multi-graph option allows users to not only assess the contents of several articles all at once with a single glance, but also reveals insights on larger concepts through visualizing the interconnections of the two sources. A user could use this tool to obtain a holistic view on any area of research they want to delve into. - - - -### Visual Embedding Generation - - - -An additional feature our system provides is a tool to plot topics in 2D and 3D space to provide a new way of representing topic relations. Even better, we use the Poltly library to make these plots interactive! The embedding tools simply take the embedding that corresponds to each topic node in our graph and projects it into 2D space. Topic clustering indicates high similarity or strong relationships between those topics, and large distances between topics indicates dissimilarity. The same logic applies to the 3D representation; we give users the ability to upgrade their 2D plots to 3D, if they're feeling especially adventurous. - -### Deployment and Caching - -We deployed our app on the Google Cloud Platform (GCP) via Docker. In particular, we sent a cloud built docker image to Google Cloud, and set up a powerful VM that launched the app from that Docker image. For any local updates to the application, redeploying was quite simple, requiring us to rebuild the image using Google Cloud Build and point the VM to the updated image. - -To speed up performance of our app, we cache graphs globally. Let’s say you are trying to graph an article about Taylor Swift’s incredible Folklore album, but another user had recently generated the same graph. Our caching system ensures that the cached graph would be quickly served instead of being re-generated, doing so by utilizing Streamlit’s global application cache. Our initial caching implementation resulted in User A’s generated and named graphs appearing in a User B’s application. To fix this, we updated each user’s session state individually instead of using one global state over all users, therefore preventing User A’s queries from interfering with User B’s experience. - -### Our Backend: Machine Learning for Concept Extraction and Graph Generation - - -Our concept graph generation pipeline is displayed in Figure 1. Users are allowed to provide either custom input (arbitrarily-formatted text) or a web URL, which we parse and extract relevant textual information from. We next generate concepts from that text using numerous concept extraction techniques, including TF-IDF and PMI-based ranking over extracted n-grams: unigrams, bigrams, and trigrams. The resulting combined topics are culled to the most relevant ones, and subsequently contextualized by sentences that contain the topics. Finally, each topic is embedded according to its relevant context, and these final embeddings are used to compute (cosine) similarities. We then define edges among topics with a high enough similarity and present these outputs as a graph visualization. Our primary machine intelligence pipelines are introduced in (1) our TF-IDF concept extraction of relevant topics from the input text and (2) our generation of BERT embeddings of each topic using the contextual information of the topic within the input text. - - - -Pipeline Illustration: A diagram of our text-to-graph pipeline, which uses machine intelligence models to extract concepts from an arbitrary input span of text. - -Our concept extraction pipeline started with the most frequent unigrams and bigrams present in the input text, but we soon realized that doing so populated our graph with meaningless words that had little to do with the article and instead represented common terms and phrases broadly used in the English language. Although taking stopwords into account and further ranking bigrams by their pointwise mutual information partially resolved this issue, we were unable to consistently obtain concepts that accurately represented the input. We properly resolved this issue by pre-processing a large Wikipedia dataset consisting of 6 million examples to extract “inverse document frequencies'' for common unigrams, bigrams, and trigrams. We then rank each topic according to its term frequency-inverse document frequency (TF-IDF) ratio, representing the uniqueness of the term to the given article compared to the overall frequency of the term in a representative sample of English text. TF-IDF let us properly select topics that were unique to the input documents, significantly improving our graph quality. - -To embed extracted topics, we initially used pre-trained embeddings from GloVe and word2vec. Both of these algorithms embed words using neural networks trained on context windows that place similar words close to each other in the embedding space. A limitation with these representations is that they fail to consider larger surrounding context when making predictions. This was particularly problematic for our use-case, as using pre-trained context-independent word embeddings would yield identical graphs for a set of concepts. And when we asked users to evaluate the quality of the generated graphs, the primary feedback was that the graph represented abstract connections between concepts as opposed to being drawn from the text itself. Taking this into account, we knew that the graphs we wanted should be both meaningful and specific to their input articles. - -In order to resolve this issue and generate contextually-relevant graphs, we introduced a BERT embedding model that embeds each concept along with its surrounding context, producing an embedding for each concept that was influenced by the article it was present in. Our BERT model is pre-trained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers). We used embeddings from the final layer of the BERT model—averaged across all WordPiece-split tokens describing the input concept—to create our final 1024-dimensional embeddings for each concept. We implemented caching mechanisms to ensure that identical queries would have their associated embeddings and adjacency matrices cached for future use. This improves the efficiency of the overall process and even guarantees graph generation completes in under 30 seconds for user inputs of reasonable length (it’s usually faster than that). - -## System Evaluation - -Since we are working with unstructured data and unsupervised learning, we had to be a little more creative in how we evaluated our model’s performance. To start, we created a few metrics to gather for generated graphs that would help us better quantify the performance of our system. The metrics include: - -- The time to generate a novel, uncached graph -- The number of nodes and edges generated, along with average node degree -- Ratings of pop-up context digest quality -- An graph-level label of whether it is informative overall -- The number of generated topics that provide no insight -- The number of topics that are substrings of another topic - -When designing metrics to track, our main goal was to capture the utility of our app to users. The runtime of graph generation is paramount, as users can easily grow impatient with wait times that are too long. The number of nodes shows how many topics we present to the user, the number of edges indicates how many connections our tool is able to find, and the average degree captures the synergy between those two. The pop-up context digests can either provide useful or irrelevant additional information. Having a general sense for the overall quality of information in graphs is important to note. Nodes generated based on irrelevant topics waste users’ time, so we want to minimize that. Similarly, nodes with topics that are substrings of other topics in the graph are also unwanted, as they indicate redundant information in our graph. - -With our metrics defined, we began generating graphs and annotating them by hand. We found that the average graph takes 20.4 seconds to generate and has 13.27 nodes and 13.55 edges, leading to an average node degree of 1.02. Overall, we are happy with the graph generation time that we measured — 20 seconds is a reasonable expectation for our users, especially considering we are not currently using a GPU. On average, we found that the graphs were informative 68% of the time. The times that they were not were caused either by too high of a threshold or poor topic generation. In particular, we noticed that performance was poor on articles that covered many different areas of a topic, such as articles discussing matchup predictions for March Madness. While the overarching theme of college basketball was the main focus of those articles, they discussed many different teams, which led the model to have a tough time parsing out common threads, such as the importance of an efficient offense and lockdown defense on good teams, throughout the article. - -Our default graph generation uses a threshold of 0.65 for the cosine similarity between topics to form an edge between them. For reference, we also tested our graph generation with thresholds of 0.6 and 0.7 for the edge cosine similarity and found that they yielded an average node degree of 1.71 and 0.81, respectively. An average node degree of 1.71 is too high and floods the user with many frivolous connections between topics. An average node degree of 0.81, on the other hand, doesn’t show enough of the connections that actually exist between topics. Therefore, a threshold of 0.65, with an average node degree of 1.02, provides a nice balance between topics presented and the connections between them. - -As for the errors we were scanning for, we found that on average, 12.33% of nodes in every graph were topics that added nothing and 17.81% of nodes were simply substrings of another topic in the graph. Therefore, about 69.86% of the nodes that we present to users are actually relevant. This tells us that users on our site may spend some time sifting through irrelevant topics, which we hope to improve in the future. We additionally rated the quality of the contextual information displayed in each node’s pop-up digest window, and found that (on a scale of 0-1) our ratings averaged 0.71. This was largely caused by lack of sufficient filtering applied to the sentences displayed. Filtering and curation heuristics for these digests is another potential area of growth. - -## Application Demonstration - -Interested users should visit our application at [this URL](http://104.199.124.26/), where they are presented with a clean, simple, and intuitive interface that allows them to either (a) input custom text to graph, (b) input a web URL to graph, or (c) generate a combined graph from two of their previously saved graphs. Upon entering custom text or a web URL, users are shown a progress bar estimating the time of graph generation (or an instant graph if the query has been cached from previous uses of the website). Generated graphs are interactive, allowing users to click on nodes to see the context in which they appear in the input document. We also present other modes of visualization, including a 2D and 3D PCA-based embedding of the concepts, which provide a different perspective of relationships between concepts. Users can also save graphs locally (to their local browser cache), and subsequently combine them to link concepts together across numerous distinct documents. - -Our team chose to use a web interface as it proved to be the most intuitive and straightforward way for users to provide custom input and interact with the produced graphs. We implemented our own customizations to the Streamlit default graphing library (in [this fork](https://github.com/mananshah99/streamlit-agraph)) to enable enhanced interactivity, and we employed Streamilit to ensure a seamless development process between our python backend and the frontend interface. - -Watch us demo our platform and [give it a try](http://104.199.124.26/) ! - - - -## Reflection - -### What worked well? -We had such a rewarding and exciting experience this quarter building out this application. From day one, we were all sold on the context graph idea and committed a lot of time and energy into it. We are so happy with the outcome that we want to continue working on it next quarter. We will soon reach out to some of the judges that were present at the demo but didn’t make it to our room. - -While nontrivial, building out the application was a smooth process for several reasons: technical decisions, use of Streamlit, great camaraderie. Let’s break these down. - -Our topic retrieval process is quite simple, with the use of highest frequency n-grams weighted by PMI scores, with n={1,2,3}. TF-IDF was a good addition to topic filtering (the graphs were more robust as a result), but because it was slower we added it as a checkbox option to the user. Sentence/context retrieval required carefully designed regular expressions, but proved to work incredibly efficiently once properly implemented. We then had to shape the topics and contexts correctly and after passing them through BERT, compute cosine similarities. For displaying graphs, we utilized a Streamlit component called streamilt-agraph. While it had all the basic functionality we needed, there were things we wanted to add on top of it (e.g. clicking on nodes to display context), which required forking the repo and making custom changes on our end. - -Due to the nature of our project, it was pretty feasible to build out the MVP on Streamlit and iterate by small performance improvements and new features. This made individual contributions easy to line up with Git issues and to execute on different branches. It also helped that we have an incredible camaraderie already, as we all met in Stanford’s study abroad program in Florence in 2019. - -### What didn't work as well? -To be honest, nothing crazy here. We had some obscure bugs from BERT embeddings that would occur rarely but at random, as well as graph generation bugs if inputs were too small. We got around them with try/catch blocks, but could have looked into them with a little more attention. - -### If we had unlimited time & unlimited resources... -Among the four of us, we made our best guesses as to what the best features would be for our application. Of course if time permitted, we could conduct serious user research about what people are looking for, and we could build exactly that. But apart from that, there are actual action items moving forward, discussed below. - -We wanted to create an accessible tutorial or perhaps some guides either on the website or an accessible location. This may actually no longer be necessary because we can point to the tutorial provided in this blog (see Application Demonstration). We saw in many cases that without any context of what our application does, users may not know what our app is for or how they could get the most out of it. - -On top of this, future work includes adding a better URL (i.e. contextgraph.io), including a Chrome extension, building more fluid topic digests in our pop-ups, and submitting a pull request to the streamlit-agraph component with our added functionality—in theory we could then deploy this for free via Streamlit. - -## Broader Impacts -**Context Graph Generator Impacts**: - -- **Summarization**: Our ***intuitive interface*** combined with ***robust graph generation*** enables users to understand large bodies of text with a simple glance. - -- **Textual Insights**: The extensive features we offer from multi-graphing to TF-IDF topic generation to context summarization for each node enables users to ***generate analysis and insights*** for their inquiries on the fly. - -Our aim in creating this tool is to empower individuals to obtain the information they need with ease, so they are empowered to achieve their goals at work or in their personal lives. Whether the user has to synthesize large amounts of information for their business or simply seek to stay informed while on a busy schedule, our tool is here to help! - -When considering the ethical implications of such a tool, it becomes apparent that while a context graph largely positively impacts users, it’s important to consider how it could become a weapon of misinformation. When a user provides text for the graph generator to analyze, we do not perform fact checking of the provided text. We believe this is reasonable considering that our platform is an analysis tool. Additionally, because we are also operating only natively within our site and graphs are not shareable, there is no possibility of a generated graph object being shared to inform others (one could take a screenshot of the graph, however, most detailed information is embedded in the nodes’ pop-up). If we were to make graphs shareable or integrate our tool into other platforms, we run the risk of being a tool of misinformation if users were to share graphs that help people quickly digest information. As we continue to work on our platform, we will keep this scenario top of mind and work to find ways to prevent such an outcome. - - -## Contributions -- Blake: Worked on generating PCA projection plots from embeddings, saving graphs, graph combination, and Streamlit UI. -- Ella: Worked on graph topic generation (primarily TF-IDF & data processing), reducing skew in embeddings of overlapping topics, and Streamlit UI -- Lauren: Worked on graph topic generation, GCP deployment, and Streamlit UI -- Manan: Worked on graph topic generation, embedding and overall graph generation, streamlit-agraph customization for node popup context digests, and Streamlit UI - -## References - - -Our system was built using [Streamlit](https://streamlit.io), Plotly and [HuggingFace’s BERT model](https://huggingface.co/transformers/model_doc/bert.html). To deploy our system, we used Docker and GCP. - -We utilized the [Tensorflow Wikipedia English Dataset](https://www.tensorflow.org/datasets/catalog/wikipedia) for IDF preprocessing as well. \ No newline at end of file diff --git a/_posts/2021-03-18-stylify.markdown b/_posts/2021-03-18-stylify.markdown deleted file mode 100644 index 3b3c306..0000000 --- a/_posts/2021-03-18-stylify.markdown +++ /dev/null @@ -1,157 +0,0 @@ ---- -layout: post -title: Stylify -date: 2021-03-18 20:13:20 -0700 -description: Stylify -img: # Add image post (optional) -fig-caption: # Add figcaption (optional) -tags: [Edge-ML] -comments: true ---- - -### The Team -- Daniel Tan -- Noah Jacobson -- Jiatu Liu - -## Problem Definition - -Where photography is limited by its reliance on reality, art transcends the confines of the physical world. At the same time, art is a highly technical skill, relying on fine motor control and knowledge of materials to put one’s mental image to paper. Professional artists must study for years and hone their technical skills through repeated projects. As a result, while everyone has the capacity for creativity, far fewer have the ability to materialize their imagination on the canvas. - -We seek to bridge this gap with Stylify, our application for neural style transfer. With our app, anyone can create original art by combining the characteristics of multiple existing images. Everybody has an innate aesthetic sense; a latent capacity for art. Our lofty ambition is to allow the human mind to fully embrace its creativity by giving form to abstract thought. - -Implementation-wise, our application is backed by a state-of-the-art neural style transfer model. Whereas previous style transfer models are trained to learn a finite number of styles, our model is able to extract the style of arbitrary images, providing true creative freedom to end users. Our model is also extremely efficient. Where previous style transfer models arrive at the output by iterated minimization, requiring many forward passes, our model requires only a single forward pass through the architecture, saving time and compute resources. The fully-convolutional architecture also means that images of any size can be stylified. - -## System Design - -Our system consists of 4 main components: - -1. A ReactJS frontend, which provides the interface for end users to interact with our application -2. An API endpoint deployed using Sagemaker. It provides access to the style transfer model, which accepts batches of content and style images and returns an image with the content of the former and the style of the latter -3. A Python middleman server, which performs preprocessing on the inputs sent by the frontend, and queries the model through the endpoint. In order to support video input, the server unpacks videos into component frames and makes a batch query to the endpoint. -4. An S3 bucket which is used for temporary file storage as well as logging of metrics - -Our system is summarized below. - -
-
- -
-
- -The frontend is a web application, enabling access by a wide variety of devices such as smartphones, tablets, and computers. It is implemented using ReactJS + TypeScript, which is the modern industry standard for frontend web development. Additionally, Material-UI is used as it provides access to many off-the-shelf UI components. The frontend is deployed and managed through AWS Amplify, which transparently hosts the frontend server on AWS. It also automatically deploys the frontend server when it detects code changes in the repo. - -The endpoint is deployed using AWS Sagemaker. Compared to manually spinning up a virtual machine to deploy our model, Sagemaker is superior as it automatically handles increasing scale by spawning new instances to perform inference as necessary. Behind the scenes, it also takes care of many minor implementation details such as automatically restarting instances when they fail. - -The server is implemented in Python. Python was chosen because of its wide support for efficient image processing (OpenCV) as well as support for an efficient asynchronous server (FastAPI). The availability of off-the-shelf components minimized development time spent on the server. The server code makes heavy use of asynchronous directives (AsyncIO), as synchronous code would lead to large amounts of idle CPU time spent waiting for network communications. The server is deployed on AWS ECS with the Fargate launch type. - -While images are passed between the frontend and server as a stream of raw bytes, the server sends requests to the endpoint in the form of a list of URLs to the content and style images, and the endpoint replies with a list of output URLs. All URLs point to an S3 bucket where images are stored temporarily before being destroyed. We store images in S3 buckets rather than sending entire images to the endpoint because the endpoint is not able to handle requests with images in them as those requests would be too long. Using S3 uploads, we are able to process hundreds of images in a single request. Furthermore, fast access to S3 improves the speed of our application as the content and style images can be uploaded to S3 before prediction time and pulled from S3 rapidly. - -Lastly, the server maintains a running average of latency for the last 64 queries to the API endpoint and logs this data in an external S3 bucket to facilitate manual inspection. - -## Machine Learning Component - -Our architecture takes as input a single _content image_, a single _style image_, and performs a single forward pass over the inputs, producing an _output image_ with the content of the former and the style of the latter. The architecture is summarized below. - -
-
- -
-
- -## Architecture - -At a high level, both the style and the content images are passed through a fully-convolutional VGG19 _encoder_ to extract high-level _embeddings_. Each embedding can be thought of as a 3D grid or _feature map_ with multiple layers. Each vertical slice of the grid forms a 2D grid of numbers _(a channel)_ that represents high-level information about the image, such as whether it contains a face. - -Stylification happens through autoencoding and adaptive instance normalization. After being forward passed through the encoder VGG, the embeddings of the content image are transformed linearly so that _the mean and standard deviation of each channel match those of the style embedding_. Concretely, we have: - -
-
- -
-
- -The result of this is an _output embedding_ which captures both the spatial structure (content) of the content image, as well as the texture (style) of the style image. This embedding is then converted back into a visual image through a fully-deconvolutional _decoder_, producing the stylified output. - -## Model Training - -The architecture is trained to perform style transfer by optimizing a combination of a _content loss_ L_c and a _style loss_ L_s. - -The content loss is calculated based on the extent to which the outputted image has the same content as the original image and the style loss is based on the extent to which the style of the original image, as represented by the Gram matrix of the image, matches the style of the generated image. By optimizing both of these objectives, our network is able to achieve exceptional and fast performance. - -## Model Development - -We did not do any training or fine-tuning ourselves, finding it sufficient to use a pretrained model implemented in PyTorch. The model was trained using MSCOCO for content images and a Wikiart-derived dataset for style images. Further implementation details can be found in the training code, which is open-source. - -We spend a significant portion of time searching for the fastest neural style transfer implementation in the world. The first implementations of NST require backpropagation to be performed for several iterations for every content and style image uploaded to the network. This was incredibly time consuming and it made NST impractical for public use. Future implementations of NST focused on increasing the speed of the algorithm by developing pretrained networks for specific style images. Networks were developed that required only one forward pass for NST to be performed, but these forward passes were only capable of transferring a particular style onto a content image. Now, there exist algorithms like the one above that can perform style transfer between an arbitrary content and style image in a single forward pass. These networks use state-of-the-art advancements such as autoencoders and GANs. Generally speaking, training different networks for different functions (i.e. encoding style, transferring content) leads to the development of super-capable networks that can “learn” from an input in a single forward pass. - -## System Evaluation - -Because of the qualitative nature of domain, it is difficult to evaluate the quality of model output in any objectively meaningful sense. As a result, we have chosen to eschew model evaluation in favour of evaluating the application latency. - -We measure the latency by making 15 requests to our endpoint, for various settings. The time taken to receive a response is recorded. We record mean and standard deviation for various choices of runtime. We also include a comparison to a similar publicly-available endpoint available [here](https://deepai.org/machine-learning-model/fast-style-transfer). Much like our model, the endpoint supports arbitrary user-supplied content and style images. Results are demonstrated in the graph below. - -![Results](../assets/img/stylify/results.png) - -## Application Demonstration -Link: [https://youtu.be/71R0xiD_TmI](https://youtu.be/71R0xiD_TmI) - -Given that the app is meant for users to upload and stylify their own images, an GUI makes much more sense than an API, since the majority of the users are not going to know how to upload images through an API. - -To use the app: - -1. Go to [https://www.stylify.net/](https://www.stylify.net/). Behold the slick and modern UI, and click on “Edit your photo now”, which will take you to a new screen. -2. Click on “Add image” and drop the image or video into the dropzone popup. Click “Submit”. -3. Click on “Add filter”, and then either click on the first item in the menu popup to upload your own filter, or choose a predefined filter from the rest of the menu. -4. The image will be automatically stylified one the filter is submitted. The “Download” button will be enabled after processing finishes. Click on it to download the stylified image. -5. Click on “Do again” if you want to stylify another image. - -## Reflection - -### What Worked -1. CI/CD is great for projects like this. All of our deployment pipelines are automated via AWS Amplify and Github actions, which means all we need to do to get our changes into production is to merge the code to the `main` branch. -2. Logging is a must. Your local environment is going to be different from your production environment, and without logging it’s almost impossible to figure out why something is working locally but not on the cloud. -3. FastAPI provided a simple, efficient asynchronous server, which helped minimize development time. Python was a good choice for the server due to the abundance of image processing libraries as well as AWS SDK. - -### What Didn’t Work So Well -1. We struggled to integrate our pretrained models with sagemaker. The first pretrained model we got functioning was not built in a tensorflow version that is compatible with SageMaker, and we had to change libraries entirely. -2. We also spent many hours trying to deploy a pretrained PyTorch model through SageMaker’s Jupyter Notebook instances, but we were not able to successfully deploy the model until we changed approaches to uploading the model through the AWS console. -3. The S3 upload ended up adding more latency than we thought it would. After some internal testing, we found that the model inference time is less than 1s, but our application can take up to 15s to query a video, largely due to multiple S3 upload and downloads. - -### Future Improvements -1. Process more frames from video inputs. Due to modeling and response time constraints we are only able to process 8 frames of a video input. We could potentially make video processing async if the input exceeds a certain size, and notify the client to come back and download the stylified video after it’s been processed. -2. UI-wise, allow users to pick a different filter after the image has been styled, w/o having to go through the entire flow again. -3. Instead of displaying just the name, also display a small thumbnail of what the style images look like. -4. Build a “premium” version of the app that allows users to upload more/larger photos and videos in order to make up for the larger compute cost. -5. Train our own smaller NST model to increase our speed. -6. Parallelize video input processing across several GPU instances. - -## Broader Impacts - -Our application enables the world to transform videos using NST for the first time. This creative outlet increases the autonomy of the general non-technical human population in a way that has never been done in history. Our team is not sure of all of the potential implications in this technological shift, but we imagine that the general public will be able to use this technology to create all sorts of abstract creations. - -Neural style transfer creates incredibly compelling and beautiful images; there is potential for a whole new area of art to develop based on the application of algorithms to raw inputs, and we view Stylify as an essential tool for these next-generation artists. Up until the invention of Stylify, the public has not been able to utilize the full potential of NST, as transforming a content image would be too tedious and take too long. Now, the public will truly be able to unleash the power of the NST algorithm, and the implications could be world-changing; we anticipate that our tool could provide lasting value to a nontrivial fraction of artists and ordinary people.Photographers could give their pictures a sense of the surreal; Instagram users could create custom filters for bespoke images; budding artists could visualize art drawn in the characteristic style of the historical greats. - -It is possible that our application might be used for forgery, seeking to pass off a stylified image as original work by a historical artist. However, considering that technology exists to detect deep fakes, we are confident that state-of-the-art detection algorithms will be able to classify our image as being artificial. Furthermore, despite the efficacy of our model, most humans will be able to see that NST images are not originals. - -## References - -[1] X. Huang and S. Belongie. "Arbitrary Style Transfer in Real-time with Adaptive Instance Normalization.", in ICCV, 2017. - -[2] Pretrained PyTorch style transfer model, adapted from: [https://github.com/naoto0804/pytorch-AdaIN](https://github.com/naoto0804/pytorch-AdaIN) - -[3] MSCOCO dataset, available at [https://cocodataset.org/#overview](https://cocodataset.org/#overview) - -[4] Wikiart, a compendium of artistic images used for styling [https://www.kaggle.com/c/painter-by-numbers/data](https://www.kaggle.com/c/painter-by-numbers/data) - -[5] Fast API: [https://fastapi.tiangolo.com/](https://fastapi.tiangolo.com/) - -[6] React: [https://reactjs.org/](https://reactjs.org/) - -[7] Material UI: [https://material-ui.com/](https://material-ui.com/) diff --git a/_posts/2021-03-18-tender-recipe-recommendations.markdown b/_posts/2021-03-18-tender-recipe-recommendations.markdown deleted file mode 100644 index b712bd1..0000000 --- a/_posts/2021-03-18-tender-recipe-recommendations.markdown +++ /dev/null @@ -1,175 +0,0 @@ ---- -layout: post -title: Tender Matching People to Recipes -date: 2021-03-18 13:32:20 +0700 -description: Your report description -img: # Add image post (optional) -fig-caption: # Add figcaption (optional) -tags: [RecSys] -comments: true ---- - -### The Team -- Justin Xu: [justinx@stanford.edu](mailto:justinx@stanford.edu) -- Makena Low: [makenal@stanford.edu](mailto:makenal@stanford.edu) -- Joshua Dong: [kindled@stanford.edu](mmailto:kindled@stanford.edu) - -Github Repo: [https://github.com/justinxu421/recipe_rex](https://github.com/justinxu421/recipe_rex) - - -# Problem Definition - -You’re at the grocery store, looking through the shelves of copious options, unsure of what you’d actually enjoy cooking or eating. You have a sense of what you’re craving, but a search on Google for your latest craving can take an hour and many tabs to surface the recipes that hit it. Your cravings also keep changing. New Years’ resolutions for healthy food turn into longing for quick meals, which cascade into a phase of comfort food. Discovering new food takes a lot of effort. - - -## Introducing Tender - -Our app is like Tinder. But instead of matching people to people, it matches people to recipes with a similar focus on simplicity. When you first open the app, you’re shown the bios of 4 recipes: their profile photos, names, and who they are. After 10 rounds of choices, our app recommends the best matches to your craving from over 2000 recipes curated by blogs covering many traditional Asian recipes and fusion concepts. - -As you choose, we guess your preferences for meats and starches and show them to you in graphs on the bottom of the screen. This helps narrow down our search for the most similar recipes to the ones you chose. And if you’re in the mood for desserts or sides instead of a main dish, you can choose to explore those options on the side bar. - -In our next sections, we’ll describe how we evaluated our app, designed the algorithms under the hood, related works, and reflections on our work. - - -## Related Work - -Recommendations are a pretty hot topic in today’s society, from apps like Spotify and TikTok, which rely on quickly changing trends and massive amounts of input data / user information in order to build out their algorithms. As a result, many of these algorithms didn’t seem to satisfy the restrictions of our use case, as they relied on repeat user experiences and massive amounts of data. - -However, inspiration for the algorithm we utilized for this project was partially taken from the original 2010 Youtube Recommendation Paper [1]. This paper showed a two-step recommendation algorithm where the first step was generating candidates based on a user’s recent activity and the second step was a ranking algorithm which was based on user preferences, which seemed to be a general interpretable approach for our problem. - - -## Video Demo - -![alt_text](../finalvideo.gif "gif") - - -## System Evaluation - -To validate the quality of our recommendations, we compare our recommendations to a random sample of recipes, and the user has to choose which they prefer the most. The percentage of their choices that we recommend is our validation score. If it’s 50%, we’re doing no better than random, giving us a natural baseline. - -
-
- -
-
- -To perform a slice based analysis of our recommendations on different types of cravings, we designed 8 cravings for soupy and stirfry main dishes. We then asked 10 young adults in their early 20s to choose a craving, and use the app. We intentionally left the cravings up to interpretation and gave users the freedom to choose recipes. - -
-
- -
-
- -Across our 54 user tests, our recommendation system achieved a **validation score of 68%, beating our baseline by 18%**. We scored **67%** across soupy dishes and **71%** across stir-fry dishes. Looking at the breakdown of scores, we find performance on vegetable soup and vegetarian stir-fry to be close to average, a surprise since these are the least represented dishes from our recipe websites. - -![alt_text](../assets/img/image6.png "image_tooltip") - - -# System Design - -Our modeling relied on libraries like FastText [2], scikit-learn [3], Pandas [4], and NumPy [5]. We chose Streamlit [6] for the front end and deployment of our app to keep our codebase in Python and for faster iteration. Below is a simple diagram of our algorithms. We’ll step through each part in the next sections. - -
-
- -
-
- -## Feature Engineering our Recipe Embeddings - -Using [recipe-scraper](https://github.com/hhursev/recipe-scrapers/) [7], we found the following features for all our recipes. - -
-
- -
-
- -We curated recipes for main dishes, desserts, and sides, receiving recipe counts of 1737, 362, 221 respectively. For each recipe, we created a joint embedding of the nutrition and ingredients. - -Our nutrition embedding gave a binary label to a recipe for exceeding the 75th percentile of fat, protein, carbohydrate, sodium, or sugar content across recipes. - -Our ingredients embedding was created in a few steps. First, unigrams and bigrams were extracted from every list of ingredients. Bigrams captured words like “soy sauce” and “chicken breast”. Among the 989 which occurred over 20 times, 359 ingredient grams were manually labeled and kept. Each of these grams were mapped to a 300-dimension embedding using a pretrained FastText language model. FastText [2] forms word embeddings by averaging subword embeddings. This allows it to generalize to unseen vocabulary in our ingredient grams, unlike Word2Vec [11]. To create a sentence embedding from all the ingredients of a recipe, we took an inverse frequency weighted average of the individual ingredient embeddings based on the smooth inverse frequency method introduced by Arora et al [8]. Compared to using SentenceBERT [12], this better takes advantage of our domain specific ingredient frequency. - -To create a joint embedding with a balanced influence from the nutrition and ingredients, we projected our ingredient embeddings into the subspace formed by their first 5 principal components, which explained 49% of the variance. Extending to 10 principal components would have explained an additional 12% of variance. - -To evaluate the semantic understanding represented by our principal components, we examined how cuisine is clustered along the first two principal components. Below on the left includes Chinese cuisine in bright red, our dominant class. To better visualize the clustering of our minority classes, we show the same graph excluding Chinese cuisine on the right. Without explicitly including cuisine in our embedding, we find that it keeps similar cuisines close to each other while also capturing intra-cuisine variance. This supports our hypothesis that our embedding incorporates semantic understanding. - -![alt_text](../assets/img/image8.png "image_tooltip") - -Our final joint embedding was a 5-dimensional ingredient embedding stacked on a 5 dimensional nutrition embedding. - - -## Designing our Recommendation Algorithms - -The main driver to our recommendation engine was a k-nearest neighbor [13] recommendation system. Given the fact that our dataset was relatively small, at around 2000 recipes, a nearest neighbor approach seemed to make the most sense in terms of finding similar recipes with cosine distance. - -To make it a true machine learning application, our app needed to learn user preferences as it proceeded! To do this, we generated a coarser, more interpretable labeling system for every recipe to capture some taste preferences a user might have coming in. The two main categories we selected were **meat** and **starch**. These two categories were chosen given the fact that the user may have dietary restrictions. The labels were generated through a title / ingredient list keyword match. - -![alt_text](../assets/img/image9.png "image_tooltip") - -We see that these categories are well distributed across our ingredient embedding space. - -![alt_text](../assets/img/image10.png "image_tooltip") - -Given these taste labels, we can then restructure our search problem as a multi-armed bandit problem [10]. The goal of the algorithm is to generate a sampling procedure of the arms to find with high accuracy what the expected payout of each arm is. In this problem, our “arms” are the individuals taste preferences, and the payout is the probability given all choices that the individual will select a particular preference. Since the hypothesis is that users come into our app with a particular taste preference in mind, they will likely select recipes matching that preference. - -
-
- -
-
- -One solution to the bandit problem with optimal regret bounds (rough amortized long term deviation from optimal) is the UCB (Upper Confidence Bound) [9] algorithm, which selects the arm with the highest upper bound to the confidence interval of the payouts. - -This approach simulates exploration vs. exploitation since in the beginning, this algorithm will select taste preferences that have not been selected yet (due to high variance), but as the user proceeds, it will start to recommend more recipes matching the user’s preferences (exploitation phase), as it gains higher confidence for the value of their preferences. - - -# Reflections - -Overall, the project experience was very positive. Our team’s general dynamic and workflow worked well together. Sometimes it was a little difficult to divide up work, as some next steps were a little conditional on the previous part, so it was hard to parallelize, especially in a remote situation. On the other hand, since this project was a full application, we were able to divide work flow into the general “front end” vs. “back end” aspects that we had to handle to some extent, switching off who was working on what at different stages, and were still able to build team camaraderie by pair programming too. - -The tech stack that we decided on in the beginning also worked out well, since everything was able to be in python and easily integrated together. We had to pivot a couple of times in terms of what we were designing, especially in the direction of away from black box approaches and more into the interpretable methods + UX considerations. - -If given more time, we would try to incorporate more features to create a richer embedding space in combination with more recipes in our database to generate more personalized recommendations. On the engineering side, we’d also try to fully deploy our app, incorporating database storage and user memory in order to preserve information across multiple uses of our app. This would also enable many more machine learning features, including labeled data as we log users using our app, and collecting information for personalization. Streamlit’s public deploy also wasn’t able to handle multiple users using the app at the same time because of shared state space. We’d probably want to migrate our tech stack to something more robust, as well as provide more flexibility in terms of the UX design. - -We were not super ambitious about the technology we used, so we’d like to also incorporate some of the concepts we learned in class, like online learning and edge computing, and setting up the general DevOps workflow (maybe if we turn our app into a startup)! - - -# Broader Impact - -We see an app like this flattening the activation energy for young adults in a hurry to plan out meals they’ll enjoy cooking and eating. Instead of many searches and open tabs to gather together a few options that satisfy one’s craving, they can come to an app like ours. - -One audience we have had a challenge serving is people with dietary restrictions. For example, an early version of our app had a difficult time distinguishing red meat from non-red meat. Using filters learned by our UCB algorithm and sourcing more recipes that are kosher and vegetarian has helped. Our app could unintentionally exclude guests with dietary restrictions from the tables of users who come to use it often. - -We attempted to combat this problem by being mindful of selecting recipes from a variety of sources and cuisines, including many vegetarian dishes + a variety of meats / taste profiles. However, naturally, our dataset is still heavily Chinese/Korean/Japanese skewed due to the popularity of East Asian cuisine. - - -# References - - - -1. Davidson, James & Liebald, Benjamin & Liu, Junning & Nandy, Palash & Vleet, Taylor & Gargi, Ullas & Gupta, Sujoy & He, Yu & Lambert, Michel & Livingston, Blake & Sampath, Dasarathi. (2010). [The YouTube video recommendation system](https://dl.acm.org/doi/10.1145/1864708.1864770) -2. P. Bojanowski*, E. Grave*, A. Joulin, T. Mikolov, [Enriching Word Vectors with Subword Information](https://arxiv.org/abs/1607.04606) -3. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay. [Scikit-learn: Machine Learning in Python](https://scikit-learn.org/), Journal of Machine Learning Research, **12**, 2825-2830 (2011) [6.](#smartreference=gfugt2howdq8) -4. Wes McKinney. [Data Structures for Statistical Computing in Python](http://conference.scipy.org/proceedings/scipy2010/pdfs/mckinney.pdf), Proceedings of the 9th Python in Science Conference, 51-56 (2010) -5. Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke & Travis E. Oliphant. [Array programming with NumPy](https://www.nature.com/articles/s41586-020-2649-2), Nature -6. Ashish Shukla, Charly Wargnier, Christian Klose, [Fanilo Andrianasolo](https://discuss.streamlit.io/u/andfanilo/summary), [Jesse Agbemabiase](https://discuss.streamlit.io/u/Jesse_JCharis/summary), [Johannes Rieke](https://discuss.streamlit.io/u/jrieke/summary), [José Manuel Nápoles](https://discuss.streamlit.io/u/napoles3d/summary), [Tyler Richards](https://discuss.streamlit.io/u/Tyler/summary) [Streamlit](https://streamlit.io/) -7. recipe-scraper [https://github.com/hhursev/recipe-scrapers/](https://github.com/hhursev/recipe-scrapers/) -8. Sanjeev Arora and Yingyu Liang and Tengyu Ma. A Simple but Tough-to-Beat Baseline for Sentence Embeddings. ICML 2017 -9. [Sébastien Bubeck](https://dblp.uni-trier.de/pid/35/4292.html), [Nicolò Cesa-Bianchi](https://dblp.uni-trier.de/pid/c/NicoloCesaBianchi.html): [Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problem](https://arxiv.org/abs/1204.5721)s. [CoRR abs/1204.5721](https://dblp.uni-trier.de/db/journals/corr/corr1204.html#abs-1204-5721) (2012) -10. Auer, Peter, et al. "[The nonstochastic multiarmed bandit problem](https://epubs.siam.org/doi/pdf/10.1137/S0097539701398375)." _SIAM journal on computing_ 32.1 (2002): 48-77. -11. Mikolov, Tomas, et al. "Efficient estimation of word representations in vector space." _arXiv preprint arXiv:1301.3781_ (2013). -12. Reimers, Nils, and Iryna Gurevych. "Sentence-bert: Sentence embeddings using siamese bert-networks." _arXiv preprint arXiv:1908.10084_ (2019). -13. Cunningham, Padraig, and Sarah Jane Delany. "k-Nearest Neighbour Classifiers--." _arXiv preprint arXiv:2004.04523_ (2020). \ No newline at end of file diff --git a/_posts/2021-03-18-virufy-cloud.markdown b/_posts/2021-03-18-virufy-cloud.markdown deleted file mode 100644 index d164447..0000000 --- a/_posts/2021-03-18-virufy-cloud.markdown +++ /dev/null @@ -1,311 +0,0 @@ ---- -layout: post -title: Virufy Asymptomatic COVID-19 Detection - Cloud Solution -authors: Taiwo Alabi, Alex Li, Chloe He, Ishan Shah -date: 2021-03-18 13:32:20 +0700 -description: CS 329S Final Project -img: -fig-caption: # Add figcaption (optional) -tags: [Cloud-ML] -comments: true ---- - -### The Team -- Taiwo Alabi -- Alex Li -- Chloe He -- Ishan Shah - -## I. Problem Definition - -By March 2021, the SARS-CoV-2 virus has infected nearly 120 million people worldwide and claimed more than 2.65 million lives [1]. Moreover, a large percentage of cases were never diagnosed because of hospital overflows and asymptomatic carriers. According to a recent study published in the JAMA Network, an estimated 59% of all COVID-19 transmissions may be attributed to people without symptoms, including 35% who unknowingly spread the virus before showing symptoms and 24% who never experience symptoms [2]. Therefore, in order to prevent further spread of the virus, it’s crucial to have screening tools that are not only fast, accurate, and scalable, but also accessible and affordable to the general population. However, such tools do not currently exist. - -Since spring 2020, AI researchers have started exploring the use of machine learning algorithms to detect COVID in a cough. Researchers at MIT and the University of Oklahoma believe that the asymptomatic cases might not be “truly asymptomatic,” and that using signal processing and machine learning methods, we may be able to extract subtle features in cough sounds which are indistinguishable to the human ear [3]. In the past year, there have been a number of related projects around the world: AI4Covid-19 at the University of Oklahoma, Cough against COVID-19 at Wadhwani AI, Opensigma at MIT, Saama AI research, among others. - -However, existing cough prediction projects have varying performances and often require high-quality data because the models were trained on audio samples that were recorded in clinical settings and appropriately preprocessed [4]. Some models do not aim only at COVID detection but at all respiratory conditions, which makes it harder to balance between different performance metrics and therefore unsuitable to the needs of minimizing false negatives for the purpose of COVID prevention. These challenges motivated our project, as we hope to build a cloud-computing system better suited for detecting COVID in various types of cough samples and a prescreening tool that is easily accessible, free for all, and produces nearly instantaneous results. - -## II. System design - -Virufy cloud needed an online prediction machine learning system with a low-latency inference capability. We also needed to comply with HIPAA privacy rules with regards to health data collection and sharing. - -Hence the machine learning system that we designed is hosted in the cloud on a beefy EC2-t3-instance with GPU acceleration. An elastic IP address was assigned to the EC2 instance and the main app was served at port 8080. A DNS name rapahelalabi.com was used to redirect all traffic to the elastic IP address through the open port. - -To comply with HIPAA privacy rules, we decided not to provide the option for users to enter personal information. This ensured complete anonymization of the entire process since data from user, user waveform .wav file, is run through the inference engine and subsequently not stored anywhere in the pipeline. - - -
-
- -
-
- -The data flow diagram for the system is shown above. With the DNS forwarding traffic to the EC2 instance. The EC2 instance does 3 processes to reduce latency including: - -1. Converts waveform (.wav file) to Mel-frequency spectrogram and Mel-frequency cepstral coefficients (MFCCs) -2. It also incorporates a pre-trained XGBoost model from COUGHVID to help validate if there is an actual cough sound in the waveform file. -3. It uses the inference model to infer the probability of the Mel-frequency spectrogram and Mel-frequency cepstral coefficients (MFCCs) containing having COVID-sound biomarkers or not. - -These 3 processes run asynchronously and the current latency is ~2sec, from uploading a cough sound to getting a positive or negative result output. - -The system also has an automated model deployment script that can automate deployment with only one line of code to an Ubuntu deep learning AMI image. The automated script makes it so much easier to deploy by taking care of all dependencies and co-dependencies during deployment. In addition, we also have an automated model validation script that can evaluate performance of many models and give their specificity and sensitivity to COVID-19 using a customized dataset that is also downloaded into the EC2 instance and kept in the repo. - -We needed a t3 instance with GPU acceleration because the core of our inference engine uses a convolutional neural network that is accelerable with GPU. We also decided to separate the inference step from the pre-processing and input data validation steps to ensure modularity and error tracking. - -The machine learning system we built also has an error-tracking log file in the server that could be used to debug the system when necessary. By incorporating error logging capability, automated model evaluation and validation, automated model deployment, and model inference, we have built and demonstrated a well-rounded system that can serve users from around the world at low latency speed. In addition the model evaluation allows for continuous integration and deployment- CI/CD- since it allows uploading many models and evaluating those models in the cloud. Thus enabling an almost seamless switch from one inference algorithm to another inference algorithm. - -A couple of flaws that the system currently faces in production would be susceptibility to attacks. The URL to our EC2 instance is public and we made the port open to the entire world. Although this made it easy to deploy and serve the model it also exposes us to DOS attacks. - -In addition, the system is currently not scalable, horizontally. To enable horizontal scaling using a load balancer on AWS we would need to integrate and use EBS (Elastic Beanstalk). - -## III. Machine Learning - -We started out with the hypothesis that cough sound from COVID-19 positive carriers could be differentiated from cough sound from unaffected people. We pre-processed cough recordings from two open-sourced COVID-19 related datasets, Coswara[5] and COUGHVID[6]. Extracted features include the recording waveform, age, and gender. We take all positive samples and randomly selected subsets of negative samples from the datasets to compensate for class imbalance. We also tried taking all samples and assigning different class weight combinations, even though this approach did not perform as well. - -Mel-frequency cepstral coefficients (MFCCs) and mel-frequency spectrograms [7] have been used to extract audio signature from cough recordings. Our main approach is to build two branches of the modelling pipeline that can handle those different engineered-features separately, which are sequentially merged together for a single binary classification task. - -We received 39 numerical coefficients from MFCCs as output, for which we built a two-layer dense model. The spectrograms are in image format (64x64x3),for which an ImageNet approach can be applied. We attempted numerous pre-trained models on ImageNet, including ResNet50, ResNet101, InceptionV2, DenseNet121, etc. ResNet50 was shown to perform the best. The output of the pre-trained base model is passed to a global average pooling layer, a dense layer and a dropout layer. We merged the outputs from a two-layer dense model for MFCCs and Convolutional Neural Net model, and passed the merged output through another two models with a shrinking number of nodes. The final output is a single node with sigmoid activation function. - -Alternatively, we tried automatic neural arectural search using AutoKeras. This is to systematically test for other architectures. However, we did not achieve the same level of performance on the test set obtained by the handbuilt architecture in the past paragraph. - -The dataset was randomly shuffled and split into 75% training, 15% validation and 15% test set. During the training, we grid-tuned different optimizers until determining that Adam works best. After training, we found the best cut-off for binary classifications with Youden’s J statistics (Sensitivity + Specificity) [7]. - -## IV. System evaluation - - - - - - - - - - - - - - - - - - - - - - - - - - -
- # samples - Accuracy - Weighted F1 - Sensitivity - Specificity -
Female - 864 - 0.7049 - 0.73 - 0.93 - 0.63 -
Male - 1,968 - 0.6951 - 0.74 - 0.91 - 0.65 -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- # samples - Accuracy - Weighted F1 - Sensitivity - Specificity -
<= 20 - 288 - 0.7222 - 0.74 - 0.93 - 0.64 -
21-40 - 1,680 - 0.7065 - 0.75 - 0.96 - 0.66 -
41-60 - 576 - 0.6867 - 0.73 - 0.88 - 0.65 -
> 60 - 48 - 0.7292 - 0.65 - 1 - 0.67 -
-Table 1: Sliced-based analysis results using an optimized cutoff of 0.012245 across a) gender groups and b) age groups. - - -We obtained the cutoff based on AUC analysis on the test set. At the threshold, the test set performance achieves 79.71% sensitivity and 49.20% specificity. The FDA guidance to COVID-19 testing solution explicitly mentions that sensitivity and specificity are the main metrics [8]. We believed that sensitivity is the most important metric of success as a screening tool, as it measures how many actual COVID-positives were captured by the model. Achieving approximately 80% sensitivity shows that we can correctly identify those with COVID-19 virus with substantial success. Admittedly, we did not achieve great specificity: 49% specificity means that there is roughly one false positive for every positive prediction. However, from a public health perspective, we think it is far more costly for a restaurant to admit infected customers than to send more people to PCR tests than necessary. - -We performed slice-based analysis across different age and gender groups in order to evaluate the performance of our model and address model weaknesses. We did inference on the entire dataset. Using an optimized cutoff of 0.012245, we found that the model achieved almost exactly the same accuracies and F1 scores among the male and female populations. Individuals between the age of 21 and 40 make up most of the population from which the cough samples were crowdsourced; despite large differences in the number of samples across age groups, the model was as accurate in the 21-40 age group as in the >60 age group. These results demonstrate that the cough signatures are generalizable across different gender and age groups, and that the model is not biased towards any gender or age groups. - -Separately, we also created an automated system evaluation that can provide analysis of multiple models as well as their inference latencies right within the production environment. These codes run in both python and bash. By using the automated script we were able to capture 2x increase in latency in moving from a tri-layer CNN architecture to a ResNet50 architecture. However, what we gave up in latency we more than made up for in specificity and sensitivity of the algorithm. With the ResNet50 algorithm enabling a sensitivity score of 0.79 and a specificity score of 0.9, while the tri-layer CNN architecture has a sensitivity score of 0.6 and a specificity score of 0.59 in production. Analysis was performed using a separate holdout dataset that was manually culled, curated for sound veracity and sound clarity. - -In addition, we evaluated the performance of the pretrained XGBoost cough validation classifier, which has some drawbacks. Specifically, the algorithm tends to misclassify audio recordings that have low-pitch or quiet coughs as non-cough files. - -## V. Application demonstration - -We chose to make this a web application running on AWS and routed to Taiwo’s URL (raphaelalabi.com) to keep it simple for the user and have their result with a couple of clicks. The feature set of the web app is fairly simple. It consists of uploading a .wav file to the browser and clicking the “Process” button. The app could reject the input due to a wrong file format, the audio not within length boundaries (0.5 - 30 seconds), or the audio not being detected as a cough, and output the appropriate error message (image below). - - -If the input is accepted, based on a fixed threshold determined by model evaluation, you will reach either a “positive” or “negative” landing page after a few seconds, returning the probability that you are an asymptomatic COVID carrier and general guidelines in both cases. We decided to only have two landing pages as we did not feel confident in setting more thresholds based on limited evidence from our model evaluation. - -### Instructions and Images: - -1. Navigate to raphaelalabi.com -2. Upload .wav file from your local directories and click “Process”: - -![alt_text](../assets/img/virufy_cloud/image2.png "image_tooltip") - -That’s it! The possible errors mentioned above display a message like this and ask to re-upload: - - -![alt_text](../assets/img/virufy_cloud/image3.png "image_tooltip") -![alt_text](../assets/img/virufy_cloud/image4.png "image_tooltip") - -If the model successfully processed the data, you will get one of the following landing pages specifying a “positive” or “negative” result with disclaimers and guidance: - - -![alt_text](../assets/img/virufy_cloud/image5.png "image_tooltip") - - -## VI. Reflection - -We believe that the infrastructure that we built our system on worked well given the team members’ varying skill sets. AWS was a great fit because their deep learning EC2 instances come preloaded with Anaconda and other linux commands required for deployment of our application, cutting out the time-consuming step of installing them and properly configuring their paths. Furthermore, Taiwo, who has more experience with the platform, deploys the app through his root account, and created IAM accounts so the rest of us could easily access the same resources. - -Another success was keeping the code concise through properly compartmentalizing it. Essentially, we pull the .wav file through a simple API call, run it through a preprocessing function to featurize it and verify it’s a cough, then run inference through the model loaded from an hdf5 file and trained separately from the system’s codebase. This setup allowed us to iterate our system and use Git with fewer roadblocks. - -In general, our team communicated effectively over Slack and had an effective division of labor as we consistently listed the remaining action items and assigned them. However, we could have met on Zoom more and learned about each other’s components with more depth, as we spent a fair bit of time in the chat playing catch-up. - -The most obvious drawback in our current system is the need for the file to be .wav, which often requires a user to manually convert the audio on a third-party website. Given a little more time, we would probably have solidified the functionality of recording within the application and/or accept and internally convert audio files of other formats. - -A more subtle yet significant limitation comes from the data utilized to train and evaluate our model. The coughs could come from symptomatic carriers, not just asymptomatic, diluting our metrics. After manually listening to positive waveforms from Coswera, we could not tell if some were asymptomatic forced coughs or naturally occurring coughs from patients. We realized we cannot solve this challenge to differentiate the two because we do not have the ground truth or curated datasets from both categories. - -With more time and resources, the first critical component to improve on would be model performance --prioritizing sensitivity. We could only train a few architectures and hyperparameter combinations on a limited dataset, so we would want to expand on that with more research and compute power. Also, we would learn the ins-and-outs of audio data and its different features to expand the pre-processing code, and relatedly implementing segmentation methods to reduce noise in the input. The second would be to improve the general user experience. For example, if a positive result is returned, we should return a basic analysis of the waveform explaining the model’s “decision”, and possibly route the user to a PCR test based on their current location. - -We are operating under the umbrella of the larger Virufy non-profit, and hope portions of our work can be adapted into their codebase. Some of our team members are thinking about continuing to work on Virufy, and hope to see it succeed with the continued development of new features and more accurate models. - -## VII. Broader Impacts - -This application is intended for use as a potential, fast, and accurate COVID detection using an unforced cough wave-form from an individual. We could see this used as a diagnostic tool in airports, hospitals and other health institutions, care-taker homes etc. This algorithm will come in handy in those places where fast diagnosis of COVID-19 ensures that regular traffic flow is minimally impacted by the need to ensure those coming into those institutions are not asymptomatic carriers. - -A potential associated harm with using this machine learning system is that it is possible that a person with common viral pneumonia or a bad case of the flu could also be labelled by the algorithm as an asymptomatic carrier. The algorithm has not been properly calibrated with users having flu, pneumonia, and other respiratory conditions but that do not have COVID or have COVID. Our belief is that such individuals may also carry the vocal bio-maker for COVID that the model has learned and thus be classified as COVID-positive. - -Lastly, our system is designed as a prescreening tool and not as a comprehensive test that would replace regular PCR or rapid testing procedures. We intended to make this as clear as possible by providing warnings and reminders on our web UI. Moreover, because the test is not 100% accurate and we do expect to see false negatives to some degree after deployment (even though we try to minimize this as much as possible), we heavily emphasize the need to continue to follow public health guidelines and quarantine procedures on our results page. For individuals who receive positive predictions, we prompt them to get a more reliable test (such as PCR) as soon as possible. - -## VIII. Contributions - -**Taiwo** - -* Wrote the FE interface with boot-strapped HTML/Javascript to Python. -* Wrote the deployment scripts. -* Wrote the automated testing scripts. -* Wrote the general framework of the API for pre-processing, inference. -* Engineered the use of AWS (EC2), Elastic IP address for the oncloud prediction. -* Worked on data pre-processing for the coughvid and re-wrote the initial base-line -* algorithm that gave the team a first look at the model performance. -* Worked on the initial model with multi-band CNN and DNN. - -**Chloe** - -* deployed model on EC2 and set up serving on AWS and routing to custom domain (through Namecheap) -* designed web UI (front-end) -* workshop presentation and final presentation - -**Alex** - -* Conducted deep and detailed analysis on model training, development and analysis. The output model was used in final presentation. -* Defined the cut-off threshold for the ResNet machine learning model, -* Did slice based analysis to evaluate model performance on different age and gender -* Did initial exploratory work with Sagemaker and GCP AI platform in terms of model hosting. -* Made MVP demo slides and presentation video - -**Ishan** - -* Integrated the cough validation XGBoost model into our codebase and verified its compatibility with our existing system on the EC2 instance -* Added functionalities like checking the length of the input sound -* Made UI modifications needed for final product -* Prepared appropriate examples and conducted MVP and final demos - -## GitHub Repo URL: - -The URL to the github repo with all the code is: [https://github.com/taiworaph/covid_cough](https://github.com/taiworaph/covid_cough) - - -## References - -[1] COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU). (n.d.). Retrieved March 17, 2021, from https://coronavirus.jhu.edu/map.html - -[2] Johansson MA, Quandelacy TM, Kada S, et al. SARS-CoV-2 Transmission From People Without COVID-19 Symptoms. JAMA Netw Open. 2021;4(1):e2035057. doi:10.1001/jamanetworkopen.2020.35057 - -[3] Scudellari, M. (2020, November 4). AI Recognizes COVID-19 in the Sound of a Cough. Retrieved March 17, 2021, from https://spectrum.ieee.org/the-human-os/artificial-intelligence/medical-ai/ai-recognizes-covid-19-in-the-sound-of-a-cough - -[4] Fakhry, Ahmed, et al. "Virufy: A Multi-Branch Deep Learning Network for Automated Detection of COVID-19." arXiv preprint arXiv:2103.01806 (2021). - -[5] Neeraj Sharma, Prashant Krishnan, Rohit Kumar, Shreyas Ramoji, Srikanth Raj Chetupalli, Nirmala R., Prasanta Kumar Ghosh, and Sriram Ganapathy. Coswara – A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis. arXiv:2005.10548 [cs, eess], August 2020. URL http://arxiv.org/abs/2005. 10548. arXiv: 2005.10548. - -[6] Lara Orlandic, Tomas Teijeiro, and David Atienza. The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms. arXiv:2009.11644 [cs, eess], September 2020. URL http: //arxiv.org/abs/2009.11644. arXiv: 2009.11644. - -[7] Brownlee, J. (2021, January 04). A gentle introduction to threshold-moving for imbalanced classification. Retrieved March 17, 2021, from https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/ - -[8] Center for Devices and Radiological Health. (n.d.). EUA Authorized Serology Test Performance. Retrieved March 17, 2021, from [https://www.fda.gov/medical-devices/coronavirus-disease-2019-covid-19-emergency-use-authorizations-medical-devices/eua-authorized-serology-test-performance](https://www.fda.gov/medical-devices/coronavirus-disease-2019-covid-19-emergency-use-authorizations-medical-devices/eua-authorized-serology-test-performance) - diff --git a/_posts/2021-03-18-virufy-on-device-detection-for-covid-19.markdown b/_posts/2021-03-18-virufy-on-device-detection-for-covid-19.markdown deleted file mode 100644 index f92a326..0000000 --- a/_posts/2021-03-18-virufy-on-device-detection-for-covid-19.markdown +++ /dev/null @@ -1,211 +0,0 @@ ---- -layout: post -title: Virufy on-Device Detection for COVID-19 -authors: Taiwo Alabi, Alex Li, Chloe He, Ishan Shah -date: 2021-03-19 13:32:20 +0700 -description: CS 329S Final Project -img: -fig-caption: # Add figcaption (optional) -tags: [Edge-ML] -comments: true ---- - -### The Team -- [Solomon Kim](https://www.linkedin.com/in/solomon-kim-7199011a6/) -- [Vivian Chen](https://www.linkedin.com/in/vivianschen/) -- [Daniel Tan](https://www.linkedin.com/in/daniel-tan-a0672b163/) -- [Amil Khanzada](http://www.amilkhanzada.com/) - - -# **Problem Description** - -COVID-19 testing is inadequate, especially in developing countries. Testing is scarce, requires trained nurses with costly equipment, and is expensive, limiting how many people can obtain their results. Also, many people in developing countries cannot risk taking tests because results are not anonymous, and a positive result may mean a loss of day-to-day work income and starvation for their families, which further allows COVID-19 to spread. - -Numerous attempts have been made to solve this problem with partial success, including contact tracing apps which have not been widely adopted often due to privacy concerns. Pharmaceutical companies have also fast-tracked development of vaccines, but they still will not be widely available in developing countries for some time. - -To combat these problems, we propose a free smartphone app to detect COVID-19 from cough recordings through machine learning analysis of audio signals, which would allow for mass-scale testing and could effectively stop the spread of the virus. - -We decided to use offline edge prediction for our app for several reasons. Especially in developing countries, Internet connectivity / latency is limited and people often face censoring. Data privacy regulations such as GDPR are now commonplace and on-device prediction will allow for diagnoses without personal information or health data crossing borders. Because our app will potentially serve billions of predictions daily, edge prediction is also more cost-effective, as maintaining and scaling cloud infrastructure to serve all of these predictions will be costly and difficult to maintain. - - -# System Design - -In designing our system and pipeline, we first and foremost kept in mind that this pipeline would be running offline on edge devices in developing countries, including outdated phones with weak CPUs. We aimed for a pipeline that could efficiently process data, run a simple model, and return a prediction within a minute. To do this, we simplified our model, sacrificing some “expressiveness” in exchange for reduced complexity, but also through straightforward preprocessing of data. - -For the frontend, we decided on a web app because it can be used in the browser, which is operating-system-agnostic; in comparison, apps may only run on certain operating systems. Our frontend is written in [ReactJS + TypeScript](https://www.typescriptlang.org/docs/handbook/react.html), which is the industry standard for modern web design. It employs responsive web design principles to be compatible with a wide range of screen sizes and aspect ratios present on different devices. Internally, the frontend calls a [TensorFlow.js](https://www.tensorflow.org/js) (TFJS) model for inference. - - -
-
- -
-
- -We chose to use the [TensorFlow.js](https://www.tensorflow.org/js) (TFJS) framework because it is supported for use with web browsers. The TFJS [Speech Command](https://github.com/tensorflow/tfjs-models/tree/master/speech-commands) library provides a JavaScript implementation of the Fourier Transform (Browser FFT) to allow straightforward preprocessing of the raw audio files. We trained a vanilla TensorFlow model on background noise examples provided by the sample TFJS Speech Commands code, along with a dataset of thousands COVID-19 test result labeled coughs, so that our model could distinguish coughs from background noise. We then converted this trained model into the TFJS LayersModel format (with the model architecture as a JSON and weights in .bin files), so that we could integrate it into the front end JavaScript code for browser inference on-device. - -Our system’s basic pipeline is as follows: - - - -1. User opens our app -2. The TFJS models are downloaded from S3 onto the user’s device -3. Microphone detects noise from user -4. The Speech Commands library continuously preprocesses the audio by creating audio spectrograms -5. The spectrograms are run through the model -6. Only if the audio snippet is classified as a cough, the user will receive a prediction of whether they are COVID positive or negative - -It is worth noting that model files are downloaded and loaded into memory only when the user first opens the app. After this, no Internet access is required and the system is able to make predictions offline. - - -# Machine Learning Component - -The model that powers our application is based on the publicly available TensorFlow.js [Speech Commands](https://github.com/tensorflow/tfjs-models/tree/master/speech-commands) module. Our model is intended to be used with the WebAudio API supported by all major browsers and expects, as input, audio data preprocessed with the browser Fast Fourier Transform (FFT) used in WebAudio’s GetFloatFrequencyData. The result of this FFT is a spectrogram which represents a sound wave as a linear combination of single-frequency components. The spectrogram, which can be thought of as a 2D image, is passed through a convolutional architecture to obtain logits which can be used in multiclass prediction. Specifically, this model has 13 layers with four pairs of Conv2D to MaxPooling layers, two dropout layers, a flatten layer, and two dense layers. - - -![alt_text](../assets/img/virufy-device-detection/virufy-image-2.png "image_tooltip") - - -Because training from scratch is expensive, we started with a model trained using the Speech Commands dataset [2], trained to recognize 20 common words such as the numbers “one” to “ten”, the four directions “left”, “right”, “up”, “down”, and basic commands like “stop” and “go”. We performed transfer learning on this model by removing the prediction layer and initializing a new one with the correct number of prediction classes. Afterwards, we fine-tuned the weights on the open source [COUGHVID dataset](https://zenodo.org/record/4498364#.YFQG6a9Kh3g), which provides over 20,000 crowdsourced cough recordings from a plethora of different characteristics including gender, geographic location, age, and COVID status. - -To ensure that data is preprocessed in the same way during training and testing, we use a custom preprocessing TensorFlow model which is trained to emulate the browser FFT that is performed by WebAudio, producing the same spectrogram as output. This browser FFT emulating model is provided and maintained by Tensorflow Speech Commands. Creating our own training pipeline allowed us to select our model architecture based on existing ongoing research efforts and fine tune our hyperparameters. - - -# System Evaluation - -Offline evaluation was done on our model as a quick way to ensure our model was working correctly. This meant setting aside 30% of our data as test data. To monitor offline testing, we used [Weights and Biases](https://wandb.ai/site). As shown below, 50 epochs were sufficient to achieve convergence in training and validation accuracies, with corresponding decreasing losses. Here is an example of what we we logged: - -![alt_text](../assets/img/virufy-device-detection/virufy-image-3.png "image_tooltip") - -As demonstrated by the graphs as well as the chart, the “loss”, or the loss calculated from our training set was 0.1717. While the ‘val_loss’, or the loss calculated from the testing set was 0.09781. Also, the “acc” or the accuracy calculated from the training set was 0.93298. While the ‘val_acc’ or the accuracy calculated from the testing set was 0.96875. Additionally, we evaluated the model before and after TFJS conversion and found that the accuracy as well as the loss on both the training and testing set were the same. This was important because we were initially concerned that during the conversion process the quality of our model would go down, however we were delightfully surprised that this did not occur. - -The remainder of our evaluation was done through real world testing. Although the gold standard of testing would be large-scale, randomized clinical trials, with data collected from a variety of demographic groups and recording devices, we did not have the time and resources to do that in the constraints of the class. Instead, we did informal evaluations on our own team members and friends in Argentina and Brazil. - -Anecdotally, the prediction was highly accurate on our group members, who were primarily Asian and all healthy. This remained true across a variety of devices such as smartphones, laptops, and tablets. - -The collection of external results was complicated by ethical considerations and lack of access to PCR tests to provide ground-truth labels. Nonetheless, we will note here two cases in Brazil. One individual was recovered, but previously was diagnosed with COVID-19; the model predicted that he had COVID-19. The other individual had COVID-19, but was predicted to be healthy. This illustrates the inherent challenge of translating models from development to production; model accuracy might be highly degraded due to distribution shift between the training and inference data. - - -# Application Demonstration - -In the beginning stages of the design process prior to this course, the Virufy product designer determined the appropriate target audience by conducting user interviews. She selected potential interviewee candidates based on certain demographic criteria such as being a citizen of selected Latin America countries or being tech-savvy and owning a cell phone. - -After gathering target audience candidates from six Latin America countries as well as the U.S. and Pakistan, user interviews were conducted. The results from the interviews were then synthesized to create user personas. These personas helped her produce empathetic and user-centered designs throughout the whole design process. - -![alt_text](../assets/img/virufy-device-detection/virufy-image-4.png "image_tooltip") - - -Once initial ideation and designs were completed, the designer conducted a series of prototype user tests in which the user was observed as they walked themselves through the app mockup. The data from each user test was then synthesized to design new and improved iterations. After numerous user tests and iterations and evolving, the designer created a mockup of the demonstration application. - -Over the past month, we worked with the Virufy designer to adapt the design to our specific user needs given our novel contribution towards edge prediction. Through discussions with hospitals and normal users, alongside the technical limitations of TensorFlow.js, we finalized on our below design in which the user could click the microphone to trigger our model execution. We made the instructions simple and easy to follow, so users could record their cough and immediately get their prediction with our edge model which performed very fast (under 200ms on our laptops). - -
- -
- - -# Reflections - -Throughout the course of this 2-month project, we explored many areas technically, some of which were fruitful, and others of which were dead ends. - - - -1. **Google’s Teachable Machine** - - At the start of our project, we used the MFCC and mel-spectrogram audio features in our models based on state-of-the-art research, but ran into issues as the same preprocessing code was not supported on-device with TFJS. We reached out to [Pete Warden](https://petewarden.com/), an expert of [TinyML](https://www.tinyml.org/) on Google’s TensorFlow team, who pointed us to Teachable Machine, a web-based tool to create models, which uses TensorFlow.js to train models and generates code to integrate into JavaScript front ends. Although very simple and lightweight, we soon discovered [Teachable Machine](https://teachablemachine.withgoogle.com/) was not a feasible long-term solution for us, as it required manual recording and upload of training audio files, while also not providing us the flexibility to configure model architecture as we hoped to do. This ultimately forced us to train our own custom model. - -2. **Speech Commands Library** - - TensorFlow’s [Speech Commands](https://www.tensorflow.org/tutorials/audio/simple_audio) library provided a simple API to access a variety of important features like segmenting the continuous audio stream into one-second snippets and performing FFT feature extraction to obtain spectrograms. The availability of pre-existing training pipelines as well as example applications using Speech Commands provided a strong foundation for us to adapt our own pipeline and frontend application. - -3. **Team Dynamics** - - We compartmentalized responsibilities such that individual members were largely in charge of separate components of the system. Frequent communication via Slack was key to ensure that we all had a sense of the bigger picture. - - -Overall, we learned over the quarter how to integrate frontend and backend codebases to build a production machine learning system, while utilizing APIs and libraries to expedite the process. Our knowledge also broadened as we considered the unique challenges of developing models for CPU-bound edge devices in the audio analysis domain. - -Continuing beyond this course, we would like to explore the following areas: - - - -1. **Model Performance** - - State-of-the-art [research papers](https://virufy.org/paper) suggest that accuracies as high as 98% are possible for larger neural networks. We would like to tune our tiny edge models to perform at similar accuracies. - -2. **Dataset Diversity** - - Our model development was limited by the lack of access to large-scale, demographically diverse, and accurately labelled datasets. For next steps, we hope to remedy this by leveraging the [Coswara dataset](https://github.com/iiscleap/Coswara-Data), along with the larger datasets Virufy is collecting globally. - -3. **Microphone Calibration** - - We didn’t take into account the distribution shift between training and inference due to differences of microphone hardware specifications between edge devices. - -4. **Audio Compression** - - The audio samples we trained on were of similar audio formats and frequencies. Exploring the effect of audio compression codecs such as mp3 on model performance may lead to interesting insights. - -5. **Expansion to More Diseases** - - COVID-19 is not the only disease that affects patient cough sounds. We believe our model can be enhanced to distinguish between various other coronaviruses such as the common cold and flu, along with asthma and pneumonia through use of a multi-class classifier. - -6. **Embedded Hardware** - - An interesting area to explore is further shrinking our model to fit onto specialized embedded devices with microphones. Such devices could be cheaply produced and shipped globally to provide COVID detection without smartphones. - - - -# Broader Impacts - -Our app is intended to be used by people in developing countries who need an anonymous solution for testing anytime, or by anyone in a community at risk of COVID-19. However, we have identified some unintended uses of our app. - -Because we intend to share our technology freely and because the algorithm runs on-device, competitors will easily be able to take our algorithm and create copies of our app and may even block access to our app and sell theirs for profit. To prevent this, we will open source our technology under terms requiring attribution to Virufy and prohibiting charging users for the use of the algorithm. - -Another risk is that people may begin to ignore medical advice and believe only in the algorithm and might use the results in place of an actual diagnostic test. This is very risky because if the algorithm mispredicts, we may be held liable. The spread of COVID-19 may increase if COVID-19 positive people become confident to socialize with false negative test results. To mitigate this, we intend to add disclaimers that our app is a pre-screening tool that should be used only in conjunction with medical providers’ input. Additionally, we will work closely with public health authorities to clinically validate our algorithm and ensure it is safe for usage. - - -
-
- -
- -
- -
-
- - -People may also start testing the algorithm with irrelevant recordings of random noises such as talking. To address this, we have equipped our algorithm with a cough detection pre-check layer to prevent any non-cough noises from being classified. - -Finally, people especially in poorer contexts may share the same smartphones with several users, which can increase the likelihood of spreading COVID-19. Thus, our instructions clearly state that users must disinfect their device and keep 20 feet away from others while recording. - - -# Code - -Our TensorFlow JavaScript audio preprocessing and model prediction code can be found here: [https://github.com/dtch1997/virufy-tm-cough-app](https://github.com/dtch1997/virufy-tm-cough-app) - -Our finalized progressive web application code can be found here: [https://github.com/virufy/demo/tree/edge-xoor](https://github.com/virufy/demo/tree/edge-xoor) - - -# References - -We’re extremely grateful to [Pete Warden](https://petewarden.com/), [Jason Mayes](http://www.jasonmayes.com/), and [Tiezhen Wang](https://www.linkedin.com/in/tiezhen/) from Google’s TensorFlow.js team for their kind guidance on TinyML concepts and usage of the [speech_commands library](https://www.tensorflow.org/datasets/catalog/speech_commands), both in class lecture and during the few weeks of our development. - -[Jonatan Jaskilioff](https://www.linkedin.com/in/jonatan-jaskilioff-77075340/) and the team at [XOOR](https://xoor.io/) were very gracious to lend their support and guidance in integrating our JavaScript code into the [progressive web app](https://virufy.org/demo) they had built pro bono for Virufy. - -We are also indebted to the broader [Virufy](http://virufy.org/) team for guiding us on the real-world applicability and challenges of our edge device prediction project. We leveraged their deep insights from their members distributed across 20 developing countries in formulating our problem statement. Additionally, we built on top of the open source [demo app](https://virufy.org/demo) that they had built prior based on intentions for real-life usage, along with their prior [research findings](https://virufy.org/paper) and [open source code](https://github.com/virufy/covid) for our model training. - -In preparing our final report, we are grateful to [Colleen Wang](https://www.linkedin.com/in/colleen-wang-59a091205/) for her kind support in editing the content of our post, Virufy lead UX designer [Maisie Mora](https://www.linkedin.com/in/maisiemora/) for helping explain the design process in the application demonstration section, and [Saad Aslam](https://www.linkedin.com/in/saslam23/) for his kind support in converting our blog post to a nicely formatted HTML page. - -Finally, we cannot forget the great lessons and close guidance from Professor [Chip Huyen](https://huyenchip.com/) and TA [Michael Cooper](https://michaeljohncooper.com/) who helped us open our eyes to production machine learning and formulate our problem to be attainable within the short 2 month course quarter. - -[1] Tensorflow Speech Commands dataset, [https://arxiv.org/pdf/1804.03209.pdf](https://arxiv.org/pdf/1804.03209.pdf) - -[2] Teachable Machine, [https://teachablemachine.withgoogle.com/](https://teachablemachine.withgoogle.com/) - -[3] Virufy: A Multi-Branch Deep Learning Network for Automated Detection of COVID-19 [https://arxiv.org/ftp/arxiv/papers/2103/2103.01806.pdf](https://arxiv.org/ftp/arxiv/papers/2103/2103.01806.pdf) diff --git a/_posts/2021-03-20-dashcam-data-valuation.markdown b/_posts/2021-03-20-dashcam-data-valuation.markdown deleted file mode 100644 index 8ec00ad..0000000 --- a/_posts/2021-03-20-dashcam-data-valuation.markdown +++ /dev/null @@ -1,159 +0,0 @@ ---- -layout: post -title: An active data valuation system for dashcam data crowdsourcing -date: 2021-03-20 00:44:20 +0100 -description: Online dashcam data valuation system -img: # Add image post (optional) -fig-caption: # Add figcaption (optional) -tags: [Edge-ML] -comments: true ---- -### App link - -[https://cs329s.aimeup.com](https://cs329s.aimeup.com) - -### The Team -- Soheil Hor, [soheilh@stanford.edu](mailto:soheilh@stanford.edu) -- Sebastian Hurubaru, [hurubaru@stanford.edu](mailto:sebastian.hurubaru@stanford.edu) - -## Problem definition - -Data diversity is one of the main practical issues that limits the ability of machine learning algorithms to generalise well to unseen test cases in industrial settings. In scenarios like data-driven perception in autonomous cars, this issue translates to acquiring a diverse train and test set of different roads and traffic scenarios. On the other hand the increased availability and reduction in cost of HD cameras has resulted in drivers opting to install in-expensive cameras (dash-cams) on their cars, creating a potential for a virtually infinite source of diverse training data for autonomous driving applications. This data availability is ideal from a machine learning engineer’s point of view but the costs in data transfer, storage, clean up and labeling limit the success of such uncontrolled data-crowd-sourcing approaches. More importantly, the data-owners might prefer not to send all of their data to the cloud because of privacy concerns. We propose a local unsupervised dataset evaluation system that can prioritize the samples needed for training of a centralized model without the need for uploading every sample to the cloud and therefore eliminate the costs of data transfer and labeling directly at the source. - -## System design - -
-
- -
-Block diagram of the proposed system -
- -As it was explained our goal is to optimise the training set corresponding to an ML model by distributed data valuation. One of the well-known approaches to this problem is to prioritise the samples based on their corresponding model uncertainty. Our proposed approach is using a local “loss-predictor network” to quantify the value of each sample at each client before its transmitted to a central server. The proposed system consists of two main modules: namely the centralized server and the local data source clients. Please see figure 1 for more details. - - - -### Module 1: The Centralized server - -The goal of the server module is to: - -1. Gather data from different data sources (clients), -2. Retrain and update the backbone model based on the updated training set (for labeled data) -3. Train the loss prediction module based on the updated backbone model -4. Transmit the weights of the updated loss prediction module to each client - -### Module 2: Local data source clients - -The goal of each client is to: - -1. Estimate the backbone model’s loss for each local sample using the local loss-prediction model -2. Select the most valuable (valid) samples based on the predicted loss -3. Transmit the selected samples to the centralized module - -In order to make the system available to users, we chose AWS as our cloud platform. Once a user decides to upload any data with us, this gets stored on AWS S3. In order to deal with the concurrency issues, which arise when multiple users share the data with us, we created a scheduler using AWS CloudWatch that triggers the online learning at a specified time interval. The centralized server which does the online learning, was implemented as a Lambda function, configured with a Docker image runtime. By using the scheduler and allowing at any time just one instance of the online training, all the new available data will be processed once and the new model will be made available to all clients at the same time. Now, as the model gets trained, we wanted to prevent users from actually being able to evaluate and upload data with us, as this would enable them to still be rewarded for some data, that after the training can be worthless. To enable this, we used AWS IoT to push data from the online learning Lambda function to all running clients, containing all the training stats and progress, based on which the client will decide when to make the platform again available to the users. - -As client data privacy was our main concern, when users evaluate pictures this has to happen without sending any data to us. Therefore the online learning component generates at the end a browser ready model, and uploads it to AWS S3 with a new incremented version. Whenever a client would like to evaluate some data, there is a check to assess whether a new version is available and always get the newest model version. All of this was done using TensorFlow JS. - -In order to protect the data from the outside world, and only allow access to the resources over the web app for all unauthenticated users, AWS IAM was employed and with CloudFormation configuration files we could set up the full security layer automatically. - -Now in order to create all the infrastructure automatically based on the code changes, allowing us to have both a test and production environment, we have employed an infrastructure as a code approach. For this we used AWS Amplify and AWS SAM allowing us to leverage AWS CloudFormation services. - -## Machine learning component -
-
- -
-Block diagram of the ML component -
- -Our approach to the on-the-edge data valuation problem is based on recent advances in development of loss-predictor models[]. In simple words, a loss predictor model is a model that tries to estimate another model’s loss as a function of its inputs. We use the loss of the model as a measure of model uncertainty that can be calculated without access to the ground truth labels enabling evaluating each sample directly at time of capture. - -For the back-bone model we converted a pre-trained YOLO-V3 [1] directly from the original dark-net implementation. Then we evaluated the converted model on a dashcam dataset available publicly online (bdd100k dataset available at https://bdd-data.berkeley.edu/) - -For the loss predictor model we decided to go with a small CNN that can be directly implemented on the browser (Tiny-VGG). We trained the Tiny-VGG model on the classification loss resulting from running the backbone model on unseen data. - -The implemented system has 2 interconnected training loops: - -First the “Offline” training loop that requires labeled data and can help in training of the loss predictor model to be a better predictor of the loss of the backbone model. Since our system did not include a labeling scheme we trained this loop only once (using a labeled subset of the bdd100k dataset) we then used the learned weights as the starting point for the second training loop (online learning). - -For the online learning training loop we start with the weights extracted from the offline training phase and then retrain the loss-predictor model whenever the centralized unlabeled dataset is updated. The challenge here is how to retrain the loss predictor model on these samples without having access to the labels. The way that we approached this problem is by considering the fact that the backbone model’s loss on these samples will be zero once they are labeled and added to the backbone model’s training set. Based on this assumption we decided to use the new samples with loss of zero as an online learning alternative to the larger offline learning loop. - -## System evaluation - -One of our main challenges was to map a measure like the loss of a model to a quantitative and fair value in dollars. For this task we first did an empirical analysis of the distribution of the classification loss values of the backbone model. Figure 3 shows the empirical distribution of losses for the YOLO V3 model. We used this empirical probability distribution to calculate how likely is observing each sample in comparison to a randomized data capture approach with uniform probability of observing each sample. We defined the value for each sample as follows: - -![alt_text](../assets/img/formula_full.png "Sample value formula") - -In which ![alt_text](../assets/img/formula_part1.png "empirical probability of each loss") is the empirical probability of each loss as shown in Figure 3, ![alt_text](../assets/img/formula_part2.png "empirical probability of each loss would be observed") - -is the probability that each loss would be observed if the loss distribution was uniform (10% for the 10-bin histogram shown in figure 3) and BSV is the “Base Sample Value” chosen by the system designer. Based on our initial research the value that companies like Kerb and lvl5 have assigned to dashcam videos is around 3$ per hour of video recordings which roughly translates to 0.1 cent per frame assuming a 1fps key-frame extraction rule. However since in our system the samples are assumed to be much more diverse than a video and we require manual selection of the samples by the user we assumed a 10 cent base sample value for each frame. - -We observed one caveat for this method in practice: Because even the smallest losses have a non-zero value (because probability of observing any loss is non-zero) the already-sold samples could monetized again if the loss-predictor model does not give exact zero loss for its training set (which can be the case in online learning). We dealt with this problem by adding a “dead-zone” to our valuation heuristic in a way that samples with losses less than a specific threshold would have zero value (in our latest implementation we found that a threshold of 0.27 to work well with our data). - - -
-
- -
-Empirical expected probability of classification loss values of the backbone model -
- -## Application demonstration - -We made our application available online, to allow all users access to it. We have two links available, [https://cs329s.aimeup.com](https://cs329s.aimeup.com) for the production environment and [https://cs329s-test.aimeup.com](https://cs329s-test.aimeup.com) for the test environment. By choosing the production environment we click on the browse data button and load some on-the-topic pictures and hit the Run button: - -![alt_text](../assets/img/app_demo1.png "Application Demo 1") - -We could see the model generated some scores which get mapped to a fair value in U.S. dollars. All this data can be exported to Excel/PDF by using the buttons available in the spreadsheet toolbar. Search is also possible, if any picture can be referenced by name, to avoid scrolling when using a large number of pictures. - -After selecting one picture and uploading it, the online learning gets activated and the functionality on all clients is disabled during this time providing a real time progress of the training, as can be seen in the screenshot below: - - -![alt_text](../assets/img/app_demo2.png "Application Demo 2") - -To assess what is going on in the backend we have built a monitor page that can be opened by pressing the “Open Monitor” button. From that moment on, all the backend resources will push notifications to it. After uploading the picture and during the online training we can see the following: - - -![alt_text](../assets/img/app_demo3.png "Application Demo 3") - - -After running the new model on the same pictures, the fair value of the uploaded pictures goes down to 0, meaning that the model has learned the features available in it. - - -![alt_text](../assets/img/app_demo4.png "Application Demo 4") - - -## Reflection - -First challenge that we encountered is how to fetch a model from a secure site, where each file can get accessed over a secured private link and run it in the browser. TensorFlow JS unfortunately does not support this kind of operation, so we had to implement this ourselves. - -One major drop back in our project was our third teammate suddenly dropping from the course. Which we could have seen coming from him not being responsive in the first couple of weeks of the quarter. - -Another major challenge was dealing with model instability while retraining the loss predictor model in our online training loop. Our decision to also have the original training set to “refresh” the training helped a lot. - -One issue that we did not count on was the fact that debugging an online learning system requires a very detailed logging and version control system that enables following the dynamic performance of the model. We ended up implementing a basic version of a logging system but still it was very hard to predict how the model would behave after a few retraining sessions. - -Infrastructure as a code, is a powerful tool, that does more than one would expect, but can lead to unexplainable behavior. Two examples that gave us some headaches: - -* one cannot rely on the fact that data on the temporary folder inside a Lambda function container persists between the calls -* AWS S3 still delivers cached data to you, despite calling the API with caching disabled. Just deleting the files and uploading them again helped! - -Given unlimited time and resources we would incorporate a labeling block into the system and close the loop on active data capture and labeling by retraining the backbone model on the centralized training set. - -## Broader Impacts - -Since our valuation system is fully automated and does not have access to labels for the input data it could be manipulated in many different ways. For instance, one could monetize several copies of the same image (or maybe slightly different versions of one image) and leverage the fact that the loss predictor model can not be trained separately for each individual image. Or because the values are assigned to samples based on how unexpected each sample is, out of context samples can be easily monetized if the users intend to trick the system. The way that we have dealt with this issue is by first, limiting number of uploads that a user can do to an upload attempt every 5 minutes, and we also train the loss-predictor model between different uploads in order to reduce the loss values corresponding to all of the uploaded samples at each iteration. As a result, the users will be able to monetize unrelated or repeated images only once. - -Detecting repeated or unrelated images can be pretty straightforward using irregularity detection methods like one-class SVM but we have not currently implemented such a method. - -## References - -[1] Redmon, Joseph, and Ali Farhadi. "Yolov3: An incremental improvement." arXiv preprint arXiv:1804.02767 (2018). - - diff --git a/_posts/2021-03-31-ml-production-system-for-covid-detection.md b/_posts/2021-03-31-ml-production-system-for-covid-detection.md deleted file mode 100644 index 698852e..0000000 --- a/_posts/2021-03-31-ml-production-system-for-covid-detection.md +++ /dev/null @@ -1,409 +0,0 @@ ---- -layout: post -title: ML Production System For Detecting Covid-19 From Coughs -date: 2021-03-21 18:35:00 +0900 -description: Your report description -img: # Add (optional) -fig-caption: # Add figcaption (optional) -tags: [Covid, Embeddings, GCP, Cloud-ML] -comments: true ---- - -### Application Link -Covid-19 Evaluation App - QR Code - -[Covid Risk Evaluation - Hosted on GCP Cloud Run](https://covid-risk-evaluation-fynom42syq-uc.a.run.app/) - -### GitHub Link -[CS 329S - Covid Risk Evaluation Repository](https://github.com/LukasHaas/cs329s-covid-prediction) - -### The Team -- [Lukas Haas](https://www.linkedin.com/in/lukas-haas/) -- [Dilara Soylu](https://www.linkedin.com/in/dilarasoylu/) -- [John Spencer](https://www.linkedin.com/in/johnspe/) - - -## **Problem Definition** - -Since the start of the COVID-19 pandemic, widespread testing has become a significant bottleneck in the efforts to monitor, model and prevent the spread of the disease. Obtaining accurate information about a person’s disease status is critical in order to isolate infected individuals and decrease the reproduction number of the virus. Unfortunately, we see four major issues with current testing regimes; first, oropharyngeal swab tests are invasive, expensive, and time consuming; second, the time required to receive test results is significant, ranging anywhere from 30 minutes for rapid swab tests to three days for PCR tests in a lab at the time of this writing; third, contamination risk is high when individuals travel to testing sites to obtain their tests, and last but not least, tests need to be administered by trained clinicians, severely limiting throughput. - -In order to address current issues with testing, we developed a machine learning system to instantly test for and collect data on COVID-19 using the cough sounds recorded on the users’ personal devices. The World Health Organization (WHO) has [reported](https://www.who.int/docs/default-source/searo/myanmar/documents/coronavirus-disease-factsheet-3.pdf?sfvrsn=471f4cf_0) that 5 out of 6 COVID-19 patients exhibit mild symptoms, most commonly a “dry cough” producing a unique audio signature in the lungs. Cough sound analyses have proven to be effective for diagnosing other respiratory diseases such as [pertussis](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0162128), [pneumonia, and asthma](https://ieeexplore.ieee.org/document/7311223). - -[Our system ](https://covid-risk-evaluation-fynom42syq-uc.a.run.app/)is deployed on **Google Cloud Platform (GCP)** and achieves a **ROC-AUC score of 71%**. - -## **System Design** - -There are three different user groups for our system: - -1. **Regular Users** - - The first group involves the regular users who would like to get pre-screened for COVID-19 to inform themselves. The goal of this group is to provide them a rough idea about whether the cough-like symptoms they are experiencing could be related to COVID-19. If the cough test results signal that COVID-19 is a high possibility, the users are encouraged to seek out medical help and isolate. - -2. **Medical Practitioners** - - The second group involves medical practitioners who would like to employ a test that is both faster and cheaper before they try out more expensive, and time consuming tests. Based on the results showing the likelihood of COVID, test takers are advised to take more rigorous tests. - -3. **Community Administrators** - - The third group involves the community admins who would like to ensure the safety of their community by employing a cheaper pre-screening test for COVID. - -All of our users access our app through the web using their mobile devices. For all of our users, a key consideration we kept in mind while designing our system was **interpretability**; we made sure to include both personalized and general explanations of how our model makes predictions. This not only informs our users but also encourages them to give back by donating their cough data for research. Interpretability is especially important for the medical practitioners using our app: even if our model fails to make the correct prediction, the methodologies and their interpretations that are being displayed can help the medical practitioners make more informed decisions. - -
-
- -
-Figure 1. System Architecture on Google Cloud Platform -
-
- -In order for our application to be effective in serving the needs of our target users, we needed to build various interconnected components which are shown in **Figure 1**. Our web-based interface allows users to record a cough sound on their laptop or mobile devices, reducing the risk of contagion and without having to download a mobile native app. Our rationale behind a web-based solution was to reduce any points of friction and avoid the perils of low uptake which COVID-19 tracing apps have experienced, further facilitating usage by wrapping the link to the web solution into a QR code. We expand more on our core system design decisions below. - -### **Model Exploration** - -Developing on a **Virtual Machine (VM)** allowed us to collaboratively iterate on different models. Furthermore, working on **GCP VMs** allowed us to harness their free research TPU program to train our more compute-intensive embedding models. Lastly, **gcloud** operations made it easy to serve models on **GCP’s AI Platform** and test their performance in-app. - -### **Model Deployment** - -As one of our model’s features, we used embeddings based on a commonly used computer vision model for sound data (**VGGish**). Given the large storage size of **VGGish**, we decided to run the model in the cloud using the **GCP AI Platform**. Once we obtained our embeddings after querying the **AI Platform** instance with the users’ cough audio, COVID-19 predictions were inferred in-app as this was not computationally expensive. It is important to note that also our cough detection model (sourced from [Coughvid](https://c4science.ch/diffusion/10770/browse/master/notebooks/cough_classification_example.ipynb)), which made sure that the user submitted a cough and not any random noise before we evaluated the audio for COVID-19, was run in-app. In addition, we built an automatic yet accurate fall-back procedure which evaluates COVID-19 risk without **VGGish** embeddings whenever the **AI Platform** is unavailable or the hosted model is shut down due to cost reasons. We are planning for the **VGGish**-based inference to be available continuously once our model's accuracy is further improved. - -### **Iterative Development and User Experience** - -We chose to develop our application using **Streamlit** because its powerful library allowed us to create an MVP version of our system quickly and to then iterate towards a more polished user experience. As sound recording was not a native feature of **Streamlit**, we build a custom **Streamlit** component in Javascript to handle recordings and device permissions which we plan to open-source to **Streamlit's** [components gallery](https://streamlit.io/gallery?type=components&category=featured). - -### **App Deployment** - -We decided to host our **Streamlit** application as a Docker container on **GCP Cloud Run** because we wanted to leverage the smooth connections between the different components we had already built on GCP in addition to the inherent scalability that would be necessary if our app was used at any testing centers. - -### **Continuous Integration and Deployment (CI/CD)** - -We decided to leverage **GitHub Actions** because of the CI/CD capability where any changes we pushed to our **Streamlit** application repository were automatically deployed to GCP. - - -## **Machine Learning** - -### **Datasets** - -We used the crowdsourced [Coughvid dataset](https://zenodo.org/record/4048312#.X4laBNAzY2w) which was created by a team of researchers from [EPFL](https://www.epfl.ch/labs/esl/) providing over **20,000** crowdsourced cough recordings representing a wide range of ages, genders, geographic locations, and COVID-19 statuses. After cleaning and balancing the dataset for our specific use case, we ended up with **699** samples for each class: healthy, symptomatic and COVID-19 positive, out of which **559** were assigned to the training set using an 80/20 random train-test split. We then augmented this data for classes for which we did not have more samples available by adding random permutations (Gaussian noise, time shifts, time stretches, and pitch), making sure to apply random permutations to an equal number of samples in each class. This resulted in a balanced training dataset with **1677** samples for each class and a non-augmented, balanced testing set containing **140** samples for each class. - -### **Feature Selection** - -When it came to feature selection we had to be thoughtful in how we iterated on our model. First, we wanted to provide interpretable predictions to our users while ensuring that our approaches were grounded in the medical literature on COVID-19. Secondly, we were working with limited, augmented data, hence we wanted to ensure that our model was focusing on the relevant parts of the cough sounds instead of overfitting to any noise. After trying out multiple models, we chose a shallow gradient boosted decision tree model that used three categories of features to provide a COVID-19 prediction: embedding-based features, audio features, and clinical background information. - -1. **Embedding-Based Features** - - When a user submits a cough, our model extracts an 128-vector embedding of the audio data using a computer vision model named **VGGish**, all in the cloud. Through our research, we recognized different patterns in a healthy person's audio embeddings when compared to a COVID-19 positive individual's audio embedding. Therefore, we used specific segments of audio sample embeddings as features to help our model gauge the risk factor that the user has COVID-19. - -2. **Audio Features** - - When a user submits a cough, our model calculates various measurements that help capture what the medical community has identified as a dry cough associated with COVID-19 infection. Specifically we consider the **maximum signal**, **median signal** and **spectral bandwidth** of an audio recording, which stand for the loudest point of the audio, the average loudness of the audio, and the standard deviation of a cough’s audio frequencies over time, respectively. We use these metrics as features in our model. - -3. **Clinical Features** - - Lastly, our app uses clinically relevant background information provided by the user to make better predictions. These features are the patient's age, history of respiratory conditions and current fever and muscle pain statuses. - - -### **Model Iterations** - -#### **First Interation Cycle** - -Our first iteration of the model involved performing a simple logistic regression on **Mel-Frequency Cepstrum Coefficients (MFCC)** features extracted from a user’s submitted cough. We sourced this initial feature set from [Coughvid’s public repository](https://coughvid.epfl.ch/about/). Using this baseline model, we were able to achieve a **60% ROC-AUC** score on the binary classification task of predicting whether a user was healthy (including symptomatic COVID-19 untested users) or COVID-19 positive. We used the [SHAP library](https://github.com/slundberg/shap) to interpret our model and evaluate which features were most important, enabling us to find that the model was focusing particularly on the **spectral bandwidth** of the cough sample. - -#### **Second Interation Cycle** - -For our second iteration we decided to build a **multi-class** model as it was important to us to distinguish between COVID-19 positive and COVID-19 symptomatic (which includes some flu symptoms) but untested individuals. To achieve this goal, our model was trained to predict one of the three classes *healthy*, *COVID-19 symptomatic*, and *COVID-19*. As part of this process, we created a deep **Convolutional Neural Network (CNN)** built on top of [Resnet-50](https://arxiv.org/abs/1512.03385), where the input was the user submitted cough audio mel-frequency spectrogram (see **Figure 2.**). This model achieved significant accuracy on the training data, however we failed to regularize the model sufficiently in order show promising validation results. - -#### **Third Interation Cycle** - -The third iteration produced the model we currently have deployed. Using our learnings from the past two iterations, we expanded our feature set by incorporating **VGG-ish** embeddings, a narrowed-selection of audio features (including the **spectral bandwidth**), as well as the clinical background information provided by the user. One of the methods to identify differences in the 128-vector embeddings between the three classes was to look at the absolute deviations in the medians for each pair of classes (see **Figure 3.**). In order to prevent overfitting, we chose to train a shallow gradient boosted decision tree model which achieved high validation accuracies in the multi-class setting. - - -
- - -
-
-
- -
- Figure 2. A Typical Mel-Frequency Spectrogram of a Cough Recording -
-
-
-
- -
- Figure 3. Absolute Deviations in the Medians of VGGish Embeddings Between the 3 Classes -
-
-
-
-
- -  - -## **System Evaluation** - -### **Model Performance** - -Considering overall performance, our model is achieving a multi-class **ROC-AUC of 71%** as broken down in **Figure 4.** which is a strong improvement over our baseline logistic regression algorithm (60% ROC-AUC). A class that is particularly important to attain high accuracy on is the symptomatic class, which represents users from our dataset who were symptomatic but did not have COVID-19 or had not yet received a COVID-19 PCR test result. - -Considering the validation set performance we are excited that our model is quite accurate in predicting a user is healthy given he or she is indeed healthy (**high recall on class "healthy"**), as shown in the confusion matrix in **Figure 5**. In addition, given a model predicts one has COVID-19, there is quite a high chance that a person actually has COVID-19 (**high precision on class "COVID-19"**). At the same time, we recognize that there is a lot of room for improvement, especially for symptomatic patients who may have a cold or pneumonia, but not COVID-19. - - -
- -
-
-
- -
- Figure 4. One-Versus-Rest (OVR) Receiver Operating
Characteristic (ROC) Curve for the 3 Classes -
-
-
- -
- Figure 5. Confusion Matrix of Predictions Achieved on the Balanced 420-Sample Validation Set -
-
-
- -  - -### **User Experience** - -To make sure our app met the usability requirements for the audience we initially targeted, we have run a series of user experience experiments. Our results indicated that our test users were initially confused about the wording used in our application and the different user flows offered. We have since addressed these issues by simplifying the language and our user interface as much as possible in the context of a **non-deep-linking single page application (SPA)**. - - -## **Application Demonstration** - -### **Use Case 1: Getting a COVID-19 Risk Assessment as a New User** - -The main utility of our app is to provide users with a risk assessment of their COVID-19 status. We decided to build a web application to make sure our application was ubiquitously used and to avoid the perils of low uptake which many COVID-19 screening mobile apps experienced. We achieve this in 4 steps, summarized in **Figure 6.** and shown in the app in **Figure 7**. - -
-
- -
-Figure 6. New User Journey to Get a COVID-19 Risk Assessment -
-
- -  - -
- -
-
-
- -
-
- (a) User coughs near the microphone. -
-
-
-
- -
-
- (b) User receives their risk assessment. -
-
-
-
-
-
- -
-
- (c) User consents to upload data for research purposes and learns more about the prediction. -
-
-
-
- -
-
- (d) User types in a unique identifier so that future PCR results can be linked to the submitted data. -
-
-
-
-
-Figure 7. Use Case 1 Demonstration in Streamlit -
- -  - -### **Use Case 2: Uploading PCR Test Results as a Returning User** - -The secondary goal of our app is to collect a large COVID-19 cough dataset. We achieve this in two additional steps to the new user journey, summarized in **Figure 8.** and shown in the app in **Figure 9**. - -
-
- -
-Figure 8. Returning User Journey to Upload a PCR Result -
-
- -  - -
-
- -
-Figure 9. Use Case 2 Demonstration in Streamlit -
-
- -  - -## **Reflection** - -We learned so much while collaborating on this project and truly had an amazing time working together, even when things sometimes didn't go as planned! Here is a summary of our key takeaways: - -### **What Worked Well** - -1. **Streamlit** - - Streamlit allowed us to quickly create an **MVP** with a decent-looking **UI**, which was awesome! Streamlit's built-in functionality with **Python** libraries such as **Matplotlib** was really helpful, because we were able to transfer work from our model development notebooks into the ML production environment without the need for much change. - -2. **GCP AI Platform** - - Uploading models to **AI Platform** and letting it handle the serving and scaling problems for us without much cost was really helpful. We were able to complete this whole project just using the free tier in GCP with the initial **$300** provided to each new account. - -### **What Did Not Work As Planned** - -1. **Latency on GCP App Engine** - - One issue that we had to deal with was significant latency that arose when we hosted our app on GCP's **App Engine** because we could not use more powerful machines. Specifically, there were some operations we did not send to **AI Platform** but ran in the app instead, including the operations for displaying to users how we **interpreted** their cough using cough segmentation and **mel-frequency spectrograms**. These operations had a higher latency than we thought at first. When developing locally, we would add a feature that easily ran on our machines but observed high latencies once uploaded to GCP’s App Engine. - -### **Next Steps** - -1. **Improving Latency** - - We are currently exploring configurations for **GCP App Engine** that will allow us to better serve users with lower latency while at the same time reducing costs. We are also working on improving the **caching** within our **Streamlit** application to help combat latency issues. - - **EDIT:** In the last 4 days since our public demo for CS 329S we have migrated our application to **GCP Cloud Run** and used more advanced **chaching** to reduce latency. We have achieved that our application now runs only with negilible latency and at less than 1/10 of the initial costs. - -2. **Improving Model Accuracy** - - One of our goals for our project is to facilitate the collection of crowdsourced COVID-19 cough datasets because we experienced how difficult it is to train an accurate model with limited data. However, we also recognize that there are changes we can make to our model which will improve its performance on the currently available data and we look forward to conducting more experiments. - -## **Broader Impacts** - -When we first learned about the possibility of detecting COVID-19 from cough recordings, we were immediately drawn to contribute in helping the world fight the pandemic. It was such a privilege to tackle this challenge while learning about the process of designing end-to-end, user-centric machine learning systems. - -As a team, we recognize that providing someone a COVID-19 risk assessment is not a task to take lightly as mispredictions can be very harmful. To make our intent and accuracy levels clear, we made sure to include disclaimers and related extra information about our algorithms, extracted features, and model performance transparently in our app. Therefore, in our design decisions, we chose to focus on creating an application that was informative and interpretable. Furthermore, people’s COVID-19 statuses are sensitive data so we only use all information for the specific purpose it was collected for. - -While we believe this project is a start in the right direction, we recognize there are more improvements to be made before our application is ready for the real world, most importantly higher prediction accuracy. - -## **Acknowledgements** - -Thank you to the teaching staff of CS 329S, Chip, Karan, Michael (our project mentor) and Xi for all their support throughout this project. - -Thank you Amil Khanzada and the [Virufy team](https://virufy.org/team.html) for promoting work on COVID-19 risk prediction using cough samples. - - -## **References** - -[1] The Coughvid dataset which we used for training and testing examples: [The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms (Zenodo)](https://zenodo.org/record/4048312#.YFQc5UiSnFZ) - -[2] The Coughvid Repo which provided us feature extraction for training our first iteration of models and heavily influenced our research: [COUGHVID · R10770 (c4science.ch)](https://c4science.ch/diffusion/10770/) - -[3] The Coughvid team’s current application (check it out!): [Coughvid (epfl.ch)](https://coughvid.epfl.ch/) - -[4] A tutorial provided to us in class by Daniel Bourke which we referenced when building our end-to-end system: [mrdbourke/cs329s-ml-deployment-tutorial: Code and files to go along with CS329s machine learning model deployment tutorial. (github.com)](https://github.com/mrdbourke/cs329s-ml-deployment-tutorial) - -[5] In order to create our application we used Streamlit: [Streamlit • The fastest way to build and share data apps](https://streamlit.io/) - -**There are various papers we referenced during research which helped us improve our model including:** - -[6] [Exploring Automatic Diagnosis of COVID-19 from Crowdsourced Respiratory Sound Data](https://arxiv.org/pdf/2006.05919.pdf) - -[7] [Deep Residual Learning for Image Recognition](https://arxiv.org/abs/1512.03385) - -[8] [COVID-19 Sounds App - University of Cambridge (covid-19-sounds.org)](https://covid-19-sounds.org/de/blog/detect_covid_kdd.html) - -[9] [AI4COVID-19 - AI Enabled Preliminary Diagnosis for COVID-19 from Cough Samples via an App](https://arxiv.org/pdf/2004.01275.pdf) - -[10] [CNN Architectures for Large-Scale Audio Classification](https://arxiv.org/abs/1609.09430). - -**Let's defeat the pandemic!** - - - - - - - - - diff --git a/_site/Fact-Checking-Tool-for-Public-Health-Claims/index.html b/_site/Fact-Checking-Tool-for-Public-Health-Claims/index.html deleted file mode 100644 index 1182253..0000000 --- a/_site/Fact-Checking-Tool-for-Public-Health-Claims/index.html +++ /dev/null @@ -1,507 +0,0 @@ - - - - - Fact-Checking Tool for Public Health Claims - CS 329S Winter 2021 Reports - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
-
-
- - -
-
-

Fact-Checking Tool for Public Health Claims

-
2021, Mar 12    
-
-

The Team

-
    -
  • Alex Gui
  • -
  • Vivek Lam
  • -
  • Sathya Chitturi
  • -
- -

Problem Definition

- -

Due to the nature and popularity of social networking sites, misinformation can propagate rapidly leading to widespread dissemination of misleading and even harmful information. A plethora of misinformation can make it hard for the public to understand what claims hold merit and which are baseless. The process of researching and validating claims can be time-consuming and difficult, leading to many users reading articles and never validating them. To tackle this issue, we made an easy-to-use tool that will help automate fact checking of various claims focusing on the area of public health. Based on the text the user puts into the search box, our system will generate a prediction that classifies the claim as one of True, False, Mixed or Unproven. Additionally, we develop a model which matches sentences in a news article against common claims that exist in a training set of fact-checking data. Much of the prior inspiration for this work can be found in Kotonya et al where the authors generated the dataset used in this project and developed a method to evaluate the veracity of claims and corresponding explanations. With this in mind we tried to address veracity prediction and explainability in our analysis of news articles.

- -

System Design

- -

Our system design used the following steps: 1) Development of ML models and integration with Streamlit 2) Packaging the application into a Docker container 3) Deployment of the application using Google App Engine.

- -
-
- -
-
- -
    -
  1. In order to allow users to have an interactive experience, we designed a web application using Streamlit for fake news detection and claim evaluation. We chose Streamlit for three primary reasons: amenability to rapid prototyping, ready integration with existing ML pipelines and clean user interface. Crucial to the interface design was allowing the users a number of different ways to interact with the platform. Here we allowed the users to either choose to enter text into text boxes directly or enter a URL from which the text could be automatically scraped using the Beautiful Soup python library. Therefore, using this design pipeline we were able to quickly get a ML-powered web-application working on a local host!
  2. -
  3. To begin the process of converting our locally hosted website to a searchable website, we used Docker containers. Docker is a tool that can help easily package a local project into an environment known as a container that can be run on another machine. For this project, our Docker container hosted the machine learning models, relevant training and testing data, the Streamlit application file (app.py) as well as a file named “requirements.txt” which contained a list of names of packages needed to run the application.
  4. -
  5. With our application packaged, we deployed our Docker container on Google App Engine using the Google Cloud SDK. Essentially, this created a CPU (or sets of CPUs) in the cloud to host the web-app and models. We opted for an auto-scaling option which means that the number of CPUs automatically scale with the number of requests. For example, many CPU cores will be used in periods of high traffic and few CPU cores will be used in periods of low traffic. Here, it is worth noting that we considered other choices for where to host the model including Amazon AWS and Heroku. We opted for Google App Engine over Heroku since we needed more than 500MB of storage; furthermore, we preferred App Engine to AWS in order to take advantage of the $300 free GCP credit!
  6. -
- -

Machine Learning Component

- -

We build our model to achieve two tasks: veracity prediction and relevant fact-check recommendation. The veracity prediction model is a classifier that takes in a text input and predicts it to be one of true, false, mixed and unproven with the corresponding probabilities. The model is trained on PUBHEALTH, an open source dataset of fact-checked health claims. The dataset contains 11.8k health claims, with the original texts and expert-annotated ground-truth labels and explanations. More details about the dataset can be found here.

- -

We first trained a baseline LSTM (Long Short Term Memory Network), a recurrent neural network that’s widely used in text classification tasks. We fit the tokenizer and classification model from scratch using tensorflow and Keras. We trained the model for 3 epochs using an embedding dimension size of 32. With a very simple architecture, we were able to get decent results on the held-out test set (see Table 1).

- -

In the next iterations, we improved the baseline model leveraging state-of-the-art language model DistilBERT with the huggingface API. Compared to the LSTM which learns one-directional sequentially, BERT makes use of Transformer which encodes the entire sequence at once, and is thus able to learn word embeddings with a deeper context. We used a lightweight pretrained model DistilBERT (a distilled version of BERT, more details can be found here and fine-tuned it on the same training dataset. We trained the model for 5 epochs using warm up steps of 500 and weight-decay of 0.02. All prediction metrics were improved on the test set. The DistilBERT model takes 5x longer time to train, however at inference step, both models are fast in generating online predictions.

- -

- -

The supervised-learning approach is extremely efficient and has good precision to capture signals of misinformation. However, the end-to-end neural network is black-box in nature and the prediction is never perfect, it is very unsatisfying for users to only receive a prediction without knowing how models arrive at certain decisions. Additionally, users don’t gain new information or become better-informed from reading a single classifier result, which defeats the overall purpose of the application. Therefore, we implemented a relevant claim recommendation feature to promote explainability and trust. Based on the user input, our app would search for claims in the training data that are similar to the input sentences. This provides two additional benefits: 1) users will have proxy knowledge of what kind of signals our classifier learned 2) users can read relevant health claims that are fact-checked by reliable sources to better understand the subject-matter.

- -

For implementation, we encode the data on a sentence level using Sentence-BERT. The top recommendations are generated by looking for the nearest neighbors in the embedding space. For each sentence in the user input, we look for most similar claims in the training dataset using cosine similarity. We returned the trigger sentence and most relevant claims with similarity scores above 0.8.

- -

- -

System Evaluation

- -

We conducted offline evaluation on the Pubhealth held-out test set (n = 1235). The first table showed the overall accuracy of the two models. Since our task is multi-label classification, we are interested in the performance per each class, particularly in how discriminative our model is in flagging false information.

- - - - - - - - - - - - - - - - - - - - - -
ModelAccuracy (Overall)F1 (False)
LSTM0.6670.635
DistilBERT Fine Tuned0.6850.674
- -

Table 1: Overall accuracy of the two models

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
LabelF1PrecisionRecall
False (n = 388 )0.6740.6830.665
Mixed (n = 201)0.40.3650.443
True (n = 599 )0.8290.8460.813
Unproven (n = 47 )0.2860.3240.255
- -

Table 2: F1, Precision and Recall per class of Fine-tuned BERT model

- -

Our overall accuracy isn’t amazing, but it is consistent with the results we see in the original pubhealth paper. There are several explanations:

-
    -
  1. Multi-label classification is inherently challenging, and since our class sizes are imbalanced, we sometimes suffer from poor performance in the minority classes.
  2. -
  3. Text embeddings themselves don’t provide rich enough signals to verify the veracity of content. Our model might be able to pick up certain writing styles and keywords, but they lack the power to predict things that are outside of what experts have fact-checked.
  4. -
  5. It is very hard to predict “mixed” and “unproven” (Table 2).
  6. -
- -

However looking at the breakdown performance per class, we observe that the model did particularly well in predicting true information, meaning that most verified articles aren’t flagged as false or otherwise. This is good because it is equally damaging for the model to misclassify truthful information, and thus make users trust our platform less. It also means that if we count mixed and unproven as “potentially containing false information”, our classifier actually achieved good accuracy on a binary label prediction task (>80%).

- -

Some interesting examples

- -

In addition to system-level evaluation, we provide some interesting instances where the model did particularly well and poorly.

- -

Case 1 (Success, DistilBERT): False information, Model prediction: mixture, p = 0.975

- -

“The notion that the cancer industry isn’t truly looking for a ‘cure’ may seem crazy to many, but the proof seems to be in the numbers. As noted by Your News Wire, if any of the existing low-cost, natural and alternative cancer treatments were ever to be approved, then the healthcare industry’s cornerstone revenue producer would vanish within months. Sadly, it doesn’t appear that big pharma would ever want that to happen. The industry seems to be what is keeping us from a cure. Lets think about how big a business cancer has become over the years. In the 1940’s, before all of the technology and innovation that we have today, just one out of every 16 people was stricken with cancer; by the 70’s, that ratio fell to 1 in 10. Today, one in two males are at risk of developing some form of cancer, and for women that ratio is one in three.”

- -

This is an example of a very successful prediction. The above article leveraged correct data to draw false conclusions. For example, that cancer rate has increased is true information that was included in the training database, but the writing itself is misleading. The model did a good job of predicting mixture.

- -

Case 2 (Failure, DistilBERT): False information, Model prediction: true, p = 0.993

- -

“WUHAN, China, December 23, 2020 (LifeSiteNews) – A study of almost 10 million people in Wuhan, China, found that asymptomatic spread of COVID-19 did not occur at all, thus undermining the need for lockdowns, which are built on the premise of the virus being unwittingly spread by infectious, asymptomatic people. Published in November in the scientific journal Nature Communications, the paper was compiled by 19 scientists, mainly from the Huazhong University of Science and Technology in Wuhan, but also from scientific institutions across China as well as in the U.K. and Australia. It focused on the residents of Wuhan, ground zero for COVID-19, where 9,899,828 people took part in a screening program between May 14 and June 1, which provided clear results as to the possibility of any asymptomatic transmission of the virus.”

- -

This is a case of the model failing completely. We suspect that this is because the article is written very appropriately, and quoted prestigious scientific journals, which all made the claim look legitimate. Given that there is no exact similar claim matched in the training data, the model tends to classify it as true.

- -

Slice analysis

- -

We performed an analysis of the LSTM model performance on various testing dataset slices. Our rationale for doing these experiments was that the LSTM likely makes a number of predictions based on writing style or similar semantics rather than the correct content. Thus, it is very possible that a model written with a “non-standard” with True information might be predicted to be False. Our slices, which included word count, percentage of punctuation, average sentence length, and date published were intended to be style features that might help us learn more about our model’s biases.

- -

Here we would like to highlight an example of the difficulty in interpreting the results of a slice based analysis for a multi-class problem. In this example, we slice the dataset by word count and create two datasets corresponding to whether the articles contain more than or less than 500 words. We found that the accuracy for the shorter articles was 0.77 while the accuracy for the larger articles was 0.60. Although this seems like a large difference in performance, there are some hidden subtleties that are worth considering further. In Table 3, we show the per-class performance for both splits as well as the number of samples in each split. Here, it is clear to see that class distributions of the two datasets are quite different, making a fair comparison challenging. For instance, it is likely that we do well on the lower split dataset because it contains a large fraction of True articles which is the class which is best predicted by our model.

- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
LabelsLower Split AccuracyLower Split NsamplesUpper Split AccuracyUpper Split Nsamples
False0.7320970.7526291
Mixture0.2041490.2368152
True0.88103110.7118288
Unproven0.3636110.058834
- -

Table 3: Slice analysis on word count

- -

Similarity Matching

- -

To evaluate the quality of similarity matching, one proxy is to look at the cosine similarity score of the recommended claims. Since we only returned those with similarity scores of more than 0.8, the matching results should be close to each other in the embedding space. However it is less straightforward to evaluate the embedding quality. For the scope of this project, we did not conduct systematic evaluation of semantic similarities of the recommended claims. But we did observe empirically that the recommended claims are semantically relevant to the input article, but they don’t always provide correction to false information. We provide one example in our app demonstration section.

- -

Application Demostration

- -

To serve users, we opted to create a web application for deployment. We settled on this choice as it enabled a highly interactive and user friendly interface. In particular, it is easy to access the website URL via either a phone or a laptop.

- -

- -

There are three core tabs in our streamlit web-application: Fake News Prediction, Similarity Matching and Testing.

- -

Fake News Prediction Tab

- -

The Fake News Prediction tab allows the user to make predictions as to whether a news article contains false claims (“False”) , true claims (“True”), claims of unknown veracity (“Unknown”) or claims which have both true and false elements (“Mixed”). Below, we show an example prediction on text from the following article: Asymptomatic transmission of COVID-19 didn’t occur at all, study of 10 million finds. Here, our LSTM model correctly identifies that this article contains false claims!

- -
- -
- -

Similarity Matching Tab

- -

The Similarity Matching Tab compares sentences in a user input article to fact checked claims made in the PUBHEALTH fact-check dataset. Again we allow users the option of being able to enter either a URL or text. The following video demonstrates the web-app usage when provided the URL corresponding to the article: Study: Covid-19 deadlier than flu. Here, it is clear that the model identifies some relevant claims made in the article including the number of deaths from covid, as well as comparisons between covid and the flu.

- -
- -
- -

Testing Tab

- -

Finally, our “Testing” tab allows users to see the impact of choosing different PUBHEALTH testing slices on the performance of the baseline LSTM model. For this tab, we allow the user to select the break point for the split. For instance, for the word count slicing type, if we select 200, this means that we create two datasets: one with only articles shorter than 200 words and another with only articles larger than 200 words. Check out the video below for a demo of splicing the dataset on the punctuation condition!

- -
- -
- -

Reflection

- -

Overall, our project was quite successful as a proof of concept for the design of a larger ML driven fact-checking platform. We succeeded in developing two models (LSTM and DistilBERT) that can reasonably detect fake news on a wide range of user articles. We achieved promising results on a held-out testing set and found that our model was relatively stable to some common dataset slices. Furthermore, for some inputs, our Sentence-BERT was able to detect claims in the article which were similar to those contained within our training set. We were also able to allocate work and seamlessly integrate among our three team members. Although all members contributed significantly to each part of the project, Alex focused on the model training and validation while Vivek and Satya focused on the UI and deployment. Despite the successes of this project, there are several things that either don’t work or need improvement.

- -

One major area for improvement is on the sentence claim matcher. Currently when certain articles make claims that are outside of the distribution of the training dataset there will be no relevant training claims that can be matched to. This is inherently due to the lack of training data needed for these types of applications. To address this issue it would be useful to periodically scrape fact-checked claims and evidence from websites such as snopes to keep the database up to date and expanding. Additionally, we could incorporate user feedback in our database after being reviewed by us or an external fact-checking group.

- -

Another issue is that we have two separate features, one where the veracity of an article is predicted based primarily on style (LSTM and DistilBERT models), and one where we attempt to extract the content by matching with fact checked claims. An ideal model would be able to combine style and content. Additionally, the claims that we can match sentences to are limited by the data in our training set.

- -

Another improvement we could make pertains to the testing tab. Currently we output the per-class accuracy, but we could additionally output several figures such as histograms and confusion matrices. Better visualization will help users understand quickly how the models perform on different slices.

- -

Broader Impacts

- -

Fake news poses a tremendous risk to the general public. With the high barrier required to fact check certain claims and articles we hope that this project will start to alleviate some of this burden from casual internet users and help people better decide what information they can trust. Although this is the intended use case of our project, we recognize that there is potential harm that can arise from the machine learning models predicting the wrong veracity for some articles. One can easily imagine that if our model predicts that an article has true information, but it is actually fake news this would only cause the user to further believe in the article. To try to mitigate this type of issue, we used the sentence claim matching algorithm where article sentences can be matched to fact-checked claims. If this approach is done correctly the user will in theory have access to training claims that are similar to those in the article and the label associated with the training claims. In addition, we chose to include a tab which showed how our model performed on different slices of the testing data. We believe showing this type of data to users could be a very useful tool for harm mitigation as it allows the users to more fully assess potential biases in the models. At the end of the day because these models are all imperfect we include a disclaimer that these predictions are not a substitute for professional fact-checking.

- -

Contributions

- -

All members contributed significantly to each part of the project. Alex focused more on model training and development. Vivek and Sathya focused more on UI and deployment. We gratefully acknowledge helpful discussions and feedback from Chip Huyen and Xi Yan throughout the project! In addition, special thanks to Navya Lam and Callista Wells for helping find errors and bugs in our UI.

- -

References

- -

We referred the the following models to guide ML model development:

- -
    -
  • Sanh, V., Debut, L., Chaumond, J. and Wolf, T., 2019. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arXiv preprint arXiv:1910.01108.
  • -
  • Reimers, N. and Gurevych, I., 2019. Sentence-bert: Sentence embeddings using siamese bert-networks. arXiv preprint arXiv:1908.10084.
  • -
- -

We used data from the following dataset:

- -
    -
  • Kotonya, N. and Toni, F., 2020. Explainable automated fact-checking for public health claims. arXiv preprint arXiv:2010.09926.
  • -
- -

We found the following tutorials and code very helpful for model deployment via Streamlit/Docker/App Engine.

- -
    -
  • Jesse E. Agbe (2021), GitHub repository, https://github.com/Jcharis/Streamlit_DataScience_Apps
  • -
  • Daniel Bourke (2021), GitHub repository https://github.com/mrdbourke/cs329s-ml-deployment-tutorial
  • -
- -

Core technologies used:

- -
    -
  • Tensorflow, Pytorch, Keras, Streamlit, Docker, Google App Engine
  • -
- - -
-
- -
- - - -
-
- -
-
-
- -
- -
- - - - - diff --git a/_site/README.md b/_site/README.md index 791857e..e823a48 100644 --- a/_site/README.md +++ b/_site/README.md @@ -1,8 +1,4 @@ -# CS 329S (Winter 2021) Final Project Reports - -We recommend that you write your report in Google Docs then migrate it Markdown. I find the migration fairly straightforward, and you can also use the [Docs to Markdown add-on](https://workspace.google.com/marketplace/app/docs_to_markdown/700168918607) to automatically convert Docs to Markdown. - -Once you've had your post in Markdown, create a pull request to add your post to the course's website. +# Template Residência em TIC BRISA - UnB FGA 1. Fork this repository. 2. Clone it to your local machine. @@ -16,6 +12,3 @@ Once you've had your post in Markdown, create a pull request to add your post to You might need to install Jekyll. If you're note familiar with Jekyll, you can find [Jekyll's installation instructions here](https://docs.github.com/en/github/working-with-github-pages/testing-your-github-pages-site-locally-with-jekyll). - -Let us know if you have any question! - diff --git a/_site/assets/css/main.css b/_site/assets/css/main.css index 3ccc8f4..26db355 100755 --- a/_site/assets/css/main.css +++ b/_site/assets/css/main.css @@ -773,7 +773,7 @@ html { } body { - font-family: 'Lato', sans-serif; + font-family: 'Source Sans Pro', sans-serif; color: #515151; background-color: #fbfbfb; margin: 0; @@ -933,7 +933,7 @@ table tfoot td { width: 240px; height: 100%; padding: 20px 10px; - background-color: #ffffff; + background-color: #3056a4; } .about { @@ -948,7 +948,7 @@ table tfoot td { -webkit-border-radius: 100%; border-radius: 100%; overflow: hidden; - background-color: #333030; + background-color: #ffffff; } .about img { @@ -975,7 +975,7 @@ table tfoot td { padding-bottom: 15px; font-size: 16px; text-transform: uppercase; - color: #333030; + color: #ffffff; font-weight: 700; } @@ -992,11 +992,11 @@ table tfoot td { height: 7px; -webkit-border-radius: 100%; border-radius: 100%; - background-color: #515151; + background-color: #ffffff; } .about p { - font-size: 16px; + font-size: 22px; margin: 0 0 10px; } @@ -1023,7 +1023,7 @@ table tfoot td { .contact .contact-title { position: relative; - color: #333030; + color: #ffffff; font-weight: 400; font-size: 12px; margin: 0 0 5px; @@ -1043,7 +1043,7 @@ table tfoot td { position: absolute; top: 50%; left: 0; - background-color: #515151; + background-color: #ffffff; } .contact .contact-title::after { @@ -1058,7 +1058,7 @@ table tfoot td { position: absolute; top: 50%; right: 0; - background-color: #515151; + background-color: #ffffff; } .contact ul { @@ -1078,7 +1078,7 @@ table tfoot td { } .contact ul li a { - color: #515151; + color: #ffffff; display: block; padding: 5px; font-size: 18px; @@ -1088,7 +1088,7 @@ table tfoot td { } .contact ul li a:hover { - color: #333030; + color: #ffffff; -webkit-transform: scale(1.2); -ms-transform: scale(1.2); transform: scale(1.2); @@ -1171,10 +1171,14 @@ footer .copyright { margin-top: 0; } +.page-title { + color: #ffffff; +} + a.older-posts, a.newer-posts { font-size: 18px; display: inline-block; - color: #515151; + color: #ffffff; -webkit-transition: -webkit-transform .2s; transition: -webkit-transform .2s; -o-transition: transform .2s; diff --git a/_site/assets/css/scss/main.scss b/_site/assets/css/scss/main.scss index a566e74..8daa721 100755 --- a/_site/assets/css/scss/main.scss +++ b/_site/assets/css/scss/main.scss @@ -11,7 +11,7 @@ html { } body { - font-family: 'Lato', sans-serif; + font-family: 'Source Sans Pro', sans-serif; color: $body-color; background-color: #fbfbfb; margin: 0; diff --git a/_site/assets/img/lappis.png b/_site/assets/img/lappis.png new file mode 100644 index 0000000..e13874e Binary files /dev/null and b/_site/assets/img/lappis.png differ diff --git a/_site/context-graph-generator/index.html b/_site/context-graph-generator/index.html deleted file mode 100644 index 6dc785e..0000000 --- a/_site/context-graph-generator/index.html +++ /dev/null @@ -1,415 +0,0 @@ - - - - - Building a Context Graph Generator - CS 329S Winter 2021 Reports - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
-
-
- - -
-
-

Building a Context Graph Generator

-
2021, Mar 18    
-
-

The Team

-
    -
  • Manan Shah
  • -
  • Lauren Zhu
  • -
  • Ella Hofmann-Coyle
  • -
  • Blake Pagon
  • -
- -

Problem Definition

- - -

Get this—55% of users read online articles for less than 15 seconds. The general problem of understanding large spans of content is painstaking, with no efficient solution.

- -

Current attempts rely on abstractive text summarization, which simply shortens text to its most relevant sentences and often obscures core components that make writing individual and meaningful. Other methods offer a smart search on articles, a sort of smart command+F. But that forces the user to not only look at only small bits and pieces of content at a time, but also know what they’re looking for ahead of time.

- -

What if, rather than engaging in this lengthy process of understanding content manually, people can leverage a tool that generates analytical, concept graphs over text? These graphs would provide the first-ever standardized, structured, and interpretable medium to understand and draw connections from large spans of content. We created such an interface where users are able to visualize and interact with the key concepts over swaths of text in a meaningful and unique way. Users can even compare and contrast concept graphs between multiple articles with intersecting concepts, to further their analysis and understanding.

- -

So with users spending less time on individual websites, we make it easier for them to absorb the key components of the content they want in an interactive graphical representation. Specifically, we enable readers to:

-
    -
  • Grasp the main concepts presented in an article
  • -
  • Understand the connections between topics
  • -
  • Draw insights across multiple articles fluidly and intuitively
  • -
- -

System Design

- -

Graph Generation

- -

- -

With the goal of making information digestion faster, we designed a concept graph generator that includes relevant information on a text’s main concepts and visually shows how topics relate to one another. To achieve this, our graph outputs nodes that represent article concepts, and edges that represent links between the concepts.

- -

We use Streamlit for the front end interface and a custom version of the Streamlit graphing libraries to display our graphs. The resulting graphs are interactive—users can move the graph and individual nodes as well as zoom in and out freely, or click a node to receive a digest of the topic in textual context.

- -

Additionally, we provide users with a threshold slider that allows users to decide how many nodes/connections they want their graph to provide. This customization doubles as an optimization for the shape and density of the graph. How this works is that connections between nodes are determined by a similarity score between the nodes (via cosine similarity on the word embeddings). A connection is drawn between two topics if the score is above the threshold from the slider. This means that as the slider moves further to the left, the lower threshold makes the graph generate more nodes, and the resulting graph would be more dense.

- -

Working with Multiple Graphs

- -

Beyond generating graphs from a single text source, users can combine graphs they have previously generated to see how concepts from several articles interrelate. In the below example, we see how two related articles interact when graphed together. Here we have one from the Bitcoin Wikipedia page and the other from the Decentralized Finance page. We can distill from a quick glance that both articles discuss networking, bitcoin, privacy, blockchain and currency concepts (as indicated by green nodes), but diverge slightly as the Bitcoin article focuses on the system specification of Bitcoin and the Decentralized Finance article talks more about impacts of Bitcoin on markets. The multi-graph option allows users to not only assess the contents of several articles all at once with a single glance, but also reveals insights on larger concepts through visualizing the interconnections of the two sources. A user could use this tool to obtain a holistic view on any area of research they want to delve into.

- -

- -

Visual Embedding Generation

- -

- -

An additional feature our system provides is a tool to plot topics in 2D and 3D space to provide a new way of representing topic relations. Even better, we use the Poltly library to make these plots interactive! The embedding tools simply take the embedding that corresponds to each topic node in our graph and projects it into 2D space. Topic clustering indicates high similarity or strong relationships between those topics, and large distances between topics indicates dissimilarity. The same logic applies to the 3D representation; we give users the ability to upgrade their 2D plots to 3D, if they’re feeling especially adventurous.

- -

Deployment and Caching

- -

We deployed our app on the Google Cloud Platform (GCP) via Docker. In particular, we sent a cloud built docker image to Google Cloud, and set up a powerful VM that launched the app from that Docker image. For any local updates to the application, redeploying was quite simple, requiring us to rebuild the image using Google Cloud Build and point the VM to the updated image.

- -

To speed up performance of our app, we cache graphs globally. Let’s say you are trying to graph an article about Taylor Swift’s incredible Folklore album, but another user had recently generated the same graph. Our caching system ensures that the cached graph would be quickly served instead of being re-generated, doing so by utilizing Streamlit’s global application cache. Our initial caching implementation resulted in User A’s generated and named graphs appearing in a User B’s application. To fix this, we updated each user’s session state individually instead of using one global state over all users, therefore preventing User A’s queries from interfering with User B’s experience.

- -

Our Backend: Machine Learning for Concept Extraction and Graph Generation

- - -

Our concept graph generation pipeline is displayed in Figure 1. Users are allowed to provide either custom input (arbitrarily-formatted text) or a web URL, which we parse and extract relevant textual information from. We next generate concepts from that text using numerous concept extraction techniques, including TF-IDF and PMI-based ranking over extracted n-grams: unigrams, bigrams, and trigrams. The resulting combined topics are culled to the most relevant ones, and subsequently contextualized by sentences that contain the topics. Finally, each topic is embedded according to its relevant context, and these final embeddings are used to compute (cosine) similarities. We then define edges among topics with a high enough similarity and present these outputs as a graph visualization. Our primary machine intelligence pipelines are introduced in (1) our TF-IDF concept extraction of relevant topics from the input text and (2) our generation of BERT embeddings of each topic using the contextual information of the topic within the input text.

- -

- -

Pipeline Illustration: A diagram of our text-to-graph pipeline, which uses machine intelligence models to extract concepts from an arbitrary input span of text.

- -

Our concept extraction pipeline started with the most frequent unigrams and bigrams present in the input text, but we soon realized that doing so populated our graph with meaningless words that had little to do with the article and instead represented common terms and phrases broadly used in the English language. Although taking stopwords into account and further ranking bigrams by their pointwise mutual information partially resolved this issue, we were unable to consistently obtain concepts that accurately represented the input. We properly resolved this issue by pre-processing a large Wikipedia dataset consisting of 6 million examples to extract “inverse document frequencies’’ for common unigrams, bigrams, and trigrams. We then rank each topic according to its term frequency-inverse document frequency (TF-IDF) ratio, representing the uniqueness of the term to the given article compared to the overall frequency of the term in a representative sample of English text. TF-IDF let us properly select topics that were unique to the input documents, significantly improving our graph quality.

- -

To embed extracted topics, we initially used pre-trained embeddings from GloVe and word2vec. Both of these algorithms embed words using neural networks trained on context windows that place similar words close to each other in the embedding space. A limitation with these representations is that they fail to consider larger surrounding context when making predictions. This was particularly problematic for our use-case, as using pre-trained context-independent word embeddings would yield identical graphs for a set of concepts. And when we asked users to evaluate the quality of the generated graphs, the primary feedback was that the graph represented abstract connections between concepts as opposed to being drawn from the text itself. Taking this into account, we knew that the graphs we wanted should be both meaningful and specific to their input articles.

- -

In order to resolve this issue and generate contextually-relevant graphs, we introduced a BERT embedding model that embeds each concept along with its surrounding context, producing an embedding for each concept that was influenced by the article it was present in. Our BERT model is pre-trained on BookCorpus, a dataset consisting of 11,038 unpublished books and English Wikipedia (excluding lists, tables and headers). We used embeddings from the final layer of the BERT model—averaged across all WordPiece-split tokens describing the input concept—to create our final 1024-dimensional embeddings for each concept. We implemented caching mechanisms to ensure that identical queries would have their associated embeddings and adjacency matrices cached for future use. This improves the efficiency of the overall process and even guarantees graph generation completes in under 30 seconds for user inputs of reasonable length (it’s usually faster than that).

- -

System Evaluation

- -

Since we are working with unstructured data and unsupervised learning, we had to be a little more creative in how we evaluated our model’s performance. To start, we created a few metrics to gather for generated graphs that would help us better quantify the performance of our system. The metrics include:

- -
    -
  • The time to generate a novel, uncached graph
  • -
  • The number of nodes and edges generated, along with average node degree
  • -
  • Ratings of pop-up context digest quality
  • -
  • An graph-level label of whether it is informative overall
  • -
  • The number of generated topics that provide no insight
  • -
  • The number of topics that are substrings of another topic
  • -
- -

When designing metrics to track, our main goal was to capture the utility of our app to users. The runtime of graph generation is paramount, as users can easily grow impatient with wait times that are too long. The number of nodes shows how many topics we present to the user, the number of edges indicates how many connections our tool is able to find, and the average degree captures the synergy between those two. The pop-up context digests can either provide useful or irrelevant additional information. Having a general sense for the overall quality of information in graphs is important to note. Nodes generated based on irrelevant topics waste users’ time, so we want to minimize that. Similarly, nodes with topics that are substrings of other topics in the graph are also unwanted, as they indicate redundant information in our graph.

- -

With our metrics defined, we began generating graphs and annotating them by hand. We found that the average graph takes 20.4 seconds to generate and has 13.27 nodes and 13.55 edges, leading to an average node degree of 1.02. Overall, we are happy with the graph generation time that we measured — 20 seconds is a reasonable expectation for our users, especially considering we are not currently using a GPU. On average, we found that the graphs were informative 68% of the time. The times that they were not were caused either by too high of a threshold or poor topic generation. In particular, we noticed that performance was poor on articles that covered many different areas of a topic, such as articles discussing matchup predictions for March Madness. While the overarching theme of college basketball was the main focus of those articles, they discussed many different teams, which led the model to have a tough time parsing out common threads, such as the importance of an efficient offense and lockdown defense on good teams, throughout the article.

- -

Our default graph generation uses a threshold of 0.65 for the cosine similarity between topics to form an edge between them. For reference, we also tested our graph generation with thresholds of 0.6 and 0.7 for the edge cosine similarity and found that they yielded an average node degree of 1.71 and 0.81, respectively. An average node degree of 1.71 is too high and floods the user with many frivolous connections between topics. An average node degree of 0.81, on the other hand, doesn’t show enough of the connections that actually exist between topics. Therefore, a threshold of 0.65, with an average node degree of 1.02, provides a nice balance between topics presented and the connections between them.

- -

As for the errors we were scanning for, we found that on average, 12.33% of nodes in every graph were topics that added nothing and 17.81% of nodes were simply substrings of another topic in the graph. Therefore, about 69.86% of the nodes that we present to users are actually relevant. This tells us that users on our site may spend some time sifting through irrelevant topics, which we hope to improve in the future. We additionally rated the quality of the contextual information displayed in each node’s pop-up digest window, and found that (on a scale of 0-1) our ratings averaged 0.71. This was largely caused by lack of sufficient filtering applied to the sentences displayed. Filtering and curation heuristics for these digests is another potential area of growth.

- -

Application Demonstration

- -

Interested users should visit our application at this URL, where they are presented with a clean, simple, and intuitive interface that allows them to either (a) input custom text to graph, (b) input a web URL to graph, or (c) generate a combined graph from two of their previously saved graphs. Upon entering custom text or a web URL, users are shown a progress bar estimating the time of graph generation (or an instant graph if the query has been cached from previous uses of the website). Generated graphs are interactive, allowing users to click on nodes to see the context in which they appear in the input document. We also present other modes of visualization, including a 2D and 3D PCA-based embedding of the concepts, which provide a different perspective of relationships between concepts. Users can also save graphs locally (to their local browser cache), and subsequently combine them to link concepts together across numerous distinct documents.

- -

Our team chose to use a web interface as it proved to be the most intuitive and straightforward way for users to provide custom input and interact with the produced graphs. We implemented our own customizations to the Streamlit default graphing library (in this fork) to enable enhanced interactivity, and we employed Streamilit to ensure a seamless development process between our python backend and the frontend interface.

- -

Watch us demo our platform and give it a try !

- - - -

Reflection

- -

What worked well?

-

We had such a rewarding and exciting experience this quarter building out this application. From day one, we were all sold on the context graph idea and committed a lot of time and energy into it. We are so happy with the outcome that we want to continue working on it next quarter. We will soon reach out to some of the judges that were present at the demo but didn’t make it to our room.

- -

While nontrivial, building out the application was a smooth process for several reasons: technical decisions, use of Streamlit, great camaraderie. Let’s break these down.

- -

Our topic retrieval process is quite simple, with the use of highest frequency n-grams weighted by PMI scores, with n={1,2,3}. TF-IDF was a good addition to topic filtering (the graphs were more robust as a result), but because it was slower we added it as a checkbox option to the user. Sentence/context retrieval required carefully designed regular expressions, but proved to work incredibly efficiently once properly implemented. We then had to shape the topics and contexts correctly and after passing them through BERT, compute cosine similarities. For displaying graphs, we utilized a Streamlit component called streamilt-agraph. While it had all the basic functionality we needed, there were things we wanted to add on top of it (e.g. clicking on nodes to display context), which required forking the repo and making custom changes on our end.

- -

Due to the nature of our project, it was pretty feasible to build out the MVP on Streamlit and iterate by small performance improvements and new features. This made individual contributions easy to line up with Git issues and to execute on different branches. It also helped that we have an incredible camaraderie already, as we all met in Stanford’s study abroad program in Florence in 2019.

- -

What didn’t work as well?

-

To be honest, nothing crazy here. We had some obscure bugs from BERT embeddings that would occur rarely but at random, as well as graph generation bugs if inputs were too small. We got around them with try/catch blocks, but could have looked into them with a little more attention.

- -

If we had unlimited time & unlimited resources…

-

Among the four of us, we made our best guesses as to what the best features would be for our application. Of course if time permitted, we could conduct serious user research about what people are looking for, and we could build exactly that. But apart from that, there are actual action items moving forward, discussed below.

- -

We wanted to create an accessible tutorial or perhaps some guides either on the website or an accessible location. This may actually no longer be necessary because we can point to the tutorial provided in this blog (see Application Demonstration). We saw in many cases that without any context of what our application does, users may not know what our app is for or how they could get the most out of it.

- -

On top of this, future work includes adding a better URL (i.e. contextgraph.io), including a Chrome extension, building more fluid topic digests in our pop-ups, and submitting a pull request to the streamlit-agraph component with our added functionality—in theory we could then deploy this for free via Streamlit.

- -

Broader Impacts

-

Context Graph Generator Impacts:

- -
    -
  • -

    Summarization: Our intuitive interface combined with robust graph generation enables users to understand large bodies of text with a simple glance.

    -
  • -
  • -

    Textual Insights: The extensive features we offer from multi-graphing to TF-IDF topic generation to context summarization for each node enables users to generate analysis and insights for their inquiries on the fly.

    -
  • -
- -

Our aim in creating this tool is to empower individuals to obtain the information they need with ease, so they are empowered to achieve their goals at work or in their personal lives. Whether the user has to synthesize large amounts of information for their business or simply seek to stay informed while on a busy schedule, our tool is here to help!

- -

When considering the ethical implications of such a tool, it becomes apparent that while a context graph largely positively impacts users, it’s important to consider how it could become a weapon of misinformation. When a user provides text for the graph generator to analyze, we do not perform fact checking of the provided text. We believe this is reasonable considering that our platform is an analysis tool. Additionally, because we are also operating only natively within our site and graphs are not shareable, there is no possibility of a generated graph object being shared to inform others (one could take a screenshot of the graph, however, most detailed information is embedded in the nodes’ pop-up). If we were to make graphs shareable or integrate our tool into other platforms, we run the risk of being a tool of misinformation if users were to share graphs that help people quickly digest information. As we continue to work on our platform, we will keep this scenario top of mind and work to find ways to prevent such an outcome.

- -

Contributions

-
    -
  • Blake: Worked on generating PCA projection plots from embeddings, saving graphs, graph combination, and Streamlit UI.
  • -
  • Ella: Worked on graph topic generation (primarily TF-IDF & data processing), reducing skew in embeddings of overlapping topics, and Streamlit UI
  • -
  • Lauren: Worked on graph topic generation, GCP deployment, and Streamlit UI
  • -
  • Manan: Worked on graph topic generation, embedding and overall graph generation, streamlit-agraph customization for node popup context digests, and Streamlit UI
  • -
- -

References

- - -

Our system was built using Streamlit, Plotly and HuggingFace’s BERT model. To deploy our system, we used Docker and GCP.

- -

We utilized the Tensorflow Wikipedia English Dataset for IDF preprocessing as well.

- - -
-
- -
- - - -
-
- -
-
-
- -
- -
- - - - - diff --git a/_site/covidbot-report/index.html b/_site/covidbot-report/index.html deleted file mode 100644 index bb4b329..0000000 --- a/_site/covidbot-report/index.html +++ /dev/null @@ -1,386 +0,0 @@ - - - - - CovidBot Project Report - CS 329S Winter 2021 Reports - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
-
-
- - -
-
-

CovidBot Project Report

-
2021, Mar 17    
-
-

The Team

-
    -
  • Andra Fehmiu
  • -
  • Bernard Lange
  • -
- -

Problem Definition

- -

One year of the global coronavirus pandemic has led to 550,574 deaths and 30,293,016 cases in the United States so far. It has forced the government and authorities to implement various restrictions and recommendations to hinder its spread until a long-lasting solution is determined. In this constantly evolving political and news landscape, it has been challenging for people all over the world, including those in the U.S., to remain informed about Covid-related matters (i.e. Covid-19 symptoms, recommended actions and guidelines, nearest test and vaccination centers, active restrictions and rules, etc). Medical call centers and providers have also been overloaded with volumes of individuals seeking reliable answers to their Covid19-related questions and/or seeking guidance with their Covid-19 symptoms.1

- -

To tackle the challenges that have arisen due to these unusual circumstances, we have decided to build CovidBot, a Covid-19 Chatbot. CovidBot provides easy access to the most up-to-date Covid-19 news and information to individuals in the U.S and, as a result, eases the burden of medical providers. CovidBot enables us to standardize the answers to the most prevalent Covid-19-related questions, such as What are Covid-19 symptoms? and How does Covid spread?, based on the information provided by WHO and CDC and provide them instantaneously and simultaneously to thousands of users seeking assistance. We have also added capabilities for handling user-specific queries that require personalized responses (i.e. When is my next test? When did I test positive?, etc.). Thus, CovidBot is able to answer both general, frequently-asked questions about COVID-19 and user-specific questions.

- -

Having come across multiple articles such as the one by Harvard Business Review about hospitals using AI to battle Covid-19, it was apparent to us that there is a clear need for a CovidBot that could also be easily integrated and used by hospital and medical centers around the U.S. While searching for available open-source code to build chatbots for Covid-19, we realized that the existing Covid question-answering models and chatbots were either limited in their capabilities and/or were not accessible. For example, Deepset’s Covid question-answering model API [2] and UI were taken offline in June 20202. Covid question-answering model deployed by students at Korea University3 [3] provides out of date Covid-related news and information. When we asked “What vaccines are available?”, we were given an answer containing a scholarly article from 2016 about the different types of vaccines in general (see Figure 1) as opposed to our Chatbot’s QA model, which is able to provide an accurate and up-to-date answer to this question by listing the Pfizer and Moderna vaccines (see Figure 4). In addition, none of the Covid chatbots we came across have implemented necessary capabilities to address user-specific queries and provide personalized responses.

- -
-
- -
-Figure 1. CovidAsk's response to the user query “What vaccines are available?” -
- -

The bot can be used to find the most up-to-date Covid-related information at the time of writing, can provide answers to personal or general questions, and can be easily integrated with various popular social platforms that people use to communicate (e.g. Slack, Facebook Messenger, etc.). The implementation behind the CovidBot is available at https://github.com/BenQLange/CovidBot.

- -

System Design

- -

Our general framework is visualized in Figure 2 and comprises the following modules. Natural Language Modelling module handles the queries and generates responses. The datasets used for training and any general information used or stored during inference is encapsulated in the Knowledge Base. Data-driven models and all the datasets are described in further detail in the Machine Learning Component section. All personal data, e.g. user-bot interaction history, personal information, and analytics, is stored in the Internal Data Storage. Finally, Dialog Deployment Engine module enables interaction with our bot via popular messaging platforms such as Facebook Messenger and Slack. The deployment framework used is Google’s Dialogflow. We have decided to use it for building our conversational interface due to its straightforward integration with our ML system via the webhook service and various popular messaging platforms (e.g. Slack, Facebook Messenger, Skype etc.). This in turn, makes the deployment of the chatbot easier and makes CovidBot easy to use for our end-users.

- -
-
- -
-Figure 2. Overview of the CovidBot Architecture -
- -
-
- -
-Figure 3. General CovidBot Framework -
- -

Machine Learning Component

- -

Our general CovidBot framework is visualized in Figure 3. The CovidBot is powered by multiple ML models running simultaneously. Depending on the type of the query, whether it relates to personal or general COVID-19 information, different models are responsible for response generation.

- -

We build an intent classifier using the GPT-3 model thanks to the OpenAI Beta access. GPT-3 uses the Transformer framework and has 175B trainable parameters, 96 layers, 96 heads in each layer, each head with a dimension of 128. To successfully perform intent classification, GPT-3 requires only a few examples of correct classification. Depending on the intent of the user’s query, we either use a GPT-3 model to generate and extract personalized (user-specific) response or a Covid General QA model that uses either DialoGPT, RoBERTa, or GPT-3 to generate a response. If the query is personal, GPT-3 extracts the type of the information provided, e.g. I have a test on the 2nd of April, and stores it locally. If it is a question, e.g. When was my last negative test?, it loads locally stored information based on which GPT-3 generates the answer.

- -

The answers to general COVID-19 questions are generated by the DialoGPT by default. However, we have also built in an additional capability to pick RoBERTa, or GPT-3. Although the GPT-3 model is a powerful text generation model, we can not fine-tune the model to our tasks and we have a limited number of input tokens. This limits the amount of knowledge about COVID-19 which is provided to the model making it inadequate for our task. For this reason, we build 2 additional models, namely RoBERTa and DialoGPT, that do not have these limitations.

- -

RoBERTa [5] is a retrained BERT model that builds on BERT’s language masking strategy and removes its next-sentence pretraining objective.4 We use the RoBERTa model fine-tuned on a SQuAD-style CORD-195 dataset provided by Deepset, which is publicly available on HuggingFace6. After testing the model performance and inspecting the Covid QA dataset, we observe that a lot of the annotated examples contain non-Covid content, which is reflected in the poor performance of the Covid QA model. Due to this, we fine-tuned the RoBERTa model again using our custom dataset containing Covid-related FAQ pages from the CDC and WHO websites. Although far from ideal, the RoBERTa model results after this iteration were more reasonable, indicating the importance of a larger and higher quality dataset in providing more robust answers. Another important observation made is that even with GPU acceleration, the RoBERTa Covid QA model is slow and would not be suitable for production as is. Thus, to reduce the model throughput, we implemented a retrieval-based RoBERTA model where the retriever scans through documents in the training set and returns only a small number of them that are most relevant to the query. The retrieval methods considered are: TF-IDF, Elastic Search, and DPR (all implemented using the Haystack package). However, even with the retrieval methods implemented, the model is still slower than other models and requires further optimization to be deployed in production.

- -

DialoGPT model is based on the GPT-2 transformer [6] which uses masked multi-head self-attention layers on the web collected data [7]. It is a small 12-layer architecture and uses byte pair encoding tokenizer [8]. The model was accessed through HuggingFace. It applies a similar training regime as OpenAI’s GPT-2 where conversation generation is framed as language modelling task with all previous dialog turns concatenated and ended with the end-of-text token.

- -

To fine-tune the pre-trained DialoGPT and RoBERTa models, we build scraper functions that collect data from the CDC and WHO FAQ pages. Our custom Covid QA dataset has 516 examples of Covid-related questions and answers and both models’ performance improves noticeably after fine-tuning them with this dataset.

- -

System Evaluation

- -

In order to evaluate the performance of our Covidbot system, we integrated each of the 3 response generation models into the messenger platform using Dialogflow and simulated multiple user-bot interactions per session. We validated the performance of our system by testing it using different types of queries; these queries include: semantically different queries, queries with different intents (personal vs. general) as well as queries that are both implicitly and explicitly related to Covid (e.g. ”implicit” queries are “What is quarantine?”, “Are vaccines available?” vs. “explicit” queries are “What are Covid-19 symptoms?”, “What is Covid-19?”).

- -

We also evaluated the latency and throughput of our system in generating responses for queries with different complexity levels and also in generating responses when multiple users are using it simultaneously.

- -

We also asked our peers to interact with the CovidBot and give us feedback based on the bot’s responses to their queries, and they were all satisfied with the performance of our bot. They thought the answers CovidBot gave were reasonable and the only remark they made was that the bot’s responses occasionally contained incomplete sentences, which is a limitation we are aware of and will work on improving for the next iteration.

- -

If we had more users testing the system and we had an environment that resembles more the real-time production environment then we would also analyze some user-experience metrics (i.e. the average number of questions asked, the total number of sessions that are simultaneously active), as well as bot-quality metrics (i.e. the most frequent responses given, percentage of fallback responses where the chatbot did not know the answer to a question). We would also integrate an optional CovidBot rating feature that uses “thumbs up/down” buttons in order to allow users to rate their experience using the system at the end of each session.

- -

Application Demonstration

- -

alt_text

- -

alt_text

- -

alt_text -Figure 4. CovidBot Demonstration for Personal and General Covid Question-Answering

- -

In terms of the core interface decisions, we chose to build a chatbot through a messenger platform as a channel. We use Dialogflow, Google’s conversational interface designer, because it allows us to seamlessly integrate our ChatBot with different, popular messenger platforms and other types of deployment methods, such as Google Assistant.

- -

As can be seen in Figure 4, the latest version of our CovidBot is deployed on Slack and provides a visual interface that can appear on both desktop and mobile. This allows users to easily access the CovidBot without having to open their web browser and makes their user experience smoother. We assume a good amount of users are familiar with similar interactions to the ones they have with our CovidBot. The bot is initialized by asking the user about the model they want to use for response generation, giving them the freedom to pick and explore the models on their own. By default or if a user inputs an invalid model name, we use the DialoGPT model. After initializing a response generating model, we begin by asking our CovidBot more general Covid questions, such as: “What are the symptoms of Covid-19?”, “Are there vaccines available?”. For all questions, we receive satisfactory and up-to-date responses as shown in Figure 4. When the CovidBot identifies a personal statement, e.g. “I have a test on the 22nd of April”, it will store it locally and reply “Noted down!”. Based on the locally stored information, the bot is capable of answering personal questions, such as “When is my next test?”.

- -

Given that there is already a significant amount of Covid-related news and information on the web, we believe that deploying CovidBot is essential in this ever-changing Covid-19 landscape which can (and does) become overwhelming at times for a lot of people.

- -

As part of this project, we have built an AI-driven bot because text generation is a difficult task especially in this context where the term “Covid” has multiple synonyms. So, given the gravity of the Covid-19 pandemic and the need for spreading accurate Covid-related information, it is highly important to build a model that is able to recognize, analyze, and understand a wide variety of responses and derive meaning from implications without relying solely on syntactic commands.

- -

Reflection

- -

We believe that we have achieved all our major objectives with the CovidBot framework. All models trained on the dataset scraped from the WHO and CDC websites outperformed our expectations both in terms of information accuracy, and inference time. They are also efficient enough to enable regular updates/re-trainings on a daily basis as more information becomes available. Model deployment with Google’s Dialogflow to Slack was also surprisingly easy making the bot easy to share. One of the issues which should be addressed is our reliance on GPT-3 provided by the OpenAI API Beta Access to perform intent classification and personal queries handling. However, we think that training both intent classification and personal response generation shouldn’t be more challenging than the general response generation achieved with DialoGPT and RoBERTa.

- -

We would like to thank CS329s course staff for advice during the development of the CovidBot and for the access to the OpenAI API Beta Access.

- -

Broader Impacts

- -

The intended uses of the CovidBot include getting the most up-to-date Covid-related news and receiving personal reminders about Covid-related matters (i.e. testing dates etc). However, we do not intend to have the CovidBot substitute doctors, which is why it is highly important for us to ensure that users understand that they should not be using the bot to seek for serious medical advice as it could have significant health consequences for the users. We have attempted to mitigate harms associated with this unintended use of the system by carefully picking the examples included in our custom Covid QA dataset, which come from trusted health organizations and agencies that also take precautions when answering FAQs in their website in order to prevent the same unintended uses as ours. As a concrete example, there is a publicly available dataset that includes examples of Covid-related conversations between patients and doctors, but we decided to not include it in our model fine-tuning step in order to mitigate the harms associated with having our bot respond like a doctor.

- -

In the future, we could perform analysis of the type of queries being inputted into our system and see if we can detect a pattern in how users interact with the bot. We could also implement features that are easy to notice (i.e. a disclaimer below the query bar) in order to remind users of the intended use cases of our CovidBot.

- -

Contributions

- -

Andra worked on data collection and preprocessing, the RoBERTa models, and integration of models for chatbot deployment using Dialogflow.

- -

Bernard worked on the DialoGPT models, GPT-3 integration and CovidBot system design.

- -

References

- -

[1] Wittbold, K., Carroll, C., Iansiti, M., Zhang, H. and Landman, A., 2021. How Hospitals Are Using AI to Battle Covid-19. [online] Harvard Business Review. Available at: <https://hbr.org/2020/04/how-hospitals-are-using-ai-to-battle-covid-19> [Accessed 19 March 2021].

- -

[2] Möller, T., Reina, A., Jayakumar, R. and Pietsch, M., 2020, July. COVID-QA: A Question Answering Dataset for COVID-19. In Proceedings of the 1st Workshop on NLP for COVID-19 at ACL 2020.

- -

[3] Lee, J., Yi, S.S., Jeong, M., Sung, M., Yoon, W., Choi, Y., Ko, M. and Kang, J., 2020. Answering questions on covid-19 in real-time. arXiv preprint arXiv:2006.15830.

- -

[4] Brown, T.B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A. and Agarwal, S., 2020. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.

- -

[5] Liu, Y., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L. and Stoyanov, V., 2019. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.

- -

[6] Zhang, Y., Sun, S., Galley, M., Chen, Y.C., Brockett, C., Gao, X., Gao, J., Liu, J. and Dolan, B., 2019. Dialogpt: Large-scale generative pre-training for conversational response generation. arXiv preprint arXiv:1911.00536.

- -

[7] Radford, A., Wu, J., Child, R., Luan, D., Amodei, D. and Sutskever, I., 2019. Language models are unsupervised multitask learners. OpenAI blog, 1(8), p.9.

- -

[8] Sennrich, R., Haddow, B. and Birch, A., 2015. Neural machine translation of rare words with subword units. arXiv preprint arXiv:1508.07909.

- - -

Notes

- -
-
    -
  1. - -

    https://hbr.org/2020/04/how-hospitals-are-using-ai-to-battle-covid-19 

    -
  2. -
  3. - -

    https://github.com/deepset-ai/COVID-QA 

    -
  4. -
  5. - -

    https://github.com/dmis-lab/covidAsk 

    -
  6. -
  7. - -

    https://ai.facebook.com/blog/roberta-an-optimized-method-for-pretraining-self-supervised-nlp-systems/ 

    -
  8. -
  9. - -

    https://allenai.org/data/cord-19 

    -
  10. -
  11. - -

    https://huggingface.co/deepset/roberta-base-squad2-covid 

    -
  12. -
-
- - -
-
- -
- - - -
-
- -
-
-
- -
- -
- - - - - diff --git a/_site/dashcam-data-valuation/index.html b/_site/dashcam-data-valuation/index.html deleted file mode 100644 index 7c24b1a..0000000 --- a/_site/dashcam-data-valuation/index.html +++ /dev/null @@ -1,379 +0,0 @@ - - - - - An active data valuation system for dashcam data crowdsourcing - CS 329S Winter 2021 Reports - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
-
-
- - -
-
-

An active data valuation system for dashcam data crowdsourcing

-
2021, Mar 19    
-
- - -

https://cs329s.aimeup.com

- -

The Team

- - -

Problem definition

- -

Data diversity is one of the main practical issues that limits the ability of machine learning algorithms to generalise well to unseen test cases in industrial settings. In scenarios like data-driven perception in autonomous cars, this issue translates to acquiring a diverse train and test set of different roads and traffic scenarios. On the other hand the increased availability and reduction in cost of HD cameras has resulted in drivers opting to install in-expensive cameras (dash-cams) on their cars, creating a potential for a virtually infinite source of diverse training data for autonomous driving applications. This data availability is ideal from a machine learning engineer’s point of view but the costs in data transfer, storage, clean up and labeling limit the success of such uncontrolled data-crowd-sourcing approaches. More importantly, the data-owners might prefer not to send all of their data to the cloud because of privacy concerns. We propose a local unsupervised dataset evaluation system that can prioritize the samples needed for training of a centralized model without the need for uploading every sample to the cloud and therefore eliminate the costs of data transfer and labeling directly at the source.

- -

System design

- -
-
- -
-Block diagram of the proposed system -
- -

As it was explained our goal is to optimise the training set corresponding to an ML model by distributed data valuation. One of the well-known approaches to this problem is to prioritise the samples based on their corresponding model uncertainty. Our proposed approach is using a local “loss-predictor network” to quantify the value of each sample at each client before its transmitted to a central server. The proposed system consists of two main modules: namely the centralized server and the local data source clients. Please see figure 1 for more details.

- -

Module 1: The Centralized server

- -

The goal of the server module is to:

- -
    -
  1. Gather data from different data sources (clients),
  2. -
  3. Retrain and update the backbone model based on the updated training set (for labeled data)
  4. -
  5. Train the loss prediction module based on the updated backbone model
  6. -
  7. Transmit the weights of the updated loss prediction module to each client
  8. -
- -

Module 2: Local data source clients

- -

The goal of each client is to:

- -
    -
  1. Estimate the backbone model’s loss for each local sample using the local loss-prediction model
  2. -
  3. Select the most valuable (valid) samples based on the predicted loss
  4. -
  5. Transmit the selected samples to the centralized module
  6. -
- -

In order to make the system available to users, we chose AWS as our cloud platform. Once a user decides to upload any data with us, this gets stored on AWS S3. In order to deal with the concurrency issues, which arise when multiple users share the data with us, we created a scheduler using AWS CloudWatch that triggers the online learning at a specified time interval. The centralized server which does the online learning, was implemented as a Lambda function, configured with a Docker image runtime. By using the scheduler and allowing at any time just one instance of the online training, all the new available data will be processed once and the new model will be made available to all clients at the same time. Now, as the model gets trained, we wanted to prevent users from actually being able to evaluate and upload data with us, as this would enable them to still be rewarded for some data, that after the training can be worthless. To enable this, we used AWS IoT to push data from the online learning Lambda function to all running clients, containing all the training stats and progress, based on which the client will decide when to make the platform again available to the users.

- -

As client data privacy was our main concern, when users evaluate pictures this has to happen without sending any data to us. Therefore the online learning component generates at the end a browser ready model, and uploads it to AWS S3 with a new incremented version. Whenever a client would like to evaluate some data, there is a check to assess whether a new version is available and always get the newest model version. All of this was done using TensorFlow JS.

- -

In order to protect the data from the outside world, and only allow access to the resources over the web app for all unauthenticated users, AWS IAM was employed and with CloudFormation configuration files we could set up the full security layer automatically.

- -

Now in order to create all the infrastructure automatically based on the code changes, allowing us to have both a test and production environment, we have employed an infrastructure as a code approach. For this we used AWS Amplify and AWS SAM allowing us to leverage AWS CloudFormation services.

- -

Machine learning component

-
-
- -
-Block diagram of the ML component -
- -

Our approach to the on-the-edge data valuation problem is based on recent advances in development of loss-predictor models[]. In simple words, a loss predictor model is a model that tries to estimate another model’s loss as a function of its inputs. We use the loss of the model as a measure of model uncertainty that can be calculated without access to the ground truth labels enabling evaluating each sample directly at time of capture.

- -

For the back-bone model we converted a pre-trained YOLO-V3 [1] directly from the original dark-net implementation. Then we evaluated the converted model on a dashcam dataset available publicly online (bdd100k dataset available at https://bdd-data.berkeley.edu/)

- -

For the loss predictor model we decided to go with a small CNN that can be directly implemented on the browser (Tiny-VGG). We trained the Tiny-VGG model on the classification loss resulting from running the backbone model on unseen data.

- -

The implemented system has 2 interconnected training loops:

- -

First the “Offline” training loop that requires labeled data and can help in training of the loss predictor model to be a better predictor of the loss of the backbone model. Since our system did not include a labeling scheme we trained this loop only once (using a labeled subset of the bdd100k dataset) we then used the learned weights as the starting point for the second training loop (online learning).

- -

For the online learning training loop we start with the weights extracted from the offline training phase and then retrain the loss-predictor model whenever the centralized unlabeled dataset is updated. The challenge here is how to retrain the loss predictor model on these samples without having access to the labels. The way that we approached this problem is by considering the fact that the backbone model’s loss on these samples will be zero once they are labeled and added to the backbone model’s training set. Based on this assumption we decided to use the new samples with loss of zero as an online learning alternative to the larger offline learning loop.

- -

System evaluation

- -

One of our main challenges was to map a measure like the loss of a model to a quantitative and fair value in dollars. For this task we first did an empirical analysis of the distribution of the classification loss values of the backbone model. Figure 3 shows the empirical distribution of losses for the YOLO V3 model. We used this empirical probability distribution to calculate how likely is observing each sample in comparison to a randomized data capture approach with uniform probability of observing each sample. We defined the value for each sample as follows:

- -

alt_text

- -

In which alt_text is the empirical probability of each loss as shown in Figure 3, alt_text

- -

is the probability that each loss would be observed if the loss distribution was uniform (10% for the 10-bin histogram shown in figure 3) and BSV is the “Base Sample Value” chosen by the system designer. Based on our initial research the value that companies like Kerb and lvl5 have assigned to dashcam videos is around 3$ per hour of video recordings which roughly translates to 0.1 cent per frame assuming a 1fps key-frame extraction rule. However since in our system the samples are assumed to be much more diverse than a video and we require manual selection of the samples by the user we assumed a 10 cent base sample value for each frame.

- -

We observed one caveat for this method in practice: Because even the smallest losses have a non-zero value (because probability of observing any loss is non-zero) the already-sold samples could monetized again if the loss-predictor model does not give exact zero loss for its training set (which can be the case in online learning). We dealt with this problem by adding a “dead-zone” to our valuation heuristic in a way that samples with losses less than a specific threshold would have zero value (in our latest implementation we found that a threshold of 0.27 to work well with our data).

- -
-
- -
-Empirical expected probability of classification loss values of the backbone model -
- -

Application demonstration

- -

We made our application available online, to allow all users access to it. We have two links available, https://cs329s.aimeup.com for the production environment and https://cs329s-test.aimeup.com for the test environment. By choosing the production environment we click on the browse data button and load some on-the-topic pictures and hit the Run button:

- -

alt_text

- -

We could see the model generated some scores which get mapped to a fair value in U.S. dollars. All this data can be exported to Excel/PDF by using the buttons available in the spreadsheet toolbar. Search is also possible, if any picture can be referenced by name, to avoid scrolling when using a large number of pictures.

- -

After selecting one picture and uploading it, the online learning gets activated and the functionality on all clients is disabled during this time providing a real time progress of the training, as can be seen in the screenshot below:

- -

alt_text

- -

To assess what is going on in the backend we have built a monitor page that can be opened by pressing the “Open Monitor” button. From that moment on, all the backend resources will push notifications to it. After uploading the picture and during the online training we can see the following:

- -

alt_text

- -

After running the new model on the same pictures, the fair value of the uploaded pictures goes down to 0, meaning that the model has learned the features available in it.

- -

alt_text

- -

Reflection

- -

First challenge that we encountered is how to fetch a model from a secure site, where each file can get accessed over a secured private link and run it in the browser. TensorFlow JS unfortunately does not support this kind of operation, so we had to implement this ourselves.

- -

One major drop back in our project was our third teammate suddenly dropping from the course. Which we could have seen coming from him not being responsive in the first couple of weeks of the quarter.

- -

Another major challenge was dealing with model instability while retraining the loss predictor model in our online training loop. Our decision to also have the original training set to “refresh” the training helped a lot.

- -

One issue that we did not count on was the fact that debugging an online learning system requires a very detailed logging and version control system that enables following the dynamic performance of the model. We ended up implementing a basic version of a logging system but still it was very hard to predict how the model would behave after a few retraining sessions.

- -

Infrastructure as a code, is a powerful tool, that does more than one would expect, but can lead to unexplainable behavior. Two examples that gave us some headaches:

- -
    -
  • one cannot rely on the fact that data on the temporary folder inside a Lambda function container persists between the calls
  • -
  • AWS S3 still delivers cached data to you, despite calling the API with caching disabled. Just deleting the files and uploading them again helped!
  • -
- -

Given unlimited time and resources we would incorporate a labeling block into the system and close the loop on active data capture and labeling by retraining the backbone model on the centralized training set.

- -

Broader Impacts

- -

Since our valuation system is fully automated and does not have access to labels for the input data it could be manipulated in many different ways. For instance, one could monetize several copies of the same image (or maybe slightly different versions of one image) and leverage the fact that the loss predictor model can not be trained separately for each individual image. Or because the values are assigned to samples based on how unexpected each sample is, out of context samples can be easily monetized if the users intend to trick the system. The way that we have dealt with this issue is by first, limiting number of uploads that a user can do to an upload attempt every 5 minutes, and we also train the loss-predictor model between different uploads in order to reduce the loss values corresponding to all of the uploaded samples at each iteration. As a result, the users will be able to monetize unrelated or repeated images only once.

- -

Detecting repeated or unrelated images can be pretty straightforward using irregularity detection methods like one-class SVM but we have not currently implemented such a method.

- -

References

- -

[1] Redmon, Joseph, and Ali Farhadi. “Yolov3: An incremental improvement.” arXiv preprint arXiv:1804.02767 (2018).

- - - -
-
- -
- - - -
-
- -
-
-
- -
- -
- - - - - diff --git a/_site/index.html b/_site/index.html index 68c0511..c20c54c 100644 --- a/_site/index.html +++ b/_site/index.html @@ -2,56 +2,56 @@ - CS 329S Winter 2021 Reports + Residência BRISA - + - + - - + - + - + - + - + - - + - - - - + + + + @@ -61,9 +61,10 @@ - + - + + @@ -75,276 +76,33 @@
- CS 329S +
-
CS 329S
-

Machine Learning Systems Design

+
+

Residência BRISA

-
- - -
-

ML Production System For Detecting Covid-19 From Coughs

-

Application Link Covid Risk Evaluation - Hosted on GCP Cloud Run GitHub Link CS 329S - Covid Risk Evaluation Repository The Team Lukas Haas Dilara Soylu John Spencer Problem Definition Since the start of the...

- - - 15 minute read -
-
- -
- - -
-

An active data valuation system for dashcam data crowdsourcing

-

App link https://cs329s.aimeup.com The Team Soheil Hor, soheilh@stanford.edu Sebastian Hurubaru, hurubaru@stanford.edu Problem definition Data diversity is one of the main practical issues that limits the ability of machine learning algorithms to generalise well to unseen...

- - - 10 minute read -
-
- -
- - -
-

Virufy on-Device Detection for COVID-19

-

The Team Solomon Kim Vivian Chen Daniel Tan Amil Khanzada Problem Description COVID-19 testing is inadequate, especially in developing countries. Testing is scarce, requires trained nurses with costly equipment, and is expensive, limiting how many...

- - - 12 minute read -
-
- -
- - -
-

Stylify

-

The Team Daniel Tan Noah Jacobson Jiatu Liu Problem Definition Where photography is limited by its reliance on reality, art transcends the confines of the physical world. At the same time, art is a highly...

- - - 10 minute read -
-
- -
- - -
-

Building a Context Graph Generator

-

The Team Manan Shah Lauren Zhu Ella Hofmann-Coyle Blake Pagon Problem Definition Get this—55% of users read online articles for less than 15 seconds. The general problem of understanding large spans of content is painstaking,...

- - - 16 minute read -
-
- -
- - -
-

Virufy Asymptomatic COVID-19 Detection - Cloud Solution

-

The Team Taiwo Alabi Alex Li Chloe He Ishan Shah I. Problem Definition By March 2021, the SARS-CoV-2 virus has infected nearly 120 million people worldwide and claimed more than 2.65 million lives [1]. Moreover,...

- - - 14 minute read -
-
- -
- - -
-

Tender Matching People to Recipes

-

The Team Justin Xu: justinx@stanford.edu Makena Low: makenal@stanford.edu Joshua Dong: kindled@stanford.edu Github Repo: https://github.com/justinxu421/recipe_rex Problem Definition You’re at the grocery store, looking through the shelves of copious options, unsure of what you’d actually enjoy cooking...

- - - 10 minute read -
-
- -
- - -
-

CovidBot Project Report

-

The Team Andra Fehmiu Bernard Lange Problem Definition One year of the global coronavirus pandemic has led to 550,574 deaths and 30,293,016 cases in the United States so far. It has forced the government and...

- - - 12 minute read -
-
- -
- - -
-

Fact-Checking Tool for Public Health Claims

-

The Team Alex Gui Vivek Lam Sathya Chitturi Problem Definition Due to the nature and popularity of social networking sites, misinformation can propagate rapidly leading to widespread dissemination of misleading and even harmful information. A...

- - - 14 minute read -
-
-
@@ -327,7 +124,7 @@

RecSys

(i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o), m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m) })(window,document,'script','https://www.google-analytics.com/analytics.js','ga'); - ga('create', 'UA-103070243-6', 'auto'); + ga('create', '', 'auto'); ga('send', 'pageview'); diff --git a/_site/tender-recipe-recommendations/index.html b/_site/tender-recipe-recommendations/index.html deleted file mode 100644 index 001cde4..0000000 --- a/_site/tender-recipe-recommendations/index.html +++ /dev/null @@ -1,383 +0,0 @@ - - - - - Tender Matching People to Recipes - CS 329S Winter 2021 Reports - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
-
-
- - -
-
-

Tender Matching People to Recipes

-
2021, Mar 18    
-
-

The Team

- - -

Github Repo: https://github.com/justinxu421/recipe_rex

- -

Problem Definition

- -

You’re at the grocery store, looking through the shelves of copious options, unsure of what you’d actually enjoy cooking or eating. You have a sense of what you’re craving, but a search on Google for your latest craving can take an hour and many tabs to surface the recipes that hit it. Your cravings also keep changing. New Years’ resolutions for healthy food turn into longing for quick meals, which cascade into a phase of comfort food. Discovering new food takes a lot of effort.

- -

Introducing Tender

- -

Our app is like Tinder. But instead of matching people to people, it matches people to recipes with a similar focus on simplicity. When you first open the app, you’re shown the bios of 4 recipes: their profile photos, names, and who they are. After 10 rounds of choices, our app recommends the best matches to your craving from over 2000 recipes curated by blogs covering many traditional Asian recipes and fusion concepts.

- -

As you choose, we guess your preferences for meats and starches and show them to you in graphs on the bottom of the screen. This helps narrow down our search for the most similar recipes to the ones you chose. And if you’re in the mood for desserts or sides instead of a main dish, you can choose to explore those options on the side bar.

- -

In our next sections, we’ll describe how we evaluated our app, designed the algorithms under the hood, related works, and reflections on our work.

- - - -

Recommendations are a pretty hot topic in today’s society, from apps like Spotify and TikTok, which rely on quickly changing trends and massive amounts of input data / user information in order to build out their algorithms. As a result, many of these algorithms didn’t seem to satisfy the restrictions of our use case, as they relied on repeat user experiences and massive amounts of data.

- -

However, inspiration for the algorithm we utilized for this project was partially taken from the original 2010 Youtube Recommendation Paper [1]. This paper showed a two-step recommendation algorithm where the first step was generating candidates based on a user’s recent activity and the second step was a ranking algorithm which was based on user preferences, which seemed to be a general interpretable approach for our problem.

- -

Video Demo

- -

alt_text

- -

System Evaluation

- -

To validate the quality of our recommendations, we compare our recommendations to a random sample of recipes, and the user has to choose which they prefer the most. The percentage of their choices that we recommend is our validation score. If it’s 50%, we’re doing no better than random, giving us a natural baseline.

- -
-
- -
-
- -

To perform a slice based analysis of our recommendations on different types of cravings, we designed 8 cravings for soupy and stirfry main dishes. We then asked 10 young adults in their early 20s to choose a craving, and use the app. We intentionally left the cravings up to interpretation and gave users the freedom to choose recipes.

- -
-
- -
-
- -

Across our 54 user tests, our recommendation system achieved a validation score of 68%, beating our baseline by 18%. We scored 67% across soupy dishes and 71% across stir-fry dishes. Looking at the breakdown of scores, we find performance on vegetable soup and vegetarian stir-fry to be close to average, a surprise since these are the least represented dishes from our recipe websites.

- -

alt_text

- -

System Design

- -

Our modeling relied on libraries like FastText [2], scikit-learn [3], Pandas [4], and NumPy [5]. We chose Streamlit [6] for the front end and deployment of our app to keep our codebase in Python and for faster iteration. Below is a simple diagram of our algorithms. We’ll step through each part in the next sections.

- -
-
- -
-
- -

Feature Engineering our Recipe Embeddings

- -

Using recipe-scraper [7], we found the following features for all our recipes.

- -
-
- -
-
- -

We curated recipes for main dishes, desserts, and sides, receiving recipe counts of 1737, 362, 221 respectively. For each recipe, we created a joint embedding of the nutrition and ingredients.

- -

Our nutrition embedding gave a binary label to a recipe for exceeding the 75th percentile of fat, protein, carbohydrate, sodium, or sugar content across recipes.

- -

Our ingredients embedding was created in a few steps. First, unigrams and bigrams were extracted from every list of ingredients. Bigrams captured words like “soy sauce” and “chicken breast”. Among the 989 which occurred over 20 times, 359 ingredient grams were manually labeled and kept. Each of these grams were mapped to a 300-dimension embedding using a pretrained FastText language model. FastText [2] forms word embeddings by averaging subword embeddings. This allows it to generalize to unseen vocabulary in our ingredient grams, unlike Word2Vec [11]. To create a sentence embedding from all the ingredients of a recipe, we took an inverse frequency weighted average of the individual ingredient embeddings based on the smooth inverse frequency method introduced by Arora et al [8]. Compared to using SentenceBERT [12], this better takes advantage of our domain specific ingredient frequency.

- -

To create a joint embedding with a balanced influence from the nutrition and ingredients, we projected our ingredient embeddings into the subspace formed by their first 5 principal components, which explained 49% of the variance. Extending to 10 principal components would have explained an additional 12% of variance.

- -

To evaluate the semantic understanding represented by our principal components, we examined how cuisine is clustered along the first two principal components. Below on the left includes Chinese cuisine in bright red, our dominant class. To better visualize the clustering of our minority classes, we show the same graph excluding Chinese cuisine on the right. Without explicitly including cuisine in our embedding, we find that it keeps similar cuisines close to each other while also capturing intra-cuisine variance. This supports our hypothesis that our embedding incorporates semantic understanding.

- -

alt_text

- -

Our final joint embedding was a 5-dimensional ingredient embedding stacked on a 5 dimensional nutrition embedding.

- -

Designing our Recommendation Algorithms

- -

The main driver to our recommendation engine was a k-nearest neighbor [13] recommendation system. Given the fact that our dataset was relatively small, at around 2000 recipes, a nearest neighbor approach seemed to make the most sense in terms of finding similar recipes with cosine distance.

- -

To make it a true machine learning application, our app needed to learn user preferences as it proceeded! To do this, we generated a coarser, more interpretable labeling system for every recipe to capture some taste preferences a user might have coming in. The two main categories we selected were meat and starch. These two categories were chosen given the fact that the user may have dietary restrictions. The labels were generated through a title / ingredient list keyword match.

- -

alt_text

- -

We see that these categories are well distributed across our ingredient embedding space.

- -

alt_text

- -

Given these taste labels, we can then restructure our search problem as a multi-armed bandit problem [10]. The goal of the algorithm is to generate a sampling procedure of the arms to find with high accuracy what the expected payout of each arm is. In this problem, our “arms” are the individuals taste preferences, and the payout is the probability given all choices that the individual will select a particular preference. Since the hypothesis is that users come into our app with a particular taste preference in mind, they will likely select recipes matching that preference.

- -
-
- -
-
- -

One solution to the bandit problem with optimal regret bounds (rough amortized long term deviation from optimal) is the UCB (Upper Confidence Bound) [9] algorithm, which selects the arm with the highest upper bound to the confidence interval of the payouts.

- -

This approach simulates exploration vs. exploitation since in the beginning, this algorithm will select taste preferences that have not been selected yet (due to high variance), but as the user proceeds, it will start to recommend more recipes matching the user’s preferences (exploitation phase), as it gains higher confidence for the value of their preferences.

- -

Reflections

- -

Overall, the project experience was very positive. Our team’s general dynamic and workflow worked well together. Sometimes it was a little difficult to divide up work, as some next steps were a little conditional on the previous part, so it was hard to parallelize, especially in a remote situation. On the other hand, since this project was a full application, we were able to divide work flow into the general “front end” vs. “back end” aspects that we had to handle to some extent, switching off who was working on what at different stages, and were still able to build team camaraderie by pair programming too.

- -

The tech stack that we decided on in the beginning also worked out well, since everything was able to be in python and easily integrated together. We had to pivot a couple of times in terms of what we were designing, especially in the direction of away from black box approaches and more into the interpretable methods + UX considerations.

- -

If given more time, we would try to incorporate more features to create a richer embedding space in combination with more recipes in our database to generate more personalized recommendations. On the engineering side, we’d also try to fully deploy our app, incorporating database storage and user memory in order to preserve information across multiple uses of our app. This would also enable many more machine learning features, including labeled data as we log users using our app, and collecting information for personalization. Streamlit’s public deploy also wasn’t able to handle multiple users using the app at the same time because of shared state space. We’d probably want to migrate our tech stack to something more robust, as well as provide more flexibility in terms of the UX design.

- -

We were not super ambitious about the technology we used, so we’d like to also incorporate some of the concepts we learned in class, like online learning and edge computing, and setting up the general DevOps workflow (maybe if we turn our app into a startup)!

- -

Broader Impact

- -

We see an app like this flattening the activation energy for young adults in a hurry to plan out meals they’ll enjoy cooking and eating. Instead of many searches and open tabs to gather together a few options that satisfy one’s craving, they can come to an app like ours.

- -

One audience we have had a challenge serving is people with dietary restrictions. For example, an early version of our app had a difficult time distinguishing red meat from non-red meat. Using filters learned by our UCB algorithm and sourcing more recipes that are kosher and vegetarian has helped. Our app could unintentionally exclude guests with dietary restrictions from the tables of users who come to use it often.

- -

We attempted to combat this problem by being mindful of selecting recipes from a variety of sources and cuisines, including many vegetarian dishes + a variety of meats / taste profiles. However, naturally, our dataset is still heavily Chinese/Korean/Japanese skewed due to the popularity of East Asian cuisine.

- -

References

- -
    -
  1. Davidson, James & Liebald, Benjamin & Liu, Junning & Nandy, Palash & Vleet, Taylor & Gargi, Ullas & Gupta, Sujoy & He, Yu & Lambert, Michel & Livingston, Blake & Sampath, Dasarathi. (2010). The YouTube video recommendation system
  2. -
  3. P. Bojanowski, E. Grave, A. Joulin, T. Mikolov, Enriching Word Vectors with Subword Information
  4. -
  5. Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos, David Cournapeau, Matthieu Brucher, Matthieu Perrot, Édouard Duchesnay. Scikit-learn: Machine Learning in Python, Journal of Machine Learning Research, 12, 2825-2830 (2011) 6.
  6. -
  7. Wes McKinney. Data Structures for Statistical Computing in Python, Proceedings of the 9th Python in Science Conference, 51-56 (2010)
  8. -
  9. Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, Robert Kern, Matti Picus, Stephan Hoyer, Marten H. van Kerkwijk, Matthew Brett, Allan Haldane, Jaime Fernández del Río, Mark Wiebe, Pearu Peterson, Pierre Gérard-Marchant, Kevin Sheppard, Tyler Reddy, Warren Weckesser, Hameer Abbasi, Christoph Gohlke & Travis E. Oliphant. Array programming with NumPy, Nature
  10. -
  11. Ashish Shukla, Charly Wargnier, Christian Klose, Fanilo Andrianasolo, Jesse Agbemabiase, Johannes Rieke, José Manuel Nápoles, Tyler Richards Streamlit
  12. -
  13. recipe-scraper https://github.com/hhursev/recipe-scrapers/
  14. -
  15. Sanjeev Arora and Yingyu Liang and Tengyu Ma. A Simple but Tough-to-Beat Baseline for Sentence Embeddings. ICML 2017
  16. -
  17. Sébastien Bubeck, Nicolò Cesa-Bianchi: Regret Analysis of Stochastic and Nonstochastic Multi-armed Bandit Problems. CoRR abs/1204.5721 (2012)
  18. -
  19. Auer, Peter, et al. “The nonstochastic multiarmed bandit problem.” SIAM journal on computing 32.1 (2002): 48-77.
  20. -
  21. Mikolov, Tomas, et al. “Efficient estimation of word representations in vector space.” arXiv preprint arXiv:1301.3781 (2013).
  22. -
  23. Reimers, Nils, and Iryna Gurevych. “Sentence-bert: Sentence embeddings using siamese bert-networks.” arXiv preprint arXiv:1908.10084 (2019).
  24. -
  25. Cunningham, Padraig, and Sarah Jane Delany. “k-Nearest Neighbour Classifiers–.” arXiv preprint arXiv:2004.04523 (2020).
  26. -
- - -
-
- -
- - - -
-
- -
-
-
- -
- -
- - - - - diff --git a/_site/virufy-cloud/index.html b/_site/virufy-cloud/index.html deleted file mode 100644 index aa9dc13..0000000 --- a/_site/virufy-cloud/index.html +++ /dev/null @@ -1,540 +0,0 @@ - - - - - Virufy Asymptomatic COVID-19 Detection - Cloud Solution - CS 329S Winter 2021 Reports - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
-
-
- - -
-
-

Virufy Asymptomatic COVID-19 Detection - Cloud Solution

-
2021, Mar 18    
-
-

The Team

-
    -
  • Taiwo Alabi
  • -
  • Alex Li
  • -
  • Chloe He
  • -
  • Ishan Shah
  • -
- -

I. Problem Definition

- -

By March 2021, the SARS-CoV-2 virus has infected nearly 120 million people worldwide and claimed more than 2.65 million lives [1]. Moreover, a large percentage of cases were never diagnosed because of hospital overflows and asymptomatic carriers. According to a recent study published in the JAMA Network, an estimated 59% of all COVID-19 transmissions may be attributed to people without symptoms, including 35% who unknowingly spread the virus before showing symptoms and 24% who never experience symptoms [2]. Therefore, in order to prevent further spread of the virus, it’s crucial to have screening tools that are not only fast, accurate, and scalable, but also accessible and affordable to the general population. However, such tools do not currently exist.

- -

Since spring 2020, AI researchers have started exploring the use of machine learning algorithms to detect COVID in a cough. Researchers at MIT and the University of Oklahoma believe that the asymptomatic cases might not be “truly asymptomatic,” and that using signal processing and machine learning methods, we may be able to extract subtle features in cough sounds which are indistinguishable to the human ear [3]. In the past year, there have been a number of related projects around the world: AI4Covid-19 at the University of Oklahoma, Cough against COVID-19 at Wadhwani AI, Opensigma at MIT, Saama AI research, among others.

- -

However, existing cough prediction projects have varying performances and often require high-quality data because the models were trained on audio samples that were recorded in clinical settings and appropriately preprocessed [4]. Some models do not aim only at COVID detection but at all respiratory conditions, which makes it harder to balance between different performance metrics and therefore unsuitable to the needs of minimizing false negatives for the purpose of COVID prevention. These challenges motivated our project, as we hope to build a cloud-computing system better suited for detecting COVID in various types of cough samples and a prescreening tool that is easily accessible, free for all, and produces nearly instantaneous results.

- -

II. System design

- -

Virufy cloud needed an online prediction machine learning system with a low-latency inference capability. We also needed to comply with HIPAA privacy rules with regards to health data collection and sharing.

- -

Hence the machine learning system that we designed is hosted in the cloud on a beefy EC2-t3-instance with GPU acceleration. An elastic IP address was assigned to the EC2 instance and the main app was served at port 8080. A DNS name rapahelalabi.com was used to redirect all traffic to the elastic IP address through the open port.

- -

To comply with HIPAA privacy rules, we decided not to provide the option for users to enter personal information. This ensured complete anonymization of the entire process since data from user, user waveform .wav file, is run through the inference engine and subsequently not stored anywhere in the pipeline.

- -
-
- -
-
- -

The data flow diagram for the system is shown above. With the DNS forwarding traffic to the EC2 instance. The EC2 instance does 3 processes to reduce latency including:

- -
    -
  1. Converts waveform (.wav file) to Mel-frequency spectrogram and Mel-frequency cepstral coefficients (MFCCs)
  2. -
  3. It also incorporates a pre-trained XGBoost model from COUGHVID to help validate if there is an actual cough sound in the waveform file.
  4. -
  5. It uses the inference model to infer the probability of the Mel-frequency spectrogram and Mel-frequency cepstral coefficients (MFCCs) containing having COVID-sound biomarkers or not.
  6. -
- -

These 3 processes run asynchronously and the current latency is ~2sec, from uploading a cough sound to getting a positive or negative result output.

- -

The system also has an automated model deployment script that can automate deployment with only one line of code to an Ubuntu deep learning AMI image. The automated script makes it so much easier to deploy by taking care of all dependencies and co-dependencies during deployment. In addition, we also have an automated model validation script that can evaluate performance of many models and give their specificity and sensitivity to COVID-19 using a customized dataset that is also downloaded into the EC2 instance and kept in the repo.

- -

We needed a t3 instance with GPU acceleration because the core of our inference engine uses a convolutional neural network that is accelerable with GPU. We also decided to separate the inference step from the pre-processing and input data validation steps to ensure modularity and error tracking.

- -

The machine learning system we built also has an error-tracking log file in the server that could be used to debug the system when necessary. By incorporating error logging capability, automated model evaluation and validation, automated model deployment, and model inference, we have built and demonstrated a well-rounded system that can serve users from around the world at low latency speed. In addition the model evaluation allows for continuous integration and deployment- CI/CD- since it allows uploading many models and evaluating those models in the cloud. Thus enabling an almost seamless switch from one inference algorithm to another inference algorithm.

- -

A couple of flaws that the system currently faces in production would be susceptibility to attacks. The URL to our EC2 instance is public and we made the port open to the entire world. Although this made it easy to deploy and serve the model it also exposes us to DOS attacks.

- -

In addition, the system is currently not scalable, horizontally. To enable horizontal scaling using a load balancer on AWS we would need to integrate and use EBS (Elastic Beanstalk).

- -

III. Machine Learning

- -

We started out with the hypothesis that cough sound from COVID-19 positive carriers could be differentiated from cough sound from unaffected people. We pre-processed cough recordings from two open-sourced COVID-19 related datasets, Coswara[5] and COUGHVID[6]. Extracted features include the recording waveform, age, and gender. We take all positive samples and randomly selected subsets of negative samples from the datasets to compensate for class imbalance. We also tried taking all samples and assigning different class weight combinations, even though this approach did not perform as well.

- -

Mel-frequency cepstral coefficients (MFCCs) and mel-frequency spectrograms [7] have been used to extract audio signature from cough recordings. Our main approach is to build two branches of the modelling pipeline that can handle those different engineered-features separately, which are sequentially merged together for a single binary classification task.

- -

We received 39 numerical coefficients from MFCCs as output, for which we built a two-layer dense model. The spectrograms are in image format (64x64x3),for which an ImageNet approach can be applied. We attempted numerous pre-trained models on ImageNet, including ResNet50, ResNet101, InceptionV2, DenseNet121, etc. ResNet50 was shown to perform the best. The output of the pre-trained base model is passed to a global average pooling layer, a dense layer and a dropout layer. We merged the outputs from a two-layer dense model for MFCCs and Convolutional Neural Net model, and passed the merged output through another two models with a shrinking number of nodes. The final output is a single node with sigmoid activation function.

- -

Alternatively, we tried automatic neural arectural search using AutoKeras. This is to systematically test for other architectures. However, we did not achieve the same level of performance on the test set obtained by the handbuilt architecture in the past paragraph.

- -

The dataset was randomly shuffled and split into 75% training, 15% validation and 15% test set. During the training, we grid-tuned different optimizers until determining that Adam works best. After training, we found the best cut-off for binary classifications with Youden’s J statistics (Sensitivity + Specificity) [7].

- -

IV. System evaluation

- - - - - - - - - - - - - - - - - - - - - - - - - - -
- # samples - Accuracy - Weighted F1 - Sensitivity - Specificity -
Female - 864 - 0.7049 - 0.73 - 0.93 - 0.63 -
Male - 1,968 - 0.6951 - 0.74 - 0.91 - 0.65 -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- # samples - Accuracy - Weighted F1 - Sensitivity - Specificity -
<= 20 - 288 - 0.7222 - 0.74 - 0.93 - 0.64 -
21-40 - 1,680 - 0.7065 - 0.75 - 0.96 - 0.66 -
41-60 - 576 - 0.6867 - 0.73 - 0.88 - 0.65 -
> 60 - 48 - 0.7292 - 0.65 - 1 - 0.67 -
-

Table 1: Sliced-based analysis results using an optimized cutoff of 0.012245 across a) gender groups and b) age groups.

- -

We obtained the cutoff based on AUC analysis on the test set. At the threshold, the test set performance achieves 79.71% sensitivity and 49.20% specificity. The FDA guidance to COVID-19 testing solution explicitly mentions that sensitivity and specificity are the main metrics [8]. We believed that sensitivity is the most important metric of success as a screening tool, as it measures how many actual COVID-positives were captured by the model. Achieving approximately 80% sensitivity shows that we can correctly identify those with COVID-19 virus with substantial success. Admittedly, we did not achieve great specificity: 49% specificity means that there is roughly one false positive for every positive prediction. However, from a public health perspective, we think it is far more costly for a restaurant to admit infected customers than to send more people to PCR tests than necessary.

- -

We performed slice-based analysis across different age and gender groups in order to evaluate the performance of our model and address model weaknesses. We did inference on the entire dataset. Using an optimized cutoff of 0.012245, we found that the model achieved almost exactly the same accuracies and F1 scores among the male and female populations. Individuals between the age of 21 and 40 make up most of the population from which the cough samples were crowdsourced; despite large differences in the number of samples across age groups, the model was as accurate in the 21-40 age group as in the >60 age group. These results demonstrate that the cough signatures are generalizable across different gender and age groups, and that the model is not biased towards any gender or age groups.

- -

Separately, we also created an automated system evaluation that can provide analysis of multiple models as well as their inference latencies right within the production environment. These codes run in both python and bash. By using the automated script we were able to capture 2x increase in latency in moving from a tri-layer CNN architecture to a ResNet50 architecture. However, what we gave up in latency we more than made up for in specificity and sensitivity of the algorithm. With the ResNet50 algorithm enabling a sensitivity score of 0.79 and a specificity score of 0.9, while the tri-layer CNN architecture has a sensitivity score of 0.6 and a specificity score of 0.59 in production. Analysis was performed using a separate holdout dataset that was manually culled, curated for sound veracity and sound clarity.

- -

In addition, we evaluated the performance of the pretrained XGBoost cough validation classifier, which has some drawbacks. Specifically, the algorithm tends to misclassify audio recordings that have low-pitch or quiet coughs as non-cough files.

- -

V. Application demonstration

- -

We chose to make this a web application running on AWS and routed to Taiwo’s URL (raphaelalabi.com) to keep it simple for the user and have their result with a couple of clicks. The feature set of the web app is fairly simple. It consists of uploading a .wav file to the browser and clicking the “Process” button. The app could reject the input due to a wrong file format, the audio not within length boundaries (0.5 - 30 seconds), or the audio not being detected as a cough, and output the appropriate error message (image below).

- -

If the input is accepted, based on a fixed threshold determined by model evaluation, you will reach either a “positive” or “negative” landing page after a few seconds, returning the probability that you are an asymptomatic COVID carrier and general guidelines in both cases. We decided to only have two landing pages as we did not feel confident in setting more thresholds based on limited evidence from our model evaluation.

- -

Instructions and Images:

- -
    -
  1. Navigate to raphaelalabi.com
  2. -
  3. Upload .wav file from your local directories and click “Process”:
  4. -
- -

alt_text

- -

That’s it! The possible errors mentioned above display a message like this and ask to re-upload:

- -

alt_text -alt_text

- -

If the model successfully processed the data, you will get one of the following landing pages specifying a “positive” or “negative” result with disclaimers and guidance:

- -

alt_text

- -

VI. Reflection

- -

We believe that the infrastructure that we built our system on worked well given the team members’ varying skill sets. AWS was a great fit because their deep learning EC2 instances come preloaded with Anaconda and other linux commands required for deployment of our application, cutting out the time-consuming step of installing them and properly configuring their paths. Furthermore, Taiwo, who has more experience with the platform, deploys the app through his root account, and created IAM accounts so the rest of us could easily access the same resources.

- -

Another success was keeping the code concise through properly compartmentalizing it. Essentially, we pull the .wav file through a simple API call, run it through a preprocessing function to featurize it and verify it’s a cough, then run inference through the model loaded from an hdf5 file and trained separately from the system’s codebase. This setup allowed us to iterate our system and use Git with fewer roadblocks.

- -

In general, our team communicated effectively over Slack and had an effective division of labor as we consistently listed the remaining action items and assigned them. However, we could have met on Zoom more and learned about each other’s components with more depth, as we spent a fair bit of time in the chat playing catch-up.

- -

The most obvious drawback in our current system is the need for the file to be .wav, which often requires a user to manually convert the audio on a third-party website. Given a little more time, we would probably have solidified the functionality of recording within the application and/or accept and internally convert audio files of other formats.

- -

A more subtle yet significant limitation comes from the data utilized to train and evaluate our model. The coughs could come from symptomatic carriers, not just asymptomatic, diluting our metrics. After manually listening to positive waveforms from Coswera, we could not tell if some were asymptomatic forced coughs or naturally occurring coughs from patients. We realized we cannot solve this challenge to differentiate the two because we do not have the ground truth or curated datasets from both categories.

- -

With more time and resources, the first critical component to improve on would be model performance –prioritizing sensitivity. We could only train a few architectures and hyperparameter combinations on a limited dataset, so we would want to expand on that with more research and compute power. Also, we would learn the ins-and-outs of audio data and its different features to expand the pre-processing code, and relatedly implementing segmentation methods to reduce noise in the input. The second would be to improve the general user experience. For example, if a positive result is returned, we should return a basic analysis of the waveform explaining the model’s “decision”, and possibly route the user to a PCR test based on their current location.

- -

We are operating under the umbrella of the larger Virufy non-profit, and hope portions of our work can be adapted into their codebase. Some of our team members are thinking about continuing to work on Virufy, and hope to see it succeed with the continued development of new features and more accurate models.

- -

VII. Broader Impacts

- -

This application is intended for use as a potential, fast, and accurate COVID detection using an unforced cough wave-form from an individual. We could see this used as a diagnostic tool in airports, hospitals and other health institutions, care-taker homes etc. This algorithm will come in handy in those places where fast diagnosis of COVID-19 ensures that regular traffic flow is minimally impacted by the need to ensure those coming into those institutions are not asymptomatic carriers.

- -

A potential associated harm with using this machine learning system is that it is possible that a person with common viral pneumonia or a bad case of the flu could also be labelled by the algorithm as an asymptomatic carrier. The algorithm has not been properly calibrated with users having flu, pneumonia, and other respiratory conditions but that do not have COVID or have COVID. Our belief is that such individuals may also carry the vocal bio-maker for COVID that the model has learned and thus be classified as COVID-positive.

- -

Lastly, our system is designed as a prescreening tool and not as a comprehensive test that would replace regular PCR or rapid testing procedures. We intended to make this as clear as possible by providing warnings and reminders on our web UI. Moreover, because the test is not 100% accurate and we do expect to see false negatives to some degree after deployment (even though we try to minimize this as much as possible), we heavily emphasize the need to continue to follow public health guidelines and quarantine procedures on our results page. For individuals who receive positive predictions, we prompt them to get a more reliable test (such as PCR) as soon as possible.

- -

VIII. Contributions

- -

Taiwo

- -
    -
  • Wrote the FE interface with boot-strapped HTML/Javascript to Python.
  • -
  • Wrote the deployment scripts.
  • -
  • Wrote the automated testing scripts.
  • -
  • Wrote the general framework of the API for pre-processing, inference.
  • -
  • Engineered the use of AWS (EC2), Elastic IP address for the oncloud prediction.
  • -
  • Worked on data pre-processing for the coughvid and re-wrote the initial base-line
  • -
  • algorithm that gave the team a first look at the model performance.
  • -
  • Worked on the initial model with multi-band CNN and DNN.
  • -
- -

Chloe

- -
    -
  • deployed model on EC2 and set up serving on AWS and routing to custom domain (through Namecheap)
  • -
  • designed web UI (front-end)
  • -
  • workshop presentation and final presentation
  • -
- -

Alex

- -
    -
  • Conducted deep and detailed analysis on model training, development and analysis. The output model was used in final presentation.
  • -
  • Defined the cut-off threshold for the ResNet machine learning model,
  • -
  • Did slice based analysis to evaluate model performance on different age and gender
  • -
  • Did initial exploratory work with Sagemaker and GCP AI platform in terms of model hosting.
  • -
  • Made MVP demo slides and presentation video
  • -
- -

Ishan

- -
    -
  • Integrated the cough validation XGBoost model into our codebase and verified its compatibility with our existing system on the EC2 instance
  • -
  • Added functionalities like checking the length of the input sound
  • -
  • Made UI modifications needed for final product
  • -
  • Prepared appropriate examples and conducted MVP and final demos
  • -
- -

GitHub Repo URL:

- -

The URL to the github repo with all the code is: https://github.com/taiworaph/covid_cough

- -

References

- -

[1] COVID-19 Dashboard by the Center for Systems Science and Engineering (CSSE) at Johns Hopkins University (JHU). (n.d.). Retrieved March 17, 2021, from https://coronavirus.jhu.edu/map.html

- -

[2] Johansson MA, Quandelacy TM, Kada S, et al. SARS-CoV-2 Transmission From People Without COVID-19 Symptoms. JAMA Netw Open. 2021;4(1):e2035057. doi:10.1001/jamanetworkopen.2020.35057

- -

[3] Scudellari, M. (2020, November 4). AI Recognizes COVID-19 in the Sound of a Cough. Retrieved March 17, 2021, from https://spectrum.ieee.org/the-human-os/artificial-intelligence/medical-ai/ai-recognizes-covid-19-in-the-sound-of-a-cough

- -

[4] Fakhry, Ahmed, et al. “Virufy: A Multi-Branch Deep Learning Network for Automated Detection of COVID-19.” arXiv preprint arXiv:2103.01806 (2021).

- -

[5] Neeraj Sharma, Prashant Krishnan, Rohit Kumar, Shreyas Ramoji, Srikanth Raj Chetupalli, Nirmala R., Prasanta Kumar Ghosh, and Sriram Ganapathy. Coswara – A Database of Breathing, Cough, and Voice Sounds for COVID-19 Diagnosis. arXiv:2005.10548 [cs, eess], August 2020. URL http://arxiv.org/abs/2005. 10548. arXiv: 2005.10548.

- -

[6] Lara Orlandic, Tomas Teijeiro, and David Atienza. The COUGHVID crowdsourcing dataset: A corpus for the study of large-scale cough analysis algorithms. arXiv:2009.11644 [cs, eess], September 2020. URL http: //arxiv.org/abs/2009.11644. arXiv: 2009.11644.

- -

[7] Brownlee, J. (2021, January 04). A gentle introduction to threshold-moving for imbalanced classification. Retrieved March 17, 2021, from https://machinelearningmastery.com/threshold-moving-for-imbalanced-classification/

- -

[8] Center for Devices and Radiological Health. (n.d.). EUA Authorized Serology Test Performance. Retrieved March 17, 2021, from https://www.fda.gov/medical-devices/coronavirus-disease-2019-covid-19-emergency-use-authorizations-medical-devices/eua-authorized-serology-test-performance

- - - -
-
- -
- - - -
-
- -
-
-
- -
- -
- - - - - diff --git a/_site/virufy-on-device-detection-for-covid-19/index.html b/_site/virufy-on-device-detection-for-covid-19/index.html deleted file mode 100644 index 117e805..0000000 --- a/_site/virufy-on-device-detection-for-covid-19/index.html +++ /dev/null @@ -1,426 +0,0 @@ - - - - - Virufy on-Device Detection for COVID-19 - CS 329S Winter 2021 Reports - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
- - - -
-
-
- - -
-
-

Virufy on-Device Detection for COVID-19

-
2021, Mar 19    
-
-

The Team

- - -

Problem Description

- -

COVID-19 testing is inadequate, especially in developing countries. Testing is scarce, requires trained nurses with costly equipment, and is expensive, limiting how many people can obtain their results. Also, many people in developing countries cannot risk taking tests because results are not anonymous, and a positive result may mean a loss of day-to-day work income and starvation for their families, which further allows COVID-19 to spread.

- -

Numerous attempts have been made to solve this problem with partial success, including contact tracing apps which have not been widely adopted often due to privacy concerns. Pharmaceutical companies have also fast-tracked development of vaccines, but they still will not be widely available in developing countries for some time.

- -

To combat these problems, we propose a free smartphone app to detect COVID-19 from cough recordings through machine learning analysis of audio signals, which would allow for mass-scale testing and could effectively stop the spread of the virus.

- -

We decided to use offline edge prediction for our app for several reasons. Especially in developing countries, Internet connectivity / latency is limited and people often face censoring. Data privacy regulations such as GDPR are now commonplace and on-device prediction will allow for diagnoses without personal information or health data crossing borders. Because our app will potentially serve billions of predictions daily, edge prediction is also more cost-effective, as maintaining and scaling cloud infrastructure to serve all of these predictions will be costly and difficult to maintain.

- -

System Design

- -

In designing our system and pipeline, we first and foremost kept in mind that this pipeline would be running offline on edge devices in developing countries, including outdated phones with weak CPUs. We aimed for a pipeline that could efficiently process data, run a simple model, and return a prediction within a minute. To do this, we simplified our model, sacrificing some “expressiveness” in exchange for reduced complexity, but also through straightforward preprocessing of data.

- -

For the frontend, we decided on a web app because it can be used in the browser, which is operating-system-agnostic; in comparison, apps may only run on certain operating systems. Our frontend is written in ReactJS + TypeScript, which is the industry standard for modern web design. It employs responsive web design principles to be compatible with a wide range of screen sizes and aspect ratios present on different devices. Internally, the frontend calls a TensorFlow.js (TFJS) model for inference.

- -
-
- -
-
- -

We chose to use the TensorFlow.js (TFJS) framework because it is supported for use with web browsers. The TFJS Speech Command library provides a JavaScript implementation of the Fourier Transform (Browser FFT) to allow straightforward preprocessing of the raw audio files. We trained a vanilla TensorFlow model on background noise examples provided by the sample TFJS Speech Commands code, along with a dataset of thousands COVID-19 test result labeled coughs, so that our model could distinguish coughs from background noise. We then converted this trained model into the TFJS LayersModel format (with the model architecture as a JSON and weights in .bin files), so that we could integrate it into the front end JavaScript code for browser inference on-device.

- -

Our system’s basic pipeline is as follows:

- -
    -
  1. User opens our app
  2. -
  3. The TFJS models are downloaded from S3 onto the user’s device
  4. -
  5. Microphone detects noise from user
  6. -
  7. The Speech Commands library continuously preprocesses the audio by creating audio spectrograms
  8. -
  9. The spectrograms are run through the model
  10. -
  11. Only if the audio snippet is classified as a cough, the user will receive a prediction of whether they are COVID positive or negative
  12. -
- -

It is worth noting that model files are downloaded and loaded into memory only when the user first opens the app. After this, no Internet access is required and the system is able to make predictions offline.

- -

Machine Learning Component

- -

The model that powers our application is based on the publicly available TensorFlow.js Speech Commands module. Our model is intended to be used with the WebAudio API supported by all major browsers and expects, as input, audio data preprocessed with the browser Fast Fourier Transform (FFT) used in WebAudio’s GetFloatFrequencyData. The result of this FFT is a spectrogram which represents a sound wave as a linear combination of single-frequency components. The spectrogram, which can be thought of as a 2D image, is passed through a convolutional architecture to obtain logits which can be used in multiclass prediction. Specifically, this model has 13 layers with four pairs of Conv2D to MaxPooling layers, two dropout layers, a flatten layer, and two dense layers.

- -

alt_text

- -

Because training from scratch is expensive, we started with a model trained using the Speech Commands dataset [2], trained to recognize 20 common words such as the numbers “one” to “ten”, the four directions “left”, “right”, “up”, “down”, and basic commands like “stop” and “go”. We performed transfer learning on this model by removing the prediction layer and initializing a new one with the correct number of prediction classes. Afterwards, we fine-tuned the weights on the open source COUGHVID dataset, which provides over 20,000 crowdsourced cough recordings from a plethora of different characteristics including gender, geographic location, age, and COVID status.

- -

To ensure that data is preprocessed in the same way during training and testing, we use a custom preprocessing TensorFlow model which is trained to emulate the browser FFT that is performed by WebAudio, producing the same spectrogram as output. This browser FFT emulating model is provided and maintained by Tensorflow Speech Commands. Creating our own training pipeline allowed us to select our model architecture based on existing ongoing research efforts and fine tune our hyperparameters.

- -

System Evaluation

- -

Offline evaluation was done on our model as a quick way to ensure our model was working correctly. This meant setting aside 30% of our data as test data. To monitor offline testing, we used Weights and Biases. As shown below, 50 epochs were sufficient to achieve convergence in training and validation accuracies, with corresponding decreasing losses. Here is an example of what we we logged:

- -

alt_text

- -

As demonstrated by the graphs as well as the chart, the “loss”, or the loss calculated from our training set was 0.1717. While the ‘val_loss’, or the loss calculated from the testing set was 0.09781. Also, the “acc” or the accuracy calculated from the training set was 0.93298. While the ‘val_acc’ or the accuracy calculated from the testing set was 0.96875. Additionally, we evaluated the model before and after TFJS conversion and found that the accuracy as well as the loss on both the training and testing set were the same. This was important because we were initially concerned that during the conversion process the quality of our model would go down, however we were delightfully surprised that this did not occur.

- -

The remainder of our evaluation was done through real world testing. Although the gold standard of testing would be large-scale, randomized clinical trials, with data collected from a variety of demographic groups and recording devices, we did not have the time and resources to do that in the constraints of the class. Instead, we did informal evaluations on our own team members and friends in Argentina and Brazil.

- -

Anecdotally, the prediction was highly accurate on our group members, who were primarily Asian and all healthy. This remained true across a variety of devices such as smartphones, laptops, and tablets.

- -

The collection of external results was complicated by ethical considerations and lack of access to PCR tests to provide ground-truth labels. Nonetheless, we will note here two cases in Brazil. One individual was recovered, but previously was diagnosed with COVID-19; the model predicted that he had COVID-19. The other individual had COVID-19, but was predicted to be healthy. This illustrates the inherent challenge of translating models from development to production; model accuracy might be highly degraded due to distribution shift between the training and inference data.

- -

Application Demonstration

- -

In the beginning stages of the design process prior to this course, the Virufy product designer determined the appropriate target audience by conducting user interviews. She selected potential interviewee candidates based on certain demographic criteria such as being a citizen of selected Latin America countries or being tech-savvy and owning a cell phone.

- -

After gathering target audience candidates from six Latin America countries as well as the U.S. and Pakistan, user interviews were conducted. The results from the interviews were then synthesized to create user personas. These personas helped her produce empathetic and user-centered designs throughout the whole design process.

- -

alt_text

- -

Once initial ideation and designs were completed, the designer conducted a series of prototype user tests in which the user was observed as they walked themselves through the app mockup. The data from each user test was then synthesized to design new and improved iterations. After numerous user tests and iterations and evolving, the designer created a mockup of the demonstration application.

- -

Over the past month, we worked with the Virufy designer to adapt the design to our specific user needs given our novel contribution towards edge prediction. Through discussions with hospitals and normal users, alongside the technical limitations of TensorFlow.js, we finalized on our below design in which the user could click the microphone to trigger our model execution. We made the instructions simple and easy to follow, so users could record their cough and immediately get their prediction with our edge model which performed very fast (under 200ms on our laptops).

- -
- -
- -

Reflections

- -

Throughout the course of this 2-month project, we explored many areas technically, some of which were fruitful, and others of which were dead ends.

- -
    -
  1. -

    Google’s Teachable Machine

    - -

    At the start of our project, we used the MFCC and mel-spectrogram audio features in our models based on state-of-the-art research, but ran into issues as the same preprocessing code was not supported on-device with TFJS. We reached out to Pete Warden, an expert of TinyML on Google’s TensorFlow team, who pointed us to Teachable Machine, a web-based tool to create models, which uses TensorFlow.js to train models and generates code to integrate into JavaScript front ends. Although very simple and lightweight, we soon discovered Teachable Machine was not a feasible long-term solution for us, as it required manual recording and upload of training audio files, while also not providing us the flexibility to configure model architecture as we hoped to do. This ultimately forced us to train our own custom model.

    -
  2. -
  3. -

    Speech Commands Library

    - -

    TensorFlow’s Speech Commands library provided a simple API to access a variety of important features like segmenting the continuous audio stream into one-second snippets and performing FFT feature extraction to obtain spectrograms. The availability of pre-existing training pipelines as well as example applications using Speech Commands provided a strong foundation for us to adapt our own pipeline and frontend application.

    -
  4. -
  5. -

    Team Dynamics

    - -

    We compartmentalized responsibilities such that individual members were largely in charge of separate components of the system. Frequent communication via Slack was key to ensure that we all had a sense of the bigger picture.

    -
  6. -
- -

Overall, we learned over the quarter how to integrate frontend and backend codebases to build a production machine learning system, while utilizing APIs and libraries to expedite the process. Our knowledge also broadened as we considered the unique challenges of developing models for CPU-bound edge devices in the audio analysis domain.

- -

Continuing beyond this course, we would like to explore the following areas:

- -
    -
  1. -

    Model Performance

    - -

    State-of-the-art research papers suggest that accuracies as high as 98% are possible for larger neural networks. We would like to tune our tiny edge models to perform at similar accuracies.

    -
  2. -
  3. -

    Dataset Diversity

    - -

    Our model development was limited by the lack of access to large-scale, demographically diverse, and accurately labelled datasets. For next steps, we hope to remedy this by leveraging the Coswara dataset, along with the larger datasets Virufy is collecting globally.

    -
  4. -
  5. -

    Microphone Calibration

    - -

    We didn’t take into account the distribution shift between training and inference due to differences of microphone hardware specifications between edge devices.

    -
  6. -
  7. -

    Audio Compression

    - -

    The audio samples we trained on were of similar audio formats and frequencies. Exploring the effect of audio compression codecs such as mp3 on model performance may lead to interesting insights.

    -
  8. -
  9. -

    Expansion to More Diseases

    - -

    COVID-19 is not the only disease that affects patient cough sounds. We believe our model can be enhanced to distinguish between various other coronaviruses such as the common cold and flu, along with asthma and pneumonia through use of a multi-class classifier.

    -
  10. -
  11. -

    Embedded Hardware

    - -

    An interesting area to explore is further shrinking our model to fit onto specialized embedded devices with microphones. Such devices could be cheaply produced and shipped globally to provide COVID detection without smartphones.

    -
  12. -
- -

Broader Impacts

- -

Our app is intended to be used by people in developing countries who need an anonymous solution for testing anytime, or by anyone in a community at risk of COVID-19. However, we have identified some unintended uses of our app.

- -

Because we intend to share our technology freely and because the algorithm runs on-device, competitors will easily be able to take our algorithm and create copies of our app and may even block access to our app and sell theirs for profit. To prevent this, we will open source our technology under terms requiring attribution to Virufy and prohibiting charging users for the use of the algorithm.

- -

Another risk is that people may begin to ignore medical advice and believe only in the algorithm and might use the results in place of an actual diagnostic test. This is very risky because if the algorithm mispredicts, we may be held liable. The spread of COVID-19 may increase if COVID-19 positive people become confident to socialize with false negative test results. To mitigate this, we intend to add disclaimers that our app is a pre-screening tool that should be used only in conjunction with medical providers’ input. Additionally, we will work closely with public health authorities to clinically validate our algorithm and ensure it is safe for usage.

- -
-
- -
- -
- -
-
- -

People may also start testing the algorithm with irrelevant recordings of random noises such as talking. To address this, we have equipped our algorithm with a cough detection pre-check layer to prevent any non-cough noises from being classified.

- -

Finally, people especially in poorer contexts may share the same smartphones with several users, which can increase the likelihood of spreading COVID-19. Thus, our instructions clearly state that users must disinfect their device and keep 20 feet away from others while recording.

- -

Code

- -

Our TensorFlow JavaScript audio preprocessing and model prediction code can be found here: https://github.com/dtch1997/virufy-tm-cough-app

- -

Our finalized progressive web application code can be found here: https://github.com/virufy/demo/tree/edge-xoor

- -

References

- -

We’re extremely grateful to Pete Warden, Jason Mayes, and Tiezhen Wang from Google’s TensorFlow.js team for their kind guidance on TinyML concepts and usage of the speech_commands library, both in class lecture and during the few weeks of our development.

- -

Jonatan Jaskilioff and the team at XOOR were very gracious to lend their support and guidance in integrating our JavaScript code into the progressive web app they had built pro bono for Virufy.

- -

We are also indebted to the broader Virufy team for guiding us on the real-world applicability and challenges of our edge device prediction project. We leveraged their deep insights from their members distributed across 20 developing countries in formulating our problem statement. Additionally, we built on top of the open source demo app that they had built prior based on intentions for real-life usage, along with their prior research findings and open source code for our model training.

- -

In preparing our final report, we are grateful to Colleen Wang for her kind support in editing the content of our post, Virufy lead UX designer Maisie Mora for helping explain the design process in the application demonstration section, and Saad Aslam for his kind support in converting our blog post to a nicely formatted HTML page.

- -

Finally, we cannot forget the great lessons and close guidance from Professor Chip Huyen and TA Michael Cooper who helped us open our eyes to production machine learning and formulate our problem to be attainable within the short 2 month course quarter.

- -

[1] Tensorflow Speech Commands dataset, https://arxiv.org/pdf/1804.03209.pdf

- -

[2] Teachable Machine, https://teachablemachine.withgoogle.com/

- -

[3] Virufy: A Multi-Branch Deep Learning Network for Automated Detection of COVID-19 https://arxiv.org/ftp/arxiv/papers/2103/2103.01806.pdf

- - -
-
- -
- - - -
-
- -
-
-
- -
- -
- - - - - diff --git a/assets/css/main.css b/assets/css/main.css index 3ccc8f4..26db355 100755 --- a/assets/css/main.css +++ b/assets/css/main.css @@ -773,7 +773,7 @@ html { } body { - font-family: 'Lato', sans-serif; + font-family: 'Source Sans Pro', sans-serif; color: #515151; background-color: #fbfbfb; margin: 0; @@ -933,7 +933,7 @@ table tfoot td { width: 240px; height: 100%; padding: 20px 10px; - background-color: #ffffff; + background-color: #3056a4; } .about { @@ -948,7 +948,7 @@ table tfoot td { -webkit-border-radius: 100%; border-radius: 100%; overflow: hidden; - background-color: #333030; + background-color: #ffffff; } .about img { @@ -975,7 +975,7 @@ table tfoot td { padding-bottom: 15px; font-size: 16px; text-transform: uppercase; - color: #333030; + color: #ffffff; font-weight: 700; } @@ -992,11 +992,11 @@ table tfoot td { height: 7px; -webkit-border-radius: 100%; border-radius: 100%; - background-color: #515151; + background-color: #ffffff; } .about p { - font-size: 16px; + font-size: 22px; margin: 0 0 10px; } @@ -1023,7 +1023,7 @@ table tfoot td { .contact .contact-title { position: relative; - color: #333030; + color: #ffffff; font-weight: 400; font-size: 12px; margin: 0 0 5px; @@ -1043,7 +1043,7 @@ table tfoot td { position: absolute; top: 50%; left: 0; - background-color: #515151; + background-color: #ffffff; } .contact .contact-title::after { @@ -1058,7 +1058,7 @@ table tfoot td { position: absolute; top: 50%; right: 0; - background-color: #515151; + background-color: #ffffff; } .contact ul { @@ -1078,7 +1078,7 @@ table tfoot td { } .contact ul li a { - color: #515151; + color: #ffffff; display: block; padding: 5px; font-size: 18px; @@ -1088,7 +1088,7 @@ table tfoot td { } .contact ul li a:hover { - color: #333030; + color: #ffffff; -webkit-transform: scale(1.2); -ms-transform: scale(1.2); transform: scale(1.2); @@ -1171,10 +1171,14 @@ footer .copyright { margin-top: 0; } +.page-title { + color: #ffffff; +} + a.older-posts, a.newer-posts { font-size: 18px; display: inline-block; - color: #515151; + color: #ffffff; -webkit-transition: -webkit-transform .2s; transition: -webkit-transform .2s; -o-transition: transform .2s; diff --git a/assets/css/scss/_variables.scss b/assets/css/scss/_variables.scss index 7a75e8c..c8c4ae4 100755 --- a/assets/css/scss/_variables.scss +++ b/assets/css/scss/_variables.scss @@ -5,3 +5,4 @@ $gray: #ecf0f1; $dark-gray: #a0a0a0; $dark-blue: #263959; $dark: #333030; +$Regular: 'Source Sans Pro', sans-serif; \ No newline at end of file diff --git a/assets/css/scss/main.scss b/assets/css/scss/main.scss index a566e74..8daa721 100755 --- a/assets/css/scss/main.scss +++ b/assets/css/scss/main.scss @@ -11,7 +11,7 @@ html { } body { - font-family: 'Lato', sans-serif; + font-family: 'Source Sans Pro', sans-serif; color: $body-color; background-color: #fbfbfb; margin: 0; diff --git a/assets/img/lappis.png b/assets/img/lappis.png new file mode 100644 index 0000000..e13874e Binary files /dev/null and b/assets/img/lappis.png differ