This project is a part of the Data Science Working Group at Code for San Francisco. Other DSWG projects can be found at the main GitHub repo.
The purpose of this project is help Code for America process and analyze text response data from Get Calfresh applications to better understand the circumstances in which people apply to the program. Goals of the project are to:
- Spellcheck the text for better downstream processing and analysis
- Remove Personal Identifying information from the text
- Complete Exploratory Data Analysis and Topic Modelling of the Text
- Code for America
- https://www.codeforamerica.org, https://www.getcalfresh.org
- Partner contact: Eric Gianella
- Data Pipelines
- Natural Language Processing
- Object Oriented Programming
- Python
- Jupyter Notebook
- NLTK and other NLP libraries
19.4% of Californians did not have enough resources to meet basic needs in 2016 (source: the Economist). One of the initiatives supervised by the Californian state to help those in need is called CalFresh, also known the Supplemental Nutrition Assistance Program (SNAP). Although the application process can be confusing and difficult to navigate, Code for America’s GetCalFresh program is ensuring that everyone can access food assistance benefits. CalFresh consists of providing monthly food benefits to assist low-income households in purchasing the food they need to maintain adequate nutritional levels. These benefits are issued on an Electronic Benefit Transfer (EBT) card which looks like any other credit card.
Code for America wishes to utilize the rich information within the free response portion of the Get Calfresh applications in order to better understand the sentiments and circumstances underlying the reasons people apply. The hope is that this will help educate the public and break common stereotypes and stigmas associated with food stamp program recepients. Additionally, this information may also serve to encourage others to apply as well. The DSWG is helping CFA process the text data so that spelling errors are corrected, which will allow personal information to be removed effectively and aid in downstream analysis. We also plan to use machine learning and NLP methods such as topic modelling to help classify and quantify circumstances in the response text.
- NLP processing Pipelines
- SpellChecker Improvements
- Data exploration/descriptive statistics
- Topic Modelling
- Clone this repo (for help see this tutorial).
- The Data for this project contains sensitive information. Please reach out to the leads for access.
- Data Processing Code (including our custom spellchecker) can be accessed here
- Sample Text Data for prototyping and exploring can be accessed here
Team Leads (Contacts) : Rocio Ng(@rocio)
Name | Slack Handle |
---|---|
Ian Colrick | @icorlick |
Melodie Belot | @Melodie |
- If you haven't joined the SF Brigade Slack, you can do that here.
- Our slack channel is
#datasci-calfresh
- Feel free to contact team leads with any questions or if you are interested in contributing!