We're thrilled you're interested in joining SimPPL! This assignment is designed to give you a taste of the kind of engineering challenges we tackle. You'll build a system for collecting social media data, with a focus on scalability and robust design. You will need to build a basic data collection platform to collect data from some fringe social networks within platform terms of service.
Some examples that may guide you in learning more about smaller platforms including some so-called fringe platforms (since there are existing data collection tools for these platforms already), in order of priority for the kinds of platforms we’d like to see you analyze.
Note: Pick a minimum of one and a maximum of two platforms from which you will gather data, present a set of inferences via graphs and analyses, and host your own system with search capabilities for the platform's data. Bonus points if you are able to include mobile-based platforms, conduct multilingual, or multimodal analysis. We added links to platforms that you may not have heard of to make them easy for you to find.
- Gab
- Threads (Meta)
- ShareChat
- Josh
- 4chan
- Telegram
- Bitchute
- Rumble
- Moj
- Mastodon
- VK
- Win Communities, especially Patriots
- Meta
We have built tools for collecting and analyzing data from Reddit and Twitter including Parrot to study the sharing of news from certain unreliable Russian media providers. To ramp you up towards understanding how to go about extending such platforms, and to expand your understanding of the broader social media ecosystem, we would like you to construct a similar analysis to Parrot by studying other publicly accessible platforms listed above. We would like you to present an analysis of a broader range of viewpoints from different (apolitical / politically biased) groups. You may even pick a case study to present e.g. a relevant controversy, campaign, or civic event.
In the long run, this research intends to accomplish the following objectives:
- Track different popular trends to understand how public content is propagated on different social media platforms.
- Identify posts containing misleading information with the use of claims verification mechanisms.
- Analyze the trends across a large number of influential accounts over time in order to report on the influence of a narrative.
- Data Collection: Develop a scraper that collects data from one of the listed social media platforms. Your scraper should gather post content, engagement metrics (likes, shares, etc.), user information, and timestamps and should be able to collect data based on hashtags, keywords, news links, and user queries. Implementing multiple data collection methods will earn bonus points.
- System Design: Create a system design (in words and with a diagram) that explains how your data collection solution can be scaled to handle a large volume of data and potentially multiple social media platforms without being blocked by the platform within reasonable scale. Consider cloud infrastructure, data storage, processing pipelines, and any APIs or services you would use.
Note: a. You may use Figma or draw.io for the system design diagram but please commit it as a PNG or JPG image to your repository so we can access and evaluate it.
b. Platforms block web scraping using HTML and Beautifulsoup quite easily so that will not be a great solution.
- Functionality: Does the scraper work reliably and collect the required data?
- Scalability: Is the system design well-thought-out and scalable? Does it address the potential challenges of handling large datasets and multiple platforms? Does the solution scale beyond just one-time searches (e.g. scraping HTML off of web pages may be easy to implement but will not scale to social media, because it may violate some of their terms and will immediately get flagged and blocked).
- Data Collection and Search: Is there a search functionality for data collection using news URLs, hashtags, keywords, or queries across the platform, made available through this solution?
- Documentation: Is the code and system design modular, well-documented, and easy for a beginner to understand?
Bonus Points: Create a system diagram illustrating your architecture for the entire data pipeline, from user queries to data storage. This diagram should detail how queries are handled, the specific data collected, the processing and transformation steps, and the storage solutions you would use. Explain how the system would handle concurrent requests and provide a high-level overview of your chosen cloud infrastructure and services, justifying your selection. A clear, well-explained diagram demonstrating a scalable design is highly valued.
These instructions outline how to use GitHub for this assignment. Please follow them carefully to ensure your work is properly submitted.
-
Fork the Repository:
- Go to the assignment repository provided by the instructor: [Insert Repository Link Here]
- Click the "Fork" button in the top right corner of the page. This creates a copy of the repository in your GitHub account.
-
Clone Your Fork:
- Go to your forked repository (it will be in your GitHub account).
- Click the "Code" button (the green one) and copy the URL. This will be a git URL (ending in .git).
- Open a terminal or Git Bash on your local machine.
- Navigate to the directory where you want to work on the assignment using the cd command. For example:
cd /path/to/your/projects
. - Clone your forked repository using the following command: git clone <your_forked_repository_url> (Replace <your_forked_repository_url> with the URL you copied).
This will download the repository to your local machine.
-
Develop Your Solution
Work on your assignment within the cloned repository. Create your code files, visualizations, and any other required deliverables. Make sure to save your work regularly.
-
Commit Your Changes
- After making changes, you need to "stage" them for commit. This tells Git which changes you want to include in the next snapshot.
- Use the following command to stage all changes in the current directory:
- To add all the files - git add.
- Or, if you want to stage-specific files - git add ...
- Now, commit your staged changes with a descriptive message- git commit -m "Your commit message here" (Replace "Your commit message here" with a brief1 description of the changes you made.2 Be clear and concise!)
- Push your commits back to your forked repository on GitHub- git push origin main (Or, if you're working on a branch other than main, replace main with your branch name. origin refers to the remote repository you cloned from).
-
Please notify us of your submission by emailing [email protected] with the subject line "Submitting ML Engineer Assignment for SimPPL".
Please ensure you include:
- A detailed README file (with screenshots of your solution, a hosted web platform).
- A text-based explanation of your code and thought process underlying system design.
- Detailed documentation and a PNG or JPEG file of your system design.
Both of these last two make it easier for us to run your code and evaluate the assignment.
- OSINT Tools
- Colly
- AppWorld
- Scrapling
- Selenium
- Puppeteer
- DuckDB
- Cloudfare Workers
- Apache Superset
- Terraform
Focus on the analysis you are presenting and the story you are telling us through it. A well-designed and scalable system is more important than a complex one with a ton of features. Consider using innovative technologies in a user-friendly manner to create unique features for your platform such as AI-generated summaries that are adaptable to the data a user searches for, using your platform.
Presentation matters! Make sure your submission is easy to understand. Create an intuitive and meaningful README file or a Wiki that can be used to review your solution. Host it so it is accessible by anyone. Ensure that you share a video demo even if it is hosted, so that users understand how to interpret the insights you present.
At SimPPL, we're building tools to analyze how information spreads on social media, especially from unreliable sources. Your work will help inform how to scale our analysis to a wider range of platforms and handle larger datasets. This is crucial for tracking trends, identifying misinformation, and understanding how narratives spread online.
We're excited to see your solution!