SemanText is a linguistic tool for the Indonesian corpus developed using the Streamlit framework. It is designed to simplify the analysis of news articles sourced from popular Indonesian news publishers. Whether you're a linguist, researcher, or language enthusiast, SemanText offers a suite of features tailored to enhance your exploration and understanding of textual data.
-
URL Scraper: Effortlessly scrape articles by providing a list of URLs saved in a .txt file, each separated by a line break.
-
Multiple CSV Files: Seamlessly manage and handle multiple CSV files for efficient data organization.
-
Most Frequent Words: Quickly identify and analyze the most frequently occurring words in the corpus.
-
N-gram Extraction - Frequency: Uncover insights through the extraction and frequency analysis of N-grams.
-
Rule-Based Collocation Extraction - Frequency: Utilizing the Stanza library, leverage three distinct patterns of Part-of-Speech tags grounded in the Universal Dependencies model for Indonesian:
- NOUN + ADJ: Discover meaningful collocations where a noun is paired with an adjective.
- NOUN + NOUN: Uncover insights into frequent noun pairings.
- VERB + NOUN: Explore the dynamic interplay between verbs and nouns.
-
Key Word in Context/Concordance: Gain contextual understanding by exploring the occurrence of specific keywords within the corpus.
-
Export Data to CSV Format: Easily export analyzed data to CSV format for further in-depth analysis.
- Statistical Measures for Collocation Extraction: Calculate and display statistical measures, such as Pointwise Mutual Information (PMI), to identify and interpret collocations better, instead of plain frequency.
- CONLL-U File Analysis: Read and parse CONLL-U file, which are a standard format for representing linguistic annotations.
- Keyword-Based Collocation Extraction: Extract collocations that include a certain keyword.
- Scraper - Google Search: Retrieve articles based on specific queries for Google Search.
- Machine Learning - Topic Modelling: Identify and group similar articles based on their content using unsupervised machine learning algorithms.
- Python 3.7 or Higher: Ensure your Python version is 3.7 or higher for compatibility.
- Clone the Repository: Copy the repository to your local machine.
- Install Dependencies: Run the command
pip install -r requirements.txt
to install the required dependencies.
Experience the tool firsthand by exploring the demo here.
This project is licensed under the terms of the MIT license. Feel free to contribute and enhance the capabilities of SemanText. Your feedback and contributions are highly appreciated!