This is a personal fun project where I play with and learn about Parquet and Python.
Apache Parquet is a columnar storage file format optimized for use with big data processing frameworks. It is designed for efficiency and performance, making it a popular choice for storing and analyzing large datasets.
So far, I built two functionalities that were inspired by my struggle with parquet files.
Using this tab, you can upload a csv file (max 200MB in size) and convert it to a Parquet file.
Using this tab, you can upload a parquet file, and it will show you details about its metadata and schema, and a preview of the first 20 rows.
Right now, the whole app is in one file: app/streamlit_app.py
.
├── app # parqueology app files
├── data # Sample data
├── requirements.txt
└── README.md
Note: This app uses Streamlit, an open-source framework for building data apps. For simple experimental projects like this one, Streamlit is often recommended over heavier frameworks like Django or Flask.
Assuming you're a debian/ubuntu-based linux environment (personally I use Linux Mint), here are the basic steps you need to follow:
- Ensure you have the latest version of
python3
,python3-pip
,python3-venv
andgit
sudo apt update sudo apt install --only-upgrade python3 python3-pip python3-venv git python3 --version pip3 --version git --version
- Clone this repo
cd ~Projects/ # Replace with your preferred directory for Git repos git clone https://github.com/monjacoder/parqueology.git cd parqueology
- Set up a Virtual Environment
python3 -m venv venv source venv/bin/activate
- Install dependencies
pip install -r requirements.txt
- Run the app using
streamlit
streamlit run app/streamlit_app.py
If you have suggestions, feedback, or ideas, feel free to open an issue or reach out to me!