GLAMhack2021: Data-Diver

Automatically generate an overview of open source datasets

Project for GLAMhack 2021 ( #GLAMhack2021 ) Fri 16th April - Sat 17th April

About

Finding and viewing open datasets can be time consuming. Datasets can take a long time to download, and after exploring them you realise it is not in the form you needed.

The idea is to create a tool that could run on open data providers' servers, but also on a local computer, which automatically generates an overview of the files, images and along with summary statistics.

Implementation

Aim to create a containerised process that takes a locally available filepath and returns a html page or json with:

collage of example images
numbers of each file types
if csv's are present create summary statistics with data quality

How it Works

as serverside microservice

to follow

as locally run app

in the src folder

python3 output_html.py

To Do

[X] get list of file in a zip (without unzipping it)
[X] return numbers of files in a zip (without unzipping it)
[X] use print statements to create a report in json [X] generate report as standalone html file
[ ] tidy and refactor output_html.py
[ ] create flask endpoint that accepts filepath as parameter and return json summary and image collage [ ] function to check file size of archive members before processing
[X] make photo collage ( could unzip specific files )
[ ] make image resolution summary function [ ] add flags for input parameters to scripts [ ] add summaries for csv files

Further Steps
[ ] add support for tar files
[ ] create microservice to run locally (e.g. Flask)
[ ] containerise app with Docker
[ ] decide if core functions should be split into a module
[ ] test speed of different methods

Notes/References

project gdrive folder: https://drive.google.com/drive/u/0/folders/1NbOzxm78wAVe_ZJaiIUvrIm42m__ndFU

zipfile module: https://docs.python.org/3/library/zipfile.html

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
assets		assets
backend		backend
notebooks		notebooks
reports		reports
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GLAMhack2021: Data-Diver

About

Implementation

How it Works

as serverside microservice

as locally run app

To Do

Notes/References

About

Releases

Packages

Languages

License

eth-library-lab/GlamHack2021-DataDiver

Folders and files

Latest commit

History

Repository files navigation

GLAMhack2021: Data-Diver

About

Implementation

How it Works

as serverside microservice

as locally run app

To Do

Notes/References

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages