Skip to content

Hackathon project for automatically generating an overview of datasets from an archive/.zip file.

License

Notifications You must be signed in to change notification settings

eth-library-lab/GlamHack2021-DataDiver

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Automatically generate an overview of open source datasets

Project for GLAMhack 2021 ( #GLAMhack2021 ) Fri 16th April - Sat 17th April

About

Finding and viewing open datasets can be time consuming. Datasets can take a long time to download, and after exploring them you realise it is not in the form you needed.

The idea is to create a tool that could run on open data providers' servers, but also on a local computer, which automatically generates an overview of the files, images and along with summary statistics.

Implementation

Aim to create a containerised process that takes a locally available filepath and returns a html page or json with:

  • collage of example images
  • numbers of each file types
  • if csv's are present create summary statistics with data quality

How it Works

as serverside microservice

to follow

as locally run app

in the src folder

python3 output_html.py

To Do

[X] get list of file in a zip (without unzipping it)
[X] return numbers of files in a zip (without unzipping it)
[X] use print statements to create a report in json [X] generate report as standalone html file
[ ] tidy and refactor output_html.py
[ ] create flask endpoint that accepts filepath as parameter and return json summary and image collage [ ] function to check file size of archive members before processing
[X] make photo collage ( could unzip specific files )
[ ] make image resolution summary function [ ] add flags for input parameters to scripts [ ] add summaries for csv files

Further Steps
[ ] add support for tar files
[ ] create microservice to run locally (e.g. Flask)
[ ] containerise app with Docker
[ ] decide if core functions should be split into a module
[ ] test speed of different methods

Notes/References

project gdrive folder: https://drive.google.com/drive/u/0/folders/1NbOzxm78wAVe_ZJaiIUvrIm42m__ndFU

zipfile module: https://docs.python.org/3/library/zipfile.html

About

Hackathon project for automatically generating an overview of datasets from an archive/.zip file.

Resources

License

Stars

Watchers

Forks

Packages

No packages published