Skip to content

Latest commit

 

History

History
74 lines (59 loc) · 3.73 KB

README.md

File metadata and controls

74 lines (59 loc) · 3.73 KB

Distrowatch scraper/crawler (spider)

Download whole distrowatch database with information on each distribution to separate files

Img

Why do you need this

  • You like to survey or find information about distributions
  • You're writing a diploma or analytical work
  • You're curious on stastistics
  • You're studying how to write scripts and/or crawlers/scrapers

Requirements

Works with arch, ubuntu & fedora. Recent version is developed on Arch (Manjaro)

  • html2text
  • wget
  • sed
  • grep
  • bash/linux

How to use the script in 6 steps

  1. Install the requirements (arch: sudo pacman -S html2text wget git) # replace with apt for ubuntu and dnf for fedora
  2. Clone this repository (git clone https://github.com/sxiii/distrowatch-scraper)
  3. Enter the cloned folder (cd distr*)
  4. Make the script executable (chmod +x parse.sh)
  5. Run it (./parse.sh)
  6. Review it's console output or file output (files are created in current date folder!)

How to view the results

They are layed out in $(current.date) directory (if today is 12.12.2012, the directory will be 12.12.2012). Inside this folder you'll find more then 800 files. Most of the files are named ".results" and ".desc". Desc - it's downloaded web pages with full HTML source of distribution description. ".results" is files with sorted results according to the following scheme:

Results scheme

  • "Based On" - name of the distro, that current was based off,
  • "Origin" - country of distribution origin,
  • "Architecture" - distribution architecture,
  • "Desktops" - desktop that distro officially supports,
  • "Category" - which are main use-cases for this distribution,
  • "Status" - is the distribution active, dormant, discounted, on waiting list or evaluting (statuses according to distrowatch)
  • "Description" - the description itself,
  • "Website" - official web portal of the distro,
  • "Latest version" - latest published version of the distro.

There'd also be a linux-clean.list, which is list of all current distribution names.

Note: as it's Linux world, you might port any of distributions from supported platform architecture to unsupported (rewrite, recheck and recompile it), you might compile another desktop environment for it. Distributions statuses might be incorrect because information delay or just a human error. So to be sure, just check all fields and know, that this data "is not a diagnosis".

Future plans

  • Make the script output data & generate some fancy infographics after downlading database
  • Support of different output formats
  • Port the script to support some other distribution websites
  • (maybe) get rid of html2text?
  • make it work faster (parallelly?)
  • make some sort of menu for this script

Bugs or errors

This script has a little difference in handling the html2text because of difference in these programs in ArchLinux and Ubuntu. ArchLinux does create markdown text from HTML, while Ubuntu creates plain text. That's why you might edit the script or take the older (ubuntu) version to use with debian/ubuntu OS. Pastebin older ubuntu version is here (tho it's not so improved): https://pastebin.com/nnuVAJdJ

If you notice any other bugs, please create an issue.

Help and development

  • You might help to improve this script. Read the "future plan section"
  • That's a good idea to implement your own ideas and commit them to this repository
  • Contact me on telegram (fakesnowden) for your ideas and knowledge exchange

Useful links on the topic

may the source be with you.

Last update

Still works in 2020; 911 active distributions @ 16 oct 2020.