A Python library for scraping and analyzing public deliberation data from the OpenGov.gr platform, specifically targeting the Greek Ministries' public consultations at https://www.opengov.gr/home/category/consultations.
Greek Public Consultations Dataset on HuggingFace
This project provides tools to extract, analyze, and process data from Greece's public consultation platform. Since OpenGov.gr does not provide an official API, this library implements web scraping techniques to access:
- Consultation documents (Νομοσχέδια)
- Public comments (Σχόλια)
- Explanatory reports (Εκθέσεις)
- Consultation metadata (dates, status, ministry, etc.)
The project has been enhanced with an improved document classification system that accurately categorizes documents into six different types based on their content and purpose.
The complete database of scraped consultations is available on the HuggingFace repository. This SQLite database contains all consultations from OpenGov.gr with the improved document classification system and extracted PDF content.
You can download the deliberation_data_gr_updated.db
file directly from the repository for immediate use in your research or applications.
AI4Deliberation/
├── README.md # This documentation file
├── complete_scraper/ # Main scraper implementation
│ ├── content_scraper.py # Scraper for article content and comments
│ ├── db_models.py # SQLAlchemy database models
│ ├── db_population_report.py # Tool to analyze database population
│ ├── list_consultations.py # Tool to list all consultations to CSV
│ ├── metadata_scraper.py # Scraper for consultation metadata
│ ├── scrape_all_consultations.py # Scrape multiple consultations
│ ├── scrape_single_consultation.py # Scrape a single consultation
│ ├── TODO.md # Project roadmap and completed features
│ └── utils.py # Utility functions for all scrapers
├── html_pipeline/ # HTML text extraction pipeline
│ ├── README.md # Documentation for this pipeline
│ └── html_to_text.py # Script to extract text from raw HTML using docling
└── pdf_pipeline/ # PDF processing pipeline implementation
├── export_documents_to_parquet.py # Export documents for processing
├── process_document_redirects.py # Resolve URL redirects for PDFs
├── process_pdfs_with_glossapi.py # Extract content using GlossAPI
├── run_pdf_pipeline.py # End-to-end pipeline orchestrator
└── update_database_with_content.py # Update DB with extracted content
- Comprehensive scraping: Extract data from all public consultations on OpenGov.gr
- Metadata extraction: Capture consultation titles, dates, ministry information, and status
- Deep content retrieval: Extract article text, comments, and structured discussion data
- Raw HTML capture: Scraper captures the raw HTML content of articles.
- Document links: Gather links to official PDF documents (draft laws, reports, etc.)
- Incremental updates: Skip already scraped consultations unless forced to re-scrape
- HTML Text Extraction Pipeline: Dedicated pipeline using
docling
to extract clean text from stored raw HTML. - PDF document processing: Extract and analyze content from linked PDF documents using GlossAPI
- Extraction quality assessment: Evaluate and record the quality of PDF content extraction
- Robust error handling: Multiple fallback methods for data extraction
- Database storage: Store all data in a normalized SQLite database
- Analytics: Generate reports on database population and data quality
- Document classification: Categorize documents into six distinct types using a data-driven approach
The scrape_all_consultations.py
script provides a powerful tool to scrape multiple consultations from OpenGov.gr. Below are examples of how to use it:
# Basic usage - scrape all consultations and store in the default database
python3 complete_scraper/scrape_all_consultations.py
# Scrape a limited number of consultations
python3 complete_scraper/scrape_all_consultations.py --max-count 10
# Scrape a specific page range
python3 complete_scraper/scrape_all_consultations.py --start-page 5 --end-page 10
# Force re-scrape of consultations already in the database
python3 complete_scraper/scrape_all_consultations.py --force-scrape
# Use a different database file
python3 complete_scraper/scrape_all_consultations.py --db-path "sqlite:///path/to/custom_db.db"
# Commit changes to database in smaller batches
python3 complete_scraper/scrape_all_consultations.py --batch-size 5
After running the main scraper (scrape_all_consultations.py
or scrape_single_consultation.py
) to populate the database with raw HTML, use the html_pipeline
to extract clean text content using docling
:
# Run the HTML text extraction pipeline
python3 html_pipeline/html_to_text.py --db-path path/to/your/database.db
# Limit the number of articles to process
python3 html_pipeline/html_to_text.py --db-path path/to/your/database.db --limit 500
See the html_pipeline/README.md
for more details.
To scrape a single consultation, use scrape_single_consultation.py
:
python3 complete_scraper/scrape_single_consultation.py "https://www.opengov.gr/ministry_code/?p=consultation_id"
The project includes a dedicated PDF processing pipeline for extracting content from document links:
# Run the complete PDF processing pipeline
python3 pdf_pipeline/run_pdf_pipeline.py
# Run specific steps of the pipeline (1=export, 2=redirects, 3=processing, 4=database update)
python3 pdf_pipeline/run_pdf_pipeline.py --start=2 --end=4
# Run individual components for more control
python3 pdf_pipeline/export_documents_to_parquet.py # Step 1: Export document URLs
python3 pdf_pipeline/process_document_redirects.py # Step 2: Resolve URL redirects
python3 pdf_pipeline/process_pdfs_with_glossapi.py # Step 3: Process PDFs with GlossAPI
python3 pdf_pipeline/update_database_with_content.py # Step 4: Update database with content
The pipeline intelligently processes only documents that need content extraction, manages its own workspace, and provides detailed logs of the process. PDF content extraction is performed using GlossAPI, an advanced document processing library developed for extracting and analyzing Greek text from PDFs.
The scraped data is stored in a SQLite database with the following structure:
-
ministries
id
: Primary keycode
: Ministry code used in URLsname
: Full ministry nameurl
: URL to ministry's main page
-
consultations
id
: Primary keypost_id
: OpenGov.gr internal post IDtitle
: Consultation titlestart_date
: Start date of the consultationend_date
: End date of the consultationurl
: Full URL to the consultationministry_id
: Foreign key to ministries tableis_finished
: Whether the consultation has endedaccepted_comments
: Number of accepted commentstotal_comments
: Total comment count
-
documents
id
: Primary keyconsultation_id
: Foreign key to consultations tabletitle
: Document titleurl
: URL to the document filetype
: Document type (see classification below)
-
articles
id
: Primary keyconsultation_id
: Foreign key to consultations tablepost_id
: Internal post ID for the articletitle
: Article titleraw_html
: Raw HTML content of the article body (populated by the main scraper)content
: Cleaned text content of the article (populated by thehtml_pipeline
using docling)url
: URL to the article page
-
comments
id
: Primary keyarticle_id
: Foreign key to articles tablecomment_id
: Internal comment IDusername
: Name of commenterdate
: Comment submission datecontent
: Full text of the comment
Documents are classified into six categories:
- law_draft: Draft legislation documents containing both "ΣΧΕΔΙΟ" and "ΝΟΜΟΥ" (31.0%)
- analysis: Regulatory impact analysis documents containing both "ΑΝΑΛΥΣΗ" and "ΣΥΝΕΠΕΙΩΝ" (10.8%)
- deliberation_report: Public consultation reports containing both "ΕΚΘΕΣΗ" and "ΔΙΑΒΟΥΛΕΥΣΗ" (5.2%)
- other_draft: Other draft documents containing "ΣΧΕΔΙΟ" but not "ΝΟΜΟΥ" (8.5%)
- other_report: Other report documents containing "ΕΚΘΕΣΗ" but not "ΔΙΑΒΟΥΛΕΥΣΗ" (12.7%)
- other: Documents not falling into any of the above categories (31.8%)
This project is in the initial development phase. Contributions and feedback are welcome.