Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changes of joacmue #14

Open
wants to merge 139 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
139 commits
Select commit Hold shift + click to select a range
4177126
python3-ified some scripts
Dec 5, 2020
e7ad0c7
minor clean-up
Dec 5, 2020
ba7cc1c
made the --name only variant work
Dec 6, 2020
b1b935a
corrected fault with heading indentation
Dec 6, 2020
9487f75
Made tables work... sort of.
Dec 6, 2020
662fa16
made BR (line break?) work
Dec 7, 2020
fee8cde
fixed multi-row headers and lists
Dec 7, 2020
d13c8c8
skipping line breaks in "kommentar"
Dec 10, 2020
a8aaf26
Made lists render inside table (no indentation)
Dec 11, 2020
e761381
prettyfying alphanumeric list indices
Dec 11, 2020
69f3a71
clean-up of todos, tables should work now
Dec 17, 2020
a4609e0
python3-ed the print statements in lawgit.py, made .jsons better read…
Dec 17, 2020
5af2990
prettifying .json outputs (indents, utf-8 umlauts)
joacmue Dec 22, 2020
d1e84e8
made the banz scraper work again
joacmue Dec 23, 2020
4dca275
Added some notes on what this actually does
joacmue Mar 28, 2021
5d64c00
Merge remote-tracking branch 'gesetze-tools-upstream/master'
joacmue Mar 28, 2021
9c1aa53
Running lawdown in python3 helps -.-
joacmue Mar 28, 2021
f7ebdf1
Should have re-added most of the f' strings instead of the u' ones
joacmue Mar 28, 2021
f91a8cf
Some suggested changes from the PR
joacmue Apr 3, 2021
5bdfd5e
Suggested Changes from the PR
joacmue Apr 3, 2021
4da5c5c
re-adding bgbl folder to .gitignore
joacmue Apr 3, 2021
70f8035
python3-ified some scripts
Dec 5, 2020
6dc6e7a
minor clean-up
Dec 5, 2020
7dd3f8c
made the --name only variant work
Dec 6, 2020
ab9b64f
Made tables work... sort of.
Dec 6, 2020
4005d7e
made BR (line break?) work
Dec 7, 2020
52b80c8
fixed multi-row headers and lists
Dec 7, 2020
b324727
skipping line breaks in "kommentar"
Dec 10, 2020
8e49ec8
Made lists render inside table (no indentation)
Dec 11, 2020
f64454f
prettyfying alphanumeric list indices
Dec 11, 2020
f9cb296
clean-up of todos, tables should work now
Dec 17, 2020
b25e56a
python3-ed the print statements in lawgit.py, made .jsons better read…
Dec 17, 2020
937b86f
prettifying .json outputs (indents, utf-8 umlauts)
joacmue Dec 22, 2020
b29530a
made the banz scraper work again
joacmue Dec 23, 2020
d3d84ba
Added some notes on what this actually does
joacmue Mar 28, 2021
2bd7a75
Running lawdown in python3 helps -.-
joacmue Mar 28, 2021
ee9bf6b
Should have re-added most of the f' strings instead of the u' ones
joacmue Mar 28, 2021
eb3ac97
Some suggested changes from the PR
joacmue Apr 3, 2021
f36e726
Suggested Changes from the PR
joacmue Apr 3, 2021
09c5004
re-adding bgbl folder to .gitignore
joacmue Apr 3, 2021
35452ff
Merge branch 'master' of https://github.com/joacmue/gesetze-tools
joacmue Apr 6, 2021
17b5ac5
Corrected a copy typo of double brackets
joacmue Apr 6, 2021
d4ced4d
Removing two causes of linter errors
joacmue Apr 6, 2021
aac5558
removing banz_scraper python 2.x leftovers
joacmue Apr 6, 2021
f70a0ed
Removing some linter warnings
joacmue Apr 6, 2021
53b96fc
Minor clean-up
joacmue Apr 6, 2021
fb39d0a
minor clean-up
joacmue Apr 6, 2021
92b2bf7
Continuing to please the linter
joacmue Apr 6, 2021
823d076
Minor modifications.
darkdragon-001 Apr 6, 2021
2209244
Update data in separate commits/branches.
darkdragon-001 Apr 6, 2021
885061a
Some fixes
darkdragon-001 Apr 17, 2021
31f948a
Merge remote-tracking branch 'origin/master' into joacmue
darkdragon-001 Apr 18, 2021
246cc82
Removing regex qualifiers from non-regex strings
joacmue Apr 18, 2021
c41acef
Merge branch 'master' of https://github.com/joacmue/gesetze-tools
joacmue Apr 18, 2021
b065734
Re-adding the default flush when outside tables
joacmue Apr 18, 2021
49ef8a6
Removing special handling of lettered list indices
joacmue Apr 18, 2021
532da90
Cleaning up the backspaces in tables & lists
joacmue Apr 18, 2021
3a01c61
not printing leading line break on table headers
joacmue Apr 18, 2021
d48adf3
Removing mess around handling breaks
joacmue Apr 18, 2021
2264d43
Cleaning up custombreaks
joacmue Apr 25, 2021
279e19e
Adding empty cells for colspans
joacmue Apr 25, 2021
73aaf7b
Something was strange with the round function
joacmue Apr 25, 2021
a7b7e82
Making breaks on encounters of <BR> again
joacmue Apr 25, 2021
e4cf4bb
Removing special case for begin of <br>
joacmue Apr 25, 2021
2d13b97
Explicitly parsing colnames for colspans now
joacmue May 9, 2021
17bf66b
Making lawdown go over all laws without errors
joacmue May 13, 2021
fab87e6
Making multiline headers with colspan render nicer
joacmue May 14, 2021
b117e25
Cleaning up column list handling
joacmue May 14, 2021
358fd09
python3-ified some scripts
Dec 5, 2020
ac34cd9
minor clean-up
Dec 5, 2020
b1152ed
made the --name only variant work
Dec 6, 2020
ed7c5a0
Made tables work... sort of.
Dec 6, 2020
84c986b
made BR (line break?) work
Dec 7, 2020
923c6ec
fixed multi-row headers and lists
Dec 7, 2020
389e419
skipping line breaks in "kommentar"
Dec 10, 2020
abaefbf
Made lists render inside table (no indentation)
Dec 11, 2020
2264272
prettyfying alphanumeric list indices
Dec 11, 2020
c5b2048
clean-up of todos, tables should work now
Dec 17, 2020
1e21aaf
python3-ed the print statements in lawgit.py, made .jsons better read…
Dec 17, 2020
bd9763a
prettifying .json outputs (indents, utf-8 umlauts)
joacmue Dec 22, 2020
fe61b75
made the banz scraper work again
joacmue Dec 23, 2020
59eea7e
Added some notes on what this actually does
joacmue Mar 28, 2021
7128419
Running lawdown in python3 helps -.-
joacmue Mar 28, 2021
33e3ee6
Should have re-added most of the f' strings instead of the u' ones
joacmue Mar 28, 2021
5058f20
Some suggested changes from the PR
joacmue Apr 3, 2021
b0fa94d
Suggested Changes from the PR
joacmue Apr 3, 2021
7361631
re-adding bgbl folder to .gitignore
joacmue Apr 3, 2021
6ec2842
python3-ified some scripts
Dec 5, 2020
a93f191
minor clean-up
Dec 5, 2020
4042407
made the --name only variant work
Dec 6, 2020
0c916fd
corrected fault with heading indentation
Dec 6, 2020
0cd86c3
Made tables work... sort of.
Dec 6, 2020
6319695
made BR (line break?) work
Dec 7, 2020
c3262fa
fixed multi-row headers and lists
Dec 7, 2020
6ce7e2e
skipping line breaks in "kommentar"
Dec 10, 2020
ff0ec27
Made lists render inside table (no indentation)
Dec 11, 2020
c408faf
prettyfying alphanumeric list indices
Dec 11, 2020
7e042b1
clean-up of todos, tables should work now
Dec 17, 2020
355d84b
python3-ed the print statements in lawgit.py, made .jsons better read…
Dec 17, 2020
b5ef45c
prettifying .json outputs (indents, utf-8 umlauts)
joacmue Dec 22, 2020
699ab06
made the banz scraper work again
joacmue Dec 23, 2020
dce5570
Running lawdown in python3 helps -.-
joacmue Mar 28, 2021
4ec214e
Should have re-added most of the f' strings instead of the u' ones
joacmue Mar 28, 2021
0836f20
Some suggested changes from the PR
joacmue Apr 3, 2021
a6ad1e6
Suggested Changes from the PR
joacmue Apr 3, 2021
3716a55
re-adding bgbl folder to .gitignore
joacmue Apr 3, 2021
fe153d3
Corrected a copy typo of double brackets
joacmue Apr 6, 2021
1689312
Removing two causes of linter errors
joacmue Apr 6, 2021
aa9ad98
removing banz_scraper python 2.x leftovers
joacmue Apr 6, 2021
52e0641
Removing some linter warnings
joacmue Apr 6, 2021
170c0ff
Minor clean-up
joacmue Apr 6, 2021
f21b36f
minor clean-up
joacmue Apr 6, 2021
6227fcb
Continuing to please the linter
joacmue Apr 6, 2021
2d34d37
Removing regex qualifiers from non-regex strings
joacmue Apr 18, 2021
91bf7ea
Minor modifications.
darkdragon-001 Apr 6, 2021
f737049
Update data in separate commits/branches.
darkdragon-001 Apr 6, 2021
f1ec414
Some fixes
darkdragon-001 Apr 17, 2021
8e22485
Improve issue templates.
darkdragon-001 Apr 17, 2021
d2fcb9d
Try to fix formatting template.
darkdragon-001 Apr 18, 2021
f820c53
Enable CI also for PRs.
darkdragon-001 Apr 18, 2021
f8026bf
Re-adding the default flush when outside tables
joacmue Apr 18, 2021
06fe657
Removing special handling of lettered list indices
joacmue Apr 18, 2021
5dae428
Cleaning up the backspaces in tables & lists
joacmue Apr 18, 2021
4905e9a
not printing leading line break on table headers
joacmue Apr 18, 2021
1d2026a
Removing mess around handling breaks
joacmue Apr 18, 2021
45bb01b
Cleaning up custombreaks
joacmue Apr 25, 2021
99dc3c0
Adding empty cells for colspans
joacmue Apr 25, 2021
5f41808
Something was strange with the round function
joacmue Apr 25, 2021
2f61890
Making breaks on encounters of <BR> again
joacmue Apr 25, 2021
625bd5e
Removing special case for begin of <br>
joacmue Apr 25, 2021
83bcc5c
Explicitly parsing colnames for colspans now
joacmue May 9, 2021
724c33b
Making lawdown go over all laws without errors
joacmue May 13, 2021
074e674
Making multiline headers with colspan render nicer
joacmue May 14, 2021
bd96212
Cleaning up column list handling
joacmue May 14, 2021
0707c01
Rebased to master
joacmue May 15, 2021
10cff58
Re-rean bgbl_scraper, updated readme.md
joacmue May 15, 2021
c66ed0c
Merge branch 'master' of https://github.com/joacmue/gesetze-tools
joacmue May 15, 2021
978b871
aligned vkbl.json formatting with other files
joacmue May 15, 2021
4c76092
Minor fixes
darkdragon-001 May 15, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,2 +1,4 @@
laws
laws-md
.vscode
__pycache__
75 changes: 58 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,53 +1,94 @@
BundesGit Gesetze Tools
=======================
# BundesGit Gesetze Tools

These scripts are used to keep the law repository up to date.
These scripts are used to keep the [law repository](https://github.com/bundestag/gesetze) up to date.

Install requirements:

```bash
pip install -r requirements.txt
```

For help see their docstring, command line help or source code.
For help, see the docstring of the scripts, command line help or source code.


## lawde.py
## Download laws (`lawde.py`)

Downloads all laws as XML files from
[www.gesetze-im-internet.de](http://www.gesetze-im-internet.de/)
and extracts them to a directory.

Last tested: 2017-01-14 SUCCESS
### Usage

## lawdown.py
Update your list of laws first:
```bash
python3 lawde.py updatelist
```

Converts all XML laws to Markdown and copies them with other files related
You can then download all laws by calling
```bash
python3 lawde.py loadall
```
Which will take approx. 2-3hrs.

Alternatively, you can find the individual law you're interested in in [./data/laws.json](./data/laws.json), which is mostly a list of laws in this form:
```bash
{"slug": "<shortname>", "name": "<longname>", "abbreviation": "<abbreviation>"}
```
You can download individual laws by calling
```bash
python3 lawde.py load <shortname>
```

Last tested: 2021-05-15 SUCCESS


## Convert to Markdown (`lawdown.py`)

Converts all downloaded XML laws to Markdown format and copies them with other files related
to the law into specified working directory.

Last tested: 2017-01-14 SUCCESS
### Usage

```bash
python3 lawdown.py convert <inpath> <outpath>
python3 lawdown.py convert ./laws ./laws-md
```

Last tested: 2021-05-15 SUCCESS

## bgbl_scraper.py

## Scaper Bundesgesetzblatt (`bgbl_scraper.py`)

Scrapes the table of contents of all issues of the Bundesgesetzblatt and dumps
the result to JSON.

Last tested: 2021-03-30 SUCCESS
```bash
python3 bgbl_scraper.py data/bgbl.json
```

Last tested: 2021-05-15 SUCCESS

## banz_scraper.py

## Scaper Bundesanzeiger (`banz_scraper.py`)

Scrapes the table of contents of all available issues of the Bundesanzeiger and
dumps the result to JSON.

Last tested: 2017-01-14 SUCCESS
Last tested: 2020-12-23 SUCCESS

## vkbl_scraper.py

## Scaper Verkehrsblatt (`vkbl_scraper.py`)

Scrapes the table of contents of all available issues of the Verkehrsblatt and
dumps the result to JSON.

Last tested: 2017-01-14 SUCCESS
```bash
python3 vkbl_scraper.py data/vkbl.json
```

Last tested: 2021-05-15 SUCCESS


## lawgit.py
## Commit changes (`lawgit.py`)

Checks the repositories working directory for changes, tries to find relations
to table of content entries in BGBl and BAnz data, commits the changes to a branch
Expand Down
16 changes: 7 additions & 9 deletions banz_scraper.py
Original file line number Diff line number Diff line change
@@ -1,23 +1,18 @@
#!/usr/bin/env python3

"""BAnz-Scraper.

Usage:
banz_scaper.py <outputfile> [<minyear> [<maxyear>]]
banz_scaper.py -h | --help
banz_scaper.py --version

Options:
-h --help Show this screen.
--version Show version.

Duration Estimates:
2-5 minutes per year
30-75 minutes in total

Examples:
banz_scaper.py data/banz.json

"""
from pathlib import Path
import re
Expand Down Expand Up @@ -114,7 +109,7 @@ def get_items(self, year, date: Tuple[str, str]):
title_result = row.find(class_="title_result")

orig_date: Optional[str] = None
match = re.search(r'[Vv]om: (\d+)\. ([\wä]+) (\d{4})', str(title_result), re.U)
match = re.search(r'[Vv]om: (\d+)\. ([\wä]+) (\d{4})', str(title_result), re.U)
if match:
day = int(match.group(1))
month = self.MONTHS.index(match.group(2)) + 1
Expand Down Expand Up @@ -151,17 +146,20 @@ def main(arguments):
maxyear = arguments['<maxyear>'] or 10000
minyear = int(minyear)
maxyear = int(maxyear)
print(f"This will scrape information from the Bundesanzeiger between {minyear} and {maxyear}.")
print(f"Results will be stored in {arguments['<outputfile>']}")
print("You will see all dates with publications appear below as they are parsed.")
banz = BAnzScraper()
data = {}
if Path(arguments['<outputfile>']).exists():
with open(arguments['<outputfile>']) as f:
data = json.load(f)
data.update(banz.scrape(minyear, maxyear))
with open(arguments['<outputfile>'], 'w') as f:
json.dump(data, f, indent=4)
with open(arguments['<outputfile>'], 'w', encoding='utf8') as f:
json.dump(data, f, indent=4, sort_keys=True, ensure_ascii=False)
darkdragon-001 marked this conversation as resolved.
Show resolved Hide resolved


if __name__ == '__main__':
from docopt import docopt
arguments = docopt(__doc__, version='BAnz-Scraper 0.0.1')
main(arguments)
main(arguments)
6 changes: 3 additions & 3 deletions bgbl_scraper.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,6 @@

from typing import List


class BGBLScraper:
BASE_URL = 'http://www.bgbl.de/xaver/bgbl/'

Expand Down Expand Up @@ -151,6 +150,7 @@ def get_number_toc(self, number_id, number_did):
toc.append(d)
return toc


def main(arguments):
minyear = arguments['<minyear>'] or 0
maxyear = arguments['<maxyear>'] or 10000
Expand All @@ -162,8 +162,8 @@ def main(arguments):
with open(arguments['<outputfile>']) as f:
data = json.load(f)
data.update(bgbl.scrape(minyear, maxyear))
with open(arguments['<outputfile>'], 'w') as f:
json.dump(data, f, indent=4)
with open(arguments['<outputfile>'], 'w+', encoding='utf8') as f:
json.dump(data, f, indent=2, sort_keys=True, ensure_ascii=False)


if __name__ == '__main__':
Expand Down
Loading