Skip to content

Commit

Permalink
Deploying to gh-pages from @ 9b2525a 🚀
Browse files Browse the repository at this point in the history
  • Loading branch information
MichalGawor committed Nov 20, 2023
1 parent 98c2fc4 commit ff52991
Show file tree
Hide file tree
Showing 95 changed files with 3,341 additions and 0 deletions.
6 changes: 6 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
*.pyc
__pycache__
*.swp
.~lock*
tables
venv*/
10 changes: 10 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Changelog


## [1.0.0] - 20.11.2023

### Changed features
- Python +3.9 required
- requirements.txt replaced with .pytoml+poetry build config
- `rfhg` as callable Python module

676 changes: 676 additions & 0 deletions LICENSE.txt

Large diffs are not rendered by default.

155 changes: 155 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,155 @@
ClarTable
=========

Installation
------------
Works with Python `^3.9`
```bash
git clone [email protected]:clarin-eric/resource-families-html-generator.git # via SSH or
git clone https://github.com/clarin-eric/resource-families-html-generator.git # via HTTPS
cd ./resource-families-html-generator/
pip install .
```

About
-----
*ClarTable* is a Python module for generating html presentation layer for tabular data from .csv file.

### Usage

#### Locally:
```bash
usage: python -m rfhg [-h] -i PATH -r PATH -o PATH

Create html table from given data and rules.
To navigate static resources within the module prepend `static.`
to the path, eg. `-r static.rules/rules.json`

optional arguments:
-h, --help show this help message and exit
-i PATH path to a .csv file or folder with .csv files
-r PATH path to a .json file with rules
-o PATH path to file where output html table will be generated
```


#### Via CI:
The html tables for resource families can be generated via GitHub. Push new .csv files to `/resouce_families` and after processing they will appear in gh-pages branch.

### CSV format
In order to create html table from .csv file with default rules, the file requires __all of following columns__ (order not important). Note that names of columns are case sensitive. If you need generator to consider additional columns contact <[email protected]> or adjust __rules.json__.

Make sure, that your .csv files __use ; (semicolon)__ as a column separator.

Single cell may containt multiple paragraphs or structures split with __#SEP__ separator. Following the example below the Description cell consists of 3 paragraphs. Some of the cells depend on others, looking into Buttons cell there are 2 buttons names split with the separator and respective URLs in Buttons_URL.

Corpus | Corpus_URL | Language | Size | Annotation | Licence | Description | Buttons | Buttons_URL | Publication | Publication_URL | Note
-------|------------|----------|------|------------|---------|-------------|---------|-------------|-------------|-----------------|-------
Example Corpus Name | www.examplaryurl.com | English | 100 million tokens | tokenised, PoS-tagged, lemmatised | CC-BY | First examplary sentence #SEPSecond examplary sentence to be started from new line #SEPExample with ```<a href="http://some.url">hyperlink</a>``` in it | Concordancer#SEPDownload | https://www.concordancer.com/ #SEPhttps://www.download.com | Smith et al. (3019) | https://publication.url | Note text to be displayed in button field

Resulting table:
![Examplary table](docs/media/example.png)

### Table titles and ordering
Table title will be derived from the .csv file name in format X-table_title.csv, where X is index used for table ordering.
Tables can be grouped into sections by storing them in the intermediate directory within corpora that is subject to the same indexation principle as .csv files.
For example corpora with structure:
```bash
Historical corpora
├── 1-Historical corpora in the CLARIN infrastructure
│   ├── 1-Monolingual corpora.csv
│   └── 2-Multilingual corpora.csv
└── 2-Other historical corpora
├── 1-Monolingual corpora.csv
└── 2-Multilingual corpora.csv
```
Will produce:

![Examplary corpora](docs/media/corpora.png)

### Rules format
Rules are composed of nested json notation of tags and field.
Given rule:
```javascript
{"tags": [
{"tag": "<table class=\"table\" cellspacing=\"2\">", "tags": [
{"tag": "<thead>", "tags": [
{"tag": "<tr>", "tags": [
{"tag": "<th>", "text": "Corpus name"}
]}
]},
{"tag": "<tbody>", "tags": [
{"tag": "<tr>", "tags": [
{"tag": "<td valign=\"top\"", "tags": [
{"tag": "<p>", "fields": [
{"text": "<strong>Field data</strong> will be inserted here: %s", "columns": ['column_name_in_csv_file']}
]}
]}
]}
]}
]}
]}
```

Generated html table with names of corpora, assuming there were only 2 rows in a .csv file
```html
<table class ="table" cellspacing="2">
<thead>
<tr>
<th valign="top">Corpus name
</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top">
<p>
<strong>Field data</strong> will be inserted here: NKJP 2.1.4
</p>
</td>
</tr>
</tbody>
<tbody>
<tr>
<td valign="top">
<p>
<strong>Field data</strong> will be inserted here: Common Crawl
</p>
</td>
</tr>
</tbody>
</table>

```
<table class ="table" cellspacing="2">
<thead>
<tr>
<th valign="top">Corpus name
</th>
</tr>
</thead>
<tbody>
<tr>
<td valign="top">
<p>Some text here
<strong>Field data</strong> will be inserted here: NKJP 2.1.4
</p>
</td>
</tr>
</tbody>
<tbody>
<tr>
<td valign="top">
<p>Some text here
<strong>Field data</strong> will be inserted here: Common Crawl
</p>
</td>
</tr>
</tbody>
</table>



\<tbody\> tag encloses tags and fields for row creation, only tags nested within \<tbody\> ... \</tbody\> can contain "fields": []


30 changes: 30 additions & 0 deletions pyproject.toml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
[tool.poetry]
name = "resource_families_html_generator"
description = "CLARIN presentation layer generator for Resource Families"
version = "1.0.0-dev"
license = "./LICENSE.txt"
authors = [
"Michał Gawor <[email protected]>",
"Alexander König <[email protected]>",
]
maintainers = [
"Michał Gawor <[email protected]>",
"Alexander König <[email protected]>",
]
packages = [
{ include = "rfhg" },
]
include = [
"rfhg/static/*",
]

[tool.poetry.dependencies]
json5 = '0.9.14'
numpy = '1.26.2'
pandas = '2.1.3'
python = "^3.12"
python-dateutil = '2.8.2'

[build-system]
requires = ["poetry-core>=1.0.0"]
build-backend = "poetry.core.masonry.api"
Empty file added rfhg/__init__.py
Empty file.
59 changes: 59 additions & 0 deletions rfhg/__main__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/usr/bin/env python3

import argparse
import os
import re

from .clartable import Clartable
from .reader import read_data, read_rules, resolve_static
from .utils import table_title, section_title

parser = argparse.ArgumentParser(description='Create html table from given data and rules. To use static resources as arguments use `static.<path_inside_rfhg/static>`')
parser.add_argument('-i', metavar='PATH', default='static.resource_families/', help='path to a .csv file or folder with .csv files. Note that nesting data files inside multiple directories will generated nested tables respective to directory nesting.')
parser.add_argument('-r', metavar='PATH', default='static.rules/rules.json', help='path to json file with rules')
parser.add_argument('-o', metavar='PATH', required=True, help='path to file where output html table will be written')

args = parser.parse_args()


if __name__ == "__main__":
rules = read_rules(args.r)
clartable = Clartable(rules)

output_path = args.o
if not os.path.exists(output_path):
os.makedirs(output_path)
if os.path.isdir(output_path):
file_name = os.path.basename(os.path.normpath(output_path)) + '.html'
output_path = os.path.join(output_path, file_name)
output = open(output_path, 'w')

# input is a single file
input_path = resolve_static(args.i)
if os.path.isfile(os.path.normpath(input_path)):
print("Processing file: ", input_path)
print(input_path)
data = read_data(input_path)
print(data)
title = table_title(input_path)
table = title + clartable.generate(data)
output.write(table)
# input is a folder
else:
print("Processing directory: ", input_path)
for root, subdir, files in os.walk(input_path):
subdir.sort()
files.sort()
if len(files) > 0:
if os.path.basename(root) != '':
output.write(section_title(root))
for _file in files:
print("Processing file: ", _file)
data = read_data(os.path.join(root, _file))
# generate table:
if _file != '':
table = table_title(_file)
else:
table = ''
table += clartable.generate(data)
output.write(table)
Loading

0 comments on commit ff52991

Please sign in to comment.