diff --git a/.flake8 b/.flake8
index d7afb7d1..555b4381 100644
--- a/.flake8
+++ b/.flake8
@@ -16,12 +16,15 @@ exclude =
resources,
venv,
build,
- dedoc.egg-info
- docs/_build
+ dedoc.egg-info,
+ docs/_build,
+ scripts/fintoc2022/metric.py
# ANN101 - type annotations for self
+# T201 - prints found
+# JS101 - Multi-line container not broken after opening character
ignore =
ANN101
per-file-ignores =
scripts/*:T201
- scripts/benchmark_pdf_performance*:JS101,T201
+ scripts/benchmark_pdf_performance*:JS101
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
index 76ee04b4..0f439368 100644
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@@ -3,7 +3,7 @@ repos:
rev: 5.0.4
hooks:
- id: flake8
- exclude: \.github|.*__init__\.py|resources|docs|venv|build|dedoc\.egg-info
+ exclude: \.github|.*__init__\.py|resources|docs|venv|build|dedoc\.egg-info|scripts/fintoc2022/metric.py
args:
- "--config=.flake8"
additional_dependencies: [
diff --git a/README.md b/README.md
index a4c02e92..519ebb7b 100644
--- a/README.md
+++ b/README.md
@@ -1,6 +1,10 @@
# Dedoc
+[![License](http://img.shields.io/:license-apache-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0.html)
[![Documentation Status](https://readthedocs.org/projects/dedoc/badge/?version=latest)](https://dedoc.readthedocs.io/en/latest/?badge=latest)
+[![GitHub release](https://img.shields.io/github/release/ispras/dedoc.svg)](https://github.com/ispras/dedoc/releases/)
+[![Demo dedoc-readme.hf.space](https://img.shields.io/website-up-down-green-red/https/huggingface.co/spaces/dedoc/README.svg)](https://dedoc-readme.hf.space)
+[![Docker Hub](https://img.shields.io/docker/pulls/dedocproject/dedoc.svg)](https://hub.docker.com/r/dedocproject/dedoc/ "Docker Pulls")
![Dedoc](https://github.com/ispras/dedoc/raw/master/dedoc_logo.png)
@@ -39,26 +43,26 @@ In 2022, the system won a grant to support the development of promising AI proje
## Document format description
The system processes different document formats. The main formats are listed below:
-| Format group | Description |
-|-----------------------|---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
-| Office formats | DOCX, XLSX, PPTX and formats that canbe converted to them. Handling of these for-mats is held by analysis of format inner rep-resentation and using specialized libraries ([python-docx](https://python-docx.readthedocs.io/en/latest/), [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)) |
-| HTML, EML, MHTML | HTML documents are parsed using tagsanalysis, HTML handler is used for han-dling documents of other formats in thisgroup |
-| TXT | Only raw textual content is analyzed |
-| Archives | Attachments of the archive are analyzed | |
-| PDF,document images | Copyable PDF documents (with a textual layer) can be handled using [pdfminer-six](https://pdfminersix.readthedocs.io/en/latest/) library or [tabby](https://github.com/sunveil/ispras_tbl_extr) software. Non-copyable PDF documents or imagesare handled using [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract), machine learning methods (including neural network methods) and [image processing methods](https://opencv.org/) |
+| Format group | Description |
+|----------------------|-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
+| Office formats | DOCX, XLSX, PPTX and formats that can be converted to them. Handling of these formats is held by analysis of format inner representation and using specialized libraries ([python-docx](https://python-docx.readthedocs.io/en/latest/), [BeautifulSoup](https://www.crummy.com/software/BeautifulSoup/bs4/doc/)) |
+| HTML, EML, MHTML | HTML documents are parsed using tags analysis, HTML handler is used for handling documents of other formats in this group |
+| TXT | Only raw textual content is analyzed |
+| Archives | Attachments of the archive are analyzed | |
+| PDF, document images | Copyable PDF documents (with a textual layer) can be handled using [pdfminer-six](https://pdfminersix.readthedocs.io/en/latest/) library or [tabby](https://github.com/sunveil/ispras_tbl_extr) software. Non-copyable PDF documents or images are handled using [Tesseract-OCR](https://github.com/tesseract-ocr/tesseract), machine learning methods (including neural network methods) and [image processing methods](https://opencv.org/) |
## Examples of processed scanned documents
* Dedoc can only process scanned black and white documents, such as technical specifications, regulations, articles, etc.
-
-
+
+
* In particular, dedoc recognizes tabular information only from tables with explicit boundaries. Here are examples of documents that can be processed by an dedoc's image handler:
-
-
+
+
* The system also automatically detects and corrects the orientation of scanned documents
-## Example of structure extractor
-
-
+## Examples of structure extractors
+
+
## Impact
@@ -66,25 +70,26 @@ This project may be useful as a first step of automatic document analysis pipeli
Dedoc is in demand for information analytic systems, information leak monitoring systems, as well as for natural language processing systems.
The library is intended for application use by developers of systems for automatic analysis and structuring of electronic documents, including for further search in electronic documents.
-# Online-Documentation
-Relevant documentation of the dedoc is available [here](https://dedoc.readthedocs.io/en/latest/)
+# Documentation
+Relevant documentation of dedoc is available [here](https://dedoc.readthedocs.io/en/latest/)
# Demo
-You can try dedoc's demo: https://dedoc-readme.hf.space.
-We have a video to demonstrate how to use the system: https://www.youtube.com/watch?v=ZUnPYV8rd9A.
+* You can try [dedoc demo](https://dedoc-readme.hf.space)
+* You can watch [video about dedoc](https://www.youtube.com/watch?v=ZUnPYV8rd9A)
-![Web_interface](docs/source/_static/web_interface.png)
+![](https://github.com/ispras/dedoc/raw/master/docs/source/_static/web_interface.png)
-![dedoc_demo](docs/source/_static/dedoc_short.gif)
+![](https://github.com/ispras/dedoc/raw/master/docs/source/_static/dedoc_short.gif)
-# Some our publications
+# Publications related to dedoc
-* Article on [Habr](https://habr.com/ru/companies/isp_ras/articles/779390/), where we describe our system in detail
-* [Our article](https://aclanthology.org/2022.fnp-1.13.pdf) from the FINTOC 2022 competition. We are the winners :smiley: :trophy:!
+* Article [ISPRAS@FinTOC-2022 shared task: Two-stage TOC generation model](https://aclanthology.org/2022.fnp-1.13.pdf) for the [FinTOC 2022 Shared Task](https://wp.lancs.ac.uk/cfie/fintoc2022/). We are the winners :smiley: :trophy:!
+* Article on habr.com [Dedoc: как автоматически извлечь из текстового документа всё и даже немного больше](https://habr.com/ru/companies/isp_ras/articles/779390/) in Russian (2023)
+* Article [Dedoc: A Universal System for Extracting Content and Logical Structure From Textual Documents](https://ieeexplore.ieee.org/abstract/document/10508151/) in English (2023)
# Installation instructions
-****************************************
+
This project has REST Api and you can run it in Docker container.
Also, dedoc can be installed as a library via `pip`.
There are two ways to install and run dedoc as a web application or a library that are described below.
diff --git a/VERSION b/VERSION
index 61618788..fae692e4 100644
--- a/VERSION
+++ b/VERSION
@@ -1 +1 @@
-2.2
\ No newline at end of file
+2.2.1
\ No newline at end of file
diff --git a/dedoc/api/api_args.py b/dedoc/api/api_args.py
index 1c260b37..f139733f 100644
--- a/dedoc/api/api_args.py
+++ b/dedoc/api/api_args.py
@@ -7,7 +7,7 @@
@dataclass
class QueryParameters:
# type of document structure parsing
- document_type: str = Form("", enum=["", "law", "tz", "diploma"], description="Document domain")
+ document_type: str = Form("", enum=["", "law", "tz", "diploma", "article", "fintoc"], description="Document domain")
structure_type: str = Form("tree", enum=["linear", "tree"], description="Output structure type")
return_format: str = Form("json", enum=["json", "html", "plain_text", "tree", "collapsed_tree", "ujson", "pretty_json"],
description="Response representation, most types (except json) are used for debug purposes only")
@@ -29,7 +29,7 @@ class QueryParameters:
# pdf handling
pdf_with_text_layer: str = Form("auto_tabby", enum=["true", "false", "auto", "auto_tabby", "tabby"],
description="Extract text from a text layer of PDF or using OCR methods for image-like documents")
- language: str = Form("rus+eng", enum=["rus+eng", "rus", "eng"], description="Recognition language")
+ language: str = Form("rus+eng", enum=["rus+eng", "rus", "eng", "fra", "spa"], description="Recognition language")
pages: str = Form(":", description='Page numbers range for reading PDF or images, "left:right" means read pages from left to right')
is_one_column_document: str = Form("auto", enum=["auto", "true", "false"],
description='One or multiple column document, "auto" - predict number of page columns automatically')
diff --git a/dedoc/api/api_utils.py b/dedoc/api/api_utils.py
index 1287912d..dd27cfc1 100644
--- a/dedoc/api/api_utils.py
+++ b/dedoc/api/api_utils.py
@@ -3,6 +3,7 @@
from dedoc.data_structures import LineMetadata
from dedoc.data_structures.concrete_annotations.bold_annotation import BoldAnnotation
from dedoc.data_structures.concrete_annotations.italic_annotation import ItalicAnnotation
+from dedoc.data_structures.concrete_annotations.reference_annotation import ReferenceAnnotation
from dedoc.data_structures.concrete_annotations.strike_annotation import StrikeAnnotation
from dedoc.data_structures.concrete_annotations.subscript_annotation import SubscriptAnnotation
from dedoc.data_structures.concrete_annotations.superscript_annotation import SuperscriptAnnotation
@@ -116,7 +117,7 @@ def json2html(text: str, paragraph: TreeNode, tables: Optional[List[Table]], tab
if table2id is None:
table2id = {table.metadata.uid: table_id for table_id, table in enumerate(tables)}
- ptext = __annotations2html(paragraph, table2id)
+ ptext = __annotations2html(paragraph, table2id, tabs=tabs)
if paragraph.metadata.hierarchy_level.line_type in [HierarchyLevel.header, HierarchyLevel.root]:
ptext = f"{ptext.strip()}"
@@ -125,7 +126,10 @@ def json2html(text: str, paragraph: TreeNode, tables: Optional[List[Table]], tab
else:
ptext = ptext.strip()
- text += f'
{" " * tabs} {ptext} id = {paragraph.node_id} ; type = {paragraph.metadata.hierarchy_level.line_type}
'
+ ptext = f'
{" " * tabs} {ptext} id = {paragraph.node_id} ; type = {paragraph.metadata.hierarchy_level.line_type}
'
+ if hasattr(paragraph.metadata, "uid"):
+ ptext = f'
{ptext}
'
+ text += ptext
for subparagraph in paragraph.subparagraphs:
text = json2html(text=text, paragraph=subparagraph, tables=None, tabs=tabs + 4, table2id=table2id)
@@ -157,6 +161,9 @@ def __value2tag(name: str, value: str) -> str:
if name == UnderlinedAnnotation.name:
return "u"
+ if name == ReferenceAnnotation.name:
+ return "a"
+
if value.startswith("heading "):
level = value[len("heading "):]
return "h" + level if level.isdigit() and int(level) < 7 else "strong"
@@ -164,7 +171,7 @@ def __value2tag(name: str, value: str) -> str:
return value
-def __annotations2html(paragraph: TreeNode, table2id: Dict[str, int]) -> str:
+def __annotations2html(paragraph: TreeNode, table2id: Dict[str, int], tabs: int = 0) -> str:
indexes = dict()
for annotation in paragraph.annotations:
@@ -177,7 +184,7 @@ def __annotations2html(paragraph: TreeNode, table2id: Dict[str, int]) -> str:
SubscriptAnnotation.name,
SuperscriptAnnotation.name,
UnderlinedAnnotation.name]
- check_annotations = bool_annotations + ["table"]
+ check_annotations = bool_annotations + ["table", "reference"]
if name not in check_annotations and not value.startswith("heading "):
continue
elif name in bool_annotations and annotation.value == "False":
@@ -187,10 +194,13 @@ def __annotations2html(paragraph: TreeNode, table2id: Dict[str, int]) -> str:
indexes.setdefault(annotation.start, "")
indexes.setdefault(annotation.end, "")
if name == "table":
- indexes[annotation.start] += f'
'
+ indexes[annotation.end] += f' (table {table2id[tag]})'
+ elif name == "reference":
+ indexes[annotation.start] += f'<{tag} href="#{value}">'
+ indexes[annotation.end] = f"{tag}>" + indexes[annotation.end]
else:
- indexes[annotation.start] += "<" + tag + ">"
- indexes[annotation.end] = "" + tag + ">" + indexes[annotation.end]
+ indexes[annotation.start] += f"<{tag}>"
+ indexes[annotation.end] = f"{tag}>" + indexes[annotation.end]
insert_tags = sorted([(index, tag) for index, tag in indexes.items()], reverse=True)
text = paragraph.text
@@ -198,12 +208,13 @@ def __annotations2html(paragraph: TreeNode, table2id: Dict[str, int]) -> str:
for index, tag in insert_tags:
text = text[:index] + tag + text[index:]
- return text.replace("\n", " ")
+ return text.replace("\n", f' {" " * tabs}')
def table2html(table: Table, table2id: Dict[str, int]) -> str:
uid = table.metadata.uid
- text = f"
table {table2id[uid]}:
"
+ table_title = f" {table.metadata.title}" if table.metadata.title else ""
+ text = f"
table {table2id[uid]}:{table_title}
"
text += f'
\n\n'
for row in table.cells:
text += "
\n"
diff --git a/dedoc/api/schema/document_metadata.py b/dedoc/api/schema/document_metadata.py
index 197bfbc1..4d814fc3 100644
--- a/dedoc/api/schema/document_metadata.py
+++ b/dedoc/api/schema/document_metadata.py
@@ -1,5 +1,3 @@
-from typing import Optional
-
from pydantic import BaseModel, Extra, Field
@@ -18,4 +16,3 @@ class Config:
created_time: int = Field(description="Creation time of the document in the UnixTime format", example=1590579805)
access_time: int = Field(description="File access time in the UnixTime format", example=1590579805)
file_type: str = Field(description="Mime type of the file", example="application/vnd.oasis.opendocument.text")
- other_fields: Optional[dict] = Field(description="Other optional fields")
diff --git a/dedoc/api/schema/line_metadata.py b/dedoc/api/schema/line_metadata.py
index 0c08dabe..37e893d8 100644
--- a/dedoc/api/schema/line_metadata.py
+++ b/dedoc/api/schema/line_metadata.py
@@ -13,4 +13,3 @@ class Config:
paragraph_type: str = Field(description="Type of the document line/paragraph (header, list_item, list) and etc.", example="raw_text")
page_id: int = Field(description="Page number of the line/paragraph beginning", example=0)
line_id: Optional[int] = Field(description="Line number", example=1)
- other_fields: Optional[dict] = Field(description="Some other fields")
diff --git a/dedoc/api/web/index.html b/dedoc/api/web/index.html
index 5ca05cec..055ef58b 100644
--- a/dedoc/api/web/index.html
+++ b/dedoc/api/web/index.html
@@ -38,6 +38,7 @@
Type of document structure parsing
+
document_type
@@ -137,6 +138,8 @@
PDF handling
+
+
language
diff --git a/dedoc/attachments_handler/attachments_handler.py b/dedoc/attachments_handler/attachments_handler.py
index 1935e5d2..b657dd88 100644
--- a/dedoc/attachments_handler/attachments_handler.py
+++ b/dedoc/attachments_handler/attachments_handler.py
@@ -72,7 +72,7 @@ def handle_attachments(self, document_parser: "DedocManager", document: Unstruct
# return empty ParsedDocument with Meta information
parsed_file = self.__get_empty_document(document_parser=document_parser, attachment=attachment, parameters=parameters_copy)
- parsed_file.metadata.set_uid(attachment.uid)
+ parsed_file.metadata.uid = attachment.uid
attachments.append(parsed_file)
return attachments
diff --git a/dedoc/data_structures/concrete_annotations/reference_annotation.py b/dedoc/data_structures/concrete_annotations/reference_annotation.py
index e629ba8b..52a45f1d 100644
--- a/dedoc/data_structures/concrete_annotations/reference_annotation.py
+++ b/dedoc/data_structures/concrete_annotations/reference_annotation.py
@@ -27,7 +27,7 @@ class ReferenceAnnotation(Annotation):
page_id=10,
line_id=189,
tag_hierarchy_level=HierarchyLevel(level1=2, level2=0, paragraph_type="bibliography_item")),
- other_fields={"uid": "97cfac39-f0e3-11ee-b81c-b88584b4e4a1"}
+ uid="97cfac39-f0e3-11ee-b81c-b88584b4e4a1"
),
annotations=[]
)
diff --git a/dedoc/data_structures/document_metadata.py b/dedoc/data_structures/document_metadata.py
index 134ba6a4..beec9c56 100644
--- a/dedoc/data_structures/document_metadata.py
+++ b/dedoc/data_structures/document_metadata.py
@@ -1,4 +1,5 @@
import uuid
+from typing import Dict, Union
from dedoc.api.schema.document_metadata import DocumentMetadata as ApiDocumentMetadata
from dedoc.data_structures.serializable import Serializable
@@ -17,8 +18,8 @@ def __init__(self,
created_time: int,
access_time: int,
file_type: str,
- other_fields: dict = None,
- uid: str = None) -> None:
+ uid: str = None,
+ **kwargs: Dict[str, Union[str, int, float]]) -> None:
"""
:param uid: document unique identifier (useful for attached files)
:param file_name: original document name (before rename and conversion, so it can contain non-ascii symbols, spaces and so on)
@@ -28,7 +29,6 @@ def __init__(self,
:param created_time: time of the creation in unixtime
:param access_time: time of the last access to the file in unixtime
:param file_type: mime type of the file
- :param other_fields: additional fields of user metadata
"""
self.file_name = file_name
self.temporary_file_name = temporary_file_name
@@ -37,32 +37,9 @@ def __init__(self,
self.created_time = created_time
self.access_time = access_time
self.file_type = file_type
- self.other_fields = {}
- if other_fields is not None and len(other_fields) > 0:
- self.extend_other_fields(other_fields)
- self.uid = f"doc_uid_auto_{uuid.uuid1()}" if uid is None else uid
-
- def set_uid(self, uid: str) -> None:
- self.uid = uid # noqa
-
- def extend_other_fields(self, new_fields: dict) -> None:
- """
- Add new attributes to the class and to the other_fields dictionary.
-
- :param new_fields: fields to add
- """
- assert (new_fields is not None)
- assert (len(new_fields) > 0)
-
- for key, value in new_fields.items():
+ for key, value in kwargs.items():
setattr(self, key, value)
- self.other_fields[key] = value
+ self.uid = f"doc_uid_auto_{uuid.uuid1()}" if uid is None else uid
def to_api_schema(self) -> ApiDocumentMetadata:
- api_document_metadata = ApiDocumentMetadata(uid=self.uid, file_name=self.file_name, temporary_file_name=self.temporary_file_name, size=self.size,
- modified_time=self.modified_time, created_time=self.created_time, access_time=self.access_time,
- file_type=self.file_type, other_fields=self.other_fields)
- if self.other_fields is not None:
- for (key, value) in self.other_fields.items():
- setattr(api_document_metadata, key, value)
- return api_document_metadata
+ return ApiDocumentMetadata(**vars(self))
diff --git a/dedoc/data_structures/line_metadata.py b/dedoc/data_structures/line_metadata.py
index 19b6730a..e9be87a3 100644
--- a/dedoc/data_structures/line_metadata.py
+++ b/dedoc/data_structures/line_metadata.py
@@ -1,4 +1,4 @@
-from typing import Optional
+from typing import Dict, Optional, Union
from dedoc.api.schema.line_metadata import LineMetadata as ApiLineMetadata
from dedoc.data_structures.hierarchy_level import HierarchyLevel
@@ -15,7 +15,7 @@ def __init__(self,
line_id: Optional[int],
tag_hierarchy_level: Optional[HierarchyLevel] = None,
hierarchy_level: Optional[HierarchyLevel] = None,
- other_fields: Optional[dict] = None) -> None:
+ **kwargs: Dict[str, Union[str, int, float]]) -> None:
"""
:param page_id: page number where paragraph starts, the numeration starts from page 0
:param line_id: line number inside the entire document, the numeration starts from line 0
@@ -23,33 +23,19 @@ def __init__(self,
(usually information got from tags e.g. in docx or html readers)
:param hierarchy_level: the hierarchy level of the line extracted by some of the structure extractors - the result type and level of the line.
The lower the level of the hierarchy, the closer it is to the root, it's used to construct document tree.
- :param other_fields: additional fields of user metadata
"""
self.tag_hierarchy_level = HierarchyLevel(None, None, can_be_multiline=True, line_type=HierarchyLevel.unknown) \
if tag_hierarchy_level is None else tag_hierarchy_level
self.hierarchy_level = hierarchy_level
self.page_id = page_id
self.line_id = line_id
- self.__other_fields = {}
- if other_fields is not None and len(other_fields) > 0:
- self.extend_other_fields(other_fields)
-
- def extend_other_fields(self, new_fields: dict) -> None:
- """
- Add new attributes to the class and to the other_fields dictionary.
-
- :param new_fields: fields to add
- """
- assert (new_fields is not None)
- assert (len(new_fields) > 0)
-
- for key, value in new_fields.items():
+ for key, value in kwargs.items():
setattr(self, key, value)
- self.__other_fields[key] = value
def to_api_schema(self) -> ApiLineMetadata:
paragraph_type = self.hierarchy_level.line_type if self.hierarchy_level is not None else HierarchyLevel.raw_text
- api_line_metadata = ApiLineMetadata(page_id=self.page_id, line_id=self.line_id, paragraph_type=paragraph_type, other_fields=self.__other_fields)
- for key, value in self.__other_fields.items():
- setattr(api_line_metadata, key, value)
+ api_line_metadata = ApiLineMetadata(page_id=self.page_id, line_id=self.line_id, paragraph_type=paragraph_type)
+ for key, value in vars(self).items():
+ if not hasattr(api_line_metadata, key) and key not in ("tag_hierarchy_level", "hierarchy_level"):
+ setattr(api_line_metadata, key, value)
return api_line_metadata
diff --git a/dedoc/dedoc_manager.py b/dedoc/dedoc_manager.py
index 301498d4..5600dd29 100644
--- a/dedoc/dedoc_manager.py
+++ b/dedoc/dedoc_manager.py
@@ -105,8 +105,8 @@ def __parse_no_error_handling(self, file_path: str, parameters: Dict[str, str])
# Step 3 - Adding meta-information
metadata = self.document_metadata_extractor.extract(file_path=tmp_file_path, converted_filename=os.path.basename(converted_file_path),
- original_filename=file_name, parameters=parameters, other_fields=unstructured_document.metadata)
- unstructured_document.metadata = metadata
+ original_filename=file_name, parameters=parameters)
+ unstructured_document.metadata = {**unstructured_document.metadata, **metadata}
self.logger.info(f"Add metadata of file {file_name}")
# Step 4 - Extract structure
diff --git a/dedoc/download_models.py b/dedoc/download_models.py
index 643cf30e..b520a7df 100644
--- a/dedoc/download_models.py
+++ b/dedoc/download_models.py
@@ -15,7 +15,8 @@
scan_orientation_efficient_net_b0="9ea283f3d346ae4fdd82463a9f60b5369a3ffb58",
font_classifier="db4481ad60ab050cbb42079b64f97f9e431feb07",
paragraph_classifier="00bf989876cec171c1cf9859a6b712af6445e864",
- line_type_classifiers="2e498d1ec82b72c1a96ba0d25344b71402997013"
+ line_type_classifiers="2e498d1ec82b72c1a96ba0d25344b71402997013",
+ fintoc_classifiers="42f8ada99a5da608139b078c93bebfffc5b30263"
)
@@ -42,6 +43,14 @@ def download(resources_path: str) -> None:
repo_name="line_type_classifiers",
hub_name=f"{classifier_type}.pkl.gz")
+ fintoc_classifiers_resources_path = os.path.join(resources_path, "fintoc_classifiers")
+ for language in ("en", "fr", "sp"):
+ for classifier_type in ("target", "binary"):
+ download_from_hub(out_dir=fintoc_classifiers_resources_path,
+ out_name=f"{classifier_type}_classifier_{language}.pkg.gz",
+ repo_name="fintoc_classifiers",
+ hub_name=f"{classifier_type}_classifier_{language}_txt_layer.pkg.gz")
+
if __name__ == "__main__":
resources_path = get_config()["resources_path"]
diff --git a/dedoc/manager_config.py b/dedoc/manager_config.py
index 679db954..35815ecf 100644
--- a/dedoc/manager_config.py
+++ b/dedoc/manager_config.py
@@ -1,7 +1,5 @@
from typing import Optional
-from dedoc.readers.article_reader.article_reader import ArticleReader
-
def _get_manager_config(config: dict) -> dict:
"""
@@ -23,6 +21,7 @@ def _get_manager_config(config: dict) -> dict:
from dedoc.metadata_extractors.concrete_metadata_extractors.pdf_metadata_extractor import PdfMetadataExtractor
from dedoc.metadata_extractors.metadata_extractor_composition import MetadataExtractorComposition
from dedoc.readers.archive_reader.archive_reader import ArchiveReader
+ from dedoc.readers.article_reader.article_reader import ArticleReader
from dedoc.readers.csv_reader.csv_reader import CSVReader
from dedoc.readers.docx_reader.docx_reader import DocxReader
from dedoc.readers.email_reader.email_reader import EmailReader
@@ -41,9 +40,11 @@ def _get_manager_config(config: dict) -> dict:
from dedoc.structure_constructors.concrete_structure_constructors.linear_constructor import LinearConstructor
from dedoc.structure_constructors.concrete_structure_constructors.tree_constructor import TreeConstructor
from dedoc.structure_constructors.structure_constructor_composition import StructureConstructorComposition
+ from dedoc.structure_extractors.concrete_structure_extractors.article_structure_extractor import ArticleStructureExtractor
from dedoc.structure_extractors.concrete_structure_extractors.classifying_law_structure_extractor import ClassifyingLawStructureExtractor
from dedoc.structure_extractors.concrete_structure_extractors.default_structure_extractor import DefaultStructureExtractor
from dedoc.structure_extractors.concrete_structure_extractors.diploma_structure_extractor import DiplomaStructureExtractor
+ from dedoc.structure_extractors.concrete_structure_extractors.fintoc_structure_extractor import FintocStructureExtractor
from dedoc.structure_extractors.concrete_structure_extractors.foiv_law_structure_extractor import FoivLawStructureExtractor
from dedoc.structure_extractors.concrete_structure_extractors.law_structure_excractor import LawStructureExtractor
from dedoc.structure_extractors.concrete_structure_extractors.tz_structure_extractor import TzStructureExtractor
@@ -93,7 +94,9 @@ def _get_manager_config(config: dict) -> dict:
DefaultStructureExtractor.document_type: DefaultStructureExtractor(config=config),
DiplomaStructureExtractor.document_type: DiplomaStructureExtractor(config=config),
TzStructureExtractor.document_type: TzStructureExtractor(config=config),
- ClassifyingLawStructureExtractor.document_type: ClassifyingLawStructureExtractor(extractors=law_extractors, config=config)
+ ClassifyingLawStructureExtractor.document_type: ClassifyingLawStructureExtractor(extractors=law_extractors, config=config),
+ ArticleStructureExtractor.document_type: ArticleStructureExtractor(config=config),
+ FintocStructureExtractor.document_type: FintocStructureExtractor(config=config)
}
return dict(
diff --git a/dedoc/metadata_extractors/abstract_metadata_extractor.py b/dedoc/metadata_extractors/abstract_metadata_extractor.py
index 3aa74bfe..02b1a8e4 100644
--- a/dedoc/metadata_extractors/abstract_metadata_extractor.py
+++ b/dedoc/metadata_extractors/abstract_metadata_extractor.py
@@ -11,8 +11,7 @@ def can_extract(self,
file_path: str,
converted_filename: Optional[str] = None,
original_filename: Optional[str] = None,
- parameters: Optional[dict] = None,
- other_fields: Optional[dict] = None) -> bool:
+ parameters: Optional[dict] = None) -> bool:
"""
Check if this extractor can handle the given file. Return True if the extractor can handle it and False otherwise.
Look to the :meth:`~dedoc.metadata_extractors.AbstractMetadataExtractor.extract` documentation to get the information about parameters.
@@ -24,8 +23,7 @@ def extract(self,
file_path: str,
converted_filename: Optional[str] = None,
original_filename: Optional[str] = None,
- parameters: Optional[dict] = None,
- other_fields: Optional[dict] = None) -> dict:
+ parameters: Optional[dict] = None) -> dict:
"""
Extract metadata from file if possible, i.e. method :meth:`can_extract` returned True.
@@ -35,7 +33,6 @@ def extract(self,
by default it's a name from the file_path. Converted file should be located in the same directory as the file before converting.
:param original_filename: name of the file before renaming (if dedoc manager is used), by default it's a name from the file_path
:param parameters: additional parameters for document parsing, see :ref:`parameters_description` for more details
- :param other_fields: other fields that should be added to the document's metadata
:return: dict with metadata information about the document
"""
pass
diff --git a/dedoc/metadata_extractors/concrete_metadata_extractors/base_metadata_extractor.py b/dedoc/metadata_extractors/concrete_metadata_extractors/base_metadata_extractor.py
index 0e467760..2fc984a0 100644
--- a/dedoc/metadata_extractors/concrete_metadata_extractors/base_metadata_extractor.py
+++ b/dedoc/metadata_extractors/concrete_metadata_extractors/base_metadata_extractor.py
@@ -32,8 +32,7 @@ def can_extract(self,
file_path: str,
converted_filename: Optional[str] = None,
original_filename: Optional[str] = None,
- parameters: Optional[dict] = None,
- other_fields: Optional[dict] = None) -> bool:
+ parameters: Optional[dict] = None) -> bool:
"""
This extractor can handle any file so the method always returns True.
Look to the :meth:`~dedoc.metadata_extractors.AbstractMetadataExtractor.can_extract` documentation to get the information about parameters.
@@ -44,8 +43,7 @@ def extract(self,
file_path: str,
converted_filename: Optional[str] = None,
original_filename: Optional[str] = None,
- parameters: Optional[dict] = None,
- other_fields: Optional[dict] = None) -> dict:
+ parameters: Optional[dict] = None) -> dict:
"""
Gets the basic meta-information about the file.
Look to the :meth:`~dedoc.metadata_extractors.AbstractMetadataExtractor.extract` documentation to get the information about parameters.
@@ -55,14 +53,9 @@ def extract(self,
meta_info = self._get_base_meta_information(file_dir, file_name, original_filename)
if parameters.get("is_attached", False) and str(parameters.get("return_base64", "false")).lower() == "true":
- other_fields = {} if other_fields is None else other_fields
+ with open(os.path.join(file_dir, converted_filename), "rb") as file:
+ meta_info["base64_encode"] = b64encode(file.read()).decode("utf-8")
- path = os.path.join(file_dir, converted_filename)
- with open(path, "rb") as file:
- other_fields["base64_encode"] = b64encode(file.read()).decode("utf-8")
-
- if other_fields is not None and len(other_fields) > 0:
- meta_info["other_fields"] = other_fields
return meta_info
@staticmethod
diff --git a/dedoc/metadata_extractors/concrete_metadata_extractors/docx_metadata_extractor.py b/dedoc/metadata_extractors/concrete_metadata_extractors/docx_metadata_extractor.py
index be0964c2..cab05fa3 100644
--- a/dedoc/metadata_extractors/concrete_metadata_extractors/docx_metadata_extractor.py
+++ b/dedoc/metadata_extractors/concrete_metadata_extractors/docx_metadata_extractor.py
@@ -30,8 +30,7 @@ def can_extract(self,
file_path: str,
converted_filename: Optional[str] = None,
original_filename: Optional[str] = None,
- parameters: Optional[dict] = None,
- other_fields: Optional[dict] = None) -> bool:
+ parameters: Optional[dict] = None) -> bool:
"""
Check if the document has .docx extension.
Look to the :meth:`~dedoc.metadata_extractors.AbstractMetadataExtractor.can_extract` documentation to get the information about parameters.
@@ -43,8 +42,7 @@ def extract(self,
file_path: str,
converted_filename: Optional[str] = None,
original_filename: Optional[str] = None,
- parameters: Optional[dict] = None,
- other_fields: Optional[dict] = None) -> dict:
+ parameters: Optional[dict] = None) -> dict:
"""
Add the predefined list of metadata for the docx documents.
Look to the :meth:`~dedoc.metadata_extractors.AbstractMetadataExtractor.extract` documentation to get the information about parameters.
@@ -52,19 +50,14 @@ def extract(self,
parameters = {} if parameters is None else parameters
file_dir, file_name, converted_filename, original_filename = self._get_names(file_path, converted_filename, original_filename)
- result = super().extract(file_path=file_path, converted_filename=converted_filename, original_filename=original_filename, parameters=parameters,
- other_fields=other_fields)
+ base_fields = super().extract(file_path=file_path, converted_filename=converted_filename, original_filename=original_filename, parameters=parameters)
+ docx_fields = self._get_docx_fields(os.path.join(file_dir, converted_filename))
- file_path = os.path.join(file_dir, converted_filename)
- docx_other_fields = self._get_docx_fields(file_path)
-
- result["other_fields"] = {**result.get("other_fields", {}), **docx_other_fields}
+ result = {**base_fields, **docx_fields}
return result
def __convert_date(self, date: Optional[datetime]) -> Optional[int]:
- if date is not None:
- return int(date.timestamp())
- return None
+ return None if date is None else int(date.timestamp())
def _get_docx_fields(self, file_path: str) -> dict:
assert os.path.isfile(file_path)
diff --git a/dedoc/metadata_extractors/concrete_metadata_extractors/image_metadata_extractor.py b/dedoc/metadata_extractors/concrete_metadata_extractors/image_metadata_extractor.py
index 465c9dea..60bec824 100644
--- a/dedoc/metadata_extractors/concrete_metadata_extractors/image_metadata_extractor.py
+++ b/dedoc/metadata_extractors/concrete_metadata_extractors/image_metadata_extractor.py
@@ -52,8 +52,7 @@ def can_extract(self,
file_path: str,
converted_filename: Optional[str] = None,
original_filename: Optional[str] = None,
- parameters: Optional[dict] = None,
- other_fields: Optional[dict] = None) -> bool:
+ parameters: Optional[dict] = None) -> bool:
"""
Check if the document has image-like extension (".png", ".jpg", ".jpeg").
Look to the :meth:`~dedoc.metadata_extractors.AbstractMetadataExtractor.can_extract` documentation to get the information about parameters.
@@ -65,20 +64,16 @@ def extract(self,
file_path: str,
converted_filename: Optional[str] = None,
original_filename: Optional[str] = None,
- parameters: Optional[dict] = None,
- other_fields: Optional[dict] = None) -> dict:
+ parameters: Optional[dict] = None) -> dict:
"""
Add the predefined list of metadata for images.
Look to the :meth:`~dedoc.metadata_extractors.AbstractMetadataExtractor.extract` documentation to get the information about parameters.
"""
file_dir, file_name, converted_filename, original_filename = self._get_names(file_path, converted_filename, original_filename)
- result = super().extract(file_path=file_path, converted_filename=converted_filename, original_filename=original_filename, parameters=parameters,
- other_fields=other_fields)
+ base_fields = super().extract(file_path=file_path, converted_filename=converted_filename, original_filename=original_filename, parameters=parameters)
- path = os.path.join(file_dir, converted_filename)
- exif_fields = self._get_exif(path)
- if len(exif_fields) > 0:
- result["other_fields"] = {**result.get("other_fields", {}), **exif_fields}
+ exif_fields = self._get_exif(os.path.join(file_dir, converted_filename))
+ result = {**base_fields, **exif_fields}
return result
def __encode_exif(self, exif: Union[str, bytes]) -> Optional[str]:
diff --git a/dedoc/metadata_extractors/concrete_metadata_extractors/note_metadata_extarctor.py b/dedoc/metadata_extractors/concrete_metadata_extractors/note_metadata_extarctor.py
index e0dc4b6e..7c33e290 100644
--- a/dedoc/metadata_extractors/concrete_metadata_extractors/note_metadata_extarctor.py
+++ b/dedoc/metadata_extractors/concrete_metadata_extractors/note_metadata_extarctor.py
@@ -21,8 +21,7 @@ def can_extract(self,
file_path: str,
converted_filename: Optional[str] = None,
original_filename: Optional[str] = None,
- parameters: Optional[dict] = None,
- other_fields: Optional[dict] = None) -> bool:
+ parameters: Optional[dict] = None) -> bool:
"""
Check if the document has .note.pickle extension.
Look to the :meth:`~dedoc.metadata_extractors.AbstractMetadataExtractor.can_extract` documentation to get the information about parameters.
@@ -34,8 +33,7 @@ def extract(self,
file_path: str,
converted_filename: Optional[str] = None,
original_filename: Optional[str] = None,
- parameters: Optional[dict] = None,
- other_fields: Optional[dict] = None) -> dict:
+ parameters: Optional[dict] = None) -> dict:
"""
Add the predefined list of metadata for the .note.pickle documents.
Look to the :meth:`~dedoc.metadata_extractors.AbstractMetadataExtractor.extract` documentation to get the information about parameters.
@@ -47,16 +45,13 @@ def extract(self,
with open(file_path, "rb") as infile:
note_dict = pickle.load(infile)
- fields = {"author": note_dict["author"]}
- other_fields = {**other_fields, **fields} if other_fields is not None else fields
-
meta_info = dict(file_name=original_filename,
file_type="note",
size=note_dict["size"],
access_time=note_dict["modified_time"],
created_time=note_dict["created_time"],
modified_time=note_dict["modified_time"],
- other_fields=other_fields)
+ author=note_dict["author"])
return meta_info
except Exception:
raise BadFileFormatError(f"Bad note file:\n file_name = {os.path.basename(file_path)}. Seems note-format is broken")
diff --git a/dedoc/metadata_extractors/concrete_metadata_extractors/pdf_metadata_extractor.py b/dedoc/metadata_extractors/concrete_metadata_extractors/pdf_metadata_extractor.py
index e3502e44..78fc2ac6 100644
--- a/dedoc/metadata_extractors/concrete_metadata_extractors/pdf_metadata_extractor.py
+++ b/dedoc/metadata_extractors/concrete_metadata_extractors/pdf_metadata_extractor.py
@@ -44,8 +44,7 @@ def can_extract(self,
file_path: str,
converted_filename: Optional[str] = None,
original_filename: Optional[str] = None,
- parameters: Optional[dict] = None,
- other_fields: Optional[dict] = None) -> bool:
+ parameters: Optional[dict] = None) -> bool:
"""
Check if the document has .pdf extension.
Look to the :meth:`~dedoc.metadata_extractors.AbstractMetadataExtractor.can_extract` documentation to get the information about parameters.
@@ -57,19 +56,15 @@ def extract(self,
file_path: str,
converted_filename: Optional[str] = None,
original_filename: Optional[str] = None,
- parameters: Optional[dict] = None,
- other_fields: Optional[dict] = None) -> dict:
+ parameters: Optional[dict] = None) -> dict:
"""
Add the predefined list of metadata for the pdf documents.
Look to the :meth:`~dedoc.metadata_extractors.AbstractMetadataExtractor.extract` documentation to get the information about parameters.
"""
file_dir, file_name, converted_filename, original_filename = self._get_names(file_path, converted_filename, original_filename)
- result = super().extract(file_path=file_path, converted_filename=converted_filename, original_filename=original_filename, parameters=parameters,
- other_fields=other_fields)
- path = os.path.join(file_dir, converted_filename)
- pdf_fields = self._get_pdf_info(path)
- if len(pdf_fields) > 0:
- result["other_fields"] = {**result.get("other_fields", {}), **pdf_fields}
+ base_fields = super().extract(file_path=file_path, converted_filename=converted_filename, original_filename=original_filename, parameters=parameters)
+ pdf_fields = self._get_pdf_info(os.path.join(file_dir, converted_filename))
+ result = {**base_fields, **pdf_fields}
return result
def _get_pdf_info(self, path: str) -> dict:
diff --git a/dedoc/metadata_extractors/metadata_extractor_composition.py b/dedoc/metadata_extractors/metadata_extractor_composition.py
index ba46c4b0..8505dcf3 100644
--- a/dedoc/metadata_extractors/metadata_extractor_composition.py
+++ b/dedoc/metadata_extractors/metadata_extractor_composition.py
@@ -21,16 +21,13 @@ def extract(self,
file_path: str,
converted_filename: Optional[str] = None,
original_filename: Optional[str] = None,
- parameters: Optional[dict] = None,
- other_fields: Optional[dict] = None) -> dict:
+ parameters: Optional[dict] = None) -> dict:
"""
Extract metadata using one of the extractors if suitable extractor was found.
Look to the method :meth:`~dedoc.metadata_extractors.AbstractMetadataExtractor.extract` of the class
:class:`~dedoc.metadata_extractors.AbstractMetadataExtractor` documentation to get the information about method's parameters.
"""
for extractor in self.extractors:
- if extractor.can_extract(file_path=file_path, converted_filename=converted_filename, original_filename=original_filename, parameters=parameters,
- other_fields=other_fields):
- return extractor.extract(file_path=file_path, converted_filename=converted_filename, original_filename=original_filename, parameters=parameters,
- other_fields=other_fields)
+ if extractor.can_extract(file_path=file_path, converted_filename=converted_filename, original_filename=original_filename, parameters=parameters):
+ return extractor.extract(file_path=file_path, converted_filename=converted_filename, original_filename=original_filename, parameters=parameters)
raise Exception(f"Can't extract metadata from from file {os.path.basename(file_path)}")
diff --git a/dedoc/readers/article_reader/article_reader.py b/dedoc/readers/article_reader/article_reader.py
index f2169452..fcb21cfb 100644
--- a/dedoc/readers/article_reader/article_reader.py
+++ b/dedoc/readers/article_reader/article_reader.py
@@ -11,6 +11,7 @@
from dedoc.data_structures.unstructured_document import UnstructuredDocument
from dedoc.extensions import recognized_mimes
from dedoc.readers.base_reader import BaseReader
+from dedoc.structure_extractors.feature_extractors.list_features.list_utils import get_dotted_item_depth
from dedoc.utils.parameter_utils import get_param_document_type
from dedoc.utils.utils import get_mime_extension
@@ -33,7 +34,8 @@ def read(self, file_path: str, parameters: Optional[dict] = None) -> Unstructure
using beautifulsoup library.
As a result, the method fills the class :class:`~dedoc.data_structures.UnstructuredDocument`.
Article reader adds additional information to the `tag_hierarchy_level` of :class:`~dedoc.data_structures.LineMetadata`.
- The method extracts information about ``authors``, ``bibliography items``, ``sections``, and ``tables``.
+ The method extracts information about ``authors``, ``keywords``, ``bibliography items``, ``sections``, and ``tables``.
+ In table cells, ``colspan`` attribute can be filled according to the GROBID's "cols" attribute.
You can find more information about the extracted information from GROBID system on the page :ref:`article_structure`.
Look to the documentation of :meth:`~dedoc.readers.BaseReader.read` to get information about the method's parameters.
@@ -51,12 +53,13 @@ def read(self, file_path: str, parameters: Optional[dict] = None) -> Unstructure
self.logger.warning(warning)
return UnstructuredDocument(tables=[], lines=[], attachments=[], warnings=[warning])
- soup = BeautifulSoup(response.text, features="lxml")
+ soup = BeautifulSoup(response.text, features="xml")
lines = self.__parse_title(soup)
- if soup.biblstruct is not None:
- authors = soup.biblstruct.find_all("author")
+ if soup.biblStruct is not None:
+ authors = soup.biblStruct.find_all("author")
lines += [line for author in authors for line in self.__parse_author(author)]
+ lines += self.__parse_keywords(soup.keywords)
bib_lines, bib2uid = self.__parse_bibliography(soup)
tables, table2uid = self.__parse_tables(soup)
@@ -129,17 +132,19 @@ def __create_line(self, text: str, hierarchy_level_id: Optional[int] = None, par
hierarchy_level = HierarchyLevel(level_1=hierarchy_level_id, level_2=0, can_be_multiline=False, line_type=paragraph_type)
return LineWithMeta(line=text,
- metadata=LineMetadata(page_id=0, line_id=0, tag_hierarchy_level=hierarchy_level, other_fields=other_fields),
+ metadata=LineMetadata(page_id=0, line_id=0, tag_hierarchy_level=hierarchy_level, **other_fields),
annotations=annotations)
def __parse_affiliation(self, affiliation_tag: Tag) -> List[LineWithMeta]:
lines = [self.__create_line(text=affiliation_tag.get("key"), hierarchy_level_id=2, paragraph_type="author_affiliation")]
- if affiliation_tag.orgname:
- lines.append(self.__create_line(text=self.__tag2text(affiliation_tag.orgname), hierarchy_level_id=3, paragraph_type="org_name"))
+ if affiliation_tag.orgName:
+ lines.append(self.__create_line(text=self.__tag2text(affiliation_tag.orgName), hierarchy_level_id=3, paragraph_type="org_name"))
if affiliation_tag.address:
- lines.append(self.__create_line(text=affiliation_tag.address.text, hierarchy_level_id=3, paragraph_type="address"))
+ lines.append(self.__create_line(text=self.__remove_newlines(affiliation_tag.address).get_text(separator=", "),
+ hierarchy_level_id=3,
+ paragraph_type="address"))
return lines
@@ -169,11 +174,11 @@ def __parse_author(self, author_tag: Tag) -> List[LineWithMeta]:
"""
lines = [self.__create_line(text="", hierarchy_level_id=1, paragraph_type="author")]
- first_name = self.__get_tag_by_hierarchy_path(author_tag, ["persname", "forename"])
+ first_name = self.__get_tag_by_hierarchy_path(author_tag, ["persName", "forename"])
if first_name:
lines.append(self.__create_line(text=first_name, hierarchy_level_id=2, paragraph_type="author_first_name"))
- surname = self.__get_tag_by_hierarchy_path(author_tag, ["persname", "surname"])
+ surname = self.__get_tag_by_hierarchy_path(author_tag, ["persName", "surname"])
if surname:
lines.append(self.__create_line(text=surname, hierarchy_level_id=2, paragraph_type="author_surname"))
@@ -187,6 +192,21 @@ def __parse_author(self, author_tag: Tag) -> List[LineWithMeta]:
return lines
+ def __parse_keywords(self, keywords_tag: Tag) -> List[LineWithMeta]:
+ """
+
+ Multi-Object Tracking
+ Data Association
+ Survey
+
+ """
+ if keywords_tag is None:
+ return []
+
+ lines = [self.__create_line(text="", hierarchy_level_id=1, paragraph_type="keywords")]
+ lines += [self.__create_line(text=item.text, hierarchy_level_id=2, paragraph_type="keyword") for item in keywords_tag.find_all("term")]
+ return lines
+
def __create_line_with_refs(self, content: List[Tuple[str, Tag]], bib2uid: dict, table2uid: dict) -> LineWithMeta:
text = ""
start = 0
@@ -219,20 +239,31 @@ def __parse_text(self, soup: Tag, bib2uid: dict, table2uid: dict) -> List[LineWi
lines.append(self.__create_line(text="Abstract", hierarchy_level_id=1, paragraph_type="abstract"))
lines.append(self.__create_line(text=self.__tag2text(abstract)))
- for text in soup.find_all("text"):
- for part in text.find_all("div"):
- # TODO: Beautifulsoup doesn't read tags from input XML file. WTF!
- # As a result we lose section number in text (see example above)
- # Need to fix this in the future.
- number = part.head.get("n") + " " if part.head else ""
- line_text = str(part.contents[0]) if len(part.contents) > 0 else None
- if line_text is not None and len(line_text) > 0:
- lines.append(self.__create_line(text=number + line_text, hierarchy_level_id=1, paragraph_type="section"))
- for subpart in part.find_all("p"):
- if subpart.string is not None:
- lines.append(self.__create_line_with_refs(subpart.string, bib2uid, table2uid))
- elif subpart.contents and len(subpart.contents) > 0:
- lines.append(self.__create_line_with_refs(subpart.contents, bib2uid, table2uid))
+ for part in soup.body.find_all("div"):
+ lines.extend(self.__parse_section(part, bib2uid, table2uid))
+
+ for other_text_type in ("acknowledgement", "annex"):
+ for text_tag in soup.find_all("div", attrs={"type": other_text_type}):
+ for part in text_tag.find_all("div"):
+ lines.extend(self.__parse_section(part, bib2uid, table2uid))
+
+ return lines
+
+ def __parse_section(self, section_tag: Tag, bib2uid: dict, table2uid: dict) -> List[LineWithMeta]:
+ lines = []
+ number = section_tag.head.get("n") if section_tag.head else ""
+ number = number + " " if number else ""
+ section_depth = get_dotted_item_depth(number)
+ section_depth = section_depth if section_depth > 0 else 1
+
+ line_text = section_tag.head.string if section_tag.head else None
+ if line_text is not None and len(line_text) > 0:
+ lines.append(self.__create_line(text=number + line_text, hierarchy_level_id=section_depth, paragraph_type="section"))
+ for subpart in section_tag.find_all("p"):
+ if subpart.string is not None:
+ lines.append(self.__create_line_with_refs(subpart.string + "\n", bib2uid, table2uid))
+ elif subpart.contents and len(subpart.contents) > 0:
+ lines.append(self.__create_line_with_refs(subpart.contents, bib2uid, table2uid))
return lines
@@ -265,12 +296,26 @@ def __parse_tables(self, soup: Tag) -> Tuple[List[Table], dict]:
tag_tables = soup.find_all("figure", {"type": "table"})
for table in tag_tables:
- row_cells = []
+ table_cells = []
head = table.contents[0] if len(table.contents) > 0 and isinstance(table.contents[0], str) else self.__tag2text(table.head)
- title = head + self.__tag2text(table.figdesc)
+ title = head + self.__tag2text(table.figDesc)
for row in table.table.find_all("row"):
- row_cells.append([CellWithMeta(lines=[self.__create_line(self.__tag2text(cell))]) for cell in row.find_all("cell")])
- tables.append(Table(cells=row_cells, metadata=TableMetadata(page_id=0, title=title)))
+ row_cells = []
+ for cell in row.find_all("cell"):
+ cell_text = self.__create_line(self.__tag2text(cell))
+ colspan = int(cell.get("cols", 1))
+ row_cells.append(CellWithMeta(lines=[cell_text], colspan=colspan))
+
+ if colspan > 1:
+ row_cells.extend([CellWithMeta(lines=[cell_text], invisible=True) for _ in range(colspan - 1)])
+
+ table_cells.append(row_cells)
+
+ # ignore empty tables
+ if len(table_cells) == 0:
+ continue
+
+ tables.append(Table(cells=table_cells, metadata=TableMetadata(page_id=0, title=title)))
table2uid["#" + table.get("xml:id")] = tables[-1].metadata.uid
return tables, table2uid
@@ -310,12 +355,12 @@ def __parse_bibliography(self, soup: Tag) -> Tuple[List[LineWithMeta], dict]:
# according GROBID description
level_2_paragraph_type = {"a": "title", "j": "title_journal", "s": "title_series", "m": "title_conference_proceedings"}
- bibliography = soup.find("listbibl", recursive=True)
+ bibliography = soup.find("listBibl", recursive=True)
lines.append(self.__create_line(text="bibliography", hierarchy_level_id=1, paragraph_type="bibliography"))
if not bibliography:
return lines, cites
- bib_items = bibliography.find_all("biblstruct")
+ bib_items = bibliography.find_all("biblStruct")
if not bib_items:
return lines, cites
@@ -331,19 +376,19 @@ def __parse_bibliography(self, soup: Tag) -> Tuple[List[LineWithMeta], dict]:
lines.append(self.__create_line(text=self.__tag2text(title), hierarchy_level_id=3, paragraph_type=paragraph_type))
lines += [ # parse bib authors
- self.__create_line(text=author.get_text(), hierarchy_level_id=3, paragraph_type="author")
+ self.__create_line(text=self.__remove_newlines(author).get_text(separator=" "), hierarchy_level_id=3, paragraph_type="author")
for author in bib_item.find_all("author", recursive=True) if author
]
lines += [ # parse biblScope
self.__create_line(text=self.__tag2text(bibl_scope), hierarchy_level_id=3, paragraph_type="biblScope_volume")
- for bibl_scope in bib_item.find_all("biblscope", {"unit": "volume"}, recursive=True) if bibl_scope
+ for bibl_scope in bib_item.find_all("biblScope", {"unit": "volume"}, recursive=True) if bibl_scope
]
try:
lines += [ # parse values
self.__create_line(text=f"{bibl_scope.get('from')}-{bibl_scope.get('to')}", hierarchy_level_id=3, paragraph_type="biblScope_page")
- for bibl_scope in bib_item.find_all("biblscope", {"unit": "page"}, recursive=True) if bibl_scope
+ for bibl_scope in bib_item.find_all("biblScope", {"unit": "page"}, recursive=True) if bibl_scope
]
finally:
self.logger.warning("Grobid parsing warning: was non-standard format")
@@ -363,3 +408,9 @@ def __parse_bibliography(self, soup: Tag) -> Tuple[List[LineWithMeta], dict]:
def __parse_title(self, soup: Tag) -> List[LineWithMeta]:
return [self.__create_line(text=self.__tag2text(soup.title), hierarchy_level_id=0, paragraph_type="root")]
+
+ def __remove_newlines(self, tag: Tag) -> Tag:
+ for item in tag:
+ if not isinstance(item, Tag):
+ item.extract()
+ return tag
diff --git a/dedoc/readers/html_reader/html_reader.py b/dedoc/readers/html_reader/html_reader.py
index 83fb2085..bea5af70 100644
--- a/dedoc/readers/html_reader/html_reader.py
+++ b/dedoc/readers/html_reader/html_reader.py
@@ -83,7 +83,7 @@ def __handle_block(self, tag: Union[Tag], filepath_hash: str, handle_invisible_t
block_lines = self.__handle_single_tag(tag=tag, filepath_hash=filepath_hash, uid=tag_uid, table=table)
for line in block_lines:
if not getattr(line.metadata, "html_tag", None):
- line.metadata.extend_other_fields({"html_tag": tag.name})
+ line.metadata.html_tag = tag.name
return block_lines
def __handle_single_tag(self, tag: Tag, filepath_hash: str, uid: str, table: Optional[bool] = False) -> List[LineWithMeta]:
@@ -97,7 +97,7 @@ def __handle_single_tag(self, tag: Tag, filepath_hash: str, uid: str, table: Opt
line_type = HierarchyLevel.unknown if header_level == 0 else HierarchyLevel.header
tag_uid = hashlib.md5((uid + text).encode()).hexdigest()
line = self.__make_line(line=text, line_type=line_type, header_level=header_level, uid=tag_uid, filepath_hash=filepath_hash, annotations=annotations)
- line.metadata.extend_other_fields({"html_tag": tag.name})
+ line.metadata.html_tag = tag.name
return [line]
def __read_blocks(self, block: Tag, filepath_hash: str = "", handle_invisible_table: bool = False, table: Optional[bool] = False,
diff --git a/dedoc/readers/pdf_reader/pdf_base_reader.py b/dedoc/readers/pdf_reader/pdf_base_reader.py
index 8372fb92..4dd00c9b 100644
--- a/dedoc/readers/pdf_reader/pdf_base_reader.py
+++ b/dedoc/readers/pdf_reader/pdf_base_reader.py
@@ -53,7 +53,7 @@ class PdfBaseReader(BaseReader):
def __init__(self, *, config: Optional[dict] = None) -> None:
super().__init__(config=config)
- self.config["n_jobs"] = config.get("n_jobs", 1)
+ self.config["n_jobs"] = self.config.get("n_jobs", 1)
self.table_recognizer = TableRecognizer(config=self.config)
self.metadata_extractor = LineMetadataExtractor(config=self.config)
self.attachment_extractor = PDFAttachmentsExtractor(config=self.config)
@@ -88,13 +88,13 @@ def read(self, file_path: str, parameters: Optional[dict] = None) -> Unstructure
attachments_dir=attachments_dir
)
- lines, scan_tables, attachments, warnings, other_fields = self._parse_document(file_path, params_for_parse)
+ lines, scan_tables, attachments, warnings, metadata = self._parse_document(file_path, params_for_parse)
tables = [scan_table.to_table() for scan_table in scan_tables]
if param_utils.get_param_with_attachments(parameters) and self.attachment_extractor.can_extract(file_path):
attachments += self.attachment_extractor.extract(file_path=file_path, parameters=parameters)
- result = UnstructuredDocument(lines=lines, tables=tables, attachments=attachments, warnings=warnings, metadata=other_fields)
+ result = UnstructuredDocument(lines=lines, tables=tables, attachments=attachments, warnings=warnings, metadata=metadata)
return self._postprocess(result)
def _parse_document(self, path: str, parameters: ParametersForParseDoc) -> (
diff --git a/dedoc/structure_extractors/__init__.py b/dedoc/structure_extractors/__init__.py
index 404d915c..20f6d350 100644
--- a/dedoc/structure_extractors/__init__.py
+++ b/dedoc/structure_extractors/__init__.py
@@ -4,11 +4,12 @@
from .concrete_structure_extractors.article_structure_extractor import ArticleStructureExtractor
from .concrete_structure_extractors.classifying_law_structure_extractor import ClassifyingLawStructureExtractor
from .concrete_structure_extractors.diploma_structure_extractor import DiplomaStructureExtractor
+from .concrete_structure_extractors.fintoc_structure_extractor import FintocStructureExtractor
from .concrete_structure_extractors.foiv_law_structure_extractor import FoivLawStructureExtractor
from .concrete_structure_extractors.law_structure_excractor import LawStructureExtractor
from .concrete_structure_extractors.tz_structure_extractor import TzStructureExtractor
from .structure_extractor_composition import StructureExtractorComposition
__all__ = ['AbstractStructureExtractor', 'AbstractLawStructureExtractor', 'ArticleStructureExtractor', 'ClassifyingLawStructureExtractor',
- 'DefaultStructureExtractor', 'DiplomaStructureExtractor', 'FoivLawStructureExtractor', 'LawStructureExtractor', 'TzStructureExtractor',
- 'StructureExtractorComposition']
+ 'DefaultStructureExtractor', 'DiplomaStructureExtractor', 'FintocStructureExtractor', 'FoivLawStructureExtractor', 'LawStructureExtractor',
+ 'TzStructureExtractor', 'StructureExtractorComposition']
diff --git a/dedoc/structure_extractors/concrete_structure_extractors/fintoc_structure_extractor.py b/dedoc/structure_extractors/concrete_structure_extractors/fintoc_structure_extractor.py
new file mode 100644
index 00000000..0d78c783
--- /dev/null
+++ b/dedoc/structure_extractors/concrete_structure_extractors/fintoc_structure_extractor.py
@@ -0,0 +1,134 @@
+import os
+import re
+from typing import Dict, List, Optional, Tuple, Union
+
+import pandas as pd
+
+from dedoc.config import get_config
+from dedoc.data_structures import HierarchyLevel, LineWithMeta, UnstructuredDocument
+from dedoc.structure_extractors import AbstractStructureExtractor
+from dedoc.structure_extractors.feature_extractors.fintoc_feature_extractor import FintocFeatureExtractor
+from dedoc.structure_extractors.feature_extractors.toc_feature_extractor import TOCFeatureExtractor
+from dedoc.structure_extractors.line_type_classifiers.fintoc_classifier import FintocClassifier
+
+
+class FintocStructureExtractor(AbstractStructureExtractor):
+ """
+ This class is an implementation of the TOC extractor for the `FinTOC 2022 Shared task `_.
+ The code is a modification of the winner's solution (ISP RAS team).
+
+ This structure extractor is used for English, French and Spanish financial prospects in PDF format (with a textual layer).
+ It is recommended to use :class:`~dedoc.readers.PdfTxtlayerReader` to obtain document lines.
+ You can find the more detailed description of this type of structure in the section :ref:`fintoc_structure`.
+ """
+ document_type = "fintoc"
+
+ def __init__(self, *, config: Optional[dict] = None) -> None:
+ super().__init__(config=config)
+ from dedoc.readers import PdfTxtlayerReader # to exclude circular imports
+ self.pdf_reader = PdfTxtlayerReader(config=self.config)
+ self.toc_extractor = TOCFeatureExtractor()
+ self.features_extractor = FintocFeatureExtractor()
+ self.languages = ("en", "fr", "sp")
+ path = os.path.join(get_config()["resources_path"], "fintoc_classifiers")
+ self.classifiers = {language: FintocClassifier(language=language, weights_dir_path=path) for language in self.languages}
+ self.toc_item_regexp = re.compile(r'"([^"]+)" (\d+)')
+ self.empty_string_regexp = re.compile(r"^\s*\n$")
+
+ def extract(self, document: UnstructuredDocument, parameters: Optional[dict] = None, file_path: Optional[str] = None) -> UnstructuredDocument:
+ """
+ According to the `FinTOC 2022 `_ title detection task, lines are classified as titles and non-titles.
+ The information about titles is saved in ``line.metadata.hierarchy_level`` (:class:`~dedoc.data_structures.HierarchyLevel` class):
+
+ - Title lines have ``HierarchyLevel.header`` type, and their depth (``HierarchyLevel.level_2``) is similar to \
+ the depth of TOC item from the FinTOC 2022 TOC generation task.
+ - Non-title lines have ``HierarchyLevel.raw_text`` type, and their depth isn't obtained.
+
+ :param document: document content that has been received from some of the readers (:class:`~dedoc.readers.PdfTxtlayerReader` is recommended).
+ :param parameters: for this structure extractor, "language" parameter is used for setting document's language, e.g. ``parameters={"language": "en"}``. \
+ The following options are supported:
+
+ * "en", "eng" - English (default);
+ * "fr", "fra" - French;
+ * "sp", "spa" - Spanish.
+ :param file_path: path to the file on disk.
+ :return: document content with added additional information about title/non-title lines and hierarchy levels of titles.
+ """
+ parameters = {} if parameters is None else parameters
+ language = self.__get_param_language(parameters=parameters)
+
+ features, documents = self.get_features(documents_dict={file_path: document.lines})
+ predictions = self.classifiers[language].predict(features)
+ lines: List[LineWithMeta] = documents[0]
+ assert len(lines) == len(predictions)
+
+ for line, prediction in zip(lines, predictions):
+ if prediction > 0:
+ line.metadata.hierarchy_level = HierarchyLevel(level_1=1, level_2=prediction, line_type=HierarchyLevel.header, can_be_multiline=True)
+ else:
+ line.metadata.hierarchy_level = HierarchyLevel.create_raw_text()
+ document.lines = lines
+
+ return document
+
+ def __get_param_language(self, parameters: dict) -> str:
+ language = parameters.get("language", "en")
+
+ if language in ("en", "eng", "rus+eng"):
+ return "en"
+
+ if language in ("fr", "fra"):
+ return "fr"
+
+ if language in ("sp", "spa"):
+ return "sp"
+
+ if language not in self.languages:
+ self.logger.warning(f"Language {language} is not supported by this extractor. Use default language (en)")
+ return "en"
+
+ def get_features(self, documents_dict: Dict[str, List[LineWithMeta]]) -> Tuple[pd.DataFrame, List[List[LineWithMeta]]]:
+ toc_lines, documents = [], []
+ for file_path, document_lines in documents_dict.items():
+ toc_lines.append(self.__get_toc(file_path=file_path))
+ documents.append(self.__filter_lines(document_lines))
+ features = self.features_extractor.transform(documents=documents, toc_lines=toc_lines)
+ return features, documents
+
+ def __filter_lines(self, lines: List[LineWithMeta]) -> List[LineWithMeta]:
+ special_unicode_symbols = [u"\uf0b7", u"\uf0d8", u"\uf084", u"\uf0a7", u"\uf0f0", u"\x83"]
+
+ lines = [line for line in lines if not self.empty_string_regexp.match(line.line)]
+ for line in lines:
+ for ch in special_unicode_symbols:
+ line.set_line(line.line.replace(ch, ""))
+
+ return lines
+
+ def __get_toc(self, file_path: Optional[str]) -> List[Dict[str, Union[LineWithMeta, str]]]:
+ """
+ Try to get TOC from PDF automatically. If TOC wasn't extracted automatically, it is extracted using regular expressions.
+ """
+ if file_path is None or not file_path.lower().endswith(".pdf"):
+ return []
+
+ toc = self.__get_automatic_toc(path=file_path)
+ if len(toc) > 0:
+ self.logger.info(f"Got automatic TOC from {os.path.basename(file_path)}")
+ return toc
+
+ parameters = {"is_one_column_document": "True", "need_header_footer_analysis": "True", "pages": ":10"}
+ lines = self.pdf_reader.read(file_path=file_path, parameters=parameters).lines
+ return self.toc_extractor.get_toc(lines)
+
+ def __get_automatic_toc(self, path: str) -> List[Dict[str, Union[LineWithMeta, str]]]:
+ result = []
+ with os.popen(f'pdftocio -p "{path}"') as out:
+ toc = out.readlines()
+
+ for line in toc:
+ match = self.toc_item_regexp.match(line.strip())
+ if match:
+ result.append({"line": LineWithMeta(match.group(1)), "page": match.group(2)})
+
+ return result
diff --git a/dedoc/structure_extractors/feature_extractors/fintoc_feature_extractor.py b/dedoc/structure_extractors/feature_extractors/fintoc_feature_extractor.py
new file mode 100644
index 00000000..82e53111
--- /dev/null
+++ b/dedoc/structure_extractors/feature_extractors/fintoc_feature_extractor.py
@@ -0,0 +1,158 @@
+import re
+from collections import defaultdict
+from typing import Dict, Iterator, List, Optional, Tuple
+
+import pandas as pd
+from Levenshtein._levenshtein import ratio
+
+from dedoc.data_structures.line_with_meta import LineWithMeta
+from dedoc.structure_extractors.feature_extractors.abstract_extractor import AbstractFeatureExtractor
+from dedoc.structure_extractors.feature_extractors.list_features.list_features_extractor import ListFeaturesExtractor
+from dedoc.structure_extractors.feature_extractors.list_features.prefix.any_letter_prefix import AnyLetterPrefix
+from dedoc.structure_extractors.feature_extractors.list_features.prefix.bracket_prefix import BracketPrefix
+from dedoc.structure_extractors.feature_extractors.list_features.prefix.bracket_roman_prefix import BracketRomanPrefix
+from dedoc.structure_extractors.feature_extractors.list_features.prefix.bullet_prefix import BulletPrefix
+from dedoc.structure_extractors.feature_extractors.list_features.prefix.dotted_prefix import DottedPrefix
+from dedoc.structure_extractors.feature_extractors.list_features.prefix.empty_prefix import EmptyPrefix
+from dedoc.structure_extractors.feature_extractors.list_features.prefix.letter_prefix import LetterPrefix
+from dedoc.structure_extractors.feature_extractors.list_features.prefix.roman_prefix import RomanPrefix
+from dedoc.structure_extractors.feature_extractors.paired_feature_extractor import PairedFeatureExtractor
+from dedoc.structure_extractors.feature_extractors.toc_feature_extractor import TOCFeatureExtractor
+from dedoc.structure_extractors.feature_extractors.utils_feature_extractor import normalization_by_min_max
+from dedoc.structure_extractors.hierarchy_level_builders.utils_reg import regexps_year
+
+
+class FintocFeatureExtractor(AbstractFeatureExtractor):
+
+ def __init__(self) -> None:
+ self.paired_feature_extractor = PairedFeatureExtractor()
+ self.prefix_list = [BulletPrefix, AnyLetterPrefix, LetterPrefix, BracketPrefix, BracketRomanPrefix, DottedPrefix, RomanPrefix]
+ self.list_feature_extractors = [
+ ListFeaturesExtractor(window_size=10, prefix_list=self.prefix_list),
+ ListFeaturesExtractor(window_size=25, prefix_list=self.prefix_list),
+ ListFeaturesExtractor(window_size=100, prefix_list=self.prefix_list)
+ ]
+ self.prefix2number = {prefix.name: i for i, prefix in enumerate(self.prefix_list, start=1)}
+ self.prefix2number[EmptyPrefix.name] = 0
+
+ def parameters(self) -> dict:
+ return {}
+
+ def fit(self, documents: List[List[LineWithMeta]], y: Optional[List[str]] = None) -> "AbstractFeatureExtractor":
+ return self
+
+ def transform(self, documents: List[List[LineWithMeta]], y: Optional[List[str]] = None, toc_lines: Optional[List[List[dict]]] = None) -> pd.DataFrame:
+ assert len(documents) > 0
+ result_matrix = pd.concat([self.__process_document(document, d_toc_lines) for document, d_toc_lines in zip(documents, toc_lines)], ignore_index=True)
+ result_matrix = pd.concat([result_matrix, self.paired_feature_extractor.transform(documents)], axis=1)
+ features = sorted(result_matrix.columns)
+ result_matrix = result_matrix[features].astype(float)
+ return result_matrix[features]
+
+ def __process_document(self, lines: List[LineWithMeta], toc: Optional[list] = None) -> pd.DataFrame:
+ features_df = pd.DataFrame(self.__look_at_prev_line(document=lines, n=1))
+ features_df["line_relative_length"] = self.__get_line_relative_length(lines)
+
+ list_features = pd.concat([f_e.one_document(lines)[1] for f_e in self.list_feature_extractors], axis=1)
+
+ page_ids = [line.metadata.page_id for line in lines]
+ start_page, finish_page = (min(page_ids), max(page_ids)) if page_ids else (0, 0)
+
+ total_lines = len(lines)
+ one_line_features_dict = defaultdict(list)
+ for line in lines:
+ for item in self.__one_line_features(line, total_lines, start_page=start_page, finish_page=finish_page, toc=toc):
+ feature_name, feature = item[0], item[1]
+ one_line_features_dict[feature_name].append(feature)
+
+ one_line_features_df = pd.DataFrame(one_line_features_dict)
+ one_line_features_df["font_size"] = self._normalize_features(one_line_features_df.font_size)
+
+ one_line_features_df = self.prev_next_line_features(one_line_features_df, 3, 3)
+ result_matrix = pd.concat([one_line_features_df, features_df, list_features], axis=1)
+ result_matrix["page_id"] = [line.metadata.page_id for line in lines]
+ return result_matrix
+
+ def __look_at_prev_line(self, document: List[LineWithMeta], n: int = 1) -> Dict[str, List]:
+ """
+ Look at previous line and compare with current line
+
+ :param document: list of lines
+ :param n: previous line number to look
+ :return: dict of features
+ """
+ res = defaultdict(list)
+ for line_id, _ in enumerate(document):
+ if line_id >= n:
+ prev_line = document[line_id - n]
+ res["prev_line_ends"].append(prev_line.line.endswith((".", ";")))
+ res["prev_ends_with_colon"].append(prev_line.line.endswith(":"))
+ res["prev_is_space"].append(prev_line.line.lower().isspace())
+ else:
+ res["prev_line_ends"].append(False)
+ res["prev_ends_with_colon"].append(False)
+ res["prev_is_space"].append(False)
+ return res
+
+ def __get_line_relative_length(self, lines: List[LineWithMeta]) -> List[float]:
+ max_len = max([len(line.line) for line in lines])
+ relative_lengths = [len(line.line) / max_len for line in lines]
+ return relative_lengths
+
+ def __one_line_features(self, line: LineWithMeta, total_lines: int, start_page: int, finish_page: int, toc: Optional[list]) -> Iterator[tuple]:
+ yield "normalized_page_id", normalization_by_min_max(line.metadata.page_id, min_v=start_page, max_v=finish_page)
+ yield "indentation", self._get_indentation(line)
+ yield "spacing", self._get_spacing(line)
+ yield "bold", self._get_bold(line)
+ yield "italic", self._get_italic(line)
+ yield from self._get_color(line)
+ yield "font_size", self._get_size(line)
+
+ yield "line_id", normalization_by_min_max(line.metadata.line_id, min_v=0, max_v=total_lines)
+ yield "num_year_regexp", len(regexps_year.findall(line.line))
+ yield "endswith_dot", line.line.endswith(".")
+ yield "endswith_semicolon", line.line.endswith(";")
+ yield "endswith_colon", line.line.endswith(":")
+ yield "endswith_comma", line.line.endswith(",")
+ yield "startswith_bracket", line.line.strip().startswith(("(", "{"))
+
+ bracket_cnt = 0
+ for char in line.line:
+ if char == "(":
+ bracket_cnt += 1
+ elif char == ")":
+ bracket_cnt = max(0, bracket_cnt - 1)
+ yield "bracket_num", bracket_cnt
+
+ probable_toc_title = re.sub(r"[\s:]", "", line.line).lower()
+ yield "is_toc_title", probable_toc_title in TOCFeatureExtractor.titles
+ yield from self.__find_in_toc(line, toc)
+
+ line_length = len(line.line) + 1
+ yield "supper_percent", sum((1 for letter in line.line if letter.isupper())) / line_length
+ yield "letter_percent", sum((1 for letter in line.line if letter.isalpha())) / line_length
+ yield "number_percent", sum((1 for letter in line.line if letter.isnumeric())) / line_length
+ yield "words_number", len(line.line.split())
+
+ def __find_in_toc(self, line: LineWithMeta, toc: Optional[List[dict]]) -> Iterator[Tuple[str, int]]:
+ if toc is None:
+ yield "is_toc", 0
+ yield "in_toc", 0
+ yield "toc_exists", 0
+ else:
+ is_toc, in_toc, toc_exists = 0, 0, int(len(toc) > 0)
+ line_text = line.line.lower().strip()
+ for item in toc:
+ if ratio(line_text, item["line"].line.lower()) < 0.8:
+ continue
+ # toc entry found
+ try:
+ is_toc = 0 if line.metadata.page_id + 1 == int(item["page"]) else 1
+ in_toc = 1 if line.metadata.page_id + 1 == int(item["page"]) else 0
+ except TypeError:
+ pass
+ break
+
+ yield "is_toc", is_toc
+ yield "in_toc", in_toc
+ yield "toc_exists", toc_exists
diff --git a/dedoc/structure_extractors/feature_extractors/paired_feature_extractor.py b/dedoc/structure_extractors/feature_extractors/paired_feature_extractor.py
new file mode 100755
index 00000000..130a5560
--- /dev/null
+++ b/dedoc/structure_extractors/feature_extractors/paired_feature_extractor.py
@@ -0,0 +1,87 @@
+import json
+from typing import List, Optional
+
+import numpy as np
+import pandas as pd
+from scipy.stats._multivariate import method
+
+from dedoc.data_structures.concrete_annotations.bbox_annotation import BBoxAnnotation
+from dedoc.data_structures.concrete_annotations.size_annotation import SizeAnnotation
+from dedoc.data_structures.line_with_meta import LineWithMeta
+from dedoc.structure_extractors.feature_extractors.abstract_extractor import AbstractFeatureExtractor
+from dedoc.utils.utils import flatten
+
+
+class PairedFeatureExtractor(AbstractFeatureExtractor):
+ """
+ This class is used as an auxiliary feature extractor to the main extractor.
+ It allows to add "raw" features related to the lines importance.
+ Based on one line property (size, indentation) it computes a raw line's depth inside the document tree.
+
+ Example:
+ For lines
+ line1 (size=16)
+ line2 (size=14)
+ line3 (size=12)
+ line4 (size=12)
+ line5 (size=14)
+ line6 (size=12)
+ We will obtain a feature vector (raw_depth_size)
+ [0, 1, 2, 2, 1, 2]
+ """
+
+ def parameters(self) -> dict:
+ return {}
+
+ def fit(self, documents: List[List[LineWithMeta]], y: Optional[List[str]] = None) -> "AbstractFeatureExtractor":
+ return self
+
+ def transform(self, documents: List[List[LineWithMeta]], y: Optional[List[str]] = None) -> pd.DataFrame:
+ df = pd.DataFrame()
+ df["raw_depth_size"] = list(flatten([self._handle_one_document(document, self.__get_size) for document in documents]))
+ df["raw_depth_indentation"] = list(flatten([self._handle_one_document(document, self._get_indentation) for document in documents]))
+ return df
+
+ def _handle_one_document(self, document: List[LineWithMeta], get_feature: method) -> List[int]:
+ if len(document) == 0:
+ return []
+ if len(document) == 1:
+ return [0]
+
+ features = [get_feature(line) for line in document]
+ std = np.std(features)
+ result = []
+ stack = []
+
+ for line in document:
+ while len(stack) > 0 and self.__compare_lines(stack[-1], line, get_feature, std) <= 0: # noqa
+ stack.pop()
+ result.append(len(stack))
+ stack.append(line)
+
+ return result
+
+ def __get_size(self, line: LineWithMeta) -> float:
+ annotations = line.annotations
+ size_annotation = [annotation for annotation in annotations if annotation.name == SizeAnnotation.name]
+ if len(size_annotation) > 0:
+ return float(size_annotation[0].value)
+
+ bbox_annotation = [annotation for annotation in annotations if annotation.name == BBoxAnnotation.name]
+ if len(bbox_annotation) > 0:
+ bbox = json.loads(bbox_annotation[0].value)
+ return bbox["height"]
+
+ return 0
+
+ def __compare_lines(self, first_line: LineWithMeta, second_line: LineWithMeta, get_feature: method, threshold: float = 0) -> int:
+ first_feature = get_feature(first_line)
+ second_feature = get_feature(second_line)
+
+ if first_feature > second_feature + threshold:
+ return 1
+
+ if second_feature > first_feature + threshold:
+ return -1
+
+ return 0
diff --git a/dedoc/structure_extractors/feature_extractors/toc_feature_extractor.py b/dedoc/structure_extractors/feature_extractors/toc_feature_extractor.py
index 28fab042..a0000e0a 100644
--- a/dedoc/structure_extractors/feature_extractors/toc_feature_extractor.py
+++ b/dedoc/structure_extractors/feature_extractors/toc_feature_extractor.py
@@ -1,5 +1,5 @@
import re
-from typing import List, Optional, Tuple, Union
+from typing import Dict, List, Optional, Tuple, Union
import numpy as np
from Levenshtein._levenshtein import ratio
@@ -17,11 +17,11 @@ class TOCFeatureExtractor:
"indice", "índice", "contenidos", "tabladecontenido" # spanish
)
- def get_toc(self, document: List[LineWithMeta]) -> List[dict]:
+ def get_toc(self, document: List[LineWithMeta]) -> List[Dict[str, Union[LineWithMeta, str]]]:
"""
Finds the table of contents in the given document
Returns:
- list of dictionaries with toc item and page number where it is located: {"line", "page"}
+ list of dictionaries with toc item (LineWithMeta) and page number where it is located: {"line", "page"}
"""
corrected_lines, marks = self.__get_probable_toc(document)
diff --git a/dedoc/structure_extractors/line_type_classifiers/fintoc_classifier.py b/dedoc/structure_extractors/line_type_classifiers/fintoc_classifier.py
new file mode 100755
index 00000000..9e00e819
--- /dev/null
+++ b/dedoc/structure_extractors/line_type_classifiers/fintoc_classifier.py
@@ -0,0 +1,95 @@
+import gzip
+import logging
+import os
+import pickle
+from typing import Dict, List, Optional, Union
+
+import numpy as np
+import pandas as pd
+import xgbfir
+from xgboost import XGBClassifier
+
+from dedoc.download_models import download_from_hub
+
+
+class FintocClassifier:
+ """
+ Classifier of financial documents for the FinTOC 2022 Shared task (https://wp.lancs.ac.uk/cfie/fintoc2022/).
+ Lines are classified in two stages:
+ 1. Binary classification title/not title (title detection task)
+ 2. Classification of title lines into title depth classes (1-6) (TOC generation task)
+
+ More important lines have a lesser depth.
+ As a result:
+ 1. For non-title lines, classifier returns -1.
+ 2. For title lines, classifier returns their depth (from 1 to 6).
+ """
+
+ def __init__(self, language: str, weights_dir_path: Optional[str] = None) -> None:
+ """
+ :param language: language of data ("en", "fr", "sp")
+ :param weights_dir_path: path to directory with trained models weights
+ """
+ self.weights_dir_path = weights_dir_path
+ self.language = language
+ self.classifiers = {"binary": None, "target": None}
+
+ def predict(self, features: pd.DataFrame) -> List[int]:
+ """
+ Two-staged classification: title/not title and depth classification for titles.
+ For non-title lines, classifier returns -1, for title lines, classifier returns their depth (from 1 to 6).
+ """
+ binary_predictions = self.binary_classifier.predict(features)
+ # binary_predictions = [True, False, ...], target predictions are predicted only for True items
+ target_predictions = self.target_classifier.predict(features[binary_predictions])
+ result = np.ones_like(binary_predictions) * -1
+ result[binary_predictions] = target_predictions
+ # return list [1, 2, 3, -1, -1, ...], where positive values mean headers depth, -1 mean non-header lines
+ return list(result)
+
+ def fit(self,
+ binary_classifier_parameters: Dict[str, Union[int, float, str]],
+ target_classifier_parameters: Dict[str, Union[int, float, str]],
+ features: pd.DataFrame,
+ features_names: List[str]) -> None:
+ self.classifiers["binary"] = XGBClassifier(**binary_classifier_parameters)
+ self.classifiers["target"] = XGBClassifier(**target_classifier_parameters)
+ self.binary_classifier.fit(features[features_names], features.label != -1)
+ self.target_classifier.fit(features[features_names][features.label != -1], features.label[features.label != -1])
+
+ def save(self, classifiers_dir_path: str, features_importances_dir_path: str, logger: logging.Logger, features_names: List[str], reader: str) -> None:
+ os.makedirs(classifiers_dir_path, exist_ok=True)
+ for classifier_type in ("binary", "target"):
+ with gzip.open(os.path.join(classifiers_dir_path, f"{classifier_type}_classifier_{self.language}_{reader}.pkg.gz"), "wb") as output_file:
+ pickle.dump(self.classifiers[classifier_type], output_file)
+ logger.info(f"Classifiers were saved in {classifiers_dir_path} directory")
+
+ os.makedirs(features_importances_dir_path, exist_ok=True)
+ for classifier_type in ("binary", "target"):
+ xgbfir.saveXgbFI(self.classifiers[classifier_type], feature_names=features_names,
+ OutputXlsxFile=os.path.join(features_importances_dir_path, f"feature_importances_{classifier_type}_{self.language}_{reader}.xlsx"))
+ logger.info(f"Features importances were saved in {features_importances_dir_path} directory")
+
+ @property
+ def binary_classifier(self) -> XGBClassifier:
+ return self.__lazy_load_weights("binary")
+
+ @property
+ def target_classifier(self) -> XGBClassifier:
+ return self.__lazy_load_weights("target")
+
+ def __lazy_load_weights(self, classifier_type: str) -> XGBClassifier:
+ if self.classifiers[classifier_type] is None:
+ assert self.weights_dir_path is not None
+ file_name = f"{classifier_type}_classifier_{self.language}.pkg.gz"
+ classifier_path = os.path.join(self.weights_dir_path, file_name)
+ if not os.path.isfile(classifier_path):
+ download_from_hub(out_dir=self.weights_dir_path,
+ out_name=file_name,
+ repo_name="fintoc_classifiers",
+ hub_name=f"{classifier_type}_classifier_{self.language}_txt_layer.pkg.gz")
+
+ with gzip.open(classifier_path, "rb") as input_file:
+ self.classifiers[classifier_type] = pickle.load(file=input_file)
+
+ return self.classifiers[classifier_type]
diff --git a/docs/source/_static/add_new_structure_type/article_classifier_000000_UX6.json b/docs/source/_static/add_new_structure_type/article_classifier_000000_UX6.json
index c7e3da40..881a3c21 100644
--- a/docs/source/_static/add_new_structure_type/article_classifier_000000_UX6.json
+++ b/docs/source/_static/add_new_structure_type/article_classifier_000000_UX6.json
@@ -33,8 +33,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10005,
- "_LineMetadata__other_fields": {}
+ "line_id": 10005
},
"_annotations": [
{
@@ -184,8 +183,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10007,
- "_LineMetadata__other_fields": {}
+ "line_id": 10007
},
"_annotations": [
{
@@ -279,8 +277,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10012,
- "_LineMetadata__other_fields": {}
+ "line_id": 10012
},
"_annotations": [
{
@@ -437,8 +434,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10018,
- "_LineMetadata__other_fields": {}
+ "line_id": 10018
},
"_annotations": [
{
@@ -588,8 +584,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10027,
- "_LineMetadata__other_fields": {}
+ "line_id": 10027
},
"_annotations": [
{
@@ -781,8 +776,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10029,
- "_LineMetadata__other_fields": {}
+ "line_id": 10029
},
"_annotations": [
{
@@ -876,8 +870,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10040,
- "_LineMetadata__other_fields": {}
+ "line_id": 10040
},
"_annotations": [
{
@@ -1097,8 +1090,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10053,
- "_LineMetadata__other_fields": {}
+ "line_id": 10053
},
"_annotations": [
{
@@ -1346,8 +1338,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10066,
- "_LineMetadata__other_fields": {}
+ "line_id": 10066
},
"_annotations": [
{
@@ -1595,8 +1586,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10077,
- "_LineMetadata__other_fields": {}
+ "line_id": 10077
},
"_annotations": [
{
@@ -1816,8 +1806,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10087,
- "_LineMetadata__other_fields": {}
+ "line_id": 10087
},
"_annotations": [
{
@@ -2023,8 +2012,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10096,
- "_LineMetadata__other_fields": {}
+ "line_id": 10096
},
"_annotations": [
{
@@ -2216,8 +2204,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10107,
- "_LineMetadata__other_fields": {}
+ "line_id": 10107
},
"_annotations": [
{
@@ -2437,8 +2424,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10116,
- "_LineMetadata__other_fields": {}
+ "line_id": 10116
},
"_annotations": [
{
@@ -2630,8 +2616,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10126,
- "_LineMetadata__other_fields": {}
+ "line_id": 10126
},
"_annotations": [
{
@@ -2837,8 +2822,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10139,
- "_LineMetadata__other_fields": {}
+ "line_id": 10139
},
"_annotations": [
{
@@ -3086,8 +3070,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10149,
- "_LineMetadata__other_fields": {}
+ "line_id": 10149
},
"_annotations": [
{
@@ -3293,8 +3276,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10154,
- "_LineMetadata__other_fields": {}
+ "line_id": 10154
},
"_annotations": [
{
@@ -3430,8 +3412,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10165,
- "_LineMetadata__other_fields": {}
+ "line_id": 10165
},
"_annotations": [
{
@@ -3651,8 +3632,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10172,
- "_LineMetadata__other_fields": {}
+ "line_id": 10172
},
"_annotations": [
{
@@ -3816,8 +3796,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10174,
- "_LineMetadata__other_fields": {}
+ "line_id": 10174
},
"_annotations": [
{
@@ -3911,8 +3890,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10186,
- "_LineMetadata__other_fields": {}
+ "line_id": 10186
},
"_annotations": [
{
@@ -4146,8 +4124,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10200,
- "_LineMetadata__other_fields": {}
+ "line_id": 10200
},
"_annotations": [
{
@@ -4409,8 +4386,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10213,
- "_LineMetadata__other_fields": {}
+ "line_id": 10213
},
"_annotations": [
{
@@ -4658,8 +4634,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10227,
- "_LineMetadata__other_fields": {}
+ "line_id": 10227
},
"_annotations": [
{
@@ -4921,8 +4896,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10238,
- "_LineMetadata__other_fields": {}
+ "line_id": 10238
},
"_annotations": [
{
@@ -5142,8 +5116,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10252,
- "_LineMetadata__other_fields": {}
+ "line_id": 10252
},
"_annotations": [
{
@@ -5405,8 +5378,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10265,
- "_LineMetadata__other_fields": {}
+ "line_id": 10265
},
"_annotations": [
{
@@ -5654,8 +5626,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10278,
- "_LineMetadata__other_fields": {}
+ "line_id": 10278
},
"_annotations": [
{
@@ -5903,8 +5874,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10284,
- "_LineMetadata__other_fields": {}
+ "line_id": 10284
},
"_annotations": [
{
@@ -6054,8 +6024,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10296,
- "_LineMetadata__other_fields": {}
+ "line_id": 10296
},
"_annotations": [
{
@@ -6289,8 +6258,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10308,
- "_LineMetadata__other_fields": {}
+ "line_id": 10308
},
"_annotations": [
{
@@ -6524,8 +6492,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10320,
- "_LineMetadata__other_fields": {}
+ "line_id": 10320
},
"_annotations": [
{
@@ -6759,8 +6726,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10335,
- "_LineMetadata__other_fields": {}
+ "line_id": 10335
},
"_annotations": [
{
@@ -7036,8 +7002,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10348,
- "_LineMetadata__other_fields": {}
+ "line_id": 10348
},
"_annotations": [
{
@@ -7285,8 +7250,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10361,
- "_LineMetadata__other_fields": {}
+ "line_id": 10361
},
"_annotations": [
{
@@ -7534,8 +7498,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10375,
- "_LineMetadata__other_fields": {}
+ "line_id": 10375
},
"_annotations": [
{
@@ -7797,8 +7760,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10383,
- "_LineMetadata__other_fields": {}
+ "line_id": 10383
},
"_annotations": [
{
@@ -7976,8 +7938,7 @@
"line_type": "list_item"
},
"page_id": 0,
- "line_id": 10389,
- "_LineMetadata__other_fields": {}
+ "line_id": 10389
},
"_annotations": [
{
@@ -8127,8 +8088,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10401,
- "_LineMetadata__other_fields": {}
+ "line_id": 10401
},
"_annotations": [
{
@@ -8362,8 +8322,7 @@
"line_type": "raw_text"
},
"page_id": 0,
- "line_id": 10402,
- "_LineMetadata__other_fields": {}
+ "line_id": 10402
},
"_annotations": [
{
@@ -8450,8 +8409,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20011,
- "_LineMetadata__other_fields": {}
+ "line_id": 20011
},
"_annotations": [
{
@@ -8685,8 +8643,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20024,
- "_LineMetadata__other_fields": {}
+ "line_id": 20024
},
"_annotations": [
{
@@ -8934,8 +8891,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20038,
- "_LineMetadata__other_fields": {}
+ "line_id": 20038
},
"_annotations": [
{
@@ -9197,8 +9153,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20048,
- "_LineMetadata__other_fields": {}
+ "line_id": 20048
},
"_annotations": [
{
@@ -9404,8 +9359,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20059,
- "_LineMetadata__other_fields": {}
+ "line_id": 20059
},
"_annotations": [
{
@@ -9625,8 +9579,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20072,
- "_LineMetadata__other_fields": {}
+ "line_id": 20072
},
"_annotations": [
{
@@ -9874,8 +9827,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20073,
- "_LineMetadata__other_fields": {}
+ "line_id": 20073
},
"_annotations": [
{
@@ -9955,8 +9907,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20086,
- "_LineMetadata__other_fields": {}
+ "line_id": 20086
},
"_annotations": [
{
@@ -10204,8 +10155,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20098,
- "_LineMetadata__other_fields": {}
+ "line_id": 20098
},
"_annotations": [
{
@@ -10439,8 +10389,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20109,
- "_LineMetadata__other_fields": {}
+ "line_id": 20109
},
"_annotations": [
{
@@ -10660,8 +10609,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20121,
- "_LineMetadata__other_fields": {}
+ "line_id": 20121
},
"_annotations": [
{
@@ -10895,8 +10843,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20134,
- "_LineMetadata__other_fields": {}
+ "line_id": 20134
},
"_annotations": [
{
@@ -11144,8 +11091,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20143,
- "_LineMetadata__other_fields": {}
+ "line_id": 20143
},
"_annotations": [
{
@@ -11337,8 +11283,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20156,
- "_LineMetadata__other_fields": {}
+ "line_id": 20156
},
"_annotations": [
{
@@ -11586,8 +11531,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20171,
- "_LineMetadata__other_fields": {}
+ "line_id": 20171
},
"_annotations": [
{
@@ -11863,8 +11807,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20182,
- "_LineMetadata__other_fields": {}
+ "line_id": 20182
},
"_annotations": [
{
@@ -12084,8 +12027,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20196,
- "_LineMetadata__other_fields": {}
+ "line_id": 20196
},
"_annotations": [
{
@@ -12347,8 +12289,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20209,
- "_LineMetadata__other_fields": {}
+ "line_id": 20209
},
"_annotations": [
{
@@ -12596,8 +12537,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20219,
- "_LineMetadata__other_fields": {}
+ "line_id": 20219
},
"_annotations": [
{
@@ -12803,8 +12743,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20232,
- "_LineMetadata__other_fields": {}
+ "line_id": 20232
},
"_annotations": [
{
@@ -13052,8 +12991,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20243,
- "_LineMetadata__other_fields": {}
+ "line_id": 20243
},
"_annotations": [
{
@@ -13273,8 +13211,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20245,
- "_LineMetadata__other_fields": {}
+ "line_id": 20245
},
"_annotations": [
{
@@ -13368,8 +13305,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20253,
- "_LineMetadata__other_fields": {}
+ "line_id": 20253
},
"_annotations": [
{
@@ -13547,8 +13483,7 @@
"line_type": "raw_text"
},
"page_id": 1,
- "line_id": 20262,
- "_LineMetadata__other_fields": {}
+ "line_id": 20262
},
"_annotations": [
{
@@ -13747,8 +13682,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30003,
- "_LineMetadata__other_fields": {}
+ "line_id": 30003
},
"_annotations": [
{
@@ -13870,8 +13804,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30014,
- "_LineMetadata__other_fields": {}
+ "line_id": 30014
},
"_annotations": [
{
@@ -14091,8 +14024,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30028,
- "_LineMetadata__other_fields": {}
+ "line_id": 30028
},
"_annotations": [
{
@@ -14368,8 +14300,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30041,
- "_LineMetadata__other_fields": {}
+ "line_id": 30041
},
"_annotations": [
{
@@ -14617,8 +14548,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30056,
- "_LineMetadata__other_fields": {}
+ "line_id": 30056
},
"_annotations": [
{
@@ -14894,8 +14824,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30069,
- "_LineMetadata__other_fields": {}
+ "line_id": 30069
},
"_annotations": [
{
@@ -15143,8 +15072,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30079,
- "_LineMetadata__other_fields": {}
+ "line_id": 30079
},
"_annotations": [
{
@@ -15350,8 +15278,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30087,
- "_LineMetadata__other_fields": {}
+ "line_id": 30087
},
"_annotations": [
{
@@ -15529,8 +15456,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30098,
- "_LineMetadata__other_fields": {}
+ "line_id": 30098
},
"_annotations": [
{
@@ -15750,8 +15676,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30108,
- "_LineMetadata__other_fields": {}
+ "line_id": 30108
},
"_annotations": [
{
@@ -15957,8 +15882,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30121,
- "_LineMetadata__other_fields": {}
+ "line_id": 30121
},
"_annotations": [
{
@@ -16206,8 +16130,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30136,
- "_LineMetadata__other_fields": {}
+ "line_id": 30136
},
"_annotations": [
{
@@ -16483,8 +16406,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30150,
- "_LineMetadata__other_fields": {}
+ "line_id": 30150
},
"_annotations": [
{
@@ -16746,8 +16668,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30162,
- "_LineMetadata__other_fields": {}
+ "line_id": 30162
},
"_annotations": [
{
@@ -16981,8 +16902,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30173,
- "_LineMetadata__other_fields": {}
+ "line_id": 30173
},
"_annotations": [
{
@@ -17202,8 +17122,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30184,
- "_LineMetadata__other_fields": {}
+ "line_id": 30184
},
"_annotations": [
{
@@ -17423,8 +17342,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30197,
- "_LineMetadata__other_fields": {}
+ "line_id": 30197
},
"_annotations": [
{
@@ -17672,8 +17590,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30211,
- "_LineMetadata__other_fields": {}
+ "line_id": 30211
},
"_annotations": [
{
@@ -17935,8 +17852,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30222,
- "_LineMetadata__other_fields": {}
+ "line_id": 30222
},
"_annotations": [
{
@@ -18156,8 +18072,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30233,
- "_LineMetadata__other_fields": {}
+ "line_id": 30233
},
"_annotations": [
{
@@ -18377,8 +18292,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30243,
- "_LineMetadata__other_fields": {}
+ "line_id": 30243
},
"_annotations": [
{
@@ -18584,8 +18498,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30245,
- "_LineMetadata__other_fields": {}
+ "line_id": 30245
},
"_annotations": [
{
@@ -18679,8 +18592,7 @@
"line_type": "list_item"
},
"page_id": 2,
- "line_id": 30255,
- "_LineMetadata__other_fields": {}
+ "line_id": 30255
},
"_annotations": [
{
@@ -18886,8 +18798,7 @@
"line_type": "list_item"
},
"page_id": 2,
- "line_id": 30263,
- "_LineMetadata__other_fields": {}
+ "line_id": 30263
},
"_annotations": [
{
@@ -19065,8 +18976,7 @@
"line_type": "list_item"
},
"page_id": 2,
- "line_id": 30270,
- "_LineMetadata__other_fields": {}
+ "line_id": 30270
},
"_annotations": [
{
@@ -19230,8 +19140,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30278,
- "_LineMetadata__other_fields": {}
+ "line_id": 30278
},
"_annotations": [
{
@@ -19409,8 +19318,7 @@
"line_type": "raw_text"
},
"page_id": 2,
- "line_id": 30284,
- "_LineMetadata__other_fields": {}
+ "line_id": 30284
},
"_annotations": [
{
@@ -19560,8 +19468,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40004,
- "_LineMetadata__other_fields": {}
+ "line_id": 40004
},
"_annotations": [
{
@@ -19697,8 +19604,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40016,
- "_LineMetadata__other_fields": {}
+ "line_id": 40016
},
"_annotations": [
{
@@ -19932,8 +19838,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40032,
- "_LineMetadata__other_fields": {}
+ "line_id": 40032
},
"_annotations": [
{
@@ -20223,8 +20128,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40040,
- "_LineMetadata__other_fields": {}
+ "line_id": 40040
},
"_annotations": [
{
@@ -20402,8 +20306,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40053,
- "_LineMetadata__other_fields": {}
+ "line_id": 40053
},
"_annotations": [
{
@@ -20651,8 +20554,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40066,
- "_LineMetadata__other_fields": {}
+ "line_id": 40066
},
"_annotations": [
{
@@ -20900,8 +20802,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40080,
- "_LineMetadata__other_fields": {}
+ "line_id": 40080
},
"_annotations": [
{
@@ -21163,8 +21064,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40095,
- "_LineMetadata__other_fields": {}
+ "line_id": 40095
},
"_annotations": [
{
@@ -21440,8 +21340,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40106,
- "_LineMetadata__other_fields": {}
+ "line_id": 40106
},
"_annotations": [
{
@@ -21661,8 +21560,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40121,
- "_LineMetadata__other_fields": {}
+ "line_id": 40121
},
"_annotations": [
{
@@ -21938,8 +21836,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40123,
- "_LineMetadata__other_fields": {}
+ "line_id": 40123
},
"_annotations": [
{
@@ -22033,8 +21930,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40139,
- "_LineMetadata__other_fields": {}
+ "line_id": 40139
},
"_annotations": [
{
@@ -22324,8 +22220,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40152,
- "_LineMetadata__other_fields": {}
+ "line_id": 40152
},
"_annotations": [
{
@@ -22573,8 +22468,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40163,
- "_LineMetadata__other_fields": {}
+ "line_id": 40163
},
"_annotations": [
{
@@ -22794,8 +22688,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40174,
- "_LineMetadata__other_fields": {}
+ "line_id": 40174
},
"_annotations": [
{
@@ -23015,8 +22908,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40189,
- "_LineMetadata__other_fields": {}
+ "line_id": 40189
},
"_annotations": [
{
@@ -23292,8 +23184,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40202,
- "_LineMetadata__other_fields": {}
+ "line_id": 40202
},
"_annotations": [
{
@@ -23541,8 +23432,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40213,
- "_LineMetadata__other_fields": {}
+ "line_id": 40213
},
"_annotations": [
{
@@ -23762,8 +23652,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40222,
- "_LineMetadata__other_fields": {}
+ "line_id": 40222
},
"_annotations": [
{
@@ -23955,8 +23844,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40236,
- "_LineMetadata__other_fields": {}
+ "line_id": 40236
},
"_annotations": [
{
@@ -24218,8 +24106,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40245,
- "_LineMetadata__other_fields": {}
+ "line_id": 40245
},
"_annotations": [
{
@@ -24411,8 +24298,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40249,
- "_LineMetadata__other_fields": {}
+ "line_id": 40249
},
"_annotations": [
{
@@ -24534,8 +24420,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40261,
- "_LineMetadata__other_fields": {}
+ "line_id": 40261
},
"_annotations": [
{
@@ -24769,8 +24654,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40273,
- "_LineMetadata__other_fields": {}
+ "line_id": 40273
},
"_annotations": [
{
@@ -25004,8 +24888,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40285,
- "_LineMetadata__other_fields": {}
+ "line_id": 40285
},
"_annotations": [
{
@@ -25239,8 +25122,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40301,
- "_LineMetadata__other_fields": {}
+ "line_id": 40301
},
"_annotations": [
{
@@ -25530,8 +25412,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40309,
- "_LineMetadata__other_fields": {}
+ "line_id": 40309
},
"_annotations": [
{
@@ -25709,8 +25590,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40324,
- "_LineMetadata__other_fields": {}
+ "line_id": 40324
},
"_annotations": [
{
@@ -25986,8 +25866,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40337,
- "_LineMetadata__other_fields": {}
+ "line_id": 40337
},
"_annotations": [
{
@@ -26235,8 +26114,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40352,
- "_LineMetadata__other_fields": {}
+ "line_id": 40352
},
"_annotations": [
{
@@ -26512,8 +26390,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40363,
- "_LineMetadata__other_fields": {}
+ "line_id": 40363
},
"_annotations": [
{
@@ -26733,8 +26610,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40376,
- "_LineMetadata__other_fields": {}
+ "line_id": 40376
},
"_annotations": [
{
@@ -26982,8 +26858,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40389,
- "_LineMetadata__other_fields": {}
+ "line_id": 40389
},
"_annotations": [
{
@@ -27231,8 +27106,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40394,
- "_LineMetadata__other_fields": {}
+ "line_id": 40394
},
"_annotations": [
{
@@ -27368,8 +27242,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40405,
- "_LineMetadata__other_fields": {}
+ "line_id": 40405
},
"_annotations": [
{
@@ -27589,8 +27462,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40418,
- "_LineMetadata__other_fields": {}
+ "line_id": 40418
},
"_annotations": [
{
@@ -27838,8 +27710,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40430,
- "_LineMetadata__other_fields": {}
+ "line_id": 40430
},
"_annotations": [
{
@@ -28073,8 +27944,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40444,
- "_LineMetadata__other_fields": {}
+ "line_id": 40444
},
"_annotations": [
{
@@ -28336,8 +28206,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40457,
- "_LineMetadata__other_fields": {}
+ "line_id": 40457
},
"_annotations": [
{
@@ -28585,8 +28454,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40464,
- "_LineMetadata__other_fields": {}
+ "line_id": 40464
},
"_annotations": [
{
@@ -28750,8 +28618,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40475,
- "_LineMetadata__other_fields": {}
+ "line_id": 40475
},
"_annotations": [
{
@@ -28971,8 +28838,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40486,
- "_LineMetadata__other_fields": {}
+ "line_id": 40486
},
"_annotations": [
{
@@ -29192,8 +29058,7 @@
"line_type": "raw_text"
},
"page_id": 3,
- "line_id": 40495,
- "_LineMetadata__other_fields": {}
+ "line_id": 40495
},
"_annotations": [
{
@@ -29385,8 +29250,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50011,
- "_LineMetadata__other_fields": {}
+ "line_id": 50011
},
"_annotations": [
{
@@ -29620,8 +29484,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50022,
- "_LineMetadata__other_fields": {}
+ "line_id": 50022
},
"_annotations": [
{
@@ -29841,8 +29704,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50034,
- "_LineMetadata__other_fields": {}
+ "line_id": 50034
},
"_annotations": [
{
@@ -30076,8 +29938,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50048,
- "_LineMetadata__other_fields": {}
+ "line_id": 50048
},
"_annotations": [
{
@@ -30339,8 +30200,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50058,
- "_LineMetadata__other_fields": {}
+ "line_id": 50058
},
"_annotations": [
{
@@ -30546,8 +30406,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50067,
- "_LineMetadata__other_fields": {}
+ "line_id": 50067
},
"_annotations": [
{
@@ -30739,8 +30598,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50076,
- "_LineMetadata__other_fields": {}
+ "line_id": 50076
},
"_annotations": [
{
@@ -30932,8 +30790,7 @@
"line_type": "list_item"
},
"page_id": 4,
- "line_id": 50088,
- "_LineMetadata__other_fields": {}
+ "line_id": 50088
},
"_annotations": [
{
@@ -31167,8 +31024,7 @@
"line_type": "list_item"
},
"page_id": 4,
- "line_id": 50102,
- "_LineMetadata__other_fields": {}
+ "line_id": 50102
},
"_annotations": [
{
@@ -31430,8 +31286,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50109,
- "_LineMetadata__other_fields": {}
+ "line_id": 50109
},
"_annotations": [
{
@@ -31595,8 +31450,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50122,
- "_LineMetadata__other_fields": {}
+ "line_id": 50122
},
"_annotations": [
{
@@ -31844,8 +31698,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50134,
- "_LineMetadata__other_fields": {}
+ "line_id": 50134
},
"_annotations": [
{
@@ -32079,8 +31932,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50139,
- "_LineMetadata__other_fields": {}
+ "line_id": 50139
},
"_annotations": [
{
@@ -32216,8 +32068,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50142,
- "_LineMetadata__other_fields": {}
+ "line_id": 50142
},
"_annotations": [
{
@@ -32325,8 +32176,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50155,
- "_LineMetadata__other_fields": {}
+ "line_id": 50155
},
"_annotations": [
{
@@ -32574,8 +32424,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50164,
- "_LineMetadata__other_fields": {}
+ "line_id": 50164
},
"_annotations": [
{
@@ -32767,8 +32616,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50175,
- "_LineMetadata__other_fields": {}
+ "line_id": 50175
},
"_annotations": [
{
@@ -32988,8 +32836,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50186,
- "_LineMetadata__other_fields": {}
+ "line_id": 50186
},
"_annotations": [
{
@@ -33209,8 +33056,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50198,
- "_LineMetadata__other_fields": {}
+ "line_id": 50198
},
"_annotations": [
{
@@ -33444,8 +33290,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50211,
- "_LineMetadata__other_fields": {}
+ "line_id": 50211
},
"_annotations": [
{
@@ -33693,8 +33538,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50220,
- "_LineMetadata__other_fields": {}
+ "line_id": 50220
},
"_annotations": [
{
@@ -33886,8 +33730,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50231,
- "_LineMetadata__other_fields": {}
+ "line_id": 50231
},
"_annotations": [
{
@@ -34107,8 +33950,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50241,
- "_LineMetadata__other_fields": {}
+ "line_id": 50241
},
"_annotations": [
{
@@ -34314,8 +34156,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50255,
- "_LineMetadata__other_fields": {}
+ "line_id": 50255
},
"_annotations": [
{
@@ -34577,8 +34418,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50267,
- "_LineMetadata__other_fields": {}
+ "line_id": 50267
},
"_annotations": [
{
@@ -34812,8 +34652,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50281,
- "_LineMetadata__other_fields": {}
+ "line_id": 50281
},
"_annotations": [
{
@@ -35075,8 +34914,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50293,
- "_LineMetadata__other_fields": {}
+ "line_id": 50293
},
"_annotations": [
{
@@ -35310,8 +35148,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50304,
- "_LineMetadata__other_fields": {}
+ "line_id": 50304
},
"_annotations": [
{
@@ -35531,8 +35368,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50313,
- "_LineMetadata__other_fields": {}
+ "line_id": 50313
},
"_annotations": [
{
@@ -35724,8 +35560,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50322,
- "_LineMetadata__other_fields": {}
+ "line_id": 50322
},
"_annotations": [
{
@@ -35917,8 +35752,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50325,
- "_LineMetadata__other_fields": {}
+ "line_id": 50325
},
"_annotations": [
{
@@ -36026,8 +35860,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50338,
- "_LineMetadata__other_fields": {}
+ "line_id": 50338
},
"_annotations": [
{
@@ -36275,8 +36108,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50350,
- "_LineMetadata__other_fields": {}
+ "line_id": 50350
},
"_annotations": [
{
@@ -36510,8 +36342,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50361,
- "_LineMetadata__other_fields": {}
+ "line_id": 50361
},
"_annotations": [
{
@@ -36731,8 +36562,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50372,
- "_LineMetadata__other_fields": {}
+ "line_id": 50372
},
"_annotations": [
{
@@ -36952,8 +36782,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50383,
- "_LineMetadata__other_fields": {}
+ "line_id": 50383
},
"_annotations": [
{
@@ -37173,8 +37002,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50395,
- "_LineMetadata__other_fields": {}
+ "line_id": 50395
},
"_annotations": [
{
@@ -37408,8 +37236,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50396,
- "_LineMetadata__other_fields": {}
+ "line_id": 50396
},
"_annotations": [
{
@@ -37489,8 +37316,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50408,
- "_LineMetadata__other_fields": {}
+ "line_id": 50408
},
"_annotations": [
{
@@ -37724,8 +37550,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50419,
- "_LineMetadata__other_fields": {}
+ "line_id": 50419
},
"_annotations": [
{
@@ -37945,8 +37770,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50432,
- "_LineMetadata__other_fields": {}
+ "line_id": 50432
},
"_annotations": [
{
@@ -38194,8 +38018,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50443,
- "_LineMetadata__other_fields": {}
+ "line_id": 50443
},
"_annotations": [
{
@@ -38415,8 +38238,7 @@
"line_type": "raw_text"
},
"page_id": 4,
- "line_id": 50449,
- "_LineMetadata__other_fields": {}
+ "line_id": 50449
},
"_annotations": [
{
@@ -38566,8 +38388,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60008,
- "_LineMetadata__other_fields": {}
+ "line_id": 60008
},
"_annotations": [
{
@@ -38759,8 +38580,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60022,
- "_LineMetadata__other_fields": {}
+ "line_id": 60022
},
"_annotations": [
{
@@ -39022,8 +38842,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60035,
- "_LineMetadata__other_fields": {}
+ "line_id": 60035
},
"_annotations": [
{
@@ -39271,8 +39090,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60045,
- "_LineMetadata__other_fields": {}
+ "line_id": 60045
},
"_annotations": [
{
@@ -39478,8 +39296,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60060,
- "_LineMetadata__other_fields": {}
+ "line_id": 60060
},
"_annotations": [
{
@@ -39755,8 +39572,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60075,
- "_LineMetadata__other_fields": {}
+ "line_id": 60075
},
"_annotations": [
{
@@ -40032,8 +39848,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60086,
- "_LineMetadata__other_fields": {}
+ "line_id": 60086
},
"_annotations": [
{
@@ -40253,8 +40068,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60087,
- "_LineMetadata__other_fields": {}
+ "line_id": 60087
},
"_annotations": [
{
@@ -40334,8 +40148,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60091,
- "_LineMetadata__other_fields": {}
+ "line_id": 60091
},
"_annotations": [
{
@@ -40457,8 +40270,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60104,
- "_LineMetadata__other_fields": {}
+ "line_id": 60104
},
"_annotations": [
{
@@ -40706,8 +40518,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60115,
- "_LineMetadata__other_fields": {}
+ "line_id": 60115
},
"_annotations": [
{
@@ -40927,8 +40738,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60128,
- "_LineMetadata__other_fields": {}
+ "line_id": 60128
},
"_annotations": [
{
@@ -41176,8 +40986,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60140,
- "_LineMetadata__other_fields": {}
+ "line_id": 60140
},
"_annotations": [
{
@@ -41411,8 +41220,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60154,
- "_LineMetadata__other_fields": {}
+ "line_id": 60154
},
"_annotations": [
{
@@ -41674,8 +41482,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60162,
- "_LineMetadata__other_fields": {}
+ "line_id": 60162
},
"_annotations": [
{
@@ -41853,8 +41660,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60174,
- "_LineMetadata__other_fields": {}
+ "line_id": 60174
},
"_annotations": [
{
@@ -42088,8 +41894,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60186,
- "_LineMetadata__other_fields": {}
+ "line_id": 60186
},
"_annotations": [
{
@@ -42323,8 +42128,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60198,
- "_LineMetadata__other_fields": {}
+ "line_id": 60198
},
"_annotations": [
{
@@ -42558,8 +42362,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60211,
- "_LineMetadata__other_fields": {}
+ "line_id": 60211
},
"_annotations": [
{
@@ -42807,8 +42610,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60222,
- "_LineMetadata__other_fields": {}
+ "line_id": 60222
},
"_annotations": [
{
@@ -43028,8 +42830,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60235,
- "_LineMetadata__other_fields": {}
+ "line_id": 60235
},
"_annotations": [
{
@@ -43277,8 +43078,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60249,
- "_LineMetadata__other_fields": {}
+ "line_id": 60249
},
"_annotations": [
{
@@ -43540,8 +43340,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60264,
- "_LineMetadata__other_fields": {}
+ "line_id": 60264
},
"_annotations": [
{
@@ -43817,8 +43616,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60279,
- "_LineMetadata__other_fields": {}
+ "line_id": 60279
},
"_annotations": [
{
@@ -44094,8 +43892,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60292,
- "_LineMetadata__other_fields": {}
+ "line_id": 60292
},
"_annotations": [
{
@@ -44343,8 +44140,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60306,
- "_LineMetadata__other_fields": {}
+ "line_id": 60306
},
"_annotations": [
{
@@ -44606,8 +44402,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60317,
- "_LineMetadata__other_fields": {}
+ "line_id": 60317
},
"_annotations": [
{
@@ -44827,8 +44622,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60327,
- "_LineMetadata__other_fields": {}
+ "line_id": 60327
},
"_annotations": [
{
@@ -45034,8 +44828,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60340,
- "_LineMetadata__other_fields": {}
+ "line_id": 60340
},
"_annotations": [
{
@@ -45283,8 +45076,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60349,
- "_LineMetadata__other_fields": {}
+ "line_id": 60349
},
"_annotations": [
{
@@ -45476,8 +45268,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60360,
- "_LineMetadata__other_fields": {}
+ "line_id": 60360
},
"_annotations": [
{
@@ -45697,8 +45488,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60370,
- "_LineMetadata__other_fields": {}
+ "line_id": 60370
},
"_annotations": [
{
@@ -45904,8 +45694,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60381,
- "_LineMetadata__other_fields": {}
+ "line_id": 60381
},
"_annotations": [
{
@@ -46125,8 +45914,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60392,
- "_LineMetadata__other_fields": {}
+ "line_id": 60392
},
"_annotations": [
{
@@ -46346,8 +46134,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60406,
- "_LineMetadata__other_fields": {}
+ "line_id": 60406
},
"_annotations": [
{
@@ -46609,8 +46396,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60416,
- "_LineMetadata__other_fields": {}
+ "line_id": 60416
},
"_annotations": [
{
@@ -46816,8 +46602,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60428,
- "_LineMetadata__other_fields": {}
+ "line_id": 60428
},
"_annotations": [
{
@@ -47051,8 +46836,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60441,
- "_LineMetadata__other_fields": {}
+ "line_id": 60441
},
"_annotations": [
{
@@ -47300,8 +47084,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60448,
- "_LineMetadata__other_fields": {}
+ "line_id": 60448
},
"_annotations": [
{
@@ -47465,8 +47248,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60461,
- "_LineMetadata__other_fields": {}
+ "line_id": 60461
},
"_annotations": [
{
@@ -47714,8 +47496,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60472,
- "_LineMetadata__other_fields": {}
+ "line_id": 60472
},
"_annotations": [
{
@@ -47935,8 +47716,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60484,
- "_LineMetadata__other_fields": {}
+ "line_id": 60484
},
"_annotations": [
{
@@ -48170,8 +47950,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60495,
- "_LineMetadata__other_fields": {}
+ "line_id": 60495
},
"_annotations": [
{
@@ -48391,8 +48170,7 @@
"line_type": "raw_text"
},
"page_id": 5,
- "line_id": 60504,
- "_LineMetadata__other_fields": {}
+ "line_id": 60504
},
"_annotations": [
{
@@ -48584,8 +48362,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70011,
- "_LineMetadata__other_fields": {}
+ "line_id": 70011
},
"_annotations": [
{
@@ -48819,8 +48596,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70017,
- "_LineMetadata__other_fields": {}
+ "line_id": 70017
},
"_annotations": [
{
@@ -48970,8 +48746,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70029,
- "_LineMetadata__other_fields": {}
+ "line_id": 70029
},
"_annotations": [
{
@@ -49205,8 +48980,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70041,
- "_LineMetadata__other_fields": {}
+ "line_id": 70041
},
"_annotations": [
{
@@ -49440,8 +49214,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70054,
- "_LineMetadata__other_fields": {}
+ "line_id": 70054
},
"_annotations": [
{
@@ -49689,8 +49462,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70067,
- "_LineMetadata__other_fields": {}
+ "line_id": 70067
},
"_annotations": [
{
@@ -49938,8 +49710,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70080,
- "_LineMetadata__other_fields": {}
+ "line_id": 70080
},
"_annotations": [
{
@@ -50187,8 +49958,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70094,
- "_LineMetadata__other_fields": {}
+ "line_id": 70094
},
"_annotations": [
{
@@ -50450,8 +50220,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70105,
- "_LineMetadata__other_fields": {}
+ "line_id": 70105
},
"_annotations": [
{
@@ -50671,8 +50440,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70117,
- "_LineMetadata__other_fields": {}
+ "line_id": 70117
},
"_annotations": [
{
@@ -50906,8 +50674,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70122,
- "_LineMetadata__other_fields": {}
+ "line_id": 70122
},
"_annotations": [
{
@@ -51043,8 +50810,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70126,
- "_LineMetadata__other_fields": {}
+ "line_id": 70126
},
"_annotations": [
{
@@ -51166,8 +50932,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70139,
- "_LineMetadata__other_fields": {}
+ "line_id": 70139
},
"_annotations": [
{
@@ -51415,8 +51180,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70149,
- "_LineMetadata__other_fields": {}
+ "line_id": 70149
},
"_annotations": [
{
@@ -51622,8 +51386,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70159,
- "_LineMetadata__other_fields": {}
+ "line_id": 70159
},
"_annotations": [
{
@@ -51829,8 +51592,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70168,
- "_LineMetadata__other_fields": {}
+ "line_id": 70168
},
"_annotations": [
{
@@ -52022,8 +51784,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70180,
- "_LineMetadata__other_fields": {}
+ "line_id": 70180
},
"_annotations": [
{
@@ -52257,8 +52018,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70192,
- "_LineMetadata__other_fields": {}
+ "line_id": 70192
},
"_annotations": [
{
@@ -52492,8 +52252,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70203,
- "_LineMetadata__other_fields": {}
+ "line_id": 70203
},
"_annotations": [
{
@@ -52713,8 +52472,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70216,
- "_LineMetadata__other_fields": {}
+ "line_id": 70216
},
"_annotations": [
{
@@ -52962,8 +52720,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70224,
- "_LineMetadata__other_fields": {}
+ "line_id": 70224
},
"_annotations": [
{
@@ -53141,8 +52898,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70237,
- "_LineMetadata__other_fields": {}
+ "line_id": 70237
},
"_annotations": [
{
@@ -53390,8 +53146,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70249,
- "_LineMetadata__other_fields": {}
+ "line_id": 70249
},
"_annotations": [
{
@@ -53625,8 +53380,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70258,
- "_LineMetadata__other_fields": {}
+ "line_id": 70258
},
"_annotations": [
{
@@ -53818,8 +53572,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70271,
- "_LineMetadata__other_fields": {}
+ "line_id": 70271
},
"_annotations": [
{
@@ -54067,8 +53820,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70284,
- "_LineMetadata__other_fields": {}
+ "line_id": 70284
},
"_annotations": [
{
@@ -54316,8 +54068,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70297,
- "_LineMetadata__other_fields": {}
+ "line_id": 70297
},
"_annotations": [
{
@@ -54565,8 +54316,7 @@
"line_type": "raw_text"
},
"page_id": 6,
- "line_id": 70313,
- "_LineMetadata__other_fields": {}
+ "line_id": 70313
},
"_annotations": [
{
diff --git a/docs/source/_static/code_examples/dedoc_creating_dedoc_document.py b/docs/source/_static/code_examples/dedoc_creating_dedoc_document.py
index b0069517..40a18c5d 100644
--- a/docs/source/_static/code_examples/dedoc_creating_dedoc_document.py
+++ b/docs/source/_static/code_examples/dedoc_creating_dedoc_document.py
@@ -7,7 +7,7 @@
hierarchy_level = HierarchyLevel(level_1=0, level_2=0, line_type="header", can_be_multiline=True)
-metadata = LineMetadata(page_id=0, line_id=1, tag_hierarchy_level=None, hierarchy_level=hierarchy_level, other_fields=None)
+metadata = LineMetadata(page_id=0, line_id=1, tag_hierarchy_level=None, hierarchy_level=hierarchy_level)
annotations = [LinkedTextAnnotation(start=0, end=5, value="Now the line isn't so simple :)"), BoldAnnotation(start=7, end=10, value="True")]
super_line = LineWithMeta(text, metadata=metadata, annotations=annotations)
diff --git a/docs/source/_static/code_examples/dedoc_return_format.py b/docs/source/_static/code_examples/dedoc_return_format.py
index 8d22432f..1d5ec0b7 100644
--- a/docs/source/_static/code_examples/dedoc_return_format.py
+++ b/docs/source/_static/code_examples/dedoc_return_format.py
@@ -55,6 +55,16 @@ def with_parsed_attachments_example() -> dict:
return json.loads(result)
+def article_example() -> dict:
+ with open("test_dir/article.pdf", "rb") as file:
+ files = {"file": ("article.pdf", file)}
+ r = requests.post("http://localhost:1231/upload", files=files, data=dict(document_type="article"))
+ result = r.content.decode("utf-8")
+
+ assert r.status_code == 200
+ return json.loads(result)
+
+
if __name__ == "__main__":
with open("../json_format_examples/basic_example.json", "w") as f:
json.dump(basic_example(), f, indent=2, ensure_ascii=False)
@@ -70,3 +80,6 @@ def with_parsed_attachments_example() -> dict:
with open("../json_format_examples/with_parsed_attachments.json", "w") as f:
json.dump(with_parsed_attachments_example(), f, indent=2, ensure_ascii=False)
+
+ with open("../json_format_examples/article_example.json", "w") as f:
+ json.dump(article_example(), f, indent=2, ensure_ascii=False)
diff --git a/docs/source/_static/code_examples/dedoc_usage_tutorial.py b/docs/source/_static/code_examples/dedoc_usage_tutorial.py
index 671a5ee6..8af1c6b1 100644
--- a/docs/source/_static/code_examples/dedoc_usage_tutorial.py
+++ b/docs/source/_static/code_examples/dedoc_usage_tutorial.py
@@ -64,10 +64,10 @@
metadata_extractor = DocxMetadataExtractor()
metadata_extractor.can_extract(file_path) # True
document.metadata = metadata_extractor.extract(file_path)
-document.metadata # {'file_name': 'example.docx', 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'size': 373795,
-# 'access_time': 1686825619, 'created_time': 1686825617, 'modified_time': 1686823541, 'other_fields': {'document_subject': '', 'keywords': '',
-# 'category': '', 'comments': '', 'author': '', 'last_modified_by': '', 'created_date': 1568725611, 'modified_date': 1686752726,
-# 'last_printed_date': None}}
+document.metadata # {'file_name': 'example.docx', 'temporary_file_name': 'example.docx',
+# 'file_type': 'application/vnd.openxmlformats-officedocument.wordprocessingml.document', 'size': 373839, 'access_time': 1713964145,
+# 'created_time': 1713958120, 'modified_time': 1709111749, 'document_subject': '', 'keywords': '', 'category': '', 'comments': '', 'author': '',
+# 'last_modified_by': 'python-docx', 'created_date': None, 'modified_date': 1714635406, 'last_printed_date': None}
"""Using attachments extractors"""
diff --git a/docs/source/_static/code_examples/test_dir/article.pdf b/docs/source/_static/code_examples/test_dir/article.pdf
new file mode 100644
index 00000000..6c74f192
Binary files /dev/null and b/docs/source/_static/code_examples/test_dir/article.pdf differ
diff --git a/docs/source/_static/json_format_examples/article_example.json b/docs/source/_static/json_format_examples/article_example.json
index 712c5841..41be6abf 100644
--- a/docs/source/_static/json_format_examples/article_example.json
+++ b/docs/source/_static/json_format_examples/article_example.json
@@ -7,8 +7,7 @@
"metadata": {
"paragraph_type": "root",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
@@ -18,8 +17,7 @@
"metadata": {
"paragraph_type": "author",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
@@ -29,8 +27,7 @@
"metadata": {
"paragraph_type": "author_first_name",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
},
@@ -41,8 +38,7 @@
"metadata": {
"paragraph_type": "author_surname",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
},
@@ -53,8 +49,7 @@
"metadata": {
"paragraph_type": "author_affiliation",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
@@ -64,20 +59,18 @@
"metadata": {
"paragraph_type": "org_name",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
},
{
"node_id": "0.0.2.1",
- "text": "\n45 rue dUlm\n75005\nParis\n",
+ "text": "45 rue dUlm, 75005, Paris",
"annotations": [],
"metadata": {
"paragraph_type": "address",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
}
@@ -90,8 +83,7 @@
"metadata": {
"paragraph_type": "author_affiliation",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
@@ -101,20 +93,18 @@
"metadata": {
"paragraph_type": "org_name",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
},
{
"node_id": "0.0.3.1",
- "text": "\n4 Avenue des Louvresses\n92230\nGennevilliers\n",
+ "text": "4 Avenue des Louvresses, 92230, Gennevilliers",
"annotations": [],
"metadata": {
"paragraph_type": "address",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
}
@@ -129,8 +119,7 @@
"metadata": {
"paragraph_type": "author",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
@@ -140,8 +129,7 @@
"metadata": {
"paragraph_type": "author_first_name",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
},
@@ -152,8 +140,7 @@
"metadata": {
"paragraph_type": "author_surname",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
},
@@ -164,8 +151,7 @@
"metadata": {
"paragraph_type": "author_affiliation",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
@@ -175,20 +161,18 @@
"metadata": {
"paragraph_type": "org_name",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
},
{
"node_id": "0.1.2.1",
- "text": "\nBelgium\n",
+ "text": "Belgium",
"annotations": [],
"metadata": {
"paragraph_type": "address",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
}
@@ -203,8 +187,7 @@
"metadata": {
"paragraph_type": "author",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
@@ -214,8 +197,7 @@
"metadata": {
"paragraph_type": "author_first_name",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
},
@@ -226,8 +208,7 @@
"metadata": {
"paragraph_type": "author_surname",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
},
@@ -238,8 +219,7 @@
"metadata": {
"paragraph_type": "author_affiliation",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
@@ -249,20 +229,18 @@
"metadata": {
"paragraph_type": "org_name",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
},
{
"node_id": "0.2.2.1",
- "text": "\nBelgium\n",
+ "text": "Belgium",
"annotations": [],
"metadata": {
"paragraph_type": "address",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
}
@@ -277,8 +255,7 @@
"metadata": {
"paragraph_type": "abstract",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
@@ -288,8 +265,7 @@
"metadata": {
"paragraph_type": "raw_text",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
}
@@ -297,13 +273,12 @@
},
{
"node_id": "0.4",
- "text": "Introduction",
+ "text": "1 Introduction",
"annotations": [],
"metadata": {
"paragraph_type": "section",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
@@ -313,207 +288,206 @@
{
"start": 92,
"end": 95,
- "name": "bibliography_ref",
- "value": "bac4e44c-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "10981248-0872-11ef-b95c-0242ac120002"
},
{
"start": 95,
"end": 98,
- "name": "bibliography_ref",
- "value": "bac4e4bb-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109a0b84-0872-11ef-b95c-0242ac120002"
},
{
"start": 201,
"end": 205,
- "name": "bibliography_ref",
- "value": "bac4e4ab-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "1099b8be-0872-11ef-b95c-0242ac120002"
},
{
"start": 205,
"end": 208,
- "name": "bibliography_ref",
- "value": "bac4e551-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109c6370-0872-11ef-b95c-0242ac120002"
},
{
"start": 208,
"end": 211,
- "name": "bibliography_ref",
- "value": "bac4e5cd-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109eab44-0872-11ef-b95c-0242ac120002"
},
{
"start": 211,
"end": 214,
- "name": "bibliography_ref",
- "value": "bac4e5dd-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109ef5e0-0872-11ef-b95c-0242ac120002"
},
{
"start": 846,
"end": 850,
- "name": "bibliography_ref",
- "value": "bac4e584-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109d578a-0872-11ef-b95c-0242ac120002"
},
{
"start": 850,
"end": 853,
- "name": "bibliography_ref",
- "value": "bac4e602-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109f92d4-0872-11ef-b95c-0242ac120002"
},
{
"start": 942,
"end": 946,
- "name": "bibliography_ref",
- "value": "bac4e516-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109b8ad6-0872-11ef-b95c-0242ac120002"
},
{
"start": 1055,
"end": 1059,
- "name": "bibliography_ref",
- "value": "bac4e4c5-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109a34ec-0872-11ef-b95c-0242ac120002"
},
{
"start": 1550,
"end": 1554,
- "name": "bibliography_ref",
- "value": "bac4e501-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109b29c4-0872-11ef-b95c-0242ac120002"
},
{
"start": 1619,
"end": 1623,
- "name": "bibliography_ref",
- "value": "bac4e480-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "10990112-0872-11ef-b95c-0242ac120002"
},
{
"start": 1623,
"end": 1626,
- "name": "bibliography_ref",
- "value": "bac4e49b-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "10997246-0872-11ef-b95c-0242ac120002"
},
{
"start": 1683,
"end": 1686,
- "name": "bibliography_ref",
- "value": "bac4e49b-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "10997246-0872-11ef-b95c-0242ac120002"
},
{
"start": 1626,
"end": 1629,
- "name": "bibliography_ref",
- "value": "bac4e571-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109cf646-0872-11ef-b95c-0242ac120002"
},
{
"start": 1629,
"end": 1632,
- "name": "bibliography_ref",
- "value": "bac4e5ec-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109f2bbe-0872-11ef-b95c-0242ac120002"
},
{
"start": 1929,
"end": 1933,
- "name": "bibliography_ref",
- "value": "bac4e5ec-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109f2bbe-0872-11ef-b95c-0242ac120002"
},
{
"start": 1632,
"end": 1635,
- "name": "bibliography_ref",
- "value": "bac4e5f6-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109f6200-0872-11ef-b95c-0242ac120002"
},
{
"start": 1689,
"end": 1692,
- "name": "bibliography_ref",
- "value": "bac4e5f6-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109f6200-0872-11ef-b95c-0242ac120002"
},
{
"start": 1635,
"end": 1638,
- "name": "bibliography_ref",
- "value": "bac4e634-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "10a07d98-0872-11ef-b95c-0242ac120002"
},
{
"start": 1692,
"end": 1695,
- "name": "bibliography_ref",
- "value": "bac4e634-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "10a07d98-0872-11ef-b95c-0242ac120002"
},
{
"start": 1638,
"end": 1641,
- "name": "bibliography_ref",
- "value": "bac4e63d-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "10a0afc0-0872-11ef-b95c-0242ac120002"
},
{
"start": 1677,
"end": 1680,
- "name": "bibliography_ref",
- "value": "bac4e42a-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "1097209a-0872-11ef-b95c-0242ac120002"
},
{
"start": 1680,
"end": 1683,
- "name": "bibliography_ref",
- "value": "bac4e46d-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "10989a06-0872-11ef-b95c-0242ac120002"
},
{
"start": 1686,
"end": 1689,
- "name": "bibliography_ref",
- "value": "bac4e539-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109c08ee-0872-11ef-b95c-0242ac120002"
},
{
"start": 3412,
"end": 3416,
- "name": "bibliography_ref",
- "value": "bac4e539-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109c08ee-0872-11ef-b95c-0242ac120002"
},
{
"start": 4544,
"end": 4548,
- "name": "bibliography_ref",
- "value": "bac4e539-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109c08ee-0872-11ef-b95c-0242ac120002"
},
{
"start": 5206,
"end": 5210,
- "name": "bibliography_ref",
- "value": "bac4e539-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109c08ee-0872-11ef-b95c-0242ac120002"
},
{
"start": 6249,
"end": 6253,
- "name": "bibliography_ref",
- "value": "bac4e539-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109c08ee-0872-11ef-b95c-0242ac120002"
},
{
"start": 2381,
"end": 2385,
- "name": "bibliography_ref",
- "value": "bac4e499-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "10996f12-0872-11ef-b95c-0242ac120002"
},
{
"start": 2405,
"end": 2408,
- "name": "bibliography_ref",
- "value": "bac4e461-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "1098687e-0872-11ef-b95c-0242ac120002"
},
{
"start": 2640,
"end": 2643,
- "name": "bibliography_ref",
- "value": "bac4e43e-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "1097c3a6-0872-11ef-b95c-0242ac120002"
},
{
"start": 3306,
"end": 3310,
- "name": "bibliography_ref",
- "value": "bac4e4b2-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "1099dbe6-0872-11ef-b95c-0242ac120002"
}
],
"metadata": {
"paragraph_type": "raw_text",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
}
@@ -521,61 +495,59 @@
},
{
"node_id": "0.5",
- "text": "Methodology & limitations",
+ "text": "2 Methodology & limitations",
"annotations": [],
"metadata": {
"paragraph_type": "section",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
"node_id": "0.5.0",
- "text": "The main goal of this paper is to provide sound techniques to evaluate how leakage-resilient PRGs/PRFs and masking combine. In this section, we provide a brief description of the methodology we will use for this purpose, and underline its limitations. The two main components, namely performance and security evaluations, are detailed in Sections 3 and 4, and then combined in Section 5. Our proposal essentially holds in five steps that we detail below.1. Fix the target security level. In the following, we will take the AES Rijndael with 128-bit key as case study. Since a small security degradation due to side-channel attacks is unavoidable, we will consider 120-bit, 100-bit and 80-bit target security levels for illustration. We do not go below 80-bit keys since it typically corresponds to current short-term security levels [9].2. Choose an implementation. Given a cryptographic algorithm, this essentially corresponds to the selection of a technology and possibly a set of countermeasures to incorporate in the designs to evaluate. In the following, we will consider both software and hardware implementations for illustration, since they lead to significantly different performance and security levels. As for countermeasures, different types of masking schemes will be considered.3. Evaluate performances / extract a cost function. Given an implementation, different metrics can be selected for this purpose (such as code size, RAM, or cycle count in software and area, frequency, throughput or power consumption in hardware). Both for software and hardware implementations, we will use combined functions, namely the \"code size × cycle count\" product and the \"area / throughput\" ratio. While our methodology would be perfectly applicable to other choices of metrics, we believe they are an interesting starting point to capture the efficiency of our different implementations. In particular for the hardware cases, such metrics are less dependent on the serial vs. parallel nature of the target architectures (see [26], Section 2).4. Evaluate security / extract the maximum number of measurements. This central part of our analysis first requires to select the attacks from which we will evaluate security. In the following, we will consider the \"standard DPA attacks\" described in [31] for this purpose. Furthermore, we will investigate them in the profiled setting of template attacks (i.e. assuming that the adversary can build a precise model for the leakage function) [6]. This choice is motivated by the goal of approaching worst-case evaluations [56]. Based on these attacks, we will estimate the security graphs introduced in [61], i.e. compute the adversaries' success rates in function of their time complexity and number of measurements. From a given security level (e.g. 120-bit time complexity), we will finally extract the maximum number of measurements per key tolerated, as can be bounded by the PRG construction1 .",
+ "text": "The main goal of this paper is to provide sound techniques to evaluate how leakage-resilient PRGs/PRFs and masking combine. In this section, we provide a brief description of the methodology we will use for this purpose, and underline its limitations. The two main components, namely performance and security evaluations, are detailed in Sections 3 and 4, and then combined in Section 5. Our proposal essentially holds in five steps that we detail below.\n1. Fix the target security level. In the following, we will take the AES Rijndael with 128-bit key as case study. Since a small security degradation due to side-channel attacks is unavoidable, we will consider 120-bit, 100-bit and 80-bit target security levels for illustration. We do not go below 80-bit keys since it typically corresponds to current short-term security levels [9].2. Choose an implementation. Given a cryptographic algorithm, this essentially corresponds to the selection of a technology and possibly a set of countermeasures to incorporate in the designs to evaluate. In the following, we will consider both software and hardware implementations for illustration, since they lead to significantly different performance and security levels. As for countermeasures, different types of masking schemes will be considered.\n3. Evaluate performances / extract a cost function. Given an implementation, different metrics can be selected for this purpose (such as code size, RAM, or cycle count in software and area, frequency, throughput or power consumption in hardware). Both for software and hardware implementations, we will use combined functions, namely the \"code size × cycle count\" product and the \"area / throughput\" ratio. While our methodology would be perfectly applicable to other choices of metrics, we believe they are an interesting starting point to capture the efficiency of our different implementations. In particular for the hardware cases, such metrics are less dependent on the serial vs. parallel nature of the target architectures (see [26], Section 2).4. Evaluate security / extract the maximum number of measurements. This central part of our analysis first requires to select the attacks from which we will evaluate security. In the following, we will consider the \"standard DPA attacks\" described in [31] for this purpose. Furthermore, we will investigate them in the profiled setting of template attacks (i.e. assuming that the adversary can build a precise model for the leakage function) [6]. This choice is motivated by the goal of approaching worst-case evaluations [56]. Based on these attacks, we will estimate the security graphs introduced in [61], i.e. compute the adversaries' success rates in function of their time complexity and number of measurements. From a given security level (e.g. 120-bit time complexity), we will finally extract the maximum number of measurements per key tolerated, as can be bounded by the PRG construction1 .",
"annotations": [
{
- "start": 833,
- "end": 836,
- "name": "bibliography_ref",
- "value": "bac4e46b-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 834,
+ "end": 837,
+ "name": "reference",
+ "value": "10988c82-0872-11ef-b95c-0242ac120002"
},
{
- "start": 2027,
- "end": 2031,
- "name": "bibliography_ref",
- "value": "bac4e4f7-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 2029,
+ "end": 2033,
+ "name": "reference",
+ "value": "109b1452-0872-11ef-b95c-0242ac120002"
},
{
- "start": 2295,
- "end": 2299,
- "name": "bibliography_ref",
- "value": "bac4e51d-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 2297,
+ "end": 2301,
+ "name": "reference",
+ "value": "109baa66-0872-11ef-b95c-0242ac120002"
},
{
- "start": 2486,
- "end": 2489,
- "name": "bibliography_ref",
- "value": "bac4e456-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 2488,
+ "end": 2491,
+ "name": "reference",
+ "value": "10982954-0872-11ef-b95c-0242ac120002"
},
{
- "start": 2566,
- "end": 2570,
- "name": "bibliography_ref",
- "value": "bac4e5e6-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 2568,
+ "end": 2572,
+ "name": "reference",
+ "value": "109f1e26-0872-11ef-b95c-0242ac120002"
},
{
- "start": 2647,
- "end": 2651,
- "name": "bibliography_ref",
- "value": "bac4e61b-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 2649,
+ "end": 2653,
+ "name": "reference",
+ "value": "109fee1e-0872-11ef-b95c-0242ac120002"
}
],
"metadata": {
"paragraph_type": "raw_text",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
}
@@ -588,19 +560,17 @@
"metadata": {
"paragraph_type": "section",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
"node_id": "0.6.0",
- "text": "Compute a global cost metric (possibly with an application constraint). In case of security-bounded implementations, the previous security evaluation can be used to estimate how frequently one has to \"re-key\" within a leakageresilient construction. From this estimate, we derive the average number of AES encryptions to execute per 128-bit output. By multiplying this number with the cost function of our performance evaluations, we obtain a global metric for the implementation of an AES-based design ensuring a given security level. In case of security-unbounded implementations, re-keying is not sufficient to maintain the target security level independent of the number of measurements performed by the adversary. So the cost functions have to be combined with an application constraint, stating the maximum number of measurements that can be tolerated to maintain this security level.Quite naturally, such a methodology is limited in the same way as any performance and security evaluation. From the performance point-of-view, our investigations only apply to a representative subset of the (large) set of AES designs published in the literature. Because of place constraints, we first paid attention to state-of-the-art implementations and countermeasures, but applying our methodology to more examples is naturally feasible (and desirable). A very similar statement holds for security evaluations. Namely, we considered standard DPA attacks as a starting point, and because they typically correspond to the state-of-the-art in research and evaluation laboratories. Yet, cryptanalytic progresses can always appear2 . Besides, countermeasures such as masking may rely on physical assumptions that are difficult to compare rigorously (since highly technology-dependent), as will be detailed next with the case of \"glitches\".Note that these limitations are to a large extent inherent to the problem we tackle, and our results also correspond to the best we can hope in this respect. Hence, more than the practical conclusions that we draw in the following sections (that are of course important for current engineers willing to implement physically secure designs), it is the fact that we are able to compare the performance vs. security tradeoffs corresponding to the combination of leakage-resilient constructions with masking that is the most important contribution of this work. Indeed, these comparisons are dependent on the state-of-the-art implementations and attacks that are considered to be relevant for the selected algorithm.",
+ "text": "Compute a global cost metric (possibly with an application constraint). In case of security-bounded implementations, the previous security evaluation can be used to estimate how frequently one has to \"re-key\" within a leakageresilient construction. From this estimate, we derive the average number of AES encryptions to execute per 128-bit output. By multiplying this number with the cost function of our performance evaluations, we obtain a global metric for the implementation of an AES-based design ensuring a given security level. In case of security-unbounded implementations, re-keying is not sufficient to maintain the target security level independent of the number of measurements performed by the adversary. So the cost functions have to be combined with an application constraint, stating the maximum number of measurements that can be tolerated to maintain this security level.\nQuite naturally, such a methodology is limited in the same way as any performance and security evaluation. From the performance point-of-view, our investigations only apply to a representative subset of the (large) set of AES designs published in the literature. Because of place constraints, we first paid attention to state-of-the-art implementations and countermeasures, but applying our methodology to more examples is naturally feasible (and desirable). A very similar statement holds for security evaluations. Namely, we considered standard DPA attacks as a starting point, and because they typically correspond to the state-of-the-art in research and evaluation laboratories. Yet, cryptanalytic progresses can always appear2 . Besides, countermeasures such as masking may rely on physical assumptions that are difficult to compare rigorously (since highly technology-dependent), as will be detailed next with the case of \"glitches\".Note that these limitations are to a large extent inherent to the problem we tackle, and our results also correspond to the best we can hope in this respect. Hence, more than the practical conclusions that we draw in the following sections (that are of course important for current engineers willing to implement physically secure designs), it is the fact that we are able to compare the performance vs. security tradeoffs corresponding to the combination of leakage-resilient constructions with masking that is the most important contribution of this work. Indeed, these comparisons are dependent on the state-of-the-art implementations and attacks that are considered to be relevant for the selected algorithm.\n",
"annotations": [],
"metadata": {
"paragraph_type": "raw_text",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
}
@@ -608,103 +578,101 @@
},
{
"node_id": "0.7",
- "text": "Performance evaluations",
+ "text": "3 Performance evaluations",
"annotations": [],
"metadata": {
"paragraph_type": "section",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
"node_id": "0.7.0",
- "text": "In this section, we provide our performance evaluations for unprotected and masked AES designs. As previously mentioned, we will consider both software and hardware examples for this purpose. In this context, the main challenge is to find implementations that are (reasonably) comparable. This turned out to be relatively easy in the software case, for which we selected a couple of implementations in 8-bit microcontrollers, i.e. typical targets for side-channel analysis. By contrast, finding implementations in the same technology turns out to be more challenging in hardware: transistor sizes have evolved from (more than) 130µm to (less than) 65ηm over the last 15 years (i.e. the period over which most countermeasures against side-channel attacks have been proposed). Hence, published performance evaluations for side-channel protected designs are rarely comparable. Yet, we could find several designs in a recent FPGA technology, namely the Xilinx Virtex-5 devices (that are based on a 65ηm process).The performances of the implementations we will analyze are summarized in Table 1. As previously mentioned, our software cost function is the frequently considered \"code size × cycle count\" metric, while we use the \"area / throughput\" ratio in the hardware (FPGA) case. As for the countermeasures evaluated, we first focused on the higher-order masking scheme proposed by Rivain and Prouff at CHES 2010, which can be considered as the state-of-the-art in software [53]. We then added the CHES 2011 polynomial masking scheme of Prouff and Roche [45] (and its implementation in [20]), as a typical example of \"glitchresistant\" solution relying on secret sharing and multiparty computation (see the discussion in the next paragraph). A similar variety of countermeasures is proposed in hardware, where we also consider an efficient but glitch-sensitive implementation proposed in [48], and a threshold AES implementation that is one of the most promising solutions to deal with glitches in this case [36]. Note that this latter implementation is based on an 8-bit architecture (rather than a 128-bit one for the others). So although our cost function is aimed at making comparisons between different architectures more reflective of the algorithms' and countermeasures' performances, more serial implementations as this one generally pay a small overhead due to their more complex control logic.Physical assumptions and glitches. As explicit in Table 1, countermeasures against side-channel attacks always rely on a number of physical assumptions.In the case of masking, a central one is that the leakage of the shares manipulated by the target implementation should be independent of each other [22]. Glitches, that are transient signals appearing during the computations in certain (e.g. CMOS) implementations, are a typical physical default that can cause this assumption to fail, as first put forward by Mangard et al. in [32]. There are two possible solutions to deal with such physical defaults: either by making explicit to cryptographic engineers that they have to prevent glitches at the physical level, or by designing countermeasures that can cope with glitches.Interestingly, the first solution is one aspect where hardware and software implementations significantly differ. Namely, while it is usually possible to ensure independent leakages in masked software, by ensuring a sufficient time separation between the manipulation of the shares, it is extremely difficult to avoid glitches in hardware [33]. Yet, even in hardware it is generally expected that the \"glitch signal\" will be more difficult to exploit by adversaries, especially if designers pay attention to this issue [35]. In this context, the main question is to determine the amplitude of this signal: if sufficiently reduced in front of the measurement noise, it may turn out that a glitch-sensitive masked implementation leads to improved security levels (compared to an unprotected one). Since this amplitude is highly technology-dependent, we will use it as a parameter to analyze the security of our hardware implementations in the next sections. Yet, we recall that it is a safe practice to focus on glitch-resistant implementations when it comes to hardware. Besides, we note that glitches are not the only physical default that may cause the independent leakage assumption to be contradicted in practice [42,51].",
+ "text": "In this section, we provide our performance evaluations for unprotected and masked AES designs. As previously mentioned, we will consider both software and hardware examples for this purpose. In this context, the main challenge is to find implementations that are (reasonably) comparable. This turned out to be relatively easy in the software case, for which we selected a couple of implementations in 8-bit microcontrollers, i.e. typical targets for side-channel analysis. By contrast, finding implementations in the same technology turns out to be more challenging in hardware: transistor sizes have evolved from (more than) 130µm to (less than) 65ηm over the last 15 years (i.e. the period over which most countermeasures against side-channel attacks have been proposed). Hence, published performance evaluations for side-channel protected designs are rarely comparable. Yet, we could find several designs in a recent FPGA technology, namely the Xilinx Virtex-5 devices (that are based on a 65ηm process).\nThe performances of the implementations we will analyze are summarized in Table 1. As previously mentioned, our software cost function is the frequently considered \"code size × cycle count\" metric, while we use the \"area / throughput\" ratio in the hardware (FPGA) case. As for the countermeasures evaluated, we first focused on the higher-order masking scheme proposed by Rivain and Prouff at CHES 2010, which can be considered as the state-of-the-art in software [53]. We then added the CHES 2011 polynomial masking scheme of Prouff and Roche [45] (and its implementation in [20]), as a typical example of \"glitchresistant\" solution relying on secret sharing and multiparty computation (see the discussion in the next paragraph). A similar variety of countermeasures is proposed in hardware, where we also consider an efficient but glitch-sensitive implementation proposed in [48], and a threshold AES implementation that is one of the most promising solutions to deal with glitches in this case [36]. Note that this latter implementation is based on an 8-bit architecture (rather than a 128-bit one for the others). So although our cost function is aimed at making comparisons between different architectures more reflective of the algorithms' and countermeasures' performances, more serial implementations as this one generally pay a small overhead due to their more complex control logic.Physical assumptions and glitches. As explicit in Table 1, countermeasures against side-channel attacks always rely on a number of physical assumptions.In the case of masking, a central one is that the leakage of the shares manipulated by the target implementation should be independent of each other [22]. Glitches, that are transient signals appearing during the computations in certain (e.g. CMOS) implementations, are a typical physical default that can cause this assumption to fail, as first put forward by Mangard et al. in [32]. There are two possible solutions to deal with such physical defaults: either by making explicit to cryptographic engineers that they have to prevent glitches at the physical level, or by designing countermeasures that can cope with glitches.Interestingly, the first solution is one aspect where hardware and software implementations significantly differ. Namely, while it is usually possible to ensure independent leakages in masked software, by ensuring a sufficient time separation between the manipulation of the shares, it is extremely difficult to avoid glitches in hardware [33]. Yet, even in hardware it is generally expected that the \"glitch signal\" will be more difficult to exploit by adversaries, especially if designers pay attention to this issue [35]. In this context, the main question is to determine the amplitude of this signal: if sufficiently reduced in front of the measurement noise, it may turn out that a glitch-sensitive masked implementation leads to improved security levels (compared to an unprotected one). Since this amplitude is highly technology-dependent, we will use it as a parameter to analyze the security of our hardware implementations in the next sections. Yet, we recall that it is a safe practice to focus on glitch-resistant implementations when it comes to hardware. Besides, we note that glitches are not the only physical default that may cause the independent leakage assumption to be contradicted in practice [42,51].",
"annotations": [
{
- "start": 1088,
- "end": 1089,
+ "start": 1089,
+ "end": 1090,
"name": "table",
- "value": "d2ce350a-25be-4d05-9061-6f1d4cf8bdd1"
+ "value": "1c9f98e6-e1f8-49f3-8bf7-24022f2d1939"
},
{
- "start": 2456,
- "end": 2457,
+ "start": 2457,
+ "end": 2458,
"name": "table",
- "value": "d2ce350a-25be-4d05-9061-6f1d4cf8bdd1"
+ "value": "1c9f98e6-e1f8-49f3-8bf7-24022f2d1939"
},
{
- "start": 1472,
- "end": 1476,
- "name": "bibliography_ref",
- "value": "bac4e5cd-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 1473,
+ "end": 1477,
+ "name": "reference",
+ "value": "109eab44-0872-11ef-b95c-0242ac120002"
},
{
- "start": 1552,
- "end": 1556,
- "name": "bibliography_ref",
- "value": "bac4e58b-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 1553,
+ "end": 1557,
+ "name": "reference",
+ "value": "109d73e6-0872-11ef-b95c-0242ac120002"
},
{
- "start": 1584,
- "end": 1588,
- "name": "bibliography_ref",
- "value": "bac4e4c5-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 1585,
+ "end": 1589,
+ "name": "reference",
+ "value": "109a34ec-0872-11ef-b95c-0242ac120002"
},
{
- "start": 1885,
- "end": 1889,
- "name": "bibliography_ref",
- "value": "bac4e5a0-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 1886,
+ "end": 1890,
+ "name": "reference",
+ "value": "109df136-0872-11ef-b95c-0242ac120002"
},
{
- "start": 2005,
- "end": 2009,
- "name": "bibliography_ref",
- "value": "bac4e549-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 2006,
+ "end": 2010,
+ "name": "reference",
+ "value": "109c4dea-0872-11ef-b95c-0242ac120002"
},
{
- "start": 2701,
- "end": 2705,
- "name": "bibliography_ref",
- "value": "bac4e4d6-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 2702,
+ "end": 2706,
+ "name": "reference",
+ "value": "109a7a92-0872-11ef-b95c-0242ac120002"
},
{
- "start": 2931,
- "end": 2935,
- "name": "bibliography_ref",
- "value": "bac4e526-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 2932,
+ "end": 2936,
+ "name": "reference",
+ "value": "109bcd5c-0872-11ef-b95c-0242ac120002"
},
{
- "start": 3517,
- "end": 3521,
- "name": "bibliography_ref",
- "value": "bac4e531-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 3518,
+ "end": 3522,
+ "name": "reference",
+ "value": "109bf85e-0872-11ef-b95c-0242ac120002"
},
{
- "start": 3697,
- "end": 3701,
- "name": "bibliography_ref",
- "value": "bac4e541-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 3698,
+ "end": 3702,
+ "name": "reference",
+ "value": "109c2608-0872-11ef-b95c-0242ac120002"
},
{
- "start": 4394,
- "end": 4398,
- "name": "bibliography_ref",
- "value": "bac4e575-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 4395,
+ "end": 4399,
+ "name": "reference",
+ "value": "109d0ba4-0872-11ef-b95c-0242ac120002"
},
{
- "start": 4398,
- "end": 4401,
- "name": "bibliography_ref",
- "value": "bac4e5bc-f290-11ee-a6ed-b88584b4e4a1"
+ "start": 4399,
+ "end": 4402,
+ "name": "reference",
+ "value": "109e67a6-0872-11ef-b95c-0242ac120002"
}
],
"metadata": {
"paragraph_type": "raw_text",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
}
@@ -712,13 +680,12 @@
},
{
"node_id": "0.8",
- "text": "Security evaluations",
+ "text": "4 Security evaluations",
"annotations": [],
"metadata": {
"paragraph_type": "section",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
@@ -728,7219 +695,6481 @@
{
"start": 520,
"end": 523,
- "name": "bibliography_ref",
- "value": "bac4e463-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "10986b4e-0872-11ef-b95c-0242ac120002"
},
{
"start": 818,
"end": 822,
- "name": "bibliography_ref",
- "value": "bac4e505-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109b3b6c-0872-11ef-b95c-0242ac120002"
},
{
"start": 1045,
"end": 1048,
- "name": "bibliography_ref",
- "value": "bac4e44c-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "10981248-0872-11ef-b95c-0242ac120002"
},
{
"start": 2048,
"end": 2052,
- "name": "bibliography_ref",
- "value": "bac4e4cd-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109a4482-0872-11ef-b95c-0242ac120002"
},
{
"start": 2520,
"end": 2524,
- "name": "bibliography_ref",
- "value": "bac4e516-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109b8ad6-0872-11ef-b95c-0242ac120002"
},
{
"start": 2604,
"end": 2608,
- "name": "bibliography_ref",
- "value": "bac4e5e6-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109f1e26-0872-11ef-b95c-0242ac120002"
},
{
"start": 3482,
"end": 3486,
- "name": "bibliography_ref",
- "value": "bac4e602-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109f92d4-0872-11ef-b95c-0242ac120002"
},
{
"start": 3864,
"end": 3868,
- "name": "bibliography_ref",
- "value": "bac4e4a3-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "10999406-0872-11ef-b95c-0242ac120002"
},
{
"start": 3868,
"end": 3871,
- "name": "bibliography_ref",
- "value": "bac4e5c4-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109e782c-0872-11ef-b95c-0242ac120002"
}
],
"metadata": {
"paragraph_type": "raw_text",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
- }
- ]
- },
- {
- "node_id": "0.9",
- "text": "Evaluation setups",
- "annotations": [],
- "metadata": {
- "paragraph_type": "section",
- "page_id": 0,
- "line_id": 0,
- "other_fields": {}
- },
- "subparagraphs": [
+ },
{
- "node_id": "0.9.0",
- "text": "We will consider two types of setups in our evaluations: one for software, one for hardware. As illustrated in Figure 3 in the case of a Boolean-masked S-box implementation with two shares, the main difference is that the software performs all the operations sequentially, while the hardware performs them in parallel. We will further assume that the leakage of parallel operations is summed [40]. As previously mentioned, we will illustrate our analyses with a Hamming weight leakage function. Additionally, we will consider a noise variance of 10, corresponding to a Signal-to-Noise Ratio of 0.2 (as defined in [29]) 3 . This is a typical value, both for software implementations [11] and FPGA measurement boards [25].Let us denote the AES S-box as S, a byte of plaintext and key as x i and k i (respectively), the random shares used in masking as r j i (before the S-box) and m j i (after the S-box), the Hamming weight function as HW, the bitwise XOR as ⊕, the field multiplication used in polynomial masking as ⊗, and Gaussiandistributed noise random variables N j i . From these notations, we can specify the list of all our target implementations as summarized in Table 2.A couple of observations are worth being underlined as we now discuss.First, and as already mentioned, the main difference between software and hardware implementations is the number of exploitable leakage samples: there is a single such sample per plaintext in hardware while there are 16×(N m +1) ones in software (with N m the number of masks). Next, we only considered glitches in hardware (since it is generally possible to ensure independent leakage in software, by ensuring a sufficient time separation between the manipulation of the shares). We assumed that \"first-order glitches\" can appear in our Boolean-masked FPGA implementation, and modeled the impact of the mask as an additive binomial noise in this case. We further assumed that the amplitude of this first-order signal was reduced according to a factor f . This factor corresponds to the parameter used to quantify the amplitude of the glitches mentioned in the previous section. Note that this modeling is sound because the complexity of a first-order DPA only depends on the value of its SNR (which is equivalent to correlation and information theoretic metrics in this case, as proven in [31]). So even leakage functions deviating from the Hamming weight abstraction would lead to similar trends. Since the threshold implementation in [36] guarantees the absence of firstorder glitches, we only analyzed the possibility of second-order glitches for this one, and modeled them in the same way as just described (i.e. by considering the second mask M 2 i as an additive binomial noise, and reducing the amplitude of the second-order signal by a factor f ). Third, the chosen-plaintext construction of [34] is only applicable in hardware. Furthermore, we only evaluated its impact for the unprotected implementation, and the 1-mask Boolean one with glitches. As will become clear in the next section, this is because the data complexity bound to 256 (that is the maximum tolerated by design in this case) is only relevant when successful side-channel attacks occur for such small complexities (which was only observed for implementations with first-order signal).For convenience, we denoted each implementation in our experiments with three letters. The first one corresponds to the type of scenario considered, i.e. with Known (K) or carefully Chosen (C) plaintexts. The second one indicates [20,45]2nd-order KP whether we are in a Software (S) or Hardware (H) case study. The third one corresponds to the type of countermeasure selected, i.e. Unprotected (U), 1-or 2-mask Boolean (B 1 , B 2 ), 1-mask Polynomial (P 1 ) and 2-mask threshold (T 2 ). The additional star signals finally reflect the presence of (first-order or secondorder) glitches. For example, KHB * 1 is an AES design protected with a 1-mask Boolean scheme, implemented in an imperfect hardware leading to first-order glitches, and analyzed in the context of known (uniform) plaintexts.",
- "annotations": [
- {
- "start": 392,
- "end": 396,
- "name": "bibliography_ref",
- "value": "bac4e568-f290-11ee-a6ed-b88584b4e4a1"
- },
- {
- "start": 613,
- "end": 617,
- "name": "bibliography_ref",
- "value": "bac4e50e-f290-11ee-a6ed-b88584b4e4a1"
- },
- {
- "start": 682,
- "end": 686,
- "name": "bibliography_ref",
- "value": "bac4e476-f290-11ee-a6ed-b88584b4e4a1"
- },
- {
- "start": 715,
- "end": 719,
- "name": "bibliography_ref",
- "value": "bac4e4ee-f290-11ee-a6ed-b88584b4e4a1"
- },
- {
- "start": 2339,
- "end": 2343,
- "name": "bibliography_ref",
- "value": "bac4e51d-f290-11ee-a6ed-b88584b4e4a1"
- },
- {
- "start": 2486,
- "end": 2490,
- "name": "bibliography_ref",
- "value": "bac4e549-f290-11ee-a6ed-b88584b4e4a1"
- },
- {
- "start": 2850,
- "end": 2854,
- "name": "bibliography_ref",
- "value": "bac4e539-f290-11ee-a6ed-b88584b4e4a1"
- },
- {
- "start": 3542,
- "end": 3546,
- "name": "bibliography_ref",
- "value": "bac4e4c5-f290-11ee-a6ed-b88584b4e4a1"
- },
- {
- "start": 3546,
- "end": 3548,
- "name": "bibliography_ref",
- "value": "bac4e58b-f290-11ee-a6ed-b88584b4e4a1"
- },
- {
- "start": 1177,
- "end": 1178,
- "name": "table",
- "value": "6e093372-d147-4245-8aab-08ed5fe5c072"
- }
- ],
+ "node_id": "0.8.1",
+ "text": "4.1 Evaluation setups",
+ "annotations": [],
"metadata": {
- "paragraph_type": "raw_text",
+ "paragraph_type": "section",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
- "subparagraphs": []
- }
- ]
- },
- {
- "node_id": "0.10",
- "text": "Template attacks and security graphs",
- "annotations": [],
- "metadata": {
- "paragraph_type": "section",
- "page_id": 0,
- "line_id": 0,
- "other_fields": {}
- },
- "subparagraphs": [
- {
- "node_id": "0.10.0",
- "text": "Given the leakage functions defined in Table 2, a template attack first requires to build a leakage model. In the following, and for each byte of the AES master key, we will consider Gaussian templates for unprotected implementations, and Gaussian for masked implementations. Let us denote the probability density function of a Gaussian distribution taken on input z, with mean µ (resp. mean vector µ) and variance σ 2 (resp. covariance matrix Σ) as N (z|µ, σ 2 ) (resp. N (z|µ, Σ)). This notation directly leads to models of the form:Prfor (software and hardware) unprotected implementations and:Prfor (software and hardware) masked implementations with two shares. The formula naturally extends to more shares, by just adding more sums over the masks. Note that in these models, all the noise (including the algorithmic one in hardware implementations) is captured by the Gaussian distribution 4 . Given these models, the template adversary will accumulate information on the key bytes k i , by computing products of probabilities corresponding to multiple plaintexts. Doing so and for each key byte, he will produce lists of 256 probabilities corresponding each possible candidate ki , defined as follows:i ],with the leakage vector L (j) respectively corresponding to l (j) i (resp. l (j) ) in the context of Equ. 1 (resp. Equ. 2) and l 1,(j) i , l 2,(j) i (resp. l (j) ) in the context of Equ. 3 (resp. Equ. 4) The number of measurements is given by q in Equ. 5. Next and for each target implementation, we will repeat 100 experiments. And for each value of q in these experiments, use a rank estimation algorithm to evaluate the time complexity needed to recover the full AES master key [61]. Eventually, we will build \"security graphs\" where the attack probability of success is provided in function of a time complexity and a number of measurements.Iterative DPA against constructions with carefully chosen plaintexts. Note that while standard DPA attacks are adequate to analyze the security of unprotected and masked implementations in a known-plaintext scenario, their divide-and-conquer strategy hardly applies to the PRF in [34], with carefullychosen plaintexts leading to key-dependent algorithmic noise. This is because the (maximum 256) constants c j used in this proposal are such that all 16 bytes are always identical. Hence, a standard DPA will provide a single list of probabilities, containing information about the 16 AES key bytes at once. In this case, we additionally considered the iterative DPA described in this previous reference, which essentially works by successively removing the algorithmic noise generated by the best-rated key bytes. While such an attack can only work under the assumption that the adversary has an very precise leakage model in hand, we use it as a representative of worst-case attack against such a construction.",
- "annotations": [
- {
- "start": 45,
- "end": 46,
- "name": "table",
- "value": "6e093372-d147-4245-8aab-08ed5fe5c072"
- },
- {
- "start": 1693,
- "end": 1697,
- "name": "bibliography_ref",
- "value": "bac4e61b-f290-11ee-a6ed-b88584b4e4a1"
- },
+ "subparagraphs": [
{
- "start": 2137,
- "end": 2141,
- "name": "bibliography_ref",
- "value": "bac4e539-f290-11ee-a6ed-b88584b4e4a1"
+ "node_id": "0.8.1.0",
+ "text": "We will consider two types of setups in our evaluations: one for software, one for hardware. As illustrated in Figure 3 in the case of a Boolean-masked S-box implementation with two shares, the main difference is that the software performs all the operations sequentially, while the hardware performs them in parallel. We will further assume that the leakage of parallel operations is summed [40]. As previously mentioned, we will illustrate our analyses with a Hamming weight leakage function. Additionally, we will consider a noise variance of 10, corresponding to a Signal-to-Noise Ratio of 0.2 (as defined in [29]) 3 . This is a typical value, both for software implementations [11] and FPGA measurement boards [25].Let us denote the AES S-box as S, a byte of plaintext and key as x i and k i (respectively), the random shares used in masking as r j i (before the S-box) and m j i (after the S-box), the Hamming weight function as HW, the bitwise XOR as ⊕, the field multiplication used in polynomial masking as ⊗, and Gaussiandistributed noise random variables N j i . From these notations, we can specify the list of all our target implementations as summarized in Table 2.A couple of observations are worth being underlined as we now discuss.\nFirst, and as already mentioned, the main difference between software and hardware implementations is the number of exploitable leakage samples: there is a single such sample per plaintext in hardware while there are 16×(N m +1) ones in software (with N m the number of masks). Next, we only considered glitches in hardware (since it is generally possible to ensure independent leakage in software, by ensuring a sufficient time separation between the manipulation of the shares). We assumed that \"first-order glitches\" can appear in our Boolean-masked FPGA implementation, and modeled the impact of the mask as an additive binomial noise in this case. We further assumed that the amplitude of this first-order signal was reduced according to a factor f . This factor corresponds to the parameter used to quantify the amplitude of the glitches mentioned in the previous section. Note that this modeling is sound because the complexity of a first-order DPA only depends on the value of its SNR (which is equivalent to correlation and information theoretic metrics in this case, as proven in [31]). So even leakage functions deviating from the Hamming weight abstraction would lead to similar trends. Since the threshold implementation in [36] guarantees the absence of firstorder glitches, we only analyzed the possibility of second-order glitches for this one, and modeled them in the same way as just described (i.e. by considering the second mask M 2 i as an additive binomial noise, and reducing the amplitude of the second-order signal by a factor f ). Third, the chosen-plaintext construction of [34] is only applicable in hardware. Furthermore, we only evaluated its impact for the unprotected implementation, and the 1-mask Boolean one with glitches. As will become clear in the next section, this is because the data complexity bound to 256 (that is the maximum tolerated by design in this case) is only relevant when successful side-channel attacks occur for such small complexities (which was only observed for implementations with first-order signal).For convenience, we denoted each implementation in our experiments with three letters. The first one corresponds to the type of scenario considered, i.e. with Known (K) or carefully Chosen (C) plaintexts. The second one indicates [20,45]2nd-order KP whether we are in a Software (S) or Hardware (H) case study. The third one corresponds to the type of countermeasure selected, i.e. Unprotected (U), 1-or 2-mask Boolean (B 1 , B 2 ), 1-mask Polynomial (P 1 ) and 2-mask threshold (T 2 ). The additional star signals finally reflect the presence of (first-order or secondorder) glitches. For example, KHB * 1 is an AES design protected with a 1-mask Boolean scheme, implemented in an imperfect hardware leading to first-order glitches, and analyzed in the context of known (uniform) plaintexts.\n",
+ "annotations": [
+ {
+ "start": 392,
+ "end": 396,
+ "name": "reference",
+ "value": "109ce200-0872-11ef-b95c-0242ac120002"
+ },
+ {
+ "start": 613,
+ "end": 617,
+ "name": "reference",
+ "value": "109b5dfe-0872-11ef-b95c-0242ac120002"
+ },
+ {
+ "start": 682,
+ "end": 686,
+ "name": "reference",
+ "value": "1098d606-0872-11ef-b95c-0242ac120002"
+ },
+ {
+ "start": 715,
+ "end": 719,
+ "name": "reference",
+ "value": "109ae9dc-0872-11ef-b95c-0242ac120002"
+ },
+ {
+ "start": 2340,
+ "end": 2344,
+ "name": "reference",
+ "value": "109baa66-0872-11ef-b95c-0242ac120002"
+ },
+ {
+ "start": 2487,
+ "end": 2491,
+ "name": "reference",
+ "value": "109c4dea-0872-11ef-b95c-0242ac120002"
+ },
+ {
+ "start": 2851,
+ "end": 2855,
+ "name": "reference",
+ "value": "109c08ee-0872-11ef-b95c-0242ac120002"
+ },
+ {
+ "start": 3543,
+ "end": 3547,
+ "name": "reference",
+ "value": "109a34ec-0872-11ef-b95c-0242ac120002"
+ },
+ {
+ "start": 3547,
+ "end": 3549,
+ "name": "reference",
+ "value": "109d73e6-0872-11ef-b95c-0242ac120002"
+ },
+ {
+ "start": 1177,
+ "end": 1178,
+ "name": "table",
+ "value": "355d0fa4-7326-4228-a163-31c56483f80d"
+ }
+ ],
+ "metadata": {
+ "paragraph_type": "raw_text",
+ "page_id": 0,
+ "line_id": 0
+ },
+ "subparagraphs": []
}
- ],
+ ]
+ },
+ {
+ "node_id": "0.8.2",
+ "text": "4.2 Template attacks and security graphs",
+ "annotations": [],
"metadata": {
- "paragraph_type": "raw_text",
+ "paragraph_type": "section",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
- "subparagraphs": []
- }
- ]
- },
- {
- "node_id": "0.11",
- "text": "Experimental results",
- "annotations": [],
- "metadata": {
- "paragraph_type": "section",
- "page_id": 0,
- "line_id": 0,
- "other_fields": {}
- },
- "subparagraphs": [
- {
- "node_id": "0.11.0",
- "text": "For illustration, the security graph of the AES implementation KHB 1 is given in Figure 4, where we additionally provide the maximum number of measurements tolerated to maintain security levels corresponding to 2 120 , 2 100 and 2 80 time complexity. All the implementations in Table 2 have been similarly evaluated and the result of these experiments are in Appendix A, Figures 8 to 13. Note that in the aforementioned case of iterative DPA (Appendix A, Figure 14), the adversary recovers the AES key bytes but still has to find their position within the AES state, which (roughly) corresponds to 16! ≈ 2 44 possibilities [2].",
- "annotations": [
- {
- "start": 284,
- "end": 285,
- "name": "table",
- "value": "6e093372-d147-4245-8aab-08ed5fe5c072"
- },
+ "subparagraphs": [
{
- "start": 623,
- "end": 626,
- "name": "bibliography_ref",
- "value": "bac4e432-f290-11ee-a6ed-b88584b4e4a1"
+ "node_id": "0.8.2.0",
+ "text": "Given the leakage functions defined in Table 2, a template attack first requires to build a leakage model. In the following, and for each byte of the AES master key, we will consider Gaussian templates for unprotected implementations, and Gaussian for masked implementations. Let us denote the probability density function of a Gaussian distribution taken on input z, with mean µ (resp. mean vector µ) and variance σ 2 (resp. covariance matrix Σ) as N (z|µ, σ 2 ) (resp. N (z|µ, Σ)). This notation directly leads to models of the form:Pr\nfor (software and hardware) unprotected implementations and:\nPr\nfor (software and hardware) masked implementations with two shares. The formula naturally extends to more shares, by just adding more sums over the masks. Note that in these models, all the noise (including the algorithmic one in hardware implementations) is captured by the Gaussian distribution 4 . Given these models, the template adversary will accumulate information on the key bytes k i , by computing products of probabilities corresponding to multiple plaintexts. Doing so and for each key byte, he will produce lists of 256 probabilities corresponding each possible candidate ki , defined as follows:i ],\nwith the leakage vector L (j) respectively corresponding to l (j) i (resp. l (j) ) in the context of Equ. 1 (resp. Equ. 2) and l 1,(j) i , l 2,(j) i (resp. l (j) ) in the context of Equ. 3 (resp. Equ. 4) The number of measurements is given by q in Equ. 5. Next and for each target implementation, we will repeat 100 experiments. And for each value of q in these experiments, use a rank estimation algorithm to evaluate the time complexity needed to recover the full AES master key [61]. Eventually, we will build \"security graphs\" where the attack probability of success is provided in function of a time complexity and a number of measurements.Iterative DPA against constructions with carefully chosen plaintexts. Note that while standard DPA attacks are adequate to analyze the security of unprotected and masked implementations in a known-plaintext scenario, their divide-and-conquer strategy hardly applies to the PRF in [34], with carefullychosen plaintexts leading to key-dependent algorithmic noise. This is because the (maximum 256) constants c j used in this proposal are such that all 16 bytes are always identical. Hence, a standard DPA will provide a single list of probabilities, containing information about the 16 AES key bytes at once. In this case, we additionally considered the iterative DPA described in this previous reference, which essentially works by successively removing the algorithmic noise generated by the best-rated key bytes. While such an attack can only work under the assumption that the adversary has an very precise leakage model in hand, we use it as a representative of worst-case attack against such a construction.",
+ "annotations": [
+ {
+ "start": 45,
+ "end": 46,
+ "name": "table",
+ "value": "355d0fa4-7326-4228-a163-31c56483f80d"
+ },
+ {
+ "start": 1697,
+ "end": 1701,
+ "name": "reference",
+ "value": "109fee1e-0872-11ef-b95c-0242ac120002"
+ },
+ {
+ "start": 2141,
+ "end": 2145,
+ "name": "reference",
+ "value": "109c08ee-0872-11ef-b95c-0242ac120002"
+ }
+ ],
+ "metadata": {
+ "paragraph_type": "raw_text",
+ "page_id": 0,
+ "line_id": 0
+ },
+ "subparagraphs": []
}
- ],
+ ]
+ },
+ {
+ "node_id": "0.8.3",
+ "text": "4.3 Experimental results",
+ "annotations": [],
"metadata": {
- "paragraph_type": "raw_text",
+ "paragraph_type": "section",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
- "subparagraphs": []
+ "subparagraphs": [
+ {
+ "node_id": "0.8.3.0",
+ "text": "For illustration, the security graph of the AES implementation KHB 1 is given in Figure 4, where we additionally provide the maximum number of measurements tolerated to maintain security levels corresponding to 2 120 , 2 100 and 2 80 time complexity. All the implementations in Table 2 have been similarly evaluated and the result of these experiments are in Appendix A, Figures 8 to 13. Note that in the aforementioned case of iterative DPA (Appendix A, Figure 14), the adversary recovers the AES key bytes but still has to find their position within the AES state, which (roughly) corresponds to 16! ≈ 2 44 possibilities [2].",
+ "annotations": [
+ {
+ "start": 284,
+ "end": 285,
+ "name": "table",
+ "value": "355d0fa4-7326-4228-a163-31c56483f80d"
+ },
+ {
+ "start": 623,
+ "end": 626,
+ "name": "reference",
+ "value": "109771bc-0872-11ef-b95c-0242ac120002"
+ }
+ ],
+ "metadata": {
+ "paragraph_type": "raw_text",
+ "page_id": 0,
+ "line_id": 0
+ },
+ "subparagraphs": []
+ }
+ ]
}
]
},
{
- "node_id": "0.12",
- "text": "Security vs. performance tradeoffs",
+ "node_id": "0.9",
+ "text": "5 Security vs. performance tradeoffs",
"annotations": [],
"metadata": {
"paragraph_type": "section",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
- "node_id": "0.12.0",
- "text": "We now combine the results in the previous sections to answer our main question. Namely, what is the best way to exploit masking and/or leakage-resilient primitives to resist standard DPA in hardware and software implementations?",
+ "node_id": "0.9.0",
+ "text": "We now combine the results in the previous sections to answer our main question. Namely, what is the best way to exploit masking and/or leakage-resilient primitives to resist standard DPA in hardware and software implementations?\n",
"annotations": [],
"metadata": {
"paragraph_type": "raw_text",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
- }
- ]
- },
- {
- "node_id": "0.13",
- "text": "Leakage-resilient PRGs",
- "annotations": [],
- "metadata": {
- "paragraph_type": "section",
- "page_id": 0,
- "line_id": 0,
- "other_fields": {}
- },
- "subparagraphs": [
+ },
{
- "node_id": "0.13.0",
- "text": "Let M be the maximum number of measurements tolerated to maintain a given security level for one of the implementations in section 4. The re-keying in leakage-resilient PRGs is such that it is exactly this number M that is limited by design (i.e. the value N in Figure 1 bounds M for the adversary), hence directly leading to security-bounded implementations. The global cost metric we use in this case can be written as M M -1 × cost f unction, where the first factor corresponds to the average number of AES encryptions that are used to produce each 128-bit output string, and the second one is the cost function of Table 1.A comparison of different leakage-resilient PRG implementations in software (i.e. based on different unprotected and protected AES implementations) is given in Figure 5 for 80-bit and 120-bit security levels (the results for 100-bit security are in Appendix A, Figure 15, left). The main observation in this context is that the straightforward implementation of the PRG with an unprotected AES design is the most efficient solution. This is mainly because moving from the smallest M value (i.e. M = 2, as imposed by the 120-bit security level in the unprotected case -see Figure 8-left) to large ones (e.g. M > 1000 for masked implementations) can only lead to a gain factor of 2 for the global cost metric, which is not justified in view of the performance overheads due to the masking. For a similar reason (i.e. the limited interest of increasing M ), the global cost metric is essentially independent of the target security level in the figure. In other words, there is little interest in decreasing this security level since it leads to poor performance improvements. The hardware implementations in Appendix A, Figures 15-right and 16 lead to essentially similar intuitions, as also witnessed by the limited impact of decreasing the amplitude of the glitch signal with the f factor (see the KHB * 1 and KHT * 2 implementations for which f = 10 in the latter figures).",
- "annotations": [
- {
- "start": 624,
- "end": 625,
- "name": "table",
- "value": "d2ce350a-25be-4d05-9061-6f1d4cf8bdd1"
- }
- ],
+ "node_id": "0.9.1",
+ "text": "5.1 Leakage-resilient PRGs",
+ "annotations": [],
"metadata": {
- "paragraph_type": "raw_text",
+ "paragraph_type": "section",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
- "subparagraphs": []
- }
- ]
- },
- {
- "node_id": "0.14",
- "text": "Leakage-resilient PRFs",
- "annotations": [],
- "metadata": {
- "paragraph_type": "section",
- "page_id": 0,
- "line_id": 0,
- "other_fields": {}
- },
- "subparagraphs": [
- {
- "node_id": "0.14.0",
- "text": "Security-unbounded implementations. Let us now consider (stateless) leakage-resilient PRFs. As already mentioned, those constructions only bound the adversary's data complexity. The main observation in this case is that if random plaintexts are considered, such implementations can only be security-unbounded (with the slight cautionary note that we give below). This fact can be easily explained when the PRF is instantiated with an unprotected software implementation of the AES. What happens then is that the adversary can repeat his measurements to get rid of the physical noise, and consequently move from the security graph of Appendix A, Figure 8-left to the one of Appendix A, Figure 13-right. Such a \"repeating\" attack is exactly the one already mentioned in [34] to argue that bounded data complexity is not enough to bound (computational) security. In fact, it similarly applies to masked implementations. The only difference is that the adversary will not average his measurements, but rather combine them as in Equation 5. This is because given a leakage function, e.g. the Hamming weight one that leads to 9 distinguishable events, the distribution of the measurements in a masked implementation will lead to the same number of distinguishable events: the only difference is that more sampling will be necessary to distinguish them (see the appendices in [60] for a plot of these distributions). So if the number of measurements is not bounded, attacks with low time complexities as in Appendix A, Figure 13 right will always exist.One important consequence is that using the PRF construction in this context is essentially useless for all the AES implementations we consider in this paper. The only way to maintain a target security level for such stateless primitives is to limit the number of measurements by putting a constraint on the lifetime of the system. And this lifetime will be selected according to the maximum number of measurements tolerated that can be extracted from our security graphs, which now highly depends on the countermeasure selected. In other words, we can only evaluate the cost function and the security level attained independently in this case, as illustrated in Figure 6 for our software instances (the 100-bit security level is again given in Appendix A, Figure 17-left). Here, we naturally come back to the standard result that Boolean (resp. polynomial) masking increases security at the cost of performance overheads that are roughly quadratic (resp. cubic) in the number of shares. Note that the security level of the 1-mask polynomial scheme is higher than the 2-mask Boolean one for the noise variance we consider, which is consistent with the previous work of Roche and Prouff [54]. Similar conclusions are obtained with hardware implementations (Appendix A, Figure 17-right and Appendix A, Figure 18), for which the impact of glitches is now clearly visible. For example, a factor f = 10 essentially multiplies the number of measurements by f for the Boolean masking with first-order glitches, and f 2 for the threshold implementation with second-order glitches. Cautionary note. The statement that stateless leakage-resilient PRFs can only be security unbounded if known plaintexts are considered essentially relates to the fact that repeated measurements allow removing the effect of the noise and the masks in a leaking implementation. Yet, this claim should be slightly mitigated in the case of algorithmic noise in hardware implementations. Indeed, this part of the noise can only be averaged up to the data complexity bound that is imposed by the PRF design. Taking the example of our hardware implementations where all 16 S-boxes are manipulated in parallel, the SNR corresponding to algorithmic noise can be computed as the ratio between the variance of a uniformly distributed 8-bit values's Hamming weight (i.e. 2) and the variance of 15 such values (i.e. 30). Averaging this noise over M plaintexts will lead to SNRs of 1 15/M , which is already larger than 17 if M = 256 (i.e. a noise level for which the security graph will be extremely close to the worst case one of Appendix A, Figure 13-right). So although there is a \"gray area\" where a leakage-resilient PRF implemented in hardware can be (weakly) security-bounded, these contexts are of quite limited interest because the will imply bounds on the data complexity that are below 256, i.e. they anyway lead to less efficient solutions than the tweaked construction that we investigate in the next subsection.Security-bounded implementations. As just discussed, stateless primitives hardly lead to security bounded implementations if physical and algorithmic noise can be averaged -which is straightforwardly feasible in a known plaintext scenario. The tweaked construction in [34] aims at avoiding such a weakness by preventing the averaging of the algorithmic noise, thanks to the combined effect of hardware parallelism and carefully chosen plaintexts leading to keydependencies in this noise. Since only the physical noise can be averaged in this case, the bounded data complexity that the leakage-resilient PRF guarantees consequently leads to security-bounded implementations again. This is illustrated both by the standard DPAs (such as in Appendix A, Figures 10-right and 12-left) and the iterative attacks (such as in Appendix A, Figure 13) that can be performed against this PRF 5 . As in Section 5.1, we extracted the maximum data complexity D from these graphs, and produced as global cost metric:where the first factor corresponds to the (rounded) average number of AES encryptions needed to produce a 128-bit output, and the second one is the cost function of Table 1. A comparison of our different leakage-resilient PRFs instantiated with a hardware implementation of the AES and chosen plaintexts is given in Figure 7. Here again, we observe that the most efficient solution is to consider an unprotected design. Interestingly, we also observe that for the unprotected AES, the iterative attack is the worst case for the 80-bit security level (where it forces the re-keying after 97 plaintexts vs. 256 for the standard DPA), while the standard DPA is the worst-case for the 120-bit security level (where it forces the re-keying after 10 plaintexts vs. 37 for the iterative attack). This nicely fits the intuition that iterative attacks become more powerful as the data complexity increases, i.e. when the additional time complexity corresponding to the enumeration of a permutation over 16 bytes becomes small compared to the time complexity required to recover the 16 AES key bytes (unordered). ",
- "annotations": [
- {
- "start": 768,
- "end": 772,
- "name": "bibliography_ref",
- "value": "bac4e539-f290-11ee-a6ed-b88584b4e4a1"
- },
- {
- "start": 4800,
- "end": 4804,
- "name": "bibliography_ref",
- "value": "bac4e539-f290-11ee-a6ed-b88584b4e4a1"
- },
- {
- "start": 1369,
- "end": 1373,
- "name": "bibliography_ref",
- "value": "bac4e610-f290-11ee-a6ed-b88584b4e4a1"
- },
- {
- "start": 2732,
- "end": 2736,
- "name": "bibliography_ref",
- "value": "bac4e5d7-f290-11ee-a6ed-b88584b4e4a1"
- },
+ "subparagraphs": [
{
- "start": 5703,
- "end": 5704,
- "name": "table",
- "value": "d2ce350a-25be-4d05-9061-6f1d4cf8bdd1"
+ "node_id": "0.9.1.0",
+ "text": "Let M be the maximum number of measurements tolerated to maintain a given security level for one of the implementations in section 4. The re-keying in leakage-resilient PRGs is such that it is exactly this number M that is limited by design (i.e. the value N in Figure 1 bounds M for the adversary), hence directly leading to security-bounded implementations. The global cost metric we use in this case can be written as M M -1 × cost f unction, where the first factor corresponds to the average number of AES encryptions that are used to produce each 128-bit output string, and the second one is the cost function of Table 1.A comparison of different leakage-resilient PRG implementations in software (i.e. based on different unprotected and protected AES implementations) is given in Figure 5 for 80-bit and 120-bit security levels (the results for 100-bit security are in Appendix A, Figure 15, left). The main observation in this context is that the straightforward implementation of the PRG with an unprotected AES design is the most efficient solution. This is mainly because moving from the smallest M value (i.e. M = 2, as imposed by the 120-bit security level in the unprotected case -see Figure 8-left) to large ones (e.g. M > 1000 for masked implementations) can only lead to a gain factor of 2 for the global cost metric, which is not justified in view of the performance overheads due to the masking. For a similar reason (i.e. the limited interest of increasing M ), the global cost metric is essentially independent of the target security level in the figure. In other words, there is little interest in decreasing this security level since it leads to poor performance improvements. The hardware implementations in Appendix A, Figures 15-right and 16 lead to essentially similar intuitions, as also witnessed by the limited impact of decreasing the amplitude of the glitch signal with the f factor (see the KHB * 1 and KHT * 2 implementations for which f = 10 in the latter figures).",
+ "annotations": [
+ {
+ "start": 624,
+ "end": 625,
+ "name": "table",
+ "value": "1c9f98e6-e1f8-49f3-8bf7-24022f2d1939"
+ }
+ ],
+ "metadata": {
+ "paragraph_type": "raw_text",
+ "page_id": 0,
+ "line_id": 0
+ },
+ "subparagraphs": []
}
- ],
+ ]
+ },
+ {
+ "node_id": "0.9.2",
+ "text": "5.2 Leakage-resilient PRFs",
+ "annotations": [],
"metadata": {
- "paragraph_type": "raw_text",
+ "paragraph_type": "section",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
- "subparagraphs": []
+ "subparagraphs": [
+ {
+ "node_id": "0.9.2.0",
+ "text": "Security-unbounded implementations. Let us now consider (stateless) leakage-resilient PRFs. As already mentioned, those constructions only bound the adversary's data complexity. The main observation in this case is that if random plaintexts are considered, such implementations can only be security-unbounded (with the slight cautionary note that we give below). This fact can be easily explained when the PRF is instantiated with an unprotected software implementation of the AES. What happens then is that the adversary can repeat his measurements to get rid of the physical noise, and consequently move from the security graph of Appendix A, Figure 8-left to the one of Appendix A, Figure 13-right. Such a \"repeating\" attack is exactly the one already mentioned in [34] to argue that bounded data complexity is not enough to bound (computational) security. In fact, it similarly applies to masked implementations. The only difference is that the adversary will not average his measurements, but rather combine them as in Equation 5. This is because given a leakage function, e.g. the Hamming weight one that leads to 9 distinguishable events, the distribution of the measurements in a masked implementation will lead to the same number of distinguishable events: the only difference is that more sampling will be necessary to distinguish them (see the appendices in [60] for a plot of these distributions). So if the number of measurements is not bounded, attacks with low time complexities as in Appendix A, Figure 13 right will always exist.One important consequence is that using the PRF construction in this context is essentially useless for all the AES implementations we consider in this paper. The only way to maintain a target security level for such stateless primitives is to limit the number of measurements by putting a constraint on the lifetime of the system. And this lifetime will be selected according to the maximum number of measurements tolerated that can be extracted from our security graphs, which now highly depends on the countermeasure selected. In other words, we can only evaluate the cost function and the security level attained independently in this case, as illustrated in Figure 6 for our software instances (the 100-bit security level is again given in Appendix A, Figure 17-left). Here, we naturally come back to the standard result that Boolean (resp. polynomial) masking increases security at the cost of performance overheads that are roughly quadratic (resp. cubic) in the number of shares. Note that the security level of the 1-mask polynomial scheme is higher than the 2-mask Boolean one for the noise variance we consider, which is consistent with the previous work of Roche and Prouff [54]. Similar conclusions are obtained with hardware implementations (Appendix A, Figure 17-right and Appendix A, Figure 18), for which the impact of glitches is now clearly visible. For example, a factor f = 10 essentially multiplies the number of measurements by f for the Boolean masking with first-order glitches, and f 2 for the threshold implementation with second-order glitches. Cautionary note. The statement that stateless leakage-resilient PRFs can only be security unbounded if known plaintexts are considered essentially relates to the fact that repeated measurements allow removing the effect of the noise and the masks in a leaking implementation. Yet, this claim should be slightly mitigated in the case of algorithmic noise in hardware implementations. Indeed, this part of the noise can only be averaged up to the data complexity bound that is imposed by the PRF design. Taking the example of our hardware implementations where all 16 S-boxes are manipulated in parallel, the SNR corresponding to algorithmic noise can be computed as the ratio between the variance of a uniformly distributed 8-bit values's Hamming weight (i.e. 2) and the variance of 15 such values (i.e. 30). Averaging this noise over M plaintexts will lead to SNRs of 1 15/M , which is already larger than 17 if M = 256 (i.e. a noise level for which the security graph will be extremely close to the worst case one of Appendix A, Figure 13-right). So although there is a \"gray area\" where a leakage-resilient PRF implemented in hardware can be (weakly) security-bounded, these contexts are of quite limited interest because the will imply bounds on the data complexity that are below 256, i.e. they anyway lead to less efficient solutions than the tweaked construction that we investigate in the next subsection.Security-bounded implementations. As just discussed, stateless primitives hardly lead to security bounded implementations if physical and algorithmic noise can be averaged -which is straightforwardly feasible in a known plaintext scenario. The tweaked construction in [34] aims at avoiding such a weakness by preventing the averaging of the algorithmic noise, thanks to the combined effect of hardware parallelism and carefully chosen plaintexts leading to keydependencies in this noise. Since only the physical noise can be averaged in this case, the bounded data complexity that the leakage-resilient PRF guarantees consequently leads to security-bounded implementations again. This is illustrated both by the standard DPAs (such as in Appendix A, Figures 10-right and 12-left) and the iterative attacks (such as in Appendix A, Figure 13) that can be performed against this PRF 5 . As in Section 5.1, we extracted the maximum data complexity D from these graphs, and produced as global cost metric:where the first factor corresponds to the (rounded) average number of AES encryptions needed to produce a 128-bit output, and the second one is the cost function of Table 1. A comparison of our different leakage-resilient PRFs instantiated with a hardware implementation of the AES and chosen plaintexts is given in Figure 7. Here again, we observe that the most efficient solution is to consider an unprotected design. Interestingly, we also observe that for the unprotected AES, the iterative attack is the worst case for the 80-bit security level (where it forces the re-keying after 97 plaintexts vs. 256 for the standard DPA), while the standard DPA is the worst-case for the 120-bit security level (where it forces the re-keying after 10 plaintexts vs. 37 for the iterative attack). This nicely fits the intuition that iterative attacks become more powerful as the data complexity increases, i.e. when the additional time complexity corresponding to the enumeration of a permutation over 16 bytes becomes small compared to the time complexity required to recover the 16 AES key bytes (unordered). ",
+ "annotations": [
+ {
+ "start": 768,
+ "end": 772,
+ "name": "reference",
+ "value": "109c08ee-0872-11ef-b95c-0242ac120002"
+ },
+ {
+ "start": 4800,
+ "end": 4804,
+ "name": "reference",
+ "value": "109c08ee-0872-11ef-b95c-0242ac120002"
+ },
+ {
+ "start": 1369,
+ "end": 1373,
+ "name": "reference",
+ "value": "109fc63c-0872-11ef-b95c-0242ac120002"
+ },
+ {
+ "start": 2732,
+ "end": 2736,
+ "name": "reference",
+ "value": "109ed6dc-0872-11ef-b95c-0242ac120002"
+ },
+ {
+ "start": 5703,
+ "end": 5704,
+ "name": "table",
+ "value": "1c9f98e6-e1f8-49f3-8bf7-24022f2d1939"
+ }
+ ],
+ "metadata": {
+ "paragraph_type": "raw_text",
+ "page_id": 0,
+ "line_id": 0
+ },
+ "subparagraphs": []
+ }
+ ]
}
]
},
{
- "node_id": "0.15",
- "text": "Conclusion",
+ "node_id": "0.10",
+ "text": "6 Conclusion",
"annotations": [],
"metadata": {
"paragraph_type": "section",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
- "node_id": "0.15.0",
+ "node_id": "0.10.0",
"text": "The results in this work essentially show that masking and leakage-resilient constructions hardly combine constructively. For (stateful) PRGs, our experiments indicate that both for software and hardware implementations, a leakageresilient design instantiated with an unprotected AES is the most efficient solution to reach any given security level. For stateless PRFs, they rather show that a bounded data complexity guarantee is (mostly) ineffective in bounding the (computational) complexity of the best attacks. So implementing masking and limiting the lifetime of the cryptographic implementation is the best solution in this case. Nevertheless, the chosen-plaintext tweak proposed in [34] is an interesting exception to this conclusion, as it leads to security-bounded hardware implementations for stateless primitives that are particularly interesting from an application point-of-view, e.g. for re-synchronization, challenge-response protocols, . . . Beyond the further analysis of such constructions, their extension to software implementations is an interesting scope for further research. In this respect, the combination of a chosen-plaintext leakage-resilient PRF with the shuffling countermeasure in [62] seems promising, as it could \"emulate\" the keydependent algorithmic noise ensuring security bounds in hardware. ",
"annotations": [
{
"start": 690,
"end": 694,
- "name": "bibliography_ref",
- "value": "bac4e539-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "109c08ee-0872-11ef-b95c-0242ac120002"
},
{
"start": 1214,
"end": 1218,
- "name": "bibliography_ref",
- "value": "bac4e623-f290-11ee-a6ed-b88584b4e4a1"
+ "name": "reference",
+ "value": "10a004ee-0872-11ef-b95c-0242ac120002"
}
],
"metadata": {
"paragraph_type": "raw_text",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": []
}
]
},
{
- "node_id": "0.16",
+ "node_id": "0.11",
"text": "A Additional figures",
"annotations": [],
"metadata": {
"paragraph_type": "section",
"page_id": 0,
- "line_id": 0,
- "other_fields": {}
- },
- "subparagraphs": []
- },
- {
- "node_id": "0.17",
- "text": "\n",
- "annotations": [],
- "metadata": {
- "paragraph_type": "section",
- "page_id": 0,
- "line_id": 0,
- "other_fields": {}
+ "line_id": 0
},
"subparagraphs": [
{
- "node_id": "0.17.0",
- "text": "Acknowledgements. F.-X. Standaert is an associate researcher of the . Work funded in parts by the through the project (CRASH) and the grant B- project.",
- "annotations": [],
- "metadata": {
- "paragraph_type": "raw_text",
- "page_id": 0,
- "line_id": 0,
- "other_fields": {}
- },
- "subparagraphs": []
- }
- ]
- },
- {
- "node_id": "0.18",
- "text": "
Acknowledgements. F.-X. Standaert is an associate researcher of the Belgian Fund for Scientific Research (FNRS-F.R.S.). Work funded in parts by the European Commission through the ERC project 280141 (CRASH) and the European ISEC action grant HOME/2010/ISEC/AG/INT-011 B-CCENTRE project.