Skip to content

Commit

Permalink
add new info to tutorial
Browse files Browse the repository at this point in the history
  • Loading branch information
Nikita Shevtsov committed Nov 20, 2023
1 parent 49fc3d5 commit 9c235e7
Show file tree
Hide file tree
Showing 3 changed files with 167 additions and 39 deletions.
91 changes: 78 additions & 13 deletions docs/source/_static/code_examples/dedoc_creating_dedoc_document.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,9 @@
import uuid

from dedoc.data_structures import BoldAnnotation, CellWithMeta, HierarchyLevel, LineMetadata, LineWithMeta, \
LinkedTextAnnotation, Table, TableMetadata, AttachedFile, UnstructuredDocument
LinkedTextAnnotation, Table, TableMetadata, AttachedFile, UnstructuredDocument, TableAnnotation, AttachAnnotation
from dedoc.structure_constructors import TreeConstructor


text = "Simple text line"
simple_line = LineWithMeta(text)
Expand All @@ -12,10 +16,7 @@
)

metadata = LineMetadata(page_id=0, line_id=1, tag_hierarchy_level=None, hierarchy_level=hierarchy_level, other_fields=None)
annotations = [
LinkedTextAnnotation(0, 5, "Now it isn't so simple :)"),
BoldAnnotation(7, 10, "True")
]
annotations = [LinkedTextAnnotation(0, 5, "Now the line isn't so simple :)"), BoldAnnotation(7, 10, "True")]

super_line = LineWithMeta(text, metadata=metadata, annotations=annotations)

Expand All @@ -29,22 +30,86 @@
cells_row = []
for cell_text in row:
line_with_meta = LineWithMeta(cell_text, metadata=LineMetadata(page_id=0, line_id=None), annotations=[])
cell = CellWithMeta(lines=[line_with_meta])
cell = CellWithMeta(lines=[line_with_meta]) # CellWithMeta contains list of LineWithMeta
cells_row.append(cell)
cells_with_meta.append(cells_row)

table_metadata = TableMetadata(page_id=0, uid="table 1")
table_metadata = TableMetadata(page_id=0, uid="table")
table = Table(cells=cells_with_meta, metadata=table_metadata)

attached_file = AttachedFile(original_name="docx_example.png", tmp_file_path="test_dir/docx_example.png", need_content_analysis=True, uid='?')
table_line_metadata = LineMetadata(
page_id=0,
line_id=None,
hierarchy_level=HierarchyLevel(
level_1=1,
level_2=0,
can_be_multiline=False,
line_type="raw_text"
),
)
table_line = LineWithMeta("Line with simple table", metadata=table_line_metadata, annotations=[TableAnnotation("table", 0, 21)])

unstructured_document = UnstructuredDocument(tables=[table], lines=[super_line], attachments=[attached_file])
table_cells = [["Last name First name Patronymic", "Last name First name Patronymic", "Last name First name Patronymic"],
["Ivanov", "Ivan", "Ivanovich"],
["Petrov", "Petr", "Petrovich"]]

from dedoc.metadata_extractors import BaseMetadataExtractor
metadata = BaseMetadataExtractor().extract_metadata(directory="./", filename="example.docx", converted_filename="example.doc", original_filename="example.docx")
unstructured_document.metadata = metadata
for row in table_cells:
cells_row = []
for cell_text in row:
line_with_meta = LineWithMeta(cell_text, metadata=LineMetadata(page_id=0, line_id=None), annotations=[])
cell = CellWithMeta([line_with_meta]) # CellWithMeta contains list of LineWithMeta
cells_row.append(cell)
cells_with_meta.append(cells_row)

cells_with_meta[0][0].colspan = 3
cells_with_meta[0][1].invisible = True
cells_with_meta[0][2].invisible = True

table_metadata = TableMetadata(page_id=0, uid="complicated_table")
complicated_table = Table(cells=cells_with_meta, metadata=table_metadata)

complicated_table_line_metadata = LineMetadata(
page_id=0,
line_id=None,
hierarchy_level=HierarchyLevel(
level_1=1,
level_2=0,
can_be_multiline=False,
line_type="raw_text"
),
)
complicated_table_line = LineWithMeta("complicated table line", metadata=table_line_metadata, annotations=[TableAnnotation("complicated_table", 0, 21)])

attached_file = AttachedFile(original_name="docx_example.png", tmp_file_path="test_dir/docx_example.png", need_content_analysis=False, uid=str(uuid.uuid4()))

attached_file_line_metadata = LineMetadata(
page_id=0,
line_id=None,
hierarchy_level=HierarchyLevel(
level_1=1,
level_2=0,
can_be_multiline=False,
line_type="raw_text"
),
)
attached_file_line = LineWithMeta("Line with attached file", metadata=attached_file_line_metadata, annotations=[AttachAnnotation("super table", 0, 21)])

unstructured_document = UnstructuredDocument(
tables=[table, complicated_table],
lines=[super_line, table_line, complicated_table_line],
attachments=[attached_file]
)

unstructured_document.metadata = {
"file_name": "my_document.txt",
"temporary_file_name": "my_document.txt",
"file_type": "txt",
"size": 11111, # in bytes
"access_time": 1696381364,
"created_time": 1696316594,
"modified_time": 1696381364
}

from dedoc.structure_constructors import TreeConstructor
structure_constructor = TreeConstructor()
parsed_document = structure_constructor.structure_document(document=unstructured_document, structure_type="tree")

Expand Down
Binary file added docs/source/_static/table_merged_horizontal.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
115 changes: 89 additions & 26 deletions docs/source/tutorials/creating_document_classes.rst
Original file line number Diff line number Diff line change
Expand Up @@ -4,11 +4,11 @@ Creating Dedoc Document from basic data structures in code
Let's dig inside Dedoc data structures and build Dedoc document from scratch. During this tutorial you will learn:

* How to use data structures of Dedoc to store text, structure, tables, annotations, metadata, attachments
* What is Dedoc unified output representation of document
* What is inside the Dedoc unified output representation of document
* How document structure is defined

Raw document content is stored in :class:`~dedoc.data_structures.UnstructuredDocument`. This is de facto
a container with lists of data strucutres objects:
a container with lists of data structures objects:

* list of :class:`~dedoc.data_structures.Table`
* list of text lines :class:`~dedoc.data_structures.LineWithMeta`
Expand All @@ -21,17 +21,19 @@ Order of data structures in lists doesn't matter. All document hierarchy and str
LineWithMeta
------------

Basic block of Dedoc document is :class:`~dedoc.data_structures.LineWithMeta`:
Basic block of Dedoc document is :class:`~dedoc.data_structures.LineWithMeta` (line with metadata):

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 4-5
:lines: 8-9

To specify hierarchy you should use :class:`~dedoc.data_structures.HierarchyLevel` class:
Each document contains a hierarchy of its elements. For example, a header line should be on level higher than common
paragraph lines. Hierarchy level is produced by :ref:`dedoc_structure_extractors` and may vary depending on the type
of document. To specify hierarchy in our handmade document use :class:`~dedoc.data_structures.HierarchyLevel` class:

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 7-11
:lines: 11-16

Hierarchy level compares by tuple (``level_1``, ``level_2``): lesser values are closer to the root of the tree.
``level_1`` is primary hierarchy dimension that defines type of line:
Expand All @@ -55,20 +57,33 @@ Define metadata with :class:`~dedoc.data_structures.LineMetadata`:

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 14
:lines: 18

Also there is an option to add some :ref:`annotations`:

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 15-18
:lines: 19

Now you can create new :class:`~dedoc.data_structures.LineMetadata` with hierarchy level, metadata and annotations:

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 20

:lines: 21

A few words about ``tag_heirarchy_level`` parameter: some readers extracts information about hierarchy
directly from tags in document. Dedoc store this information as :class:`~dedoc.data_structures.HierarchyLevel` object
at ``tag_heirarchy_level`` property of :class:`~dedoc.data_structures.LineMetadata`. List of readers that
create ``tag_hierarchy_level``:

* :class:`~dedoc.readers.DocxReader`
* :class:`~dedoc.readers.EmailReader`
* :class:`~dedoc.readers.HtmlReader`
* :class:`~dedoc.readers.JsonReader`
* :class:`~dedoc.readers.PdfImageReader`
* :class:`~dedoc.readers.PdfImageReader`
* :class:`~dedoc.readers.PdfTabbyReader`
* :class:`~dedoc.readers.RawTextReader`

Table
-----
Expand All @@ -77,65 +92,113 @@ Imagine you have table like this:

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 22-25
:lines: 23-26

Main block of tables is :class:`~dedoc.data_structures.CellWithMeta`. To create table, you should
make list of lists of :class:`~dedoc.data_structures.CellWithMeta`.

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 27-34
:lines: 28-35

Table also has some metadata, let's assume that our table is on the first page.
Use :class:`~dedoc.data_structures.TableMetadata`:

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 36
:lines: 37

Finally, create :class:`~dedoc.data_structures.Table`:

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 36
:lines: 38

To place table to the specific place in hierarchy create :class:`~dedoc.data_structures.LineWithMeta`
with :class:`~dedoc.data_structures.TableAnnotation`:

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 40-50

Let's try to construct more complicated table such this one:

.. image:: ../_static/table_merged_horizontal.png
:width: 700px

First steps is almost the same as for previous table:

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 52-62

Then change ``colspan`` parameter of the first cell of first row to 3 like in HTML format.
Set ``invisible`` to `True` on the other two cells of the row:

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 64-66

Table is well done!

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 58-69

Add to :class:`~dedoc.data_structures.LineWithMeta`:

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 71-81

AttachedFile
------------

Also we can attach some files: TODO что такое uid
Also we can attach some files:

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 39
:lines: 83

Unstructured Document
---------------------
Following the example of tables:

Now we are ready to create :class:`~dedoc.data_structures.UnstructuredDocument` object:

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 41
:lines: 85-95


Unstructured Document
---------------------

There is an option to add file metadata to document:
Now we are ready to create :class:`~dedoc.data_structures.UnstructuredDocument` object:

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 43-45
:lines: 97-101


Parsed Document
---------------

There are several ways how the structure of document can be represented. In this tutorial
we will utilize :class:`~dedoc.structure_constructors.TreeConstructor` that
returns document tree from unstrucutred document:

returns document tree from unstructured document. However, we should add some file
metadata to create tree representation. File metadata is usually extracted by Dedoc but because we are
building document from scratch we have to add it by ourselves.

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 47-51
:lines: 103-111

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 113-114

Great job! You just created your first document in Dedoc format from scratch!
To get the tree as a dict:

.. literalinclude:: ../_static/code_examples/dedoc_creating_dedoc_document.py
:language: python
:lines: 116

Great job! You just created from scratch your first document in Dedoc format!

0 comments on commit 9c235e7

Please sign in to comment.