Skip to content

Commit 56594ab

Browse files
committed
Address #4324
Offer option to return raw clustering results.
1 parent cc114d7 commit 56594ab

File tree

2 files changed

+8
-3
lines changed

2 files changed

+8
-3
lines changed

docs/page.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -438,15 +438,17 @@ In a nutshell, this is what you can do with PyMuPDF:
438438
.. image:: images/img-markers.*
439439
:scale: 100
440440

441-
.. method:: cluster_drawings(clip=None, drawings=None, x_tolerance=3, y_tolerance=3)
441+
.. method:: cluster_drawings(clip=None, drawings=None, x_tolerance=3, y_tolerance=3, final_filter=True)
442442

443443
Cluster vector graphics (synonyms are line-art or drawings) based on their geometrical vicinity. The method walks through the output of :meth:`Page.get_drawings` and joins paths whose `path["rect"]` are closer to each other than some tolerance values (given in the arguments). The result is a list of rectangles that each wrap things like tables (with gridlines), pie charts, bar charts, etc.
444444

445445
:arg rect_like clip: only consider paths inside this area. The default is the full page.
446446

447447
:arg list drawings: (optional) provide a previously generated output of :meth:`Page.get_drawings`. If `None` the method will execute the method.
448448

449-
:arg float x_tolerance:
449+
:arg float x_tolerance / y_tolerance: Assume vector graphics to be close enough neighbors for belonging to the same rectangle. Default is 3 points.
450+
451+
:arg bool final_filter: If `True` (default), the method will to remove rectangles having width or height smaller than the respective tolerance value. If `False` no such filtering is done.
450452

451453
.. method:: find_tables(clip=None, strategy=None, vertical_strategy=None, horizontal_strategy=None, vertical_lines=None, horizontal_lines=None, snap_tolerance=None, snap_x_tolerance=None, snap_y_tolerance=None, join_tolerance=None, join_x_tolerance=None, join_y_tolerance=None, edge_min_length=3, min_words_vertical=3, min_words_horizontal=1, intersection_tolerance=None, intersection_x_tolerance=None, intersection_y_tolerance=None, text_tolerance=None, text_x_tolerance=None, text_y_tolerance=None, add_lines=None)
452454

src/__init__.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9117,7 +9117,8 @@ def remove_rotation(self):
91179117
return rot # the inverse of the generated derotation matrix
91189118

91199119
def cluster_drawings(
9120-
self, clip=None, drawings=None, x_tolerance: float = 3, y_tolerance: float = 3
9120+
self, clip=None, drawings=None, x_tolerance: float = 3, y_tolerance: float = 3,
9121+
final_filter: bool = True,
91219122
) -> list:
91229123
"""Join rectangles of neighboring vector graphic items.
91239124

@@ -9210,6 +9211,8 @@ def are_neighbors(r1, r2):
92109211
prects = sorted(set(prects), key=lambda r: (r.y1, r.x0))
92119212

92129213
new_rects = sorted(set(new_rects), key=lambda r: (r.y1, r.x0))
9214+
if not final_filter:
9215+
return new_rects
92139216
return [r for r in new_rects if r.width > delta_x and r.height > delta_y]
92149217

92159218
def get_fonts(self, full=False):

0 commit comments

Comments
 (0)