Skip to content

Commit eaddf59

Browse files
JorjMcKiejulian-smith-artifex-com
authored andcommitted
Address #4324
Offer option to return raw clustering results.
1 parent f6e81d6 commit eaddf59

File tree

2 files changed

+8
-3
lines changed

2 files changed

+8
-3
lines changed

docs/page.rst

Lines changed: 4 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -439,15 +439,17 @@ In a nutshell, this is what you can do with PyMuPDF:
439439
.. image:: images/img-markers.*
440440
:scale: 100
441441

442-
.. method:: cluster_drawings(clip=None, drawings=None, x_tolerance=3, y_tolerance=3)
442+
.. method:: cluster_drawings(clip=None, drawings=None, x_tolerance=3, y_tolerance=3, final_filter=True)
443443

444444
Cluster vector graphics (synonyms are line-art or drawings) based on their geometrical vicinity. The method walks through the output of :meth:`Page.get_drawings` and joins paths whose `path["rect"]` are closer to each other than some tolerance values (given in the arguments). The result is a list of rectangles that each wrap things like tables (with gridlines), pie charts, bar charts, etc.
445445

446446
:arg rect_like clip: only consider paths inside this area. The default is the full page.
447447

448448
:arg list drawings: (optional) provide a previously generated output of :meth:`Page.get_drawings`. If `None` the method will execute the method.
449449

450-
:arg float x_tolerance:
450+
:arg float x_tolerance / y_tolerance: Assume vector graphics to be close enough neighbors for belonging to the same rectangle. Default is 3 points.
451+
452+
:arg bool final_filter: If `True` (default), the method will to remove rectangles having width or height smaller than the respective tolerance value. If `False` no such filtering is done.
451453

452454
.. method:: find_tables(clip=None, strategy=None, vertical_strategy=None, horizontal_strategy=None, vertical_lines=None, horizontal_lines=None, snap_tolerance=None, snap_x_tolerance=None, snap_y_tolerance=None, join_tolerance=None, join_x_tolerance=None, join_y_tolerance=None, edge_min_length=3, min_words_vertical=3, min_words_horizontal=1, intersection_tolerance=None, intersection_x_tolerance=None, intersection_y_tolerance=None, text_tolerance=None, text_x_tolerance=None, text_y_tolerance=None, add_lines=None)
453455

src/__init__.py

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9170,7 +9170,8 @@ def remove_rotation(self):
91709170
return rot # the inverse of the generated derotation matrix
91719171

91729172
def cluster_drawings(
9173-
self, clip=None, drawings=None, x_tolerance: float = 3, y_tolerance: float = 3
9173+
self, clip=None, drawings=None, x_tolerance: float = 3, y_tolerance: float = 3,
9174+
final_filter: bool = True,
91749175
) -> list:
91759176
"""Join rectangles of neighboring vector graphic items.
91769177

@@ -9263,6 +9264,8 @@ def are_neighbors(r1, r2):
92639264
prects = sorted(set(prects), key=lambda r: (r.y1, r.x0))
92649265

92659266
new_rects = sorted(set(new_rects), key=lambda r: (r.y1, r.x0))
9267+
if not final_filter:
9268+
return new_rects
92669269
return [r for r in new_rects if r.width > delta_x and r.height > delta_y]
92679270

92689271
def get_fonts(self, full=False):

0 commit comments

Comments
 (0)