Skip to content

Commit

Permalink
Merge pull request #862 from jsvine/develop
Browse files Browse the repository at this point in the history
v0.9.0
  • Loading branch information
jsvine authored Apr 13, 2023
2 parents 1d5d646 + 3e0f9d7 commit 255eaac
Show file tree
Hide file tree
Showing 20 changed files with 419 additions and 191 deletions.
19 changes: 19 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,25 @@

All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](http://keepachangelog.com/).

## [0.9.0] - 2023-04-13

### Changed

- Make word segmentation (via `WordExtractor.char_begins_new_word(...)`) more explict and rigorous; should help in catching edge-cases in the future. ([6acd580](https://github.com/jsvine/pdfplumber/commit/6acd580) + [ebb93ea](https://github.com/jsvine/pdfplumber/commit/ebb93ea) + [#840](https://github.com/jsvine/pdfplumber/discussions/840#discussioncomment-5312166))
- Use `curve_edge` objects (instead of just `line` and `rect_edge` objects) in default table-detection strategy. ([6f6b465](https://github.com/jsvine/pdfplumber/commit/6f6b465) + [#858](https://github.com/jsvine/pdfplumber/discussions/858))
- By default, expand ligatures into their consituent letters (e.g., `` to `ffi`), and add the `expand_ligatures` boolean parameter to text-extraction methods. ([86e935d](https://github.com/jsvine/pdfplumber/commit/86e935d) + [#598](https://github.com/jsvine/pdfplumber/issues/598))

### Added

- Add `Page.extract_text_lines(...)` method. ([4b37397](https://github.com/jsvine/pdfplumber/commit/4b37397) + [#852](https://github.com/jsvine/pdfplumber/discussions/852))
- Add `main_group`, `return_groups`, `return_chars` parameters to `Page.search(...)`. ([4b37397](https://github.com/jsvine/pdfplumber/commit/4b37397))
- Add `.curve_edges` property to `PDF` and `Page`. ([6f6b465](https://github.com/jsvine/pdfplumber/commit/6f6b465))

### Fixed

- Fix handling of bytes-typed fontnames. ([9441ff7](https://github.com/jsvine/pdfplumber/commit/9441ff7) + [#461](https://github.com/jsvine/pdfplumber/discussions/461) + [#842](https://github.com/jsvine/pdfplumber/discussions/842))
- Fix handling of whitespace-only and empty results of `Page.search(...)`. ([6f6b465](https://github.com/jsvine/pdfplumber/commit/6f6b465) + [#853](https://github.com/jsvine/pdfplumber/discussions/853))

## [0.8.1] - 2023-04-08
### Fixed

Expand Down
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
cff-version: 1.2.0
title: pdfplumber
type: software
version: 0.8.1
date-released: "2023-04-08"
version: 0.9.0
date-released: "2023-04-13"
authors:
- family-names: "Singer-Vine"
given-names: "Jeremy"
Expand Down
9 changes: 6 additions & 3 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -230,7 +230,9 @@ my_char_rotation = my_char_ctm.skew_x
|`non_stroking_color`|The curve’s fill color.|
|`object_type`| "curve"|

Additionally, both `pdfplumber.PDF` and `pdfplumber.Page` provide access to two derived lists of objects: `.rect_edges` (which decomposes each rectangle into its four lines) and `.edges` (which combines `.rect_edges` with `.lines`).
#### Derived properties

Additionally, both `pdfplumber.PDF` and `pdfplumber.Page` provide access to several derived lists of objects: `.rect_edges` (which decomposes each rectangle into its four lines), `.curve_edges` (which does the same for `curve` objects), and `.edges` (which combines `.rect_edges`, `.curve_edges`, and `.lines`).

#### `image` properties

Expand Down Expand Up @@ -320,10 +322,11 @@ If you're using `pdfplumber` on a Debian-based system and encounter a `PolicyErr

| Method | Description |
|--------|-------------|
|`.extract_words(x_tolerance=3, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, horizontal_ltr=True, vertical_ttb=True, extra_attrs=[], split_at_punctuation=False)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. The parameters `horizontal_ltr` and `vertical_ttb` indicate whether the words should be read from left-to-right (for horizontal words) / top-to-bottom (for vertical words). Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. |
|`.extract_text(x_tolerance=3, y_tolerance=3, layout=False, x_density=7.25, y_density=13, **kwargs)`| Collates all of the page's character objects into a single string.<ul><li><p>When `layout=False`: Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`.</p></li><li><p>When `layout=True` (*experimental feature*): Attempts to mimic the structural layout of the text on the page(s), using `x_density` and `y_density` to determine the minimum number of characters/newlines per "point," the PDF unit of measurement. All remaining `**kwargs` are passed to `.extract_words(...)` (see above), the first step in calculating the layout.</p></li></ul>|
|`.extract_text_simple(x_tolerance=3, y_tolerance=3)`| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.|
|`.search(pattern, regex=True, case=True, **kwargs)`|*Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.|
|`.extract_words(x_tolerance=3, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, horizontal_ltr=True, vertical_ttb=True, extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. The parameters `horizontal_ltr` and `vertical_ttb` indicate whether the words should be read from left-to-right (for horizontal words) / top-to-bottom (for vertical words). Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`).|
|`.extract_text_lines(layout=False, strip=True, return_chars=True, **kwargs)`|*Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.|
|`.search(pattern, regex=True, case=True, main_group=0, return_groups=True, return_chars=True, layout=False, **kwargs)`|*Experimental feature* that allows you to search a page's text, returning a list of all instances that match the query. For each instance, the response dictionary object contains the matching text, any regex group matches, the bounding box coordinates, and the char objects themselves. `pattern` can be a compiled regular expression, an uncompiled regular expression, or a non-regex string. If `regex` is `False`, the pattern is treated as a non-regex string. If `case` is `False`, the search is performed in a case-insensitive manner. Setting `main_group` restricts the results to a specific regex group within the `pattern` (default of `0` means the entire match). Setting `return_groups` and/or `return_chars` to `False` will exclude the lists of the matched regex groups and/or characters from being added (as `"groups"` and `"chars"` to the return dicts). The `layout` parameter operates as it does for `.extract_text(...)`. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`. __Note__: Zero-width and all-whitespace matches are discarded, because they (generally) have no explicit position on the page. |
|`.dedupe_chars(tolerance=1)`| Returns a version of the page with duplicate chars — those sharing the same text, fontname, size, and positioning (within `tolerance` x/y) as other characters — removed. (See [Issue #71](https://github.com/jsvine/pdfplumber/issues/71) to understand the motivation.)|

## Extracting tables
Expand Down
26 changes: 13 additions & 13 deletions examples/notebooks/ag-energy-roundup-curves.ipynb

Large diffs are not rendered by default.

16 changes: 8 additions & 8 deletions examples/notebooks/extract-table-ca-warn-report.ipynb

Large diffs are not rendered by default.

14 changes: 7 additions & 7 deletions examples/notebooks/extract-table-nics.ipynb

Large diffs are not rendered by default.

112 changes: 21 additions & 91 deletions examples/notebooks/san-jose-pd-firearm-report.ipynb

Large diffs are not rendered by default.

2 changes: 1 addition & 1 deletion pdfplumber/_version.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
version_info = (0, 8, 1)
version_info = (0, 9, 0)
__version__ = ".".join(map(str, version_info))
12 changes: 10 additions & 2 deletions pdfplumber/container.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@


class Container(object):
cached_properties = ["_rect_edges", "_edges", "_objects"]
cached_properties = ["_rect_edges", "_curve_edges", "_edges", "_objects"]

@property
def pages(self) -> Optional[List[Any]]:
Expand Down Expand Up @@ -73,12 +73,20 @@ def rect_edges(self) -> T_obj_list:
self._rect_edges: T_obj_list = list(chain(*rect_edges_gen))
return self._rect_edges

@property
def curve_edges(self) -> T_obj_list:
if hasattr(self, "_curve_edges"):
return self._curve_edges
curve_edges_gen = (utils.curve_to_edges(r) for r in self.curves)
self._curve_edges: T_obj_list = list(chain(*curve_edges_gen))
return self._curve_edges

@property
def edges(self) -> T_obj_list:
if hasattr(self, "_edges"):
return self._edges
line_edges = list(map(utils.line_to_edge, self.lines))
self._edges: T_obj_list = self.rect_edges + line_edges
self._edges: T_obj_list = line_edges + self.rect_edges + self.curve_edges
return self._edges

@property
Expand Down
44 changes: 43 additions & 1 deletion pdfplumber/page.py
Original file line number Diff line number Diff line change
Expand Up @@ -71,6 +71,27 @@
from .display import PageImage
from .pdf import PDF

# via https://git.ghostscript.com/?p=mupdf.git;a=blob;f=source/pdf/pdf-font.c;h=6322cedf2c26cfb312c0c0878d7aff97b4c7470e;hb=HEAD#l774 # noqa

CP936_FONTNAMES = {
b"\xcb\xce\xcc\xe5": "SimSun,Regular",
b"\xba\xda\xcc\xe5": "SimHei,Regular",
b"\xbf\xac\xcc\xe5_GB2312": "SimKai,Regular",
b"\xb7\xc2\xcb\xce_GB2312": "SimFang,Regular",
b"\xc1\xa5\xca\xe9": "SimLi,Regular",
}


def fix_fontname_bytes(fontname: bytes) -> str:
if b"+" in fontname:
split_at = fontname.index(b"+") + 1
prefix, suffix = fontname[:split_at], fontname[split_at:]
else:
prefix, suffix = b"", fontname

suffix_new = CP936_FONTNAMES.get(suffix, str(suffix)[2:-1])
return str(prefix)[2:-1] + suffix_new


class Page(Container):
cached_properties: List[str] = Container.cached_properties + ["_layout"]
Expand Down Expand Up @@ -221,6 +242,10 @@ def process_attr(item: Tuple[str, Any]) -> Optional[Tuple[str, Any]]:
attr["stroking_color"] = gs.scolor
attr["non_stroking_color"] = gs.ncolor

# Handle (rare) byte-encoded fontnames
if isinstance(attr["fontname"], bytes):
attr["fontname"] = fix_fontname_bytes(attr["fontname"])

if "pts" in attr:
attr["pts"] = list(map(self.point2coord, attr["pts"]))

Expand Down Expand Up @@ -306,10 +331,20 @@ def search(
pattern: Union[str, Pattern[str]],
regex: bool = True,
case: bool = True,
main_group: int = 0,
return_chars: bool = True,
return_groups: bool = True,
**kwargs: Any,
) -> List[Dict[str, Any]]:
textmap = self.get_textmap(**kwargs)
return textmap.search(pattern, regex=regex, case=case)
return textmap.search(
pattern,
regex=regex,
case=case,
main_group=main_group,
return_chars=return_chars,
return_groups=return_groups,
)

def extract_text(self, **kwargs: Any) -> str:
return self.get_textmap(**kwargs).as_string
Expand All @@ -320,6 +355,13 @@ def extract_text_simple(self, **kwargs: Any) -> str:
def extract_words(self, **kwargs: Any) -> T_obj_list:
return utils.extract_words(self.chars, **kwargs)

def extract_text_lines(
self, strip: bool = True, return_chars: bool = True, **kwargs: Any
) -> T_obj_list:
return self.get_textmap(**kwargs).extract_text_lines(
strip=strip, return_chars=return_chars
)

def crop(
self, bbox: T_bbox, relative: bool = False, strict: bool = True
) -> "CroppedPage":
Expand Down
14 changes: 8 additions & 6 deletions pdfplumber/utils/geometry.py
Original file line number Diff line number Diff line change
Expand Up @@ -193,6 +193,7 @@ def curve_to_edges(curve: T_obj) -> T_obj_list:
point_pairs = zip(curve["pts"], curve["pts"][1:])
return [
{
"object_type": "curve_edge",
"x0": min(p0[0], p1[0]),
"x1": max(p0[0], p1[0]),
"top": min(p0[1], p1[1]),
Expand Down Expand Up @@ -253,12 +254,13 @@ def line_to_edge(line: T_obj) -> T_obj:


def obj_to_edges(obj: T_obj) -> T_obj_list:
return {
"line": lambda x: [line_to_edge(x)],
"rect": rect_to_edges,
"rect_edge": rect_to_edges,
"curve": curve_to_edges,
}[obj["object_type"]](obj)
t = obj["object_type"]
if "_edge" in t:
return [obj]
elif t == "line":
return [line_to_edge(obj)]
else:
return {"rect": rect_to_edges, "curve": curve_to_edges}[t](obj)


def filter_edges(
Expand Down
Loading

0 comments on commit 255eaac

Please sign in to comment.