Skip to content

Commit

Permalink
Merge pull request #936 from jsvine/develop
Browse files Browse the repository at this point in the history
v0.10.0
  • Loading branch information
jsvine authored Jul 16, 2023
2 parents ae676ae + 28c0afc commit 00386ad
Show file tree
Hide file tree
Showing 29 changed files with 516 additions and 203 deletions.
5 changes: 5 additions & 0 deletions .github/ISSUE_TEMPLATE/bug-report.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,11 @@ assignees: ''
*A clear and concise description of what the bug is.*


## Have you tried [repairing](../../docs/repairing.md) the PDF?

*Please try running your code with `pdfplumber.open(..., repair=True)` before submitting a bug report.*


## Code to reproduce the problem

*Paste it here, or attach a Python file.*
Expand Down
21 changes: 9 additions & 12 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,15 +7,15 @@ jobs:
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3

- name: Set up Python 3.9
uses: actions/setup-python@v2
uses: actions/setup-python@v4
with:
python-version: 3.9

- name: Configure pip caching
uses: actions/cache@v2
uses: actions/cache@v3
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt')}}-${{ hashFiles('**/requirements-dev.txt') }}
Expand Down Expand Up @@ -44,24 +44,21 @@ jobs:
strategy:
fail-fast: false
matrix:
python-version: [3.7, 3.8, 3.9, "3.10"]
python-version: [3.8, 3.9, "3.10", "3.11"]

steps:
- uses: actions/checkout@v2
- uses: actions/checkout@v3

- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
uses: actions/setup-python@v4
with:
python-version: ${{ matrix.python-version }}

- name: Install ghostscript & imagemagick
run: sudo apt update && sudo apt install ghostscript libmagickwand-dev

- name: Remove policy.xml
run: sudo rm /etc/ImageMagick-6/policy.xml
- name: Install ghostscript
run: sudo apt update && sudo apt install ghostscript

- name: Configure pip caching
uses: actions/cache@v2
uses: actions/cache@v3
with:
path: ~/.cache/pip
key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt')}}-${{ hashFiles('**/requirements-dev.txt') }}
Expand Down
25 changes: 25 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,31 @@

All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](http://keepachangelog.com/).

## [0.10.0] - 2023-07-16

### Changed

- Normalize color representation to `tuple[float|int, ...]` (#[917](https://github.com/jsvine/pdfplumber/issues/917)). ([57d51bb](https://github.com/jsvine/pdfplumber/commit/57d51bb))
- Replace Wand with pypdfium2 for page.to_image(...). ([b049373](https://github.com/jsvine/pdfplumber/commit/b049373))

### Added

- Add `pdfplumber.repair(...)` and `.open(repair=True)` (#[824](https://github.com/jsvine/pdfplumber/issues/824)). ([db6ae97](https://github.com/jsvine/pdfplumber/commit/db6ae97))
- Add Page.find_table(...) (#[873](https://github.com/jsvine/pdfplumber/issues/873)). ([3772af6](https://github.com/jsvine/pdfplumber/commit/3772af6))
- Add `quantize=True`, `colors=256`, `bits=8` arguments/defaults to `PageImage.save(...)`. ([b049373](https://github.com/jsvine/pdfplumber/commit/b049373))
- Extract and handle patterns + (some) color spaces. ([97ca4b0](https://github.com/jsvine/pdfplumber/commit/97ca4b0))

### Removed

- Remove support for Python 3.7 ([EOL'ed June 2023](https://endoflife.date/python)). ([c9d24d5](https://github.com/jsvine/pdfplumber/commit/c9d24d5))
- Remove vestigial 'font' and 'name' properties from PDF objects. ([6d62054](https://github.com/jsvine/pdfplumber/commit/6d62054))

### Fixed

- Fix bug for re-crops that use relative=True (#[914](https://github.com/jsvine/pdfplumber/issues/914)). ([0de6da9](https://github.com/jsvine/pdfplumber/commit/0de6da9))
- Handle `use_text_flow` more consistently (#[912](https://github.com/jsvine/pdfplumber/issues/912)). ([b1db5b8](https://github.com/jsvine/pdfplumber/commit/b1db5b8))


## [0.9.0] - 2023-04-13

### Changed
Expand Down
4 changes: 2 additions & 2 deletions CITATION.cff
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
cff-version: 1.2.0
title: pdfplumber
type: software
version: 0.9.0
date-released: "2023-04-13"
version: 0.10.0
date-released: "2023-07-16"
authors:
- family-names: "Singer-Vine"
given-names: "Jeremy"
Expand Down
39 changes: 18 additions & 21 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Plumb a PDF for detailed information about each text character, rectangle, and l

Works best on machine-generated, rather than scanned, PDFs. Built on [`pdfminer.six`](https://github.com/goulu/pdfminer).

Currently [tested](tests/) on [Python 3.7, 3.8, 3.9, 3.10](.github/workflows/tests.yml).
Currently [tested](tests/) on [Python 3.8, 3.9, 3.10, 3.11](.github/workflows/tests.yml).

Translations of this document are available in: [Chinese (by @hbh112233abc)](https://github.com/hbh112233abc/pdfplumber/blob/stable/README-CN.md).

Expand Down Expand Up @@ -158,8 +158,11 @@ Each object is represented as a simple Python `dict`, with the following propert
|`bottom`| Distance of bottom of the character from top of page.|
|`doctop`| Distance of top of character from top of document.|
|`matrix`| The "current transformation matrix" for this character. (See below for details.)|
|`stroking_color`|The color of the character's outline (i.e., stroke), expressed as a tuple or integer, depending on the “color space” used.|
|`non_stroking_color`|The character's interior color.|
|`ncs`|TKTK|
|`stroking_pattern`|TKTK|
|`non_stroking_pattern`|TKTK|
|`stroking_color`|The color of the character's outline (i.e., stroke). See [docs/colors.md](docs/colors.md) for details.|
|`non_stroking_color`|The character's interior color. See [docs/colors.md](docs/colors.md) for details.|
|`object_type`| "char"|

__Note__: A character’s `matrix` property represents the “current transformation matrix,” as described in Section 4.2.2 of the [PDF Reference](https://ghostscript.com/~robin/pdf_reference17.pdf) (6th Ed.). The matrix controls the character’s scale, skew, and positional translation. Rotation is a combination of scale and skew, but in most cases can be considered equal to the x-axis skew. The `pdfplumber.ctm` submodule defines a class, `CTM`, that assists with these calculations. For instance:
Expand All @@ -186,8 +189,8 @@ my_char_rotation = my_char_ctm.skew_x
|`bottom`| Distance of bottom of the line from top of page.|
|`doctop`| Distance of top of line from top of document.|
|`linewidth`| Thickness of line.|
|`stroking_color`|The color of the line, expressed as a tuple or integer, depending on the “color space” used.|
|`non_stroking_color`|The non-stroking color specified for the line’s path.|
|`stroking_color`|The color of the line. See [docs/colors.md](docs/colors.md) for details.|
|`non_stroking_color`|The non-stroking color specified for the line’s path. See [docs/colors.md](docs/colors.md) for details.|
|`object_type`| "line"|

#### `rect` properties
Expand All @@ -205,8 +208,8 @@ my_char_rotation = my_char_ctm.skew_x
|`bottom`| Distance of bottom of the rectangle from top of page.|
|`doctop`| Distance of top of rectangle from top of document.|
|`linewidth`| Thickness of line.|
|`stroking_color`|The color of the rectangle's outline, expressed as a tuple or integer, depending on the “color space” used.|
|`non_stroking_color`|The rectangle’s fill color.|
|`stroking_color`|The color of the rectangle's outline. See [docs/colors.md](docs/colors.md) for details.|
|`non_stroking_color`|The rectangle’s fill color. See [docs/colors.md](docs/colors.md) for details.|
|`object_type`| "rect"|

#### `curve` properties
Expand All @@ -226,8 +229,8 @@ my_char_rotation = my_char_ctm.skew_x
|`doctop`| Distance of curve's highest point from top of document.|
|`linewidth`| Thickness of line.|
|`fill`| Whether the shape defined by the curve's path is filled.|
|`stroking_color`|The color of the curve's outline, expressed as a tuple or integer, depending on the “color space” used.|
|`non_stroking_color`|The curve’s fill color.|
|`stroking_color`|The color of the curve's outline. See [docs/colors.md](docs/colors.md) for details.|
|`non_stroking_color`|The curve’s fill color. See [docs/colors.md](docs/colors.md) for details.|
|`object_type`| "curve"|

#### Derived properties
Expand All @@ -247,17 +250,12 @@ If you pass the `pdfminer.six`-handling `laparams` parameter to `pdfplumber.open

`pdfplumber`'s visual debugging tools can be helpful in understanding the structure of a PDF and the objects that have been extracted from it.

__Note:__ To use this feature, you'll also need to have two additional pieces of software installed on your computer:

- [`ImageMagick`](https://www.imagemagick.org/). [Installation instructions here](http://docs.wand-py.org/en/latest/guide/install.html#install-imagemagick-debian).
- [`ghostscript`](https://www.ghostscript.com). [Installation instructions here](https://ghostscript.readthedocs.io/en/latest/Install.html), or simply `apt install ghostscript` (Ubuntu) / `brew install ghostscript` (Mac).


### Creating a `PageImage` with `.to_image()`

To turn any page (including cropped pages) into an `PageImage` object, call `my_page.to_image()`. You can optionally pass *one* of the following keyword arguments:

- `resolution`: The desired number pixels per inch. Defaults to 72. See note below.
- `resolution`: The desired number pixels per inch. Defaults to 72.
- `width`: The desired image width in pixels.
- `height`: The desired image width in pixels.

Expand All @@ -267,12 +265,10 @@ For instance:
im = my_pdf.pages[0].to_image(resolution=150)
```

From a script or REPL, `im.show()` will open the image in your local image viewer. But `PageImage` objects also play nicely with IPython/Jupyter notebooks; they automatically render as cell outputs. For example:
From a script or REPL, `im.show()` will open the image in your local image viewer. But `PageImage` objects also play nicely with Jupyter notebooks; they automatically render as cell outputs. For example:

![Visual debugging in Jupyter](examples/screenshots/visual-debugging-in-jupyter.png "Visual debugging in Jupyter")

*Note*: `pdfplumber` passes the `resolution` parameter to [Wand](https://docs.wand-py.org/en/latest/wand/image.html#wand.image.Image), the Python library we use for image conversion. Wand will create the image with the desired number of total pixels of height/width, but does not fully respect the `resolution` in the strict sense of that word: Although PNGs are capable of storing an image's resolution density as metadata, Wand's PNGs do not.

*Note*: `.to_image(...)` works as expected with `Page.crop(...)`/`CroppedPage` instances, but is unable to incorporate changes made via `Page.filter(...)`/`FilteredPage` instances.


Expand All @@ -283,7 +279,7 @@ From a script or REPL, `im.show()` will open the image in your local image viewe
|`im.reset()`| Clears anything you've drawn so far.|
|`im.copy()`| Copies the image to a new `PageImage` object.|
|`im.show()`| Opens the image in your local image viewer.|
|`im.save(path_or_fileobject, format="PNG")`| Saves the annotated image.|
|`im.save(path_or_fileobject, format="PNG", quantize=True, colors=256, bits=8)`| Saves the annotated image as a PNG file. The default arguments quantize the image to a palette of 256 colors, saving the PNG with 8-bit color depth. You can disable quantization by passing `quantize=False` or adjust the size of the color palette by passing `colors=N`.|

### Drawing methods

Expand Down Expand Up @@ -322,7 +318,7 @@ If you're using `pdfplumber` on a Debian-based system and encounter a `PolicyErr

| Method | Description |
|--------|-------------|
|`.extract_text(x_tolerance=3, y_tolerance=3, layout=False, x_density=7.25, y_density=13, **kwargs)`| Collates all of the page's character objects into a single string.<ul><li><p>When `layout=False`: Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`.</p></li><li><p>When `layout=True` (*experimental feature*): Attempts to mimic the structural layout of the text on the page(s), using `x_density` and `y_density` to determine the minimum number of characters/newlines per "point," the PDF unit of measurement. All remaining `**kwargs` are passed to `.extract_words(...)` (see above), the first step in calculating the layout.</p></li></ul>|
|`.extract_text(x_tolerance=3, y_tolerance=3, layout=False, x_density=7.25, y_density=13, **kwargs)`| Collates all of the page's character objects into a single string.<ul><li><p>When `layout=False`: Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`.</p></li><li><p>When `layout=True` (*experimental feature*): Attempts to mimic the structural layout of the text on the page(s), using `x_density` and `y_density` to determine the minimum number of characters/newlines per "point," the PDF unit of measurement. All remaining `**kwargs` are passed to `.extract_words(...)` (see below), the first step in calculating the layout.</p></li></ul>|
|`.extract_text_simple(x_tolerance=3, y_tolerance=3)`| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.|
|`.extract_words(x_tolerance=3, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, horizontal_ltr=True, vertical_ttb=True, extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. The parameters `horizontal_ltr` and `vertical_ttb` indicate whether the words should be read from left-to-right (for horizontal words) / top-to-bottom (for vertical words). Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) Passing a list of `extra_attrs` (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `fi` will be expanded into their constituent letters (e.g., `fi`).|
|`.extract_text_lines(layout=False, strip=True, return_chars=True, **kwargs)`|*Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.|
Expand All @@ -346,8 +342,9 @@ If you're using `pdfplumber` on a Debian-based system and encounter a `PolicyErr
| Method | Description |
|--------|-------------|
|`.find_tables(table_settings={})`|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.|
|`.find_table(table_settings={})`|Similar to `.find_tables(...)`, but returns the *largest* table on the page, as a `Table` object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.|
|`.extract_tables(table_settings={})`|Returns the text extracted from *all* tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.|
|`.extract_table(table_settings={})`|Returns the text extracted from the *largest* table on the page, represented as a list of lists, with the structure `row -> cell`. (If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.)|
|`.extract_table(table_settings={})`|Returns the text extracted from the *largest* table on the page (see `.find_table(...)` above), represented as a list of lists, with the structure `row -> cell`.|
|`.debug_tablefinder(table_settings={})`|Returns an instance of the `TableFinder` class, with access to the `.edges`, `.intersections`, `.cells`, and `.tables` properties.|

For example:
Expand Down
41 changes: 41 additions & 0 deletions docs/colors.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# Colors

In the PDF specification, as well as in `pdfplumber`, most graphical objects can have two color attributes:

- `stroking_color`: The color of the object's outline
- `non_stroking_color`: The color of the object's interior, or "fill"

In the PDF specification, colors have both a "color space" and a "color value".

## Color Spaces

Valid color spaces are grouped into three categories:

- Device color spaces
- `DeviceGray`
- `DeviceRGB`
- `DeviceCMYK`
- CIE-based color spaces
- `CalGray`
- `CalRGB`
- `Lab`
- `ICCBased`
- Special color spaces
- `Indexed`
- `Pattern`
- `Separation`
- `DeviceN`

To read more about the differences between those color spaces, see section 4.5 [here](https://ghostscript.com/~robin/pdf_reference17.pdf).

`pdfplumber` aims to expose those color spaces as `scs` (stroking color space) and `ncs` (non-stroking color space), represented as a __string__.

__Caveat__: The only information `pdfplumber` can __currently__ expose is the non-stroking color space for `char` objects. The rest (stroking color space for `char` objects and either color space for the other types of objects) will require a pull request to `pdfminer.six`.

## Color Values

The color value determines *what specific color* in the color space should be used. With the exception of the "special color spaces," these color values are specified as a series of numbers. For `DeviceRGB`, for example, the color values are three numbers, representing the intensities of red, green, and blue.

In `pdfplumber`, those color values are exposed as `stroking_color` and `non_stroking_color`, represented as a __tuple of numbers__.

The pattern specified by the `Pattern` color space is exposed via the `non_stroking_pattern` and `stroking_pattern` attributes.
11 changes: 11 additions & 0 deletions docs/repairing.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# Repairing Malformed PDFs

Many parsing issues can be traced back to malformed PDFs.

Malformed PDFs can often be [fixed via Ghostscript](https://superuser.com/questions/278562/how-can-i-fix-repair-a-corrupted-pdf-file).

`pdfplumber` lets you automatically run those repairs, in several ways:

- `pdfplumber.open(..., repair=True)` will repair your PDF on the fly (but not save the repaired version to disk).
- `pdfplumber.repair(path_to_pdf)` will return a `BytesIO` object holding the bytes of a repaired version of the original file.
- `pdfplumber.repair(path_to_pdf, outfile="path/to/repaired.pdf")` will write a repaired version of the original file to the indicated `outfile` path.
22 changes: 12 additions & 10 deletions examples/notebooks/ag-energy-roundup-curves.ipynb

Large diffs are not rendered by default.

10 changes: 5 additions & 5 deletions examples/notebooks/extract-table-ca-warn-report.ipynb

Large diffs are not rendered by default.

18 changes: 9 additions & 9 deletions examples/notebooks/extract-table-nics.ipynb

Large diffs are not rendered by default.

10 changes: 5 additions & 5 deletions examples/notebooks/san-jose-pd-firearm-report.ipynb

Large diffs are not rendered by default.

2 changes: 2 additions & 0 deletions pdfplumber/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
"utils",
"pdfminer",
"open",
"repair",
"set_debug",
]

Expand All @@ -12,5 +13,6 @@
from . import utils
from ._version import __version__
from .pdf import PDF
from .repair import repair

open = PDF.open
2 changes: 1 addition & 1 deletion pdfplumber/_version.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,2 @@
version_info = (0, 9, 0)
version_info = (0, 10, 0)
__version__ = ".".join(map(str, version_info))
Loading

0 comments on commit 00386ad

Please sign in to comment.