Merge pull request #936 from jsvine/develop

v0.10.0
jsvine · Jul 16, 2023 · 00386ad · 00386ad
2 parents ae676ae + 28c0afc
commit 00386ad
Show file tree

Hide file tree

Showing 29 changed files with 516 additions and 203 deletions.
diff --git a/.github/ISSUE_TEMPLATE/bug-report.md b/.github/ISSUE_TEMPLATE/bug-report.md
@@ -11,6 +11,11 @@ assignees: ''
 *A clear and concise description of what the bug is.*
 
 
+## Have you tried [repairing](../../docs/repairing.md) the PDF?
+
+*Please try running your code with `pdfplumber.open(..., repair=True)` before submitting a bug report.*
+
+
 ## Code to reproduce the problem
 
 *Paste it here, or attach a Python file.*

diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -7,15 +7,15 @@ jobs:
     runs-on: ubuntu-latest
 
     steps:
-    - uses: actions/checkout@v2
+    - uses: actions/checkout@v3
 
     - name: Set up Python 3.9
-      uses: actions/setup-python@v2
+      uses: actions/setup-python@v4
       with:
         python-version: 3.9
 
     - name: Configure pip caching
-      uses: actions/cache@v2
+      uses: actions/cache@v3
       with:
         path: ~/.cache/pip
         key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt')}}-${{ hashFiles('**/requirements-dev.txt') }}
@@ -44,24 +44,21 @@ jobs:
     strategy:
       fail-fast: false
       matrix:
-        python-version: [3.7, 3.8, 3.9, "3.10"]
+        python-version: [3.8, 3.9, "3.10", "3.11"]
 
     steps:
-    - uses: actions/checkout@v2
+    - uses: actions/checkout@v3
 
     - name: Set up Python ${{ matrix.python-version }}
-      uses: actions/setup-python@v2
+      uses: actions/setup-python@v4
       with:
         python-version: ${{ matrix.python-version }}
 
-    - name: Install ghostscript & imagemagick
-      run: sudo apt update && sudo apt install ghostscript libmagickwand-dev
-
-    - name: Remove policy.xml
-      run: sudo rm /etc/ImageMagick-6/policy.xml
+    - name: Install ghostscript
+      run: sudo apt update && sudo apt install ghostscript
 
     - name: Configure pip caching
-      uses: actions/cache@v2
+      uses: actions/cache@v3
       with:
         path: ~/.cache/pip
         key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt')}}-${{ hashFiles('**/requirements-dev.txt') }}

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -2,6 +2,31 @@
 
 All notable changes to this project will be documented in this file. The format is based on [Keep a Changelog](http://keepachangelog.com/).
 
+## [0.10.0] - 2023-07-16
+
+### Changed
+
+- Normalize color representation to `tuple[float|int, ...]` (#[917](https://github.com/jsvine/pdfplumber/issues/917)). ([57d51bb](https://github.com/jsvine/pdfplumber/commit/57d51bb))
+- Replace Wand with pypdfium2 for page.to_image(...). ([b049373](https://github.com/jsvine/pdfplumber/commit/b049373))
+
+### Added
+
+- Add `pdfplumber.repair(...)` and `.open(repair=True)` (#[824](https://github.com/jsvine/pdfplumber/issues/824)). ([db6ae97](https://github.com/jsvine/pdfplumber/commit/db6ae97))
+- Add Page.find_table(...) (#[873](https://github.com/jsvine/pdfplumber/issues/873)). ([3772af6](https://github.com/jsvine/pdfplumber/commit/3772af6))
+- Add `quantize=True`, `colors=256`, `bits=8` arguments/defaults to `PageImage.save(...)`. ([b049373](https://github.com/jsvine/pdfplumber/commit/b049373))
+- Extract and handle patterns + (some) color spaces. ([97ca4b0](https://github.com/jsvine/pdfplumber/commit/97ca4b0))
+
+### Removed
+
+- Remove support for Python 3.7 ([EOL'ed June 2023](https://endoflife.date/python)). ([c9d24d5](https://github.com/jsvine/pdfplumber/commit/c9d24d5))
+- Remove vestigial 'font' and 'name' properties from PDF objects. ([6d62054](https://github.com/jsvine/pdfplumber/commit/6d62054))
+
+### Fixed
+
+- Fix bug for re-crops that use relative=True (#[914](https://github.com/jsvine/pdfplumber/issues/914)). ([0de6da9](https://github.com/jsvine/pdfplumber/commit/0de6da9))
+- Handle `use_text_flow` more consistently (#[912](https://github.com/jsvine/pdfplumber/issues/912)). ([b1db5b8](https://github.com/jsvine/pdfplumber/commit/b1db5b8))
+
+
 ## [0.9.0] - 2023-04-13
 
 ### Changed

diff --git a/CITATION.cff b/CITATION.cff
@@ -1,8 +1,8 @@
 cff-version: 1.2.0
 title: pdfplumber
 type: software
-version: 0.9.0
-date-released: "2023-04-13"
+version: 0.10.0
+date-released: "2023-07-16"
 authors:
   - family-names: "Singer-Vine"
     given-names: "Jeremy"

diff --git a/README.md b/README.md
@@ -6,7 +6,7 @@ Plumb a PDF for detailed information about each text character, rectangle, and l
 
 Works best on machine-generated, rather than scanned, PDFs. Built on [`pdfminer.six`](https://github.com/goulu/pdfminer). 
 
-Currently [tested](tests/) on [Python 3.7, 3.8, 3.9, 3.10](.github/workflows/tests.yml).
+Currently [tested](tests/) on [Python 3.8, 3.9, 3.10, 3.11](.github/workflows/tests.yml).
 
 Translations of this document are available in: [Chinese (by @hbh112233abc)](https://github.com/hbh112233abc/pdfplumber/blob/stable/README-CN.md).
 
@@ -158,8 +158,11 @@ Each object is represented as a simple Python `dict`, with the following propert
 |`bottom`| Distance of bottom of the character from top of page.|
 |`doctop`| Distance of top of character from top of document.|
 |`matrix`| The "current transformation matrix" for this character. (See below for details.)|
-|`stroking_color`|The color of the character's outline (i.e., stroke), expressed as a tuple or integer, depending on the “color space” used.|
-|`non_stroking_color`|The character's interior color.|
+|`ncs`|TKTK|
+|`stroking_pattern`|TKTK|
+|`non_stroking_pattern`|TKTK|
+|`stroking_color`|The color of the character's outline (i.e., stroke). See [docs/colors.md](docs/colors.md) for details.|
+|`non_stroking_color`|The character's interior color. See [docs/colors.md](docs/colors.md) for details.|
 |`object_type`| "char"|
 
 __Note__: A character’s `matrix` property represents the “current transformation matrix,” as described in Section 4.2.2 of the [PDF Reference](https://ghostscript.com/~robin/pdf_reference17.pdf) (6th Ed.). The matrix controls the character’s scale, skew, and positional translation. Rotation is a combination of scale and skew, but in most cases can be considered equal to the x-axis skew. The `pdfplumber.ctm` submodule defines a class, `CTM`, that assists with these calculations. For instance:
@@ -186,8 +189,8 @@ my_char_rotation = my_char_ctm.skew_x
 |`bottom`| Distance of bottom of the line from top of page.|
 |`doctop`| Distance of top of line from top of document.|
 |`linewidth`| Thickness of line.|
-|`stroking_color`|The color of the line, expressed as a tuple or integer, depending on the “color space” used.|
-|`non_stroking_color`|The non-stroking color specified for the line’s path.|
+|`stroking_color`|The color of the line. See [docs/colors.md](docs/colors.md) for details.|
+|`non_stroking_color`|The non-stroking color specified for the line’s path. See [docs/colors.md](docs/colors.md) for details.|
 |`object_type`| "line"|
 
 #### `rect` properties
@@ -205,8 +208,8 @@ my_char_rotation = my_char_ctm.skew_x
 |`bottom`| Distance of bottom of the rectangle from top of page.|
 |`doctop`| Distance of top of rectangle from top of document.|
 |`linewidth`| Thickness of line.|
-|`stroking_color`|The color of the rectangle's outline, expressed as a tuple or integer, depending on the “color space” used.|
-|`non_stroking_color`|The rectangle’s fill color.|
+|`stroking_color`|The color of the rectangle's outline. See [docs/colors.md](docs/colors.md) for details.|
+|`non_stroking_color`|The rectangle’s fill color. See [docs/colors.md](docs/colors.md) for details.|
 |`object_type`| "rect"|
 
 #### `curve` properties
@@ -226,8 +229,8 @@ my_char_rotation = my_char_ctm.skew_x
 |`doctop`| Distance of curve's highest point from top of document.|
 |`linewidth`| Thickness of line.|
 |`fill`| Whether the shape defined by the curve's path is filled.|
-|`stroking_color`|The color of the curve's outline, expressed as a tuple or integer, depending on the “color space” used.|
-|`non_stroking_color`|The curve’s fill color.|
+|`stroking_color`|The color of the curve's outline. See [docs/colors.md](docs/colors.md) for details.|
+|`non_stroking_color`|The curve’s fill color. See [docs/colors.md](docs/colors.md) for details.|
 |`object_type`| "curve"|
 
 #### Derived properties
@@ -247,17 +250,12 @@ If you pass the `pdfminer.six`-handling `laparams` parameter to `pdfplumber.open
 
 `pdfplumber`'s visual debugging tools can be helpful in understanding the structure of a PDF and the objects that have been extracted from it.
 
-__Note:__ To use this feature, you'll also need to have two additional pieces of software installed on your computer:
-
-- [`ImageMagick`](https://www.imagemagick.org/). [Installation instructions here](http://docs.wand-py.org/en/latest/guide/install.html#install-imagemagick-debian).
-- [`ghostscript`](https://www.ghostscript.com). [Installation instructions here](https://ghostscript.readthedocs.io/en/latest/Install.html), or simply `apt install ghostscript` (Ubuntu) / `brew install ghostscript` (Mac).
-
 
 ### Creating a `PageImage` with `.to_image()`
 
 To turn any page (including cropped pages) into an `PageImage` object, call `my_page.to_image()`. You can optionally pass *one* of the  following keyword arguments:
 
-- `resolution`: The desired number pixels per inch. Defaults to 72. See note below.
+- `resolution`: The desired number pixels per inch. Defaults to 72.
 - `width`: The desired image width in pixels.
 - `height`: The desired image width in pixels.
 
@@ -267,12 +265,10 @@ For instance:
 im = my_pdf.pages[0].to_image(resolution=150)
 ```
 
-From a script or REPL, `im.show()` will open the image in your local image viewer. But `PageImage` objects also play nicely with IPython/Jupyter notebooks; they automatically render as cell outputs. For example:
+From a script or REPL, `im.show()` will open the image in your local image viewer. But `PageImage` objects also play nicely with Jupyter notebooks; they automatically render as cell outputs. For example:
 
 ![Visual debugging in Jupyter](examples/screenshots/visual-debugging-in-jupyter.png "Visual debugging in Jupyter")
 
-*Note*: `pdfplumber` passes the `resolution` parameter to [Wand](https://docs.wand-py.org/en/latest/wand/image.html#wand.image.Image), the Python library we use for image conversion. Wand will create the image with the desired number of total pixels of height/width, but does not fully respect the `resolution` in the strict sense of that word: Although PNGs are capable of storing an image's resolution density as metadata, Wand's PNGs do not.
-
 *Note*: `.to_image(...)` works as expected with `Page.crop(...)`/`CroppedPage` instances, but is unable to incorporate changes made via `Page.filter(...)`/`FilteredPage` instances.
 
 
@@ -283,7 +279,7 @@ From a script or REPL, `im.show()` will open the image in your local image viewe
 |`im.reset()`| Clears anything you've drawn so far.|
 |`im.copy()`| Copies the image to a new `PageImage` object.|
 |`im.show()`| Opens the image in your local image viewer.|
-|`im.save(path_or_fileobject, format="PNG")`| Saves the annotated image.|
+|`im.save(path_or_fileobject, format="PNG", quantize=True, colors=256, bits=8)`| Saves the annotated image as a PNG file. The default arguments quantize the image to a palette of 256 colors, saving the PNG with 8-bit color depth. You can disable quantization by passing `quantize=False` or adjust the size of the color palette by passing `colors=N`.|
 
 ### Drawing methods
 
@@ -322,7 +318,7 @@ If you're using `pdfplumber` on a Debian-based system and encounter a `PolicyErr
 
 | Method | Description |
 |--------|-------------|
-|`.extract_text(x_tolerance=3, y_tolerance=3, layout=False, x_density=7.25, y_density=13, **kwargs)`| Collates all of the page's character objects into a single string.<ul><li><p>When `layout=False`: Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`.</p></li><li><p>When `layout=True` (*experimental feature*): Attempts to mimic the structural layout of the text on the page(s), using `x_density` and `y_density` to determine the minimum number of characters/newlines per "point," the PDF unit of measurement. All remaining `**kwargs` are passed to `.extract_words(...)` (see above), the first step in calculating the layout.</p></li></ul>|
+|`.extract_text(x_tolerance=3, y_tolerance=3, layout=False, x_density=7.25, y_density=13, **kwargs)`| Collates all of the page's character objects into a single string.<ul><li><p>When `layout=False`: Adds spaces where the difference between the `x1` of one character and the `x0` of the next is greater than `x_tolerance`. Adds newline characters where the difference between the `doctop` of one character and the `doctop` of the next is greater than `y_tolerance`.</p></li><li><p>When `layout=True` (*experimental feature*): Attempts to mimic the structural layout of the text on the page(s), using `x_density` and `y_density` to determine the minimum number of characters/newlines per "point," the PDF unit of measurement. All remaining `**kwargs` are passed to `.extract_words(...)` (see below), the first step in calculating the layout.</p></li></ul>|
 |`.extract_text_simple(x_tolerance=3, y_tolerance=3)`| A slightly faster but less flexible version of `.extract_text(...)`, using a simpler logic.|
 |`.extract_words(x_tolerance=3, y_tolerance=3, keep_blank_chars=False, use_text_flow=False, horizontal_ltr=True, vertical_ttb=True, extra_attrs=[], split_at_punctuation=False, expand_ligatures=True)`| Returns a list of all word-looking things and their bounding boxes. Words are considered to be sequences of characters where (for "upright" characters) the difference between the `x1` of one character and the `x0` of the next is less than or equal to `x_tolerance` *and* where the `doctop` of one character and the `doctop` of the next is less than or equal to `y_tolerance`. A similar approach is taken for non-upright characters, but instead measuring the vertical, rather than horizontal, distances between them. The parameters `horizontal_ltr` and `vertical_ttb` indicate whether the words should be read from left-to-right (for horizontal words) / top-to-bottom (for vertical words). Changing `keep_blank_chars` to `True` will mean that blank characters are treated as part of a word, not as a space between words. Changing `use_text_flow` to `True` will use the PDF's underlying flow of characters as a guide for ordering and segmenting the words, rather than presorting the characters by x/y position. (This mimics how dragging a cursor highlights text in a PDF; as with that, the order does not always appear to be logical.) Passing a list of `extra_attrs`  (e.g., `["fontname", "size"]` will restrict each words to characters that share exactly the same value for each of those [attributes](#char-properties), and the resulting word dicts will indicate those attributes. Setting `split_at_punctuation` to `True` will enforce breaking tokens at punctuations specified by `string.punctuation`; or you can specify the list of separating punctuation by pass a string, e.g., <code>split_at_punctuation='!"&\'()*+,.:;<=>?@[\]^\`\{\|\}~'</code>. Unless you set `expand_ligatures=False`, ligatures such as `ﬁ` will be expanded into their constituent letters (e.g., `fi`).|
 |`.extract_text_lines(layout=False, strip=True, return_chars=True, **kwargs)`|*Experimental feature* that returns a list of dictionaries representing the lines of text on the page. The `strip` parameter works analogously to Python's `str.strip()` method, and returns `text` attributes without their surrounding whitespace. (Only relevant when `layout = True`.) Setting `return_chars` to `False` will exclude the individual character objects from the returned text-line dicts. The remaining `**kwargs` are those you would pass to `.extract_text(layout=True, ...)`.|
@@ -346,8 +342,9 @@ If you're using `pdfplumber` on a Debian-based system and encounter a `PolicyErr
 | Method | Description |
 |--------|-------------|
 |`.find_tables(table_settings={})`|Returns a list of `Table` objects. The `Table` object provides access to the `.cells`, `.rows`, and `.bbox` properties, as well as the `.extract(x_tolerance=3, y_tolerance=3)` method.|
+|`.find_table(table_settings={})`|Similar to `.find_tables(...)`, but returns the *largest* table on the page, as a `Table` object. If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.|
 |`.extract_tables(table_settings={})`|Returns the text extracted from *all* tables found on the page, represented as a list of lists of lists, with the structure `table -> row -> cell`.|
-|`.extract_table(table_settings={})`|Returns the text extracted from the *largest* table on the page, represented as a list of lists, with the structure `row -> cell`. (If multiple tables have the same size — as measured by the number of cells — this method returns the table closest to the top of the page.)|
+|`.extract_table(table_settings={})`|Returns the text extracted from the *largest* table on the page (see `.find_table(...)` above), represented as a list of lists, with the structure `row -> cell`.|
 |`.debug_tablefinder(table_settings={})`|Returns an instance of the `TableFinder` class, with access to the `.edges`, `.intersections`, `.cells`, and `.tables` properties.|
 
 For example:

diff --git a/docs/colors.md b/docs/colors.md
@@ -0,0 +1,41 @@
+# Colors
+
+In the PDF specification, as well as in `pdfplumber`, most graphical objects can have two color attributes:
+
+- `stroking_color`: The color of the object's outline
+- `non_stroking_color`: The color of the object's interior, or "fill"
+
+In the PDF specification, colors have both a "color space" and a "color value".
+
+## Color Spaces
+
+Valid color spaces are grouped into three categories:
+
+- Device color spaces
+    - `DeviceGray`
+    - `DeviceRGB`
+    - `DeviceCMYK`
+- CIE-based color spaces
+    - `CalGray`
+    - `CalRGB`
+    - `Lab`
+    - `ICCBased`
+- Special color spaces
+    - `Indexed`
+    - `Pattern`
+    - `Separation`
+    - `DeviceN`
+
+To read more about the differences between those color spaces, see section 4.5 [here](https://ghostscript.com/~robin/pdf_reference17.pdf).
+
+`pdfplumber` aims to expose those color spaces as `scs` (stroking color space) and `ncs` (non-stroking color space), represented as a __string__.
+
+__Caveat__: The only information `pdfplumber` can __currently__ expose is the non-stroking color space for `char` objects. The rest (stroking color space for `char` objects and either color space for the other types of objects) will require a pull request to `pdfminer.six`.
+
+## Color Values
+
+The color value determines *what specific color* in the color space should be used. With the exception of the "special color spaces," these color values are specified as a series of numbers. For `DeviceRGB`, for example, the color values are three numbers, representing the intensities of red, green, and blue.
+
+In `pdfplumber`, those color values are exposed as `stroking_color` and `non_stroking_color`, represented as a __tuple of numbers__.
+
+The pattern specified by the `Pattern` color space is exposed via the `non_stroking_pattern` and `stroking_pattern` attributes.
diff --git a/docs/repairing.md b/docs/repairing.md
@@ -0,0 +1,11 @@
+# Repairing Malformed PDFs
+
+Many parsing issues can be traced back to malformed PDFs.
+
+Malformed PDFs can often be [fixed via Ghostscript](https://superuser.com/questions/278562/how-can-i-fix-repair-a-corrupted-pdf-file).
+
+`pdfplumber` lets you automatically run those repairs, in several ways:
+
+- `pdfplumber.open(..., repair=True)` will repair your PDF on the fly (but not save the repaired version to disk).
+- `pdfplumber.repair(path_to_pdf)` will return a `BytesIO` object holding the bytes of a repaired version of the original file.
+- `pdfplumber.repair(path_to_pdf, outfile="path/to/repaired.pdf")` will write a repaired version of the original file to the indicated `outfile` path.
diff --git a/examples/notebooks/ag-energy-roundup-curves.ipynb b/examples/notebooks/ag-energy-roundup-curves.ipynb
diff --git a/examples/notebooks/extract-table-ca-warn-report.ipynb b/examples/notebooks/extract-table-ca-warn-report.ipynb
diff --git a/examples/notebooks/extract-table-nics.ipynb b/examples/notebooks/extract-table-nics.ipynb
diff --git a/examples/notebooks/san-jose-pd-firearm-report.ipynb b/examples/notebooks/san-jose-pd-firearm-report.ipynb
diff --git a/pdfplumber/__init__.py b/pdfplumber/__init__.py
@@ -3,6 +3,7 @@
     "utils",
     "pdfminer",
     "open",
+    "repair",
     "set_debug",
 ]
 
@@ -12,5 +13,6 @@
 from . import utils
 from ._version import __version__
 from .pdf import PDF
+from .repair import repair
 
 open = PDF.open
diff --git a/pdfplumber/_version.py b/pdfplumber/_version.py
@@ -1,2 +1,2 @@
-version_info = (0, 9, 0)
+version_info = (0, 10, 0)
 __version__ = ".".join(map(str, version_info))