Skip to content

Commit

Permalink
Merge pull request #300 from VikParuchuri/dev
Browse files Browse the repository at this point in the history
Better tables, better markdown output, header level detection, table of contents
  • Loading branch information
VikParuchuri authored Oct 17, 2024
2 parents 6534333 + d7b204b commit ea845fd
Show file tree
Hide file tree
Showing 33 changed files with 2,646 additions and 2,767 deletions.
4 changes: 0 additions & 4 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,10 +29,6 @@ jobs:
run: |
poetry run python benchmarks/overall.py benchmark_data/pdfs benchmark_data/references report.json
poetry run python scripts/verify_benchmark_scores.py report.json --type marker
- name: Run table benchmark
run: |
poetry run python benchmarks/table.py tables.json
poetry run python scripts/verify_benchmark_scores.py tables.json --type table
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ wandb
*.dat
report.json
benchmark_data
debug_data

# Byte-compiled / optimized / DLL files
__pycache__/
Expand Down
4 changes: 2 additions & 2 deletions CLA.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
Marker Contributor Agreement

This Marker Contributor Agreement ("MCA") applies to any contribution that you make to any product or project managed by us (the "project"), and sets out the intellectual property rights you grant to us in the contributed materials. The term "us" shall mean Vikas Paruchuri. The term "you" shall mean the person or entity identified below.
This Marker Contributor Agreement ("MCA") applies to any contribution that you make to any product or project managed by us (the "project"), and sets out the intellectual property rights you grant to us in the contributed materials. The term "us" shall mean Endless Labs, Inc. The term "you" shall mean the person or entity identified below.

If you agree to be bound by these terms, sign by writing "I have read the CLA document and I hereby sign the CLA" in response to the CLA bot Github comment. Read this agreement carefully before signing. These terms and conditions constitute a binding legal agreement.

Expand All @@ -20,5 +20,5 @@ If you or your affiliates institute patent litigation against any entity (includ
- each contribution that you submit is and shall be an original work of authorship and you can legally grant the rights set out in this MCA;
- to the best of your knowledge, each contribution will not violate any third party's copyrights, trademarks, patents, or other intellectual property rights; and
- each contribution shall be in compliance with U.S. export control laws and other applicable export and import laws.
You agree to notify us if you become aware of any circumstance which would make any of the foregoing representations inaccurate in any respect. Vikas Paruchuri may publicly disclose your participation in the project, including the fact that you have signed the MCA.
You agree to notify us if you become aware of any circumstance which would make any of the foregoing representations inaccurate in any respect. Endless Labs, Inc. may publicly disclose your participation in the project, including the fact that you have signed the MCA.
6. This MCA is governed by the laws of the State of California and applicable U.S. Federal law. Any choice of law rules will not apply.
57 changes: 41 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instruc

I want marker to be as widely accessible as possible, while still funding my development/training costs. Research and personal usage is always okay, but there are some restrictions on commercial usage.

The weights for the models are licensed `cc-by-nc-sa-4.0`, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to).
The weights for the models are licensed `cc-by-nc-sa-4.0`, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the [Datalab API](https://www.datalab.to/). If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to).

# Hosted API

Expand Down Expand Up @@ -89,6 +89,7 @@ First, some configuration:
- Inspect the settings in `marker/settings.py`. You can override any settings with environment variables.
- Your torch device will be automatically detected, but you can override this. For example, `TORCH_DEVICE=cuda`.
- By default, marker will use `surya` for OCR. Surya is slower on CPU, but more accurate than tesseract. It also doesn't require you to specify the languages in the document. If you want faster OCR, set `OCR_ENGINE` to `ocrmypdf`. This also requires external dependencies (see above). If you don't want OCR at all, set `OCR_ENGINE` to `None`.
- Some PDFs, even digital ones, have bad text in them. Set `OCR_ALL_PAGES=true` to force OCR if you find bad output from marker.

## Interactive App

Expand All @@ -107,15 +108,15 @@ marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --ma

- `--batch_multiplier` is how much to multiply default batch sizes by if you have extra VRAM. Higher numbers will take more VRAM, but process faster. Set to 2 by default. The default batch sizes will take ~3GB of VRAM.
- `--max_pages` is the maximum number of pages to process. Omit this to convert the entire document.
- `--start_page` is the page to start from (default is None, will start from the first page).
- `--langs` is an optional comma separated list of the languages in the document, for OCR. Optional by default, required if you use tesseract.
- `--ocr_all_pages` is an optional argument to force OCR on all pages of the PDF. If this or the env var `OCR_ALL_PAGES` are true, OCR will be forced.

The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/languages.py). If you need more languages, you can use any language supported by [Tesseract](https://tesseract-ocr.github.io/tessdoc/Data-Files#data-files-for-version-400-november-29-2016) if you set `OCR_ENGINE` to `ocrmypdf`. If you don't need OCR, marker can work with any language.

## Convert multiple files

```shell
marker /path/to/input/folder /path/to/output/folder --workers 4 --max 10 --min_length 10000
marker /path/to/input/folder /path/to/output/folder --workers 4 --max 10
```

- `--workers` is the number of pdfs to convert at once. This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage. Marker will use 5GB of VRAM per worker at the peak, and 3.5GB average.
Expand All @@ -136,7 +137,7 @@ You can use language names or codes. The exact codes depend on the OCR engine.
## Convert multiple files on multiple GPUs

```shell
MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
```

- `METADATA_FILE` is an optional path to a json file with metadata about the pdfs. See above for the format.
Expand All @@ -146,19 +147,52 @@ MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 mar

Note that the env variables above are specific to this script, and cannot be set in `local.env`.

# Output format

The output will be a markdown file, but there will also be a metadata json file that gives information about the conversion process. It has these fields:

```json
{
"languages": null, // any languages that were passed in
"filetype": "pdf", // type of the file
"pdf_toc": [], // the table of contents from the pdf
"computed_toc": [], //the computed table of contents
"pages": 10, // page count
"ocr_stats": {
"ocr_pages": 0, // number of pages OCRed
"ocr_failed": 0, // number of pages where OCR failed
"ocr_success": 0,
"ocr_engine": "none"
},
"block_stats": {
"header_footer": 0,
"code": 0, // number of code blocks
"table": 2, // number of tables
"equations": {
"successful_ocr": 0,
"unsuccessful_ocr": 0,
"equations": 0
}
}
}
```

# Troubleshooting

There are some settings that you may find useful if things aren't working the way you expect:

- `OCR_ALL_PAGES` - set this to true to force OCR all pages. This can be very useful if the table layouts aren't recognized properly by default, or if there is garbled text.
- `OCR_ALL_PAGES` - set this to true to force OCR all pages. This can be very useful if there is garbled text in the output of marker.
- `TORCH_DEVICE` - set this to force marker to use a given torch device for inference.
- `OCR_ENGINE` - can set this to `surya` or `ocrmypdf`.
- `DEBUG` - setting this to `True` shows ray logs when converting multiple pdfs
- Verify that you set the languages correctly, or passed in a metadata file.
- If you're getting out of memory errors, decrease worker count (increased the `VRAM_PER_TASK` setting). You can also try splitting up long PDFs into multiple files.

In general, if output is not what you expect, trying to OCR the PDF is a good first step. Not all PDFs have good text/bboxes embedded in them.

## Debugging

Set `DEBUG=true` to save data to the `debug` subfolder in the marker root directory. This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.

## Useful settings

These settings can improve/change output quality:
Expand Down Expand Up @@ -210,21 +244,13 @@ poetry install
Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run the overall benchmark like this:

```shell
python benchmark/overall.py data/pdfs data/references report.json --nougat
python benchmarks/overall.py data/pdfs data/references report.json --nougat
```

This will benchmark marker against other text extraction methods. It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each.

Omit `--nougat` to exclude nougat from the benchmark. I don't recommend running nougat on CPU, since it is very slow.

### Table benchmark

There is a benchmark for table parsing, which you can run with:

```shell
python benchmarks/table.py test_data/tables.json
```

# Thanks

This work would not have been possible without amazing open source models and datasets, including (but not limited to):
Expand All @@ -233,6 +259,5 @@ This work would not have been possible without amazing open source models and da
- Texify
- Pypdfium2/pdfium
- DocLayNet from IBM
- ByT5 from Google

Thank you to the authors of these models and datasets for making them available to the community!
2 changes: 1 addition & 1 deletion benchmarks/overall.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ def nougat_prediction(pdf_filename, batch_size=1):


def main():
parser = argparse.ArgumentParser(description="Benchmark PDF to MD conversion. Needs source pdfs, and a refernece folder with the correct markdown.")
parser = argparse.ArgumentParser(description="Benchmark PDF to MD conversion. Needs source pdfs, and a reference folder with the correct markdown.")
parser.add_argument("in_folder", help="Input PDF files")
parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
parser.add_argument("out_file", help="Output filename")
Expand Down
77 changes: 0 additions & 77 deletions benchmarks/table.py

This file was deleted.

8 changes: 3 additions & 5 deletions convert_single.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

import pypdfium2 # Needs to be at the top to avoid warnings
import os

os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1" # For some reason, transformers decided to use .isin for a simple op, which is not supported on MPS

import argparse
Expand All @@ -22,23 +23,20 @@ def main():
parser.add_argument("--start_page", type=int, default=None, help="Page to start processing at")
parser.add_argument("--langs", type=str, help="Optional languages to use for OCR, comma separated", default=None)
parser.add_argument("--batch_multiplier", type=int, default=2, help="How much to increase batch sizes")
parser.add_argument("--debug", action="store_true", help="Enable debug logging", default=False)
parser.add_argument("--ocr_all_pages", action="store_true", help="Force OCR on all pages", default=False)
args = parser.parse_args()

langs = args.langs.split(",") if args.langs else None

fname = args.filename
model_lst = load_all_models()
start = time.time()
full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier, start_page=args.start_page, ocr_all_pages=args.ocr_all_pages)
full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier, start_page=args.start_page)

fname = os.path.basename(fname)
subfolder_path = save_markdown(args.output, fname, full_text, images, out_meta)

print(f"Saved markdown to the {subfolder_path} folder")
if args.debug:
print(f"Total time: {time.time() - start}")
print(f"Total time: {time.time() - start}")


if __name__ == "__main__":
Expand Down
67 changes: 67 additions & 0 deletions marker/cleaners/headings.py
Original file line number Diff line number Diff line change
@@ -1,4 +1,7 @@
from collections import defaultdict
from typing import List
import numpy as np
from sklearn.cluster import KMeans

from marker.settings import settings
from marker.schema.bbox import rescale_bbox
Expand Down Expand Up @@ -57,3 +60,67 @@ def split_heading_blocks(pages: List[Page]):
new_blocks.append(copied_block)

page.blocks = new_blocks


def bucket_headings(line_heights, num_levels=settings.HEADING_LEVEL_COUNT):
if len(line_heights) <= num_levels:
return []

data = np.asarray(line_heights).reshape(-1, 1)
labels = KMeans(n_clusters=num_levels, random_state=0, n_init="auto").fit_predict(data)
data_labels = np.concatenate([data, labels.reshape(-1, 1)], axis=1)
data_labels = np.sort(data_labels, axis=0)

cluster_means = {label: np.mean(data_labels[data_labels[:, 1] == label, 0]) for label in np.unique(labels)}
label_max = None
label_min = None
heading_ranges = []
prev_cluster = None
for row in data_labels:
value, label = row
if prev_cluster is not None and label != prev_cluster:
prev_cluster_mean = cluster_means[prev_cluster]
cluster_mean = cluster_means[label]
if cluster_mean * settings.HEADING_MERGE_THRESHOLD < prev_cluster_mean:
heading_ranges.append((label_min, label_max))
label_min = None
label_max = None

label_min = value if label_min is None else min(label_min, value)
label_max = value if label_max is None else max(label_max, value)
prev_cluster = label

if label_min is not None:
heading_ranges.append((label_min, label_max))

heading_ranges = sorted(heading_ranges, key=lambda x: x[0], reverse=True)

return heading_ranges


def infer_heading_levels(pages: List[Page], height_tol=.99):
all_line_heights = []
for page in pages:
for block in page.blocks:
if block.block_type not in ["Title", "Section-header"]:
continue

all_line_heights.extend([l.height for l in block.lines])

heading_ranges = bucket_headings(all_line_heights)

for page in pages:
for block in page.blocks:
if block.block_type not in ["Title", "Section-header"]:
continue

block_heights = [l.height for l in block.lines] # Account for rotation
avg_height = sum(block_heights) / len(block_heights)
for idx, (min_height, max_height) in enumerate(heading_ranges):
if avg_height >= min_height * height_tol:
block.heading_level = idx + 1
break

if block.heading_level is None:
block.heading_level = settings.HEADING_DEFAULT_LEVEL

29 changes: 29 additions & 0 deletions marker/cleaners/toc.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
from typing import List

from marker.schema.page import Page


def get_pdf_toc(doc, max_depth=15):
toc = doc.get_toc(max_depth=max_depth)
toc_list = []
for item in toc:
list_item = {
"title": item.title,
"level": item.level,
"page": item.page_index,
}
toc_list.append(list_item)
return toc_list


def compute_toc(pages: List[Page]):
toc = []
for page in pages:
for block in page.blocks:
if block.block_type in ["Title", "Section-header"]:
toc.append({
"title": block.prelim_text,
"level": block.heading_level,
"page": page.pnum
})
return toc
Loading

0 comments on commit ea845fd

Please sign in to comment.