Merge pull request #300 from VikParuchuri/dev

Better tables, better markdown output, header level detection, table of contents
VikParuchuri · Oct 17, 2024 · ea845fd · ea845fd
2 parents 6534333 + d7b204b
commit ea845fd
Show file tree

Hide file tree

Showing 33 changed files with 2,646 additions and 2,767 deletions.
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
@@ -29,10 +29,6 @@ jobs:
         run: |
           poetry run python benchmarks/overall.py benchmark_data/pdfs benchmark_data/references report.json
           poetry run python scripts/verify_benchmark_scores.py report.json --type marker
-      - name: Run table benchmark
-        run: |
-          poetry run python benchmarks/table.py tables.json
-          poetry run python scripts/verify_benchmark_scores.py tables.json --type table
         
           
 
diff --git a/.gitignore b/.gitignore
@@ -8,6 +8,7 @@ wandb
 *.dat
 report.json
 benchmark_data
+debug_data
 
 # Byte-compiled / optimized / DLL files
 __pycache__/

diff --git a/CLA.md b/CLA.md
@@ -1,6 +1,6 @@
 Marker Contributor Agreement
 
-This Marker Contributor Agreement ("MCA") applies to any contribution that you make to any product or project managed by us (the "project"), and sets out the intellectual property rights you grant to us in the contributed materials. The term "us" shall mean Vikas Paruchuri. The term "you" shall mean the person or entity identified below. 
+This Marker Contributor Agreement ("MCA") applies to any contribution that you make to any product or project managed by us (the "project"), and sets out the intellectual property rights you grant to us in the contributed materials. The term "us" shall mean Endless Labs, Inc. The term "you" shall mean the person or entity identified below. 
 
 If you agree to be bound by these terms, sign by writing "I have read the CLA document and I hereby sign the CLA" in response to the CLA bot Github comment. Read this agreement carefully before signing. These terms and conditions constitute a binding legal agreement.
 
@@ -20,5 +20,5 @@ If you or your affiliates institute patent litigation against any entity (includ
    - each contribution that you submit is and shall be an original work of authorship and you can legally grant the rights set out in this MCA; 
    - to the best of your knowledge, each contribution will not violate any third party's copyrights, trademarks, patents, or other intellectual property rights; and 
    - each contribution shall be in compliance with U.S. export control laws and other applicable export and import laws.
-You agree to notify us if you become aware of any circumstance which would make any of the foregoing representations inaccurate in any respect. Vikas Paruchuri may publicly disclose your participation in the project, including the fact that you have signed the MCA. 
+You agree to notify us if you become aware of any circumstance which would make any of the foregoing representations inaccurate in any respect. Endless Labs, Inc. may publicly disclose your participation in the project, including the fact that you have signed the MCA. 
 6. This MCA is governed by the laws of the State of California and applicable U.S. Federal law. Any choice of law rules will not apply.
diff --git a/README.md b/README.md
@@ -42,7 +42,7 @@ See [below](#benchmarks) for detailed speed and accuracy benchmarks, and instruc
 
 I want marker to be as widely accessible as possible, while still funding my development/training costs.  Research and personal usage is always okay, but there are some restrictions on commercial usage.
 
-The weights for the models are licensed `cc-by-nc-sa-4.0`, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to).
+The weights for the models are licensed `cc-by-nc-sa-4.0`, but I will waive that for any organization under $5M USD in gross revenue in the most recent 12-month period AND under $5M in lifetime VC/angel funding raised. You also must not be competitive with the [Datalab API](https://www.datalab.to/).  If you want to remove the GPL license requirements (dual-license) and/or use the weights commercially over the revenue limit, check out the options [here](https://www.datalab.to).
 
 # Hosted API
 
@@ -89,6 +89,7 @@ First, some configuration:
 - Inspect the settings in `marker/settings.py`.  You can override any settings with environment variables.
 - Your torch device will be automatically detected, but you can override this.  For example, `TORCH_DEVICE=cuda`.
 - By default, marker will use `surya` for OCR.  Surya is slower on CPU, but more accurate than tesseract.  It also doesn't require you to specify the languages in the document.  If you want faster OCR, set `OCR_ENGINE` to `ocrmypdf`. This also requires external dependencies (see above).  If you don't want OCR at all, set `OCR_ENGINE` to `None`.
+- Some PDFs, even digital ones, have bad text in them.  Set `OCR_ALL_PAGES=true` to force OCR if you find bad output from marker.
 
 ## Interactive App
 
@@ -107,15 +108,15 @@ marker_single /path/to/file.pdf /path/to/output/folder --batch_multiplier 2 --ma
 
 - `--batch_multiplier` is how much to multiply default batch sizes by if you have extra VRAM.  Higher numbers will take more VRAM, but process faster.  Set to 2 by default.  The default batch sizes will take ~3GB of VRAM.
 - `--max_pages` is the maximum number of pages to process.  Omit this to convert the entire document.
+- `--start_page` is the page to start from (default is None, will start from the first page).
 - `--langs` is an optional comma separated list of the languages in the document, for OCR.  Optional by default, required if you use tesseract.
-- `--ocr_all_pages` is an optional argument to force OCR on all pages of the PDF.  If this or the env var `OCR_ALL_PAGES` are true, OCR will be forced.
 
 The list of supported languages for surya OCR is [here](https://github.com/VikParuchuri/surya/blob/master/surya/languages.py).  If you need more languages, you can use any language supported by [Tesseract](https://tesseract-ocr.github.io/tessdoc/Data-Files#data-files-for-version-400-november-29-2016) if you set `OCR_ENGINE` to `ocrmypdf`.  If you don't need OCR, marker can work with any language.
 
 ## Convert multiple files
 
 ```shell
-marker /path/to/input/folder /path/to/output/folder --workers 4 --max 10 --min_length 10000
+marker /path/to/input/folder /path/to/output/folder --workers 4 --max 10
 ```
 
 - `--workers` is the number of pdfs to convert at once.  This is set to 1 by default, but you can increase it to increase throughput, at the cost of more CPU/GPU usage.  Marker will use 5GB of VRAM per worker at the peak, and 3.5GB average.
@@ -136,7 +137,7 @@ You can use language names or codes.  The exact codes depend on the OCR engine.
 ## Convert multiple files on multiple GPUs
 
 ```shell
-MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
+METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 marker_chunk_convert ../pdf_in ../md_out
 ```
 
 - `METADATA_FILE` is an optional path to a json file with metadata about the pdfs.  See above for the format.
@@ -146,19 +147,52 @@ MIN_LENGTH=10000 METADATA_FILE=../pdf_meta.json NUM_DEVICES=4 NUM_WORKERS=15 mar
 
 Note that the env variables above are specific to this script, and cannot be set in `local.env`.
 
+# Output format
+
+The output will be a markdown file, but there will also be a metadata json file that gives information about the conversion process.  It has these fields:
+
+```json
+{
+    "languages": null, // any languages that were passed in
+    "filetype": "pdf", // type of the file
+    "pdf_toc": [], // the table of contents from the pdf
+    "computed_toc": [], //the computed table of contents
+    "pages": 10, // page count
+    "ocr_stats": {
+        "ocr_pages": 0, // number of pages OCRed
+        "ocr_failed": 0, // number of pages where OCR failed
+        "ocr_success": 0,
+        "ocr_engine": "none"
+    },
+    "block_stats": {
+        "header_footer": 0,
+        "code": 0, // number of code blocks
+        "table": 2, // number of tables
+        "equations": {
+            "successful_ocr": 0,
+            "unsuccessful_ocr": 0,
+            "equations": 0
+        }
+    }
+}
+```
+
 # Troubleshooting
 
 There are some settings that you may find useful if things aren't working the way you expect:
 
-- `OCR_ALL_PAGES` - set this to true to force OCR all pages.  This can be very useful if the table layouts aren't recognized properly by default, or if there is garbled text.
+- `OCR_ALL_PAGES` - set this to true to force OCR all pages.  This can be very useful if there is garbled text in the output of marker.
 - `TORCH_DEVICE` - set this to force marker to use a given torch device for inference.
 - `OCR_ENGINE` - can set this to `surya` or `ocrmypdf`.
-- `DEBUG` - setting this to `True` shows ray logs when converting multiple pdfs
 - Verify that you set the languages correctly, or passed in a metadata file.
 - If you're getting out of memory errors, decrease worker count (increased the `VRAM_PER_TASK` setting).  You can also try splitting up long PDFs into multiple files.
 
 In general, if output is not what you expect, trying to OCR the PDF is a good first step.  Not all PDFs have good text/bboxes embedded in them.
 
+## Debugging
+
+Set `DEBUG=true` to save data to the `debug` subfolder in the marker root directory.  This will save images of each page with detected layout and text, as well as output a json file with additional bounding box information.
+
 ## Useful settings
 
 These settings can improve/change output quality:
@@ -210,21 +244,13 @@ poetry install
 Download the benchmark data [here](https://drive.google.com/file/d/1ZSeWDo2g1y0BRLT7KnbmytV2bjWARWba/view?usp=sharing) and unzip. Then run the overall benchmark like this:
 
 ```shell
-python benchmark/overall.py data/pdfs data/references report.json --nougat
+python benchmarks/overall.py data/pdfs data/references report.json --nougat
 ```
 
 This will benchmark marker against other text extraction methods.  It sets up batch sizes for nougat and marker to use a similar amount of GPU RAM for each.
 
 Omit `--nougat` to exclude nougat from the benchmark.  I don't recommend running nougat on CPU, since it is very slow.
 
-### Table benchmark
-
-There is a benchmark for table parsing, which you can run with:
-
-```shell
-python benchmarks/table.py test_data/tables.json
-```
-
 # Thanks
 
 This work would not have been possible without amazing open source models and datasets, including (but not limited to):
@@ -233,6 +259,5 @@ This work would not have been possible without amazing open source models and da
 - Texify
 - Pypdfium2/pdfium
 - DocLayNet from IBM
-- ByT5 from Google
 
 Thank you to the authors of these models and datasets for making them available to the community!
diff --git a/benchmarks/overall.py b/benchmarks/overall.py
@@ -48,7 +48,7 @@ def nougat_prediction(pdf_filename, batch_size=1):
 
 
 def main():
-    parser = argparse.ArgumentParser(description="Benchmark PDF to MD conversion.  Needs source pdfs, and a refernece folder with the correct markdown.")
+    parser = argparse.ArgumentParser(description="Benchmark PDF to MD conversion.  Needs source pdfs, and a reference folder with the correct markdown.")
     parser.add_argument("in_folder", help="Input PDF files")
     parser.add_argument("reference_folder", help="Reference folder with reference markdown files")
     parser.add_argument("out_file", help="Output filename")

diff --git a/benchmarks/table.py b/benchmarks/table.py
diff --git a/convert_single.py b/convert_single.py
@@ -2,6 +2,7 @@
 
 import pypdfium2 # Needs to be at the top to avoid warnings
 import os
+
 os.environ["PYTORCH_ENABLE_MPS_FALLBACK"] = "1" # For some reason, transformers decided to use .isin for a simple op, which is not supported on MPS
 
 import argparse
@@ -22,23 +23,20 @@ def main():
     parser.add_argument("--start_page", type=int, default=None, help="Page to start processing at")
     parser.add_argument("--langs", type=str, help="Optional languages to use for OCR, comma separated", default=None)
     parser.add_argument("--batch_multiplier", type=int, default=2, help="How much to increase batch sizes")
-    parser.add_argument("--debug", action="store_true", help="Enable debug logging", default=False)
-    parser.add_argument("--ocr_all_pages", action="store_true", help="Force OCR on all pages", default=False)
     args = parser.parse_args()
 
     langs = args.langs.split(",") if args.langs else None
 
     fname = args.filename
     model_lst = load_all_models()
     start = time.time()
-    full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier, start_page=args.start_page, ocr_all_pages=args.ocr_all_pages)
+    full_text, images, out_meta = convert_single_pdf(fname, model_lst, max_pages=args.max_pages, langs=langs, batch_multiplier=args.batch_multiplier, start_page=args.start_page)
 
     fname = os.path.basename(fname)
     subfolder_path = save_markdown(args.output, fname, full_text, images, out_meta)
 
     print(f"Saved markdown to the {subfolder_path} folder")
-    if args.debug:
-        print(f"Total time: {time.time() - start}")
+    print(f"Total time: {time.time() - start}")
 
 
 if __name__ == "__main__":

diff --git a/marker/cleaners/headings.py b/marker/cleaners/headings.py
@@ -1,4 +1,7 @@
+from collections import defaultdict
 from typing import List
+import numpy as np
+from sklearn.cluster import KMeans
 
 from marker.settings import settings
 from marker.schema.bbox import rescale_bbox
@@ -57,3 +60,67 @@ def split_heading_blocks(pages: List[Page]):
                 new_blocks.append(copied_block)
 
         page.blocks = new_blocks
+
+
+def bucket_headings(line_heights, num_levels=settings.HEADING_LEVEL_COUNT):
+    if len(line_heights) <= num_levels:
+        return []
+
+    data = np.asarray(line_heights).reshape(-1, 1)
+    labels = KMeans(n_clusters=num_levels, random_state=0, n_init="auto").fit_predict(data)
+    data_labels = np.concatenate([data, labels.reshape(-1, 1)], axis=1)
+    data_labels = np.sort(data_labels, axis=0)
+
+    cluster_means = {label: np.mean(data_labels[data_labels[:, 1] == label, 0]) for label in np.unique(labels)}
+    label_max = None
+    label_min = None
+    heading_ranges = []
+    prev_cluster = None
+    for row in data_labels:
+        value, label = row
+        if prev_cluster is not None and label != prev_cluster:
+            prev_cluster_mean = cluster_means[prev_cluster]
+            cluster_mean = cluster_means[label]
+            if cluster_mean * settings.HEADING_MERGE_THRESHOLD < prev_cluster_mean:
+                heading_ranges.append((label_min, label_max))
+                label_min = None
+                label_max = None
+
+        label_min = value if label_min is None else min(label_min, value)
+        label_max = value if label_max is None else max(label_max, value)
+        prev_cluster = label
+
+    if label_min is not None:
+        heading_ranges.append((label_min, label_max))
+
+    heading_ranges = sorted(heading_ranges, key=lambda x: x[0], reverse=True)
+
+    return heading_ranges
+
+
+def infer_heading_levels(pages: List[Page], height_tol=.99):
+    all_line_heights = []
+    for page in pages:
+        for block in page.blocks:
+            if block.block_type not in ["Title", "Section-header"]:
+                continue
+
+            all_line_heights.extend([l.height for l in block.lines])
+
+    heading_ranges = bucket_headings(all_line_heights)
+
+    for page in pages:
+        for block in page.blocks:
+            if block.block_type not in ["Title", "Section-header"]:
+                continue
+
+            block_heights = [l.height for l in block.lines] # Account for rotation
+            avg_height = sum(block_heights) / len(block_heights)
+            for idx, (min_height, max_height) in enumerate(heading_ranges):
+                if avg_height >= min_height * height_tol:
+                    block.heading_level = idx + 1
+                    break
+
+            if block.heading_level is None:
+                block.heading_level = settings.HEADING_DEFAULT_LEVEL
+
diff --git a/marker/cleaners/toc.py b/marker/cleaners/toc.py
@@ -0,0 +1,29 @@
+from typing import List
+
+from marker.schema.page import Page
+
+
+def get_pdf_toc(doc, max_depth=15):
+    toc = doc.get_toc(max_depth=max_depth)
+    toc_list = []
+    for item in toc:
+        list_item = {
+            "title": item.title,
+            "level": item.level,
+            "page": item.page_index,
+        }
+        toc_list.append(list_item)
+    return toc_list
+
+
+def compute_toc(pages: List[Page]):
+    toc = []
+    for page in pages:
+        for block in page.blocks:
+            if block.block_type in ["Title", "Section-header"]:
+                toc.append({
+                    "title": block.prelim_text,
+                    "level": block.heading_level,
+                    "page": page.pnum
+                })
+    return toc