Segmenting Markdown-converted PDFs into pages #86

umarbutler · 2024-02-17T03:41:26Z

Hi @VikParuchuri,
Thank you very much for creating this invaluable package which I have found extremely useful in several projects already. I just wanted to ask if an option could be added to indicate where pages start and end in the outputted Markdown? Even having the ability to add a custom delimiter such as <page> would help.

The text was updated successfully, but these errors were encountered:

umarbutler · 2024-02-17T10:44:43Z

For anyone else interested in preserving page boundaries, I managed to add a page delimiter by:

Replacing the merge_lines() function in markdown.py with the following:

def merge_lines(blocks, page_blocks: List[Page]):
    text_blocks = []
    prev_type = None
    prev_line = None
    block_text = ""
    block_type = ""
    common_line_heights = [p.get_line_height_stats() for p in page_blocks]
    for page_i, page in enumerate(blocks):
        for block in page:
            block_type = block.most_common_block_type()
            if block_type != prev_type and prev_type:
                text_blocks.append(
                    FullyMergedBlock(
                        text=block_surround(block_text, prev_type),
                        block_type=prev_type
                    )
                )
                block_text = ""

            prev_type = block_type
            # Join lines in the block together properly
            for i, line in enumerate(block.lines):
                line_height = line.bbox[3] - line.bbox[1]
                prev_line_height = prev_line.bbox[3] - prev_line.bbox[1] if prev_line else 0
                prev_line_x = prev_line.bbox[0] if prev_line else 0
                prev_line = line
                is_continuation = line_height == prev_line_height and line.bbox[0] == prev_line_x

                if block_text:
                    block_text = line_separator(block_text, line.text, block_type, is_continuation)
                else:
                    block_text = line.text

        # This is where the magic happens!
        if page_i != len(blocks) - 1:
            block_text += ''
        # This is where the magic ends!

    # Append the final block
    text_blocks.append(
        FullyMergedBlock(
            text=block_surround(block_text, prev_type),
            block_type=block_type
        )
    )
    return text_blocks

Replacing lowercase_letters = "a-zà-öø-ÿа-яşćăâđêôơưþðæøå" in the line_seperator() function of markdown.py with lowercase_letters = "a-zà-öø-ÿа-яşćăâđêôơưþðæøå". This ensures that delimiters do not cause newlines to be inserted in the middle of lines.

This uses (Unicode's object replacement character) instead of <page> as it is a single character and can therefore be added directly to the lowercase_letters regex character set instead of having to rework regex patterns. You may replace it with any other character of your choosing.

This is a bit of a hacky solution so I'd still like to see page segmentation implemented officially in marker.

see VikParuchuri#86

nunamia · 2024-02-21T08:52:27Z

YES, You need edit schema.py

and edit markdown.py
`def merge_lines(blocks, page_blocks: List[Page]):
text_blocks = []
prev_type = None
prev_line = None
block_text = ""
block_type = ""
block_pnum = 0
common_line_heights = [p.get_line_height_stats() for p in page_blocks]
for page in blocks:
for block in page:
block_pnum = block.pnum
block_type = block.most_common_block_type()
if block_type != prev_type and prev_type:
text_blocks.append(
FullyMergedBlock(
text=block_surround(block_text, prev_type),
block_type=prev_type,
pnum=block_pnum
)
)
block_text = ""
prev_type = block_type
# Join lines in the block together properly
for i, line in enumerate(block.lines):
line_height = line.bbox[3] - line.bbox[1]
prev_line_height = prev_line.bbox[3] - prev_line.bbox[1] if prev_line else 0
prev_line_x = prev_line.bbox[0] if prev_line else 0
prev_line = line
is_continuation = line_height == prev_line_height and line.bbox[0] == prev_line_x
if block_text:
block_text = line_separator(block_text, line.text, block_type, is_continuation)
else:
block_text = line.text

# Append the final block
text_blocks.append(
    FullyMergedBlock(
        text=block_surround(block_text, prev_type),
        block_type=block_type,
        pnum=block_pnum
    )
)
return text_blocks`

Terranic · 2024-04-28T07:54:14Z

@nunamia How about making a merge of this solution?

However, I´m observing issues with the page numbers. I have a document vom EU Parliament where every page has content but the page numbers are too often and jump

umarbutler · 2024-05-10T01:26:16Z

@Terranic Try out my solution, I haven't found that issue with it.

VikParuchuri · 2024-05-12T02:14:28Z

Thanks for the script @umarbutler . This is on my list of features to include, as a few people have asked for it

HaileyStorm · 2024-05-13T20:49:02Z

Here's a script to monkeypatch Marker with @umarbutler 's solution:

import ast
import inspect
import marker.postprocessors.markdown


class MarkdownTransformer(ast.NodeTransformer):
    def __init__(self):
        self.current_function = None

    def visit_FunctionDef(self, node):
        # Store the current function name
        self.current_function = node.name
        # Visit all the child nodes within the function
        self.generic_visit(node)
        # Reset current function name to None after leaving the function
        self.current_function = None
        return node

    def visit_Assign(self, node):
        if self.current_function == 'line_separator':
            if isinstance(node.targets[0], ast.Name) and node.targets[0].id == 'lowercase_letters':
                if isinstance(node.value, ast.Constant) and isinstance(node.value.value, str):
                    original_value = node.value.value  # might want node.value.s
                    new_value = original_value + '|'
                    node.value = ast.Constant(value=new_value)
        return node

    def visit_For(self, node):
        if self.current_function == 'merge_lines':
            # Check if the loop iterates over a variable named 'page'
            if isinstance(node.target, ast.Name) and node.target.id == 'page':
                # Change the loop to use enumerate
                node.iter = ast.Call(
                    func=ast.Name(id='enumerate', ctx=ast.Load()),
                    args=[node.iter],
                    keywords=[]
                )
                node.target = ast.Tuple(elts=[
                    ast.Name(id='page_i', ctx=ast.Store()),
                    ast.Name(id='page', ctx=ast.Store())
                ], ctx=ast.Store())

                # Create the additional check and append operation
                page_check = ast.parse("""
if page_i != len(blocks) - 1:
    block_text += ''
""").body[0]
                node.body.append(page_check)
        return node


# Get the source code and make the AST
markdown_source = inspect.getsource(marker.postprocessors.markdown)
markdown_ast = ast.parse(markdown_source)

# Create the AST transformer instance
markdown_transformer = MarkdownTransformer()

# Perform the transformation (explores the tree and applies defined transformation functions, returning the new tree)
markdown_ast = markdown_transformer.visit(markdown_ast)
# Fix missing locations in the modified AST
ast.fix_missing_locations(markdown_ast)

# Replace the functions in the actual module - e.g. internal module calls to
# marker.postprocessors.markdown.line_separator will use the updated version.
exec(compile(markdown_ast, filename='<ast>', mode='exec'), marker.postprocessors.markdown.__dict__)

knysfh · 2024-06-03T07:21:37Z

Less debugging for others,the method of using @umarbutler requires changing the two files marker/schema/merged.py and marker/postprocessors/markdown.py

note:tested on marker-pdf==0.2.5

merged.py

from collections import Counter
from typing import List, Optional

from pydantic import BaseModel

from marker.schema.bbox import BboxElement


class MergedLine(BboxElement):
    text: str
    fonts: List[str]

    def most_common_font(self):
        counter = Counter(self.fonts)
        return counter.most_common(1)[0][0]


class MergedBlock(BboxElement):
    lines: List[MergedLine]
    pnum: int
    block_type: Optional[str]


class FullyMergedBlock(BaseModel):
    text: str
    block_type: str
    pnum: int

markdown.py,replace merge_lines function.

def merge_lines(blocks: List[List[MergedBlock]]):
    text_blocks = []
    prev_type = None
    prev_line = None
    block_text = ""
    block_type = ""
    block_pnum = 0
    # common_line_heights = [p.get_line_height_stats() for p in page_blocks]
    for page_i, page in enumerate(blocks):
        for block in page:
            block_pnum = block.pnum
            block_type = block.block_type
            if block_type != prev_type and prev_type:
                text_blocks.append(
                    FullyMergedBlock(
                        text=block_surround(block_text, prev_type),
                        block_type=prev_type,
                        pnum=block_pnum
                    )
                )
                block_text = ""

            prev_type = block_type
            # Join lines in the block together properly
            for i, line in enumerate(block.lines):
                line_height = line.bbox[3] - line.bbox[1]
                prev_line_height = prev_line.bbox[3] - prev_line.bbox[1] if prev_line else 0
                prev_line_x = prev_line.bbox[0] if prev_line else 0
                prev_line = line
                is_continuation = line_height == prev_line_height and line.bbox[0] == prev_line_x

                if block_text:
                    block_text = line_separator(block_text, line.text, block_type, is_continuation)
                else:
                    block_text = line.text

        # This is where the magic happens!
        if page_i != len(blocks) - 1:
            block_text += ''
        # This is where the magic ends!

    # Append the final block
    text_blocks.append(
        FullyMergedBlock(
            text=block_surround(block_text, prev_type),
            block_type=block_type,
            pnum=block_pnum
        )
    )
    return text_blocks

davidpomerenke added a commit to danu-insight/marker that referenced this issue Feb 20, 2024

Add special character on page break

3154adc

see VikParuchuri#86

nunamia mentioned this issue Feb 21, 2024

How to add option to marker page range of pdf #84

Open

VikParuchuri mentioned this issue Jun 17, 2024

Bugfixes and new features #197

Merged

VikParuchuri closed this as completed in #197 Jun 17, 2024

wciq1208 mentioned this issue Jul 9, 2024

Crashed in a multi-threaded environment #225

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segmenting Markdown-converted PDFs into pages #86

Segmenting Markdown-converted PDFs into pages #86

umarbutler commented Feb 17, 2024

umarbutler commented Feb 17, 2024 •

edited

Loading

nunamia commented Feb 21, 2024

Terranic commented Apr 28, 2024 •

edited

Loading

umarbutler commented May 10, 2024

VikParuchuri commented May 12, 2024

HaileyStorm commented May 13, 2024

knysfh commented Jun 3, 2024 •

edited

Loading

Segmenting Markdown-converted PDFs into pages #86

Segmenting Markdown-converted PDFs into pages #86

Comments

umarbutler commented Feb 17, 2024

umarbutler commented Feb 17, 2024 • edited Loading

nunamia commented Feb 21, 2024

Terranic commented Apr 28, 2024 • edited Loading

umarbutler commented May 10, 2024

VikParuchuri commented May 12, 2024

HaileyStorm commented May 13, 2024

knysfh commented Jun 3, 2024 • edited Loading

umarbutler commented Feb 17, 2024 •

edited

Loading

Terranic commented Apr 28, 2024 •

edited

Loading

knysfh commented Jun 3, 2024 •

edited

Loading