-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segmenting Markdown-converted PDFs into pages #86
Comments
For anyone else interested in preserving page boundaries, I managed to add a page delimiter by:
This uses This is a bit of a hacky solution so I'd still like to see page segmentation implemented officially in |
@nunamia How about making a merge of this solution? However, I´m observing issues with the page numbers. I have a document vom EU Parliament where every page has content but the page numbers are too often and jump |
@Terranic Try out my solution, I haven't found that issue with it. |
Thanks for the script @umarbutler . This is on my list of features to include, as a few people have asked for it |
Here's a script to monkeypatch Marker with @umarbutler 's solution: import ast
import inspect
import marker.postprocessors.markdown
class MarkdownTransformer(ast.NodeTransformer):
def __init__(self):
self.current_function = None
def visit_FunctionDef(self, node):
# Store the current function name
self.current_function = node.name
# Visit all the child nodes within the function
self.generic_visit(node)
# Reset current function name to None after leaving the function
self.current_function = None
return node
def visit_Assign(self, node):
if self.current_function == 'line_separator':
if isinstance(node.targets[0], ast.Name) and node.targets[0].id == 'lowercase_letters':
if isinstance(node.value, ast.Constant) and isinstance(node.value.value, str):
original_value = node.value.value # might want node.value.s
new_value = original_value + '|'
node.value = ast.Constant(value=new_value)
return node
def visit_For(self, node):
if self.current_function == 'merge_lines':
# Check if the loop iterates over a variable named 'page'
if isinstance(node.target, ast.Name) and node.target.id == 'page':
# Change the loop to use enumerate
node.iter = ast.Call(
func=ast.Name(id='enumerate', ctx=ast.Load()),
args=[node.iter],
keywords=[]
)
node.target = ast.Tuple(elts=[
ast.Name(id='page_i', ctx=ast.Store()),
ast.Name(id='page', ctx=ast.Store())
], ctx=ast.Store())
# Create the additional check and append operation
page_check = ast.parse("""
if page_i != len(blocks) - 1:
block_text += ''
""").body[0]
node.body.append(page_check)
return node
# Get the source code and make the AST
markdown_source = inspect.getsource(marker.postprocessors.markdown)
markdown_ast = ast.parse(markdown_source)
# Create the AST transformer instance
markdown_transformer = MarkdownTransformer()
# Perform the transformation (explores the tree and applies defined transformation functions, returning the new tree)
markdown_ast = markdown_transformer.visit(markdown_ast)
# Fix missing locations in the modified AST
ast.fix_missing_locations(markdown_ast)
# Replace the functions in the actual module - e.g. internal module calls to
# marker.postprocessors.markdown.line_separator will use the updated version.
exec(compile(markdown_ast, filename='<ast>', mode='exec'), marker.postprocessors.markdown.__dict__) |
Less debugging for others,the method of using @umarbutler requires changing the two files note:tested on marker-pdf==0.2.5 merged.py from collections import Counter
from typing import List, Optional
from pydantic import BaseModel
from marker.schema.bbox import BboxElement
class MergedLine(BboxElement):
text: str
fonts: List[str]
def most_common_font(self):
counter = Counter(self.fonts)
return counter.most_common(1)[0][0]
class MergedBlock(BboxElement):
lines: List[MergedLine]
pnum: int
block_type: Optional[str]
class FullyMergedBlock(BaseModel):
text: str
block_type: str
pnum: int markdown.py,replace merge_lines function. def merge_lines(blocks: List[List[MergedBlock]]):
text_blocks = []
prev_type = None
prev_line = None
block_text = ""
block_type = ""
block_pnum = 0
# common_line_heights = [p.get_line_height_stats() for p in page_blocks]
for page_i, page in enumerate(blocks):
for block in page:
block_pnum = block.pnum
block_type = block.block_type
if block_type != prev_type and prev_type:
text_blocks.append(
FullyMergedBlock(
text=block_surround(block_text, prev_type),
block_type=prev_type,
pnum=block_pnum
)
)
block_text = ""
prev_type = block_type
# Join lines in the block together properly
for i, line in enumerate(block.lines):
line_height = line.bbox[3] - line.bbox[1]
prev_line_height = prev_line.bbox[3] - prev_line.bbox[1] if prev_line else 0
prev_line_x = prev_line.bbox[0] if prev_line else 0
prev_line = line
is_continuation = line_height == prev_line_height and line.bbox[0] == prev_line_x
if block_text:
block_text = line_separator(block_text, line.text, block_type, is_continuation)
else:
block_text = line.text
# This is where the magic happens!
if page_i != len(blocks) - 1:
block_text += ''
# This is where the magic ends!
# Append the final block
text_blocks.append(
FullyMergedBlock(
text=block_surround(block_text, prev_type),
block_type=block_type,
pnum=block_pnum
)
)
return text_blocks |
Hi @VikParuchuri,
Thank you very much for creating this invaluable package which I have found extremely useful in several projects already. I just wanted to ask if an option could be added to indicate where pages start and end in the outputted Markdown? Even having the ability to add a custom delimiter such as
<page>
would help.The text was updated successfully, but these errors were encountered: