Uparxive

The uparxive aims to provide a llm-friendly dataest for the whole arxiv .tex source. A similiar dataset is the unarxive, while the uparxvie use a different tool chain.

The Uparxive dataset is stored in .json format, which can be seamlessly converted into Markdown .md format.

Format Rules

The Uparxive dataset adheres to the following rules:

Tables and Figures: Elements enclosed within \begin{table} \end{table} and \begin{figure} \end{figure} tags are extracted and appended at the end of the document for clarity and organization.
Citations and References: Citations (\cite{}) and references (\ref{}) are converted to more explicit forms to improve readability. Examples include:
- Direct mentions: (See [Ref. [1,2] of ArXiv.1512.03385])
- Contextual references: in [Ref. [1,2] of ArXiv.1512.03385]
- Equation/Section/Figures/Tables references: in [Equation [1] of ArXiv.1512.03385], depending on the usage context.
Mathematical Notations:
- In-line Math: Single dollar signs $ are used for in-line mathematical expressions, e.g., $\alpha$ .
- Block Math: Double dollar signs $$ denote block mathematical expressions, e.g., $$\mathbf{y}=\mathcal{F}(\mathbf{x},\{W_{i}\})+\mathbf{x}.$$

See 1512.03385.json and 1512.03385.md as example

Download Url

uparxive_metadata: huggingface
uparxive: huggingface
uparxive-reference: huggingface
upar5iv(uparxive version)[Coming Soon][License check]
unarxive(uparxive version)[Coming Soon][License check]

Note: The "full" version of uparxive was generated from all of arXiv.org including non-permissively licensed papers. Make sure that your use of the data is compliant with the paper's licensing terms. (For information on papers' licenses use arXiv's bulk metadata access).

Note: paper under CC BY-NC-ND license are not included in the dataset.

Statistic

Up to April 2024, there are around 2,450,893 papers in the arxiv source, and the uparxive dataset has covered 1,700,000 papers. Those missing parts are mainly due to the lack of the .tex source or the failure of the conversion process.

Build from arxiv source

To effectively collect and process data from the arXiv source, follow the outlined tool chain and resources provided below:

Tools and Resources for Data Collection

1. Download Arxiv Source Files

arXiv Bulk Data Access: Access and download bulk data directly from arXiv using the AWS S3 request-payer dataset. Detailed instructions and access points can be found here: arXiv Bulk Data Access. or check this and this
arXiv API: For more specific data needs or metadata, use the arXiv API. Documentation and usage guidelines are available here: arXiv API.
- Important Note: When crawling arXiv source files, ensure to use export.arxiv.org instead of the official arxiv.org domain to avoid overloading the main site.

2. Python Wrapper for arXiv API

To simplify API interactions, consider using the arxiv.py Python library. This wrapper facilitates easier querying and data retrieval from the arXiv API.

3. HTML Packed Dataset

Update as of 2025.04.30: Deyan Ginev (@Deyan) has published an HTML-packed arXiv dataset, known as ar5iv. This dataset provides HTML files that are more friendly for data extraction aimed at language model training or other NLP tasks.

Starndard Usage:

obtain the .json format data

Compile the .tex file to .xml file
- LaTeXML
- python python_script/tex_to_xml.py --root [SourPath]
Convert the format .xml file to .json or .md file
- python python_script/xml_to_json.py --root [SourPath]
Update 2025.04.30: html_to_dense_text.py aim to convert the ar5iv dataset from html format to llm-friendly data.
- python python_script/html_to_json.py --root [SourPath]
Update 2025.05.22: Nougat provide a standard parser for the latexmlc html format from nougat.dataset.parser.html2md import html2md. I do think this solution is much more convience than my custom implement.
Upate 2025.06.17: we now reach the V3 version of uparxive with the help of the nougat
Upate 2025.06.24: we now release the aligned markdown-pdf subset, please see here for more detail

Advanced Usage: Referecne Retrieve:

Turn to the Citation Retreive for more details

Resource: in order to retrieve the digital url for each citation string, you need collect the citation metadata from
- openalex snapshot: official page or the aws s3 opendata
- crossref snapshot: Metadata Retrieval or the aws s3 request-payer dataset
- arxive metadata: arxiv_dataset
Tool Chain:
- Citation Structure Tool:
  - Anystyle
  - Grobid
- Citation Retreive Engine:
  - Elasticsearch
- Citation_Retreive_Script

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
bash_script		bash_script
documents		documents
example		example
figures		figures
notebook		notebook
python_script		python_script
uparxive		uparxive
.gitignore		.gitignore
README.md		README.md
pyrightconfig.json		pyrightconfig.json
requirements.txt		requirements.txt
view_json_content_in_markdown.py		view_json_content_in_markdown.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Uparxive

Format Rules

Download Url

Statistic

Build from arxiv source

Tools and Resources for Data Collection

1. Download Arxiv Source Files

2. Python Wrapper for arXiv API

3. HTML Packed Dataset

Starndard Usage:

Advanced Usage: Referecne Retrieve:

About

Releases

Packages

Languages

veya2ztn/uparxive

Folders and files

Latest commit

History

Repository files navigation

Uparxive

Format Rules

Download Url

Statistic

Build from arxiv source

Tools and Resources for Data Collection

1. Download Arxiv Source Files

2. Python Wrapper for arXiv API

3. HTML Packed Dataset

Starndard Usage:

Advanced Usage: Referecne Retrieve:

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages