Skip to content

Latest commit

 

History

History
252 lines (94 loc) · 4.29 KB

harvest-and-collect.md

File metadata and controls

252 lines (94 loc) · 4.29 KB

Module harvest_and_collect {#id}

Sub-modules

Module harvest_and_collect.connect_to_arxiv {#id}

This module provides classes to connect to and harvest records from the ArXiv database.

Classes -----= ArXivRecord: Represents a single record from the ArXiv database. ArXivHarvester: Handles the connection to the ArXiv database and fetches records.

The ArXivRecord class parses XML data from a single ArXiv record into a Python object. It extracts the header and metadata from the record and checks if the record is valid.

The ArXivHarvester class connects to the ArXiv database and fetches records. It handles HTTP exceptions and retries failed requests. It also handles pagination by using the resumption token provided by the ArXiv API.

Classes

Class ArXivHarvester {#id}

class ArXivHarvester(
    **kwargs
)

A class to handle the connection to the ArXiv database and fetch records.

Raises -----= ArXivHarvester.CustomHTTPException : Custom HTTP Exception that forward the status code and the resumption token, if any.

Yields -----= next_record(): Yields the next record from the fetched records.

Class variables

Variable CustomHTTPException {#id}

Custom HTTP Exception that forward the status code and the resumption token, if any.

Methods

Method next_record {#id}
def next_record(
    self
) ‑> Generator[harvest_and_collect.connect_to_arxiv.ArXivRecord, Any, None]

A generator method that yields the next record from the fetched records.

This method continuously yields records from the fetched records list. If the list is empty, it fetches a new batch of records from the ArXiv database. If there are still no records after fetching, it stops the generator.

Yields: ArXivRecord : The next record from the fetched records.

Raises: CustomHTTPException : If an HTTP error occurs while fetching new records.

Class ArXivRecord {#id}

class ArXivRecord(
    record_xml: xml.etree.ElementTree.Element
)

A class to represent a single record from the ArXiv database.

Module harvest_and_collect.db_connexion {#id}

Using the data from a harvester, add records to the database.

Classes

Class GraphDBConnexion {#id}

class GraphDBConnexion(
    uri: str
)

Handle the database connection and provides functions to easily add records to it.

Methods

Method add_record {#id}
def add_record(
    self,
    record: harvest_and_collect.connect_to_arxiv.ArXivRecord
) ‑> None

Adds a record to the database.

This method checks if the record is valid. If it is, it opens a new session with the database and executes a write transaction using the _record_tx method.

Args -----= record : ArXivRecord : The record to be added to the database.

Raises -----= neo4j.exceptions.ServiceUnavailable : If the database is not available.

Method clean_database {#id}
def clean_database(
    self
) ‑> None

Deletes all nodes and relationships from the database.

Module harvest_and_collect.main {#id}

This module provides the main entry point for the ArXiv harvesting application.

The main function in this module sets up a connection to the ArXiv database and the Neo4j database, then fetches records from the ArXiv database and adds them to the Neo4j database. It also handles command line arguments for running the application in mock mode, specifying the Neo4j URI, and specifying the resumption token for the ArXiv database.

Functions

Function main {#id}

def main(
    mock=False,
    neo4j_uri='neo4j://localhost:7687',
    resumption_token=None
)

Generated by pdoc 0.10.0 (https://pdoc3.github.io).