File Operations

Python scripts to perform the following file operations:

Jump to a line number in a file and read a line.
Read a large file as a stream of lines and filter only the lines that match some criteria.
Read a large file, filter only the lines that match some criteria, redirect and write those filtered lines to another file.
Read a JSON input and load it into an object.

Timing Results

The following timings have been obtained by reading a Wikimedia abstracts dump file (an .xml file of size 5.8GB with almost 75.6M lines - the file can be downloaded from here).

Adding line numbers to the file:
addLineNumber : 58.024850428 s
addLineNumber_inplace : 103.272668963 s
Reading a line at a given line number:
getline from the linecache module is not practical for large files.
getLine uses enumerate() to read the file line-by-line until it reaches the target line number.
getLine_binarysearch searches for the given line number using binary search. The input file must have line numbers. The time spent to add line numbers is reported above.

Use ./tools/timingplot.py to generate an interactive plotly plot. The timing data can be found at: ./data/

Test Data
shakespeare.txt : "As You Like It" by William Shakespeare.
exoplanets.json : list of potentially habitable exoplanets, source: Wikipedia (accessed: Mar. 2021), table converted into a .json file.

Resources
Documentation can be viewed at: https://seyedb.github.io/file-ops/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

File Operations

Timing Results

Files

README.md

Latest commit

History

README.md

File metadata and controls

File Operations

Timing Results