Skip to content

Latest commit

 

History

History
32 lines (23 loc) · 2.04 KB

README.md

File metadata and controls

32 lines (23 loc) · 2.04 KB

File Operations

License: MIT docs: latest coverage: 92%

Python scripts to perform the following file operations:

  • Jump to a line number in a file and read a line.
  • Read a large file as a stream of lines and filter only the lines that match some criteria.
  • Read a large file, filter only the lines that match some criteria, redirect and write those filtered lines to another file.
  • Read a JSON input and load it into an object.

Timing Results

The following timings have been obtained by reading a Wikimedia abstracts dump file (an .xml file of size 5.8GB with almost 75.6M lines - the file can be downloaded from here).

  • Adding line numbers to the file:
    addLineNumber : 58.024850428 s
    addLineNumber_inplace : 103.272668963 s

  • Reading a line at a given line number:
    getline from the linecache module is not practical for large files.
    getLine uses enumerate() to read the file line-by-line until it reaches the target line number.
    getLine_binarysearch searches for the given line number using binary search. The input file must have line numbers. The time spent to add line numbers is reported above.

Use ./tools/timingplot.py to generate an interactive plotly plot. The timing data can be found at: ./data/

Test Data
shakespeare.txt : "As You Like It" by William Shakespeare.
exoplanets.json : list of potentially habitable exoplanets, source: Wikipedia (accessed: Mar. 2021), table converted into a .json file.

Resources
Documentation can be viewed at: https://seyedb.github.io/file-ops/