Skip to content

Strange performance hit when converting seq[seq[string]]] to List[List[str]]] to pd.Dataframe #243

Open
@arkanoid87

Description

@arkanoid87

I have two equivalent functions

def py_myparser(data: str) -> List[Sequence[str]]:
    cols = ["colA","colB","colC","colD","colE"]
    rows: List[Sequence[str]] = []
    reading = False
    for line in (ll for ll in data.splitlines() if ll and ll.strip()):
        if not reading:
            if [ll.strip() for ll in line.split("|")][0:-1] == cols:
                reading = True
                continue
        else:
            row: Sequence[str] = [w.strip() for w in line.split("|")][0:-1]
            if len(row) == len(cols):
                rows.append(row)
            else:
                reading = False
    return rows
import std/[strutils]
import zero_functional
import nimpy

proc nim_myparser(data: string): seq[seq[string]] {.exportpy.} =
  const cols = ["colA","colB","colC","colD","colE"]
  var reading = false
  data.splitLines --> filter(it.strip.len > 0) --> createIter(lines)
  for line in lines():
    if not reading:
      if line.split('|') --> map(it.strip())[0..^2] == cols:
        reading = true
        continue
    else:
      let row = line.split('|') --> map(it.strip())[0..^2]
      if len(row) == len(cols):
        result.add(row)
      else:
        reading = false

I compile the python module with

--gc: "arc"
--d: "danger"
--app: "lib"
--passC: "-flto"
--passL: "-flto"

the final result is correct for both, but running %timeit -n 10 on quite big data gives surprising timings.
Seems like List[List[str]]] conversion to pd.DataFrame is zero-cost for python version but is very expensive for nim version

%timeit -n 10 py_myparser(data)
# 265 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
%timeit -n 10 pd.DataFrame(py_myparser(data))
# 268 ms ± 15.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
%timeit -n 10 nim_myparser(data)
# 160 ms ± 7.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit -n 10 pd.DataFrame(nim_myparser(data))
# 306 ms ± 8.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

same timing happens if I do the pd.DataFrame conversion via pyIpmport("pandas") in the nim/nimpy side at the end of the function (and changing the return type)

Any idea what's happening here? I'm trying to replace python function with nim functions to speeup things, but I need pandas on the python side, and this behavior is confusing to me.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions