Strange performance hit when converting seq[seq[string]]] to List[List[str]]] to pd.Dataframe

I have two equivalent functions

```python
def py_myparser(data: str) -> List[Sequence[str]]:
    cols = ["colA","colB","colC","colD","colE"]
    rows: List[Sequence[str]] = []
    reading = False
    for line in (ll for ll in data.splitlines() if ll and ll.strip()):
        if not reading:
            if [ll.strip() for ll in line.split("|")][0:-1] == cols:
                reading = True
                continue
        else:
            row: Sequence[str] = [w.strip() for w in line.split("|")][0:-1]
            if len(row) == len(cols):
                rows.append(row)
            else:
                reading = False
    return rows
```

```nim
import std/[strutils]
import zero_functional
import nimpy

proc nim_myparser(data: string): seq[seq[string]] {.exportpy.} =
  const cols = ["colA","colB","colC","colD","colE"]
  var reading = false
  data.splitLines --> filter(it.strip.len > 0) --> createIter(lines)
  for line in lines():
    if not reading:
      if line.split('|') --> map(it.strip())[0..^2] == cols:
        reading = true
        continue
    else:
      let row = line.split('|') --> map(it.strip())[0..^2]
      if len(row) == len(cols):
        result.add(row)
      else:
        reading = false
```

I compile the python module with
```nim
--gc: "arc"
--d: "danger"
--app: "lib"
--passC: "-flto"
--passL: "-flto"
```

the final result is correct for both, but running `%timeit -n 10` on quite big data gives surprising timings.
Seems like `List[List[str]]]` conversion to `pd.DataFrame` is zero-cost for python version but is very expensive for nim version
```python
%timeit -n 10 py_myparser(data)
# 265 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
%timeit -n 10 pd.DataFrame(py_myparser(data))
# 268 ms ± 15.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
%timeit -n 10 nim_myparser(data)
# 160 ms ± 7.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit -n 10 pd.DataFrame(nim_myparser(data))
# 306 ms ± 8.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```
same timing happens if I do the `pd.DataFrame` conversion via `pyIpmport("pandas")` in the nim/nimpy side at the end of the function (and changing the return type)

Any idea what's happening here? I'm trying to replace python function with nim functions to speeup things, but I need pandas on the python side, and this behavior is confusing to me.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Strange performance hit when converting seq[seq[string]]] to List[List[str]]] to pd.Dataframe #243

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Strange performance hit when converting seq[seq[string]]] to List[List[str]]] to pd.Dataframe #243

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions