Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange performance hit when converting seq[seq[string]]] to List[List[str]]] to pd.Dataframe #243

Open
arkanoid87 opened this issue Dec 3, 2021 · 7 comments

Comments

@arkanoid87
Copy link

I have two equivalent functions

def py_myparser(data: str) -> List[Sequence[str]]:
    cols = ["colA","colB","colC","colD","colE"]
    rows: List[Sequence[str]] = []
    reading = False
    for line in (ll for ll in data.splitlines() if ll and ll.strip()):
        if not reading:
            if [ll.strip() for ll in line.split("|")][0:-1] == cols:
                reading = True
                continue
        else:
            row: Sequence[str] = [w.strip() for w in line.split("|")][0:-1]
            if len(row) == len(cols):
                rows.append(row)
            else:
                reading = False
    return rows
import std/[strutils]
import zero_functional
import nimpy

proc nim_myparser(data: string): seq[seq[string]] {.exportpy.} =
  const cols = ["colA","colB","colC","colD","colE"]
  var reading = false
  data.splitLines --> filter(it.strip.len > 0) --> createIter(lines)
  for line in lines():
    if not reading:
      if line.split('|') --> map(it.strip())[0..^2] == cols:
        reading = true
        continue
    else:
      let row = line.split('|') --> map(it.strip())[0..^2]
      if len(row) == len(cols):
        result.add(row)
      else:
        reading = false

I compile the python module with

--gc: "arc"
--d: "danger"
--app: "lib"
--passC: "-flto"
--passL: "-flto"

the final result is correct for both, but running %timeit -n 10 on quite big data gives surprising timings.
Seems like List[List[str]]] conversion to pd.DataFrame is zero-cost for python version but is very expensive for nim version

%timeit -n 10 py_myparser(data)
# 265 ms ± 4.94 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
%timeit -n 10 pd.DataFrame(py_myparser(data))
# 268 ms ± 15.7 ms per loop (mean ± std. dev. of 7 runs, 10 loops each) 
%timeit -n 10 nim_myparser(data)
# 160 ms ± 7.3 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit -n 10 pd.DataFrame(nim_myparser(data))
# 306 ms ± 8.46 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

same timing happens if I do the pd.DataFrame conversion via pyIpmport("pandas") in the nim/nimpy side at the end of the function (and changing the return type)

Any idea what's happening here? I'm trying to replace python function with nim functions to speeup things, but I need pandas on the python side, and this behavior is confusing to me.

@yglukhov
Copy link
Owner

yglukhov commented Dec 3, 2021

Marshalling seqs is pretty much a deep copy, so with big data it can get expensive. Invoking nim_myparser involves copying data from python to newly allocated nim string, and then copying nim result to the newly allocated python lists.

To avoid marshalling you can either use PyObjects instead of nim types, or use raw_buffers or some higher level abstractions over it.

@yglukhov yglukhov closed this as completed Dec 3, 2021
@arkanoid87
Copy link
Author

Marshalling seqs is pretty much a deep copy, so with big data it can get expensive. Invoking nim_myparser involves copying data from python to newly allocated nim string, and then copying nim result to the newly allocated python lists.

To avoid marshalling you can either use PyObjects instead of nim types, or use raw_buffers or some higher level abstractions over it.

Sorry but I fear I'm not understanding your answer completely. I am aware that with nim_myparser the marshalling is happening from python string to nim string on input and from nim seq[seq[string]] to python List[List[str]] on output.
The problem is that benchmark result is 160ms to call nim_myparser(data) and 306ms to call pd.DataFrame(nim_myparser(data)), but the amount of marshalling between python and nim should be the same in both cases.
Long story short, creating a DataFrame out of py_myparser has no additional overhead, but creating DataFrame out of nim_myparser has 146ms of overhead. AFAIK this overhead is generated after all python/nim marshalling has been completed.

@yglukhov
Copy link
Owner

yglukhov commented Dec 4, 2021

Oh it seems I misread your benchmark. Can you verify that nim_myparser produces the same result that of py_myparser? Also a reproducible sample would be nice for me to play around with.

@yglukhov yglukhov reopened this Dec 4, 2021
@arkanoid87
Copy link
Author

I've been trying to simplify the problem. Now it's about passing a (literal) unicode string and reverse it

nimmodule.nim

import std/[unicode]
import nimpy


proc inputMarshal(data: string): string {.exportpy.} =
  return data.reversed


proc inputCast(data: PyObject): string {.exportpy.} =
  return data.to(string).reversed

nimmodule.nims

--gc: "arc"
--d: "danger"
--app: "lib"
--passC: "-flto"
--passL: "-flto"

benchmark.py

import random
import timeit
from inspect import getsource


def strgen(length: int) -> str:
    get_char = chr

    # Update this to include code point ranges to be sampled
    include_ranges = [
        (0x0021, 0x0021),
        (0x0023, 0x0026),
        (0x0028, 0x007E),
        (0x00A1, 0x00AC),
        (0x00AE, 0x00FF),
        (0x0100, 0x017F),
        (0x0180, 0x024F),
        (0x2C60, 0x2C7F),
        (0x16A0, 0x16F0),
        (0x0370, 0x0377),
        (0x037A, 0x037E),
        (0x0384, 0x038A),
        (0x038C, 0x038C),
    ]

    alphabet = [
        get_char(code_point) for current_range in include_ranges
        for code_point in range(current_range[0], current_range[1] + 1)
    ]
    return ''.join(random.choice(alphabet) for i in range(length))


def reverse_string(data: str) -> str:
    return data[::-1]


rndstr: str = strgen(5000)
reps: int = 100000

pyres = timeit.timeit(f"reverse_string('{rndstr}')",
                      setup=f"{getsource(reverse_string)}\ngc.enable()",
                      number=reps)

nimres_inputMarshal = timeit.timeit(f"nimmodule.inputMarshal('{rndstr}')",
                                    setup="import nimmodule; gc.enable()",
                                    number=reps)

nimres_inputCast = timeit.timeit(f"nimmodule.inputCast('{rndstr}')",
                                 setup="import nimmodule; gc.enable()",
                                 number=reps)

print(f"""
py: {pyres}
nim (inputMarshal): {nimres_inputMarshal}
nim (inputCast): {nimres_inputCast}
""".strip())

output

py: 0.7897346769750584
nim (inputMarshal): 7.288033179007471
nim (inputCast): 7.266386620991398

Also according to my benchmarks automatic input (un)marshalling or calling PyObject.to are equivalent in benchmarks, also seems that nim's unicode.reversed is much slower than python's [::-1]

proc inputMarshal(data: string) {.exportpy.} =
  discard

proc inputCast(data: PyObject) {.exportpy.} =
  discard data.to(string)
nim (inputMarshal): 0.03680235700448975
nim (inputCast): 0.0365947810059879

@arkanoid87
Copy link
Author

just read about python string representation and that there's no guarantee that it's ABI won't change.

Also I see that python string to nim string will always face copy

proc pyStringToNim*(o: PPyObject, output: var string): bool =

I guess unless I stay on the utf-8 land on the python side, I can't optimize python to nim interop.

Is this going to perform copy?

python

nimmodule.reverseBuf("ABCDE".encode()) 

nim

proc reverseBuf(obj: PyObject): string {.exportpy.} =
  var pybuf: RawPyBuffer
  getBuffer(obj, pybuf, PyBUF_SIMPLE)
  result = $cast[cstring](pybuf.buf)
  return result.reversed

@arkanoid87
Copy link
Author

arkanoid87 commented Dec 5, 2021

new results without unicode operations, just passing data and using identity functions

import nimpy
import nimpy/raw_buffers


proc inputMarshal(data: string): string {.exportpy.} =
  return data

proc inputCast(data: PyObject): string {.exportpy.} =
  return data.to(string)

proc inputcstring(data: PyObject): string {.exportpy.} =
  var pybuf: RawPyBuffer
  getBuffer(data, pybuf, PyBUF_SIMPLE)
  result = $cast[cstring](pybuf.buf)
  pybuf.release()
--gc: "arc"
--d: "danger"
--app: "lib"
--passC: "-flto"
--passL: "-flto"
import string
from inspect import getsource
import timeit
import random
import nimporter
import nimmodule


def identitystr(data: str) -> str:
    return data


strlen: int = 5000
rndstr: str = ''.join(random.choice(
    string.ascii_uppercase + string.digits) for _ in range(strlen))
reps: int = 1000000


assert identitystr(rndstr) == nimmodule.inputMarshal(rndstr)
assert identitystr(rndstr) == nimmodule.inputCast(rndstr)
assert identitystr(rndstr) == nimmodule.inputcstring(rndstr.encode())


pyres = timeit.timeit(f"identitystr('{rndstr}')",
                      setup=f"{getsource(identitystr)}\ngc.enable()",
                      number=reps)

nimres_inputMarshal = timeit.timeit(f"nimmodule.inputMarshal('{rndstr}')",
                                    setup="import nimmodule; gc.enable()",
                                    number=reps)

nimres_inputCast = timeit.timeit(f"nimmodule.inputCast('{rndstr}')",
                                 setup="import nimmodule; gc.enable()",
                                 number=reps)

nimres_inputcstring = timeit.timeit(f"nimmodule.inputcstring('{rndstr}'.encode())",
                                    setup="import nimmodule; gc.enable()",
                                    number=reps)


print(f"""
py: {pyres:.5f}
nim (inputMarshal): {nimres_inputMarshal:.5f} x{nimres_inputMarshal/pyres:.2f}
nim (inputCast): {nimres_inputCast:.5f} x{nimres_inputCast/pyres:.2f}
nim (nimres_inputcstring): {nimres_inputcstring:-5f} x{nimres_inputcstring/pyres:.2f}
""".strip())

results

py: 0.04605
nim (inputMarshal): 0.92249 x20.03
nim (inputCast): 0.76446 x16.60
nim (nimres_inputcstring): 0.825108 x17.92

@yglukhov
Copy link
Owner

yglukhov commented Dec 6, 2021

new results without unicode operations, just passing data and using identity functions

Well there's still a lot more stuff going on in nim.

proc inputcstring(data: PyObject): string {.exportpy.} =
  var pybuf: RawPyBuffer
  getBuffer(data, pybuf, PyBUF_SIMPLE) # There's no copying here alright
  result = $cast[cstring](pybuf.buf) # But here you're allocating a nim string and copying cstring into it
  pybuf.release()
  # And then result is marshalled back to python, by allocating a python string and copying data into it, and result is deallocated

Some nuance to keep in mind. If nim string being marshalled to python string is not a valid utf8, it will fallback to python bytes. But it will cost extra performance as utf8 conversion is tried first.

Your recent findings look obviuos to me, as there's more marshalling than useful work. I don't follow how they are related to your initial benchmark.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants