Skip to content

Commit

Permalink
COPY INTO FROM ON CLIENT (#102)
Browse files Browse the repository at this point in the history
* Announce that we support file transfers

* Pass existing BytesIO instead of creating it anew everywhere

* Respond to file transfer requests

* It can upload some data

* Don't look for acknowledgement when server cancelled the upload

* Allow to check for cancellation

* Do not include \n in the file name

* Allow uploading through a file-like object

* Return early if cancelled

* Mock _getblock_raw instead of _getblock

The latter is no longer called by mapi.Connection.cmd().

* Fix type error found by mypy

* Add some asserts to silence mypy

* Make pycodestyle happy

* Make abstract base class for uploaders

* Add cancel method to Uploader

* Support downloads

* Refactor: chunk_left -> chunk_used

* Add set_chunk_size

* Complain if the handler didn't do anything

* Fix missing 'return' keyword

* Avoid empty writes

* When cancelled, pretend to write all bytes

Before we returned 0 but that leads to endless loops
when called by the utf-8 codec.

* Start porting the  file transfer tests from monetdb-jdbc

They are very useful.

* whitespace

* Silly copy pasta

* More tests

* Use TextIOWrapper() directly

instead of calling codecs.getreader().
This allows us to force the line endings to "\n"

* Rename {Up,Down}loader.handle to handle_{up,down}load

So a class can implement Uploader and Downloader simultaneously

* Disconnect if the upload handler throws an exception

We must make sure that the server doesn't think the upload ended
succesfully and unfortunately the only way to do that is by killing
the connection.

* Disconnect if the download handler throws an exception

* Normalize \r\n to \n in uploaded text

* Export Uploader and Downloader from the pymonetdb toplevel

* Add doc strings

* Reset default chunk size to 1 MiB

* Minor fixes

* Small fixes for pycodestyle and mypy

* Move file transfer code to separate module

* cleanup

* Add DefaultHandler

* Test uploads with DefaultHandler

* Thank you, mypy

* Path.is_relative_to was only introduced in Python 3.9

* Fix bad mistake in test

* Test downloads with DefaultHandler

* Delete leftover line of code

* Stop on end of file while skipping

* Roll back between subtests

* Test various skip amounts

* Do not forget to acknowledge when uploading empty file

* Add timeout mechanism to catch hanging tests

* Test empty downloads

* Fix bug in empty downloads

* Test the CR LF normalizer and fix some bugs

* Test DefaultHandler security

* Split generic tests and default handler tests

* Expand generated subtests into standalone

* Uncompress files automatically

* Also test default download handler

* Demonstrate that _getblock_socket is dead code

* Remove the dead code

* Simplify buffer management

* Update the documentation

* Document compression support and allow to disable it

* Combine the standalone subtests into a generator again

Having them separate was useful while there were many
bugs, now that it mostly works conciseness is more important.

* Test it on Windows

Line endings are different there..

* Fix the paths

* Drop close_fds

* add future dependency hoping it can then import 'past'

* do not capture stderr

* Record server stderr

* We'd really like to be able to see the server stderr

* fixes

* syntax

* Show the default encoding

* Encodings

* Be more careful with the default encoding

* When testing text uploads in binary mode, make sure it's utf-8

* Proofreading changes

* Remove duplicate code

* Rename DefaultHandler to SafeDirectoryHandler

* Improve the documentation

* Some gra corrections.

* Use with-block in example

* Clarify :meta private: in docstring

* Avoid shadowing 'mapi' import with 'mapi' parameter

* Clean unused imports

* Remove circular dependencies between .mapi and .filetransfer

* Add more type annotations

* Removed nested try block

* Start splitting filetransfer.py in separate files

* Split up the filetransfer module

* Adjust api.rst to new module structure

* Add documentation for handle_download

Co-authored-by: lrpereira <[email protected]>
  • Loading branch information
joerivanruth and lrpereira authored Apr 12, 2022
1 parent a2f7166 commit ea38611
Show file tree
Hide file tree
Showing 22 changed files with 2,264 additions and 57 deletions.
50 changes: 50 additions & 0 deletions .github/workflows/windows.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
name: Windows Test
on:
push:
branches: [ master ]
pull_request:
branches: [ master ]
jobs:
runtests:
runs-on: windows-2019
steps:
- uses: actions/checkout@v2

- uses: actions/setup-python@v1
with:
python-version: '3.6'

# Unfortunately msiexec does seem to work on the windows runner.
# 7zip is able to unpack .msi files but it loses the directory structure.
# We fix that in the next step.
- name: Download MonetDB
run: |
curl https://www.monetdb.org/downloads/Windows/Jan2022-SP1/MonetDB5-SQL-Installer-x86_64-20220207.msi -o ${{ runner.temp }}\monetdb.msi --no-progress-meter
dir ${{ runner.temp }}
7z x ${{ runner.temp }}\monetdb.msi -o${{ runner.temp }}\staging
dir ${{ runner.temp }}\staging
# Run a script to restore the directory structure and see if it works (a little)
- name: Install MonetDB
run: |
python tests/install_monetdb_from_msi_dir.py ${{ runner.temp }}\staging ${{ runner.temp }}\MONET
dir ${{ runner.temp }}\MONET
dir ${{ runner.temp }}\MONET\bin
${{ runner.temp }}\MONET\bin\mserver5.exe --help
- name: Setup virtual environment
run: |
python -m venv venv
venv\Scripts\Activate.ps1
python -m pip install -r tests/requirements.txt
# Script tests/windows_tests.py starts an mserver in the background
# and runs pytest, excluding the Control tests.
- name: run the tests
run: |
venv\Scripts\Activate.ps1
mkdir ${{ runner.temp }}\dbfarm
python tests/windows_tests.py ${{ runner.temp }}\MONET ${{ runner.temp }}\dbfarm demo 50000
echo ""; echo ""; echo "================ SERVER STDERR: ==================="; echo ""
type ${{ runner.temp }}\dbfarm\errlog
16 changes: 16 additions & 0 deletions doc/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -38,6 +38,22 @@ MAPI
:show-inheritance:


File Uploads and Downloads
==========================

Classes related to file transfer requests as used by COPY INTO ON CLIENT.

.. automodule:: pymonetdb.filetransfer
:members: Uploader, Downloader, SafeDirectoryHandler
:member-order: bysource

.. automodule:: pymonetdb.filetransfer.uploads
:members: Upload

.. automodule:: pymonetdb.filetransfer.downloads
:members: Download


MonetDB remote control
======================

Expand Down
2 changes: 1 addition & 1 deletion doc/development.rst
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
development
Development
===========

Github
Expand Down
16 changes: 15 additions & 1 deletion doc/examples.rst
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
Examples
========

examples usage below::
Here are some examples of how to use pymonetdb.

Example session
---------------

::

> # import the SQL module
> import pymonetdb
Expand Down Expand Up @@ -49,6 +54,8 @@ examples usage below::
('commit_action', 'smallint', 1, 1, None, None, None),
('temporary', 'tinyint', 1, 1, None, None, None)]

MAPI Connection
---------------

If you would like to communicate with the database at a lower level
you can use the MAPI library::
Expand All @@ -60,3 +67,10 @@ you can use the MAPI library::
> server.cmd("sSELECT * FROM tables;")
...


CSV Upload
--------------

This is an example script that uploads some csv data from the local file system:

.. literalinclude:: examples/uploadcsv.py
36 changes: 36 additions & 0 deletions doc/examples/uploadcsv.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
#!/usr/bin/env python3

import os
import pymonetdb

# Create the data directory and the CSV file
try:
os.mkdir("datadir")
except FileExistsError:
pass
with open("datadir/data.csv", "w") as f:
for i in range(10):
print(f"{i},item{i + 1}", file=f)

# Connect to MonetDB and register the upload handler
conn = pymonetdb.connect('demo')
handler = pymonetdb.SafeDirectoryHandler("datadir")
conn.set_uploader(handler)
cursor = conn.cursor()

# Set up the table
cursor.execute("DROP TABLE foo")
cursor.execute("CREATE TABLE foo(i INT, t TEXT)")

# Upload the data, this will ask the handler to upload data.csv
cursor.execute("COPY INTO foo FROM 'data.csv' ON CLIENT USING DELIMITERS ','")

# Check that it has loaded
cursor.execute("SELECT t FROM foo WHERE i = 9")
row = cursor.fetchone()
assert row[0] == 'item10'

# Goodbye
conn.commit()
cursor.close()
conn.close()
27 changes: 27 additions & 0 deletions doc/examples/uploaddyn.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#!/usr/bin/env python3
import pymonetdb

class MyUploader(pymonetdb.Uploader):
def handle_upload(self, upload, filename, text_mode, skip_amount):
tw = upload.text_writer()
for i in range(skip_amount, 1000):
print(f'{i},number{i}', file=tw)

conn = pymonetdb.connect('demo')
conn.set_uploader(MyUploader())

cursor = conn.cursor()
cursor.execute("DROP TABLE foo")
cursor.execute("CREATE TABLE foo(i INT, t TEXT)")
cursor.execute("COPY 10 RECORDS OFFSET 7 INTO foo FROM 'data.csv' ON CLIENT USING DELIMITERS ','")
cursor.execute("SELECT COUNT(i), MIN(i), MAX(i) FROM foo")
row = cursor.fetchone()
print(row)
assert row[0] == 10 # ten records numbered
assert row[1] == 6 # offset 7 means skip first 6, that is, records 0, .., 5
assert row[2] == 15 # 10 records: 6, 7,8, 9,10,11, 12,13,14, and 15

# Goodbye
conn.commit()
cursor.close()
conn.close()
38 changes: 38 additions & 0 deletions doc/examples/uploadsafe.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
#!/usr/bin/env python3
import pathlib
import shutil
import pymonetdb

class MyUploader(pymonetdb.Uploader):
def __init__(self, dir):
self.dir = pathlib.Path(dir)

def handle_upload(self, upload, filename, text_mode, skip_amount):
# security check
path = self.dir.joinpath(filename).resolve()
if not str(path).startswith(str(self.dir.resolve())):
return upload.send_error('Forbidden')
# open
tw = upload.text_writer()
with open(path) as f:
# skip
for i in range(skip_amount):
f.readline()
# bulk upload
shutil.copyfileobj(f, tw)

conn = pymonetdb.connect('demo')
conn.set_uploader(MyUploader('datadir'))

cursor = conn.cursor()
cursor.execute("DROP TABLE foo")
cursor.execute("CREATE TABLE foo(i INT, t TEXT)")
cursor.execute("COPY 10 RECORDS OFFSET 7 INTO foo FROM 'data.csv' ON CLIENT USING DELIMITERS ','")
cursor.execute("SELECT COUNT(i), MIN(i), MAX(i) FROM foo")
row = cursor.fetchone()
print(row)

# Goodbye
conn.commit()
cursor.close()
conn.close()
120 changes: 120 additions & 0 deletions doc/filetransfers.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,120 @@
File Transfers
==============

MonetDB supports the non-standard :code:`COPY INTO` statement to load a CSV-like
text file into a table or to dump a table to a text file. This statement has an
optional modifier :code:`ON CLIENT` to indicate that the server should not
try to open the file server-side, but should instead ask the client to open the
file on its behalf.

For example::

COPY INTO mytable FROM 'data'.csv' ON CLIENT
USING DELIMITERS ',', E'\n', '"';

By default, if pymonetdb receives a file request from the server, it will refuse
it for security considerations. You do not want the server or a hacker pretending
to be the server to be able to request arbitrary files on your system and even
overwrite them.

To enable file transfers, create a `pymonetdb.Uploader` and/or
`pymonetdb.Downloader` and register them with your connection::

transfer_handler = pymonetdb.SafeDirectoryHandler(datadir)
conn.set_uploader(transfer_handler)
conn.set_downloader(transfer_handler)

With this in place, the COPY INTO ON CLIENT statement above will ask to open
file data.csv in the given `datadir` and upload its contents. As its name
suggests, :class:`SafeDirectoryHandler` will only allow access to the files in
that directory.

Note that in this example we register the same handler object both as an
uploader and a downloader, but it is perfectly sensible to only register an
uploader, or only a downloader, or to use two separate handlers.

See the API documentation for details.


Make up data as you go
----------------------

You can also write your own transfer handlers. And instead of opening a file,
such handlers can also make up the data on the fly, retrieve it from a remote
microservice, prompt the user interactively or do whatever else you come up
with:

.. literalinclude:: examples/uploaddyn.py
:pyobject: MyUploader

In this example we called `upload.text_writer()` which yields a text-mode
file-like object. There is also `upload.binary_writer()` which yields a
binary-mode file-like object. This works even if the server requested a text
mode object, but in that case you have to make sure the bytes you write are valid
utf-8 and delimited with Unix line endings rather than Windows line endings.

If you want to refuse an up- or download, call `upload.send_error()` to send an
error message. This is only possible before any calls to `text_writer()` and
`binary_writer()`.

For custom downloaders the situation is similar, except that instead of
`text_writer` and `binary_writer`, the `download` parameter offers
`download.text_reader()` and `download.text_writer()`.


Skip amount
-----------

MonetDB's :code:`COPY INTO` statement allows you to skip for example the first
line in a file using the the modifier :code:`OFFSET 2`. In such a case,
the `skip_amount` parameter to `handle_upload` will be greater than zero.

Note that the offset in the SQL statement is 1-based, whereas the `skip_amount`
parameter has already been converted to be 0-based. In the example above
this allowed us to write :code:`for i in range(skip_amount, 1000):` rather
than :code:`for i in range(1000):`.


Cancellation
------------

If the server does not need all uploaded data, for example if you did::

COPY 100 RECORDS INTO mytable FROM 'data.csv' ON CLIENT

the server may at some point cancel the upload. This does not happen instantly,
from time to time pymonetdb explicitly asks the server if they are still
interested. By default this is after every MiB of data but that can be
configured using `upload.set_chunk_size()`. If the server answers that it is no
longer interested, pymonetdb will discard any further data written to the
writer. It is recommended to occasionally call `upload.is_cancelled()` to check
for this and exit early if the upload has been cancelled.

Upload handlers also have an optional method `cancel()` that you can override.
This method is called when pymonetdb receives the cancellation request.


Copying data from or to a file-like object
------------------------------------------

If you are moving large amounts of data between pymonetdb and a file-like object
such as a file, Pythons `copyfileobj`_ function may come in handy:

.. literalinclude:: examples/uploadsafe.py
:pyobject: MyUploader

However, note that copyfileobj does not handle cancellations as described above.

.. _copyfileobj: https://docs.python.org/3/library/shutil.html#shutil.copyfileobj


Security considerations
-----------------------

If your handler accesses the file system or the network, it is absolutely critical
to carefully validate the file name you are given. Otherwise an attacker can take
over the server or the connection to the server and cause great damage.

An example of how to validate file systems paths is given in the code sample above.
Similar considerations apply to text that is inserted into network urls and other
resource identifiers.
1 change: 1 addition & 0 deletions doc/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ Contents:
:maxdepth: 2

introduction
filetransfers
examples
api
development
Expand Down
6 changes: 5 additions & 1 deletion pymonetdb/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,10 @@
from pymonetdb.sql.connections import Connection
from pymonetdb.sql.pythonize import *
from pymonetdb.exceptions import *
from pymonetdb.filetransfer import Downloader, Uploader
from pymonetdb.filetransfer.downloads import Download
from pymonetdb.filetransfer.uploads import Upload
from pymonetdb.filetransfer.directoryhandler import SafeDirectoryHandler

try:
__version__ = pkg_resources.require("pymonetdb")[0].version
Expand All @@ -34,7 +38,7 @@
'Timestamp', 'DateFromTicks', 'TimeFromTicks', 'TimestampFromTicks', 'DataError', 'DatabaseError', 'Error',
'IntegrityError', 'InterfaceError', 'InternalError', 'NUMBER', 'NotSupportedError', 'OperationalError',
'ProgrammingError', 'ROWID', 'STRING', 'TIME', 'Warning', 'apilevel', 'connect', 'paramstyle',
'threadsafety']
'threadsafety', 'Download', 'Downloader', 'Upload', 'Uploader', 'SafeDirectoryHandler']


def connect(*args, **kwargs) -> Connection:
Expand Down
Loading

0 comments on commit ea38611

Please sign in to comment.