Skip to content

Latest commit

 

History

History
218 lines (147 loc) · 9.36 KB

MIGRATING_FROM_OLDER_VERSIONS.rst

File metadata and controls

218 lines (147 loc) · 9.36 KB

Migrating to the new client-based S3 API

Version of smart_open prior to 5.0.0 used the boto3 resource API for communicating with S3. This API was easy to integrate for smart_open developers, but this came at a cost: it was not thread- or multiprocess-safe. Furthermore, as smart_open supported more and more options, the transport parameter list grew, making it less maintainable.

Starting with version 5.0.0, smart_open uses the client API instead of the resource API. Functionally, very little changes for the smart_open user. The only difference is in passing transport parameters to the S3 backend.

More specifically, the following S3 transport parameters are no longer supported:

  • multipart_upload_kwargs
  • object_kwargs
  • resource
  • resource_kwargs
  • session
  • singlepart_upload_kwargs

If you weren't using the above parameters, nothing changes for you.

However, if you were using any of the above, then you need to adjust your code. Here are some quick recipes below.

If you were previously passing session, then construct an S3 client from the session and pass that instead. For example, before:

smart_open.open('s3://bucket/key', transport_params={'session': session})

After:

smart_open.open('s3://bucket/key', transport_params={'client': session.client('s3')})

If you were passing resource, then replace the resource with a client, and pass that instead. For example, before:

resource = session.resource('s3', **resource_kwargs)
smart_open.open('s3://bucket/key', transport_params={'resource': resource})

After:

client = session.client('s3')
smart_open.open('s3://bucket/key', transport_params={'client': client})

If you were passing any of the *_kwargs parameters, you will need to include them in client_kwargs, keeping in mind the following transformations.

Parameter name Resource API method Client API function
multipart_upload_kwargs s3.Object.initiate_multipart_upload s3.Client.create_multipart_upload
object_kwargs s3.Object.get s3.Client.get_object
resource_kwargs s3.resource s3.client
singlepart_upload_kwargs s3.Object.put s3.Client.put_object

Most of the above is self-explanatory, with the exception of resource_kwargs. These were previously used mostly for passing a custom endpoint URL.

The client_kwargs dict can thus contain the following members:

  • s3.Client: initializer parameters, e.g. those to pass directly to the boto3.client function, such as endpoint_url.
  • s3.Client.create_multipart_upload
  • s3.Client.get_object
  • s3.Client.put_object

Here's a before-and-after example for connecting to a custom endpoint. Before:

session = boto3.Session(profile_name='digitalocean')
resource_kwargs = {'endpoint_url': 'https://ams3.digitaloceanspaces.com'}
with open('s3://bucket/key.txt', 'wb', transport_params={'resource_kwarg': resource_kwargs}) as fout:
    fout.write(b'here we stand')

After:

session = boto3.Session(profile_name='digitalocean')
client = session.client('s3', endpoint_url='https://ams3.digitaloceanspaces.com')
with open('s3://bucket/key.txt', 'wb', transport_params={'client': client}) as fout:
    fout.write(b'here we stand')

See README and HOWTO for more examples.

Migrating to the new dependency management subsystem

Smart_open has grown over the years to cover a lot of different storages, each with a different set of library dependencies. Not everybody needs all of them, so to make each smart_open installation leaner and faster, version 3.0.0 introduced a new, backward-incompatible installation method:

  • smart_open < 3.0.0: All dependencies were installed by default. No way to select just a subset during installation.
  • smart_open >= 3.0.0: No dependencies installed by default. Install the ones you need with e.g. pip install smart_open[s3] (only AWS), or smart_open[all] (install everything = same behaviour as < 3.0.0; use this for backward compatibility).

You can read more about the motivation and internal discussions for this change here.

Migrating to the new open function

Since 1.8.1, there is a smart_open.open function that replaces smart_open.smart_open. The new function offers several advantages over the old one:

  • 100% compatible with the built-in open function (aka io.open): it accepts all the parameters that the built-in open accepts.
  • The default open mode is now "r", the same as for the built-in open. The default for the old smart_open.smart_open function used to be "rb".
  • Fully documented keyword parameters (try help("smart_open.open"))

The instructions below will help you migrate to the new function painlessly.

First, update your imports:

>>> from smart_open import smart_open  # before
>>> from smart_open import open  # after

In general, smart_open uses io.open directly, where possible, so if your code already uses open for local file I/O, then it will continue to work. If you want to continue using the built-in open function for e.g. debugging, then you can import smart_open and use smart_open.open.

The default read mode is now "r" (read text). If your code was implicitly relying on the default mode being "rb" (read binary), you'll need to update it and pass "rb" explicitly.

Before:

>>> import smart_open
>>> smart_open.smart_open('s3://commoncrawl/robots.txt').read(32)  # 'rb' used to be the default
b'User-Agent: *\nDisallow: /'

After:

>>> import smart_open
>>> smart_open.open('s3://commoncrawl/robots.txt', 'rb').read(32)
b'User-Agent: *\nDisallow: /'

The ignore_extension keyword parameter is now called ignore_ext. It behaves identically otherwise.

The most significant change is in the handling on keyword parameters for the transport layer, e.g. HTTP, S3, etc. The old function accepted these directly:

>>> url = 's3://smart-open-py37-benchmark-results/test.txt'
>>> session = boto3.Session(profile_name='smart_open')
>>> smart_open.smart_open(url, 'r', session=session).read(32)
'first line\nsecond line\nthird lin'

The new function accepts a transport_params keyword argument. It's a dict. Put your transport parameters in that dictionary.

>>> url = 's3://smart-open-py37-benchmark-results/test.txt'
>>> params = {'session': boto3.Session(profile_name='smart_open')}
>>> open(url, 'r', transport_params=params).read(32)
'first line\nsecond line\nthird lin'

Renamed parameters:

  • s3_upload -> multipart_upload_kwargs
  • s3_session -> session

Removed parameters:

  • profile_name

The profile_name parameter has been removed. Pass an entire boto3.Session object instead.

Before:

>>> url = 's3://smart-open-py37-benchmark-results/test.txt'
>>> smart_open.smart_open(url, 'r', profile_name='smart_open').read(32)
'first line\nsecond line\nthird lin'

After:

>>> url = 's3://smart-open-py37-benchmark-results/test.txt'
>>> params = {'session': boto3.Session(profile_name='smart_open')}
>>> open(url, 'r', transport_params=params).read(32)
'first line\nsecond line\nthird lin'

See help("smart_open.open") for the full list of acceptable parameter names, or view the help online here.

If you pass an invalid parameter name, the smart_open.open function will warn you about it. Keep an eye on your logs for WARNING messages from smart_open.