Allow multiple clients to use the same Transformer stream #288

vdumoulin · 2016-01-26T15:57:19Z

Fixes #106.

dwf · 2016-01-26T17:26:19Z

docs/overview.rst

@@ -407,6 +407,13 @@ dataset in the order defined by the iteration scheme.
 (4, 2, 2) (4, 1)
 (4, 2, 2) (4, 1)

+.. warning::
+
+    :class:`DataStream` assumes that there is only one consumer of its


One consumer per epoch iterator or one consumer per stream?

Per stream. I'll clarify that.

bartvm · 2016-01-27T05:01:50Z

Just wondering if I'm missing anything, but data streams should behave like any other iterator in Python, right? Or does "undefined" behaviour mean something different here?

vdumoulin · 2016-01-27T15:13:09Z

What I'm trying to convey here is that users should not use data streams for more than one purpose, e.g. training and monitoring with the same data stream. Even though data streams behave like any other iterator in Python, I don't think transformers behave as one would expect, as demonstrated by the following piece of code:

import numpy

from fuel.datasets import IndexableDataset
from fuel.streams import DataStream
from fuel.schemes import SequentialScheme
from fuel.transformers import Mapping

def run(bug=True):
    constructor = DataStream.default_stream if bug else DataStream
    dataset = IndexableDataset(numpy.arange(100))
    dataset.default_transformers = ((Mapping, [lambda d: (2 * d[0],)], dict()),)
    data_stream = constructor(dataset, iteration_scheme=SequentialScheme(100, 10))
    iterator_1 = data_stream.get_epoch_iterator()
    iterator_2 = data_stream.get_epoch_iterator()
    for batch in iterator_2:
        pass
    next(iterator_1)

The problem is Transformer instances maintain a child_epoch_iterator attribute that is set when their get_epoch_iterator method is called. In that piece of code, the two client iterators unknowingly share the same child iterator if the data stream happens to be a Transformer instance.

I've seen many people get bit by this (myself included), and I think a proper warning in the documentation would be a good thing.

dwf · 2016-02-03T00:18:58Z

It seems like the child_epoch_iterator ought to be owned by the iterator, in order to at least make it possible for the data stream to remain stateless (there's no contract that this is the case, of course, but it should probably be the default unless the user knows they're doing otherwise?).

vdumoulin · 2016-02-03T15:48:37Z

@dwf @bartvm I decided to go ahead and fix this once and for all.

I implemented a new TransformerDataIterator class whose constructor expects a child_iterator argument in addition to the transformer itself. The Transformer.get_data method now expects a child_epoch_iterator argument which is passed by TransformerDataIterator.

This has the effect of moving ownership of the child epoch iterator from the stream to the iterator. Now that transformers are stateless, we can finally safely re-use data streams at multiple places in the code. This means no more weird StopIteration exceptions when you accidentally re-use the main loop data stream for monitoring! 😄

rizar · 2016-02-04T15:08:24Z

Hold on guys, what will happen if the user creates two iterators from a transformer that uses a stateful data stream, like e.g. a datastream over a TextFile? It seems like they will silently share the state...

dwf · 2016-02-08T20:50:57Z

@rizar I suppose it's a question of whether we want the stream object or the iterator object to be stateful, or both. Often, the situation is straightforward enough that the latter is clearly preferable. There are situations where it's not, though, see #310.

It seems like there's value in being able to separate a (stateless) description of a pipeline with the act of sucking things through that pipeline.

bartvm · 2016-02-08T21:03:31Z

There are other stateful transformers, such as the Cache transformer.

Perhaps on the long term the solution is to take the same approach as we do with datasets, storing the state of a pipeline separately from the actual transformer instances. You could imagine a single object, owned by the iterator, holding the state of all the transformers. When you call __next__ on the iterator, this stateful object is passed to all the get_data requests of each transformer which can read and write to the state. This way several iterators of the same pipeline can have entirely separated states.

Did that make sense?

dwf · 2016-02-08T21:10:23Z

Yes, it does. It's unclear to me why you'd want a single state object and not want to encapsulate it as one per transformer though, since you already have such state objects: the iterator. So the iterator, when querying the stream object, passes a state object (call it a "context", maybe, since that's what people call containers of stateful nonsense with no better descriptor) that consists of the child iterator that it's receiving data from and whatever state the stream object needs to do its thing.

Anyway, this PR clearly doesn't quite fit the bill.

bartvm · 2016-02-08T22:06:03Z

I think we're talking about the same thing, but just to clarify: DataIterator.__next__ does something like

class DataIterator(six.Iterator):
    def __init__(self, data_stream):
        self.data_stream = data_stream
        self.context = defaultdict(lambda: defaultdict(dict))

    def __next__(self):
        data = self.data_stream.get_data(self.context)

And then stateful transformers and streams do something like:

class StatefulTransformer(Transformer):
    def open(self, context):
        context['file'] = open('some_file.txt')

    def get_data(self, context, request=None):
        if not context[self]:
            self.open(context[self])
        return context['file'].readline()

vdumoulin · 2016-02-09T00:17:57Z

I think I understand what you're getting at. I'll try to come up with a tentative implementation in the near future.

dwf reviewed Jan 26, 2016
View reviewed changes

vdumoulin added 6 commits February 3, 2016 10:42

Document DataStream's 'one consumer' assumption

b4adc04

Clarification

1ec02f9

Introduce TransformerDataIterator

fcd8d90

Adapt Transformer tests to new TransformerDataIterator

dca255d

Remove notes and warnings regarding 'one data stream per client'

c3481c4

Add test for multiple Transformer clients

19a1d09

vdumoulin force-pushed the document_streams branch from db8e434 to 19a1d09 Compare February 3, 2016 15:42

vdumoulin changed the title ~~Document DataStream's 'one consumer' assumption~~ Allow multiple clients to use the same Transformer stream Feb 3, 2016

Adapt docs/streams.rst to new TransformerDataIterator class

1c94f24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow multiple clients to use the same Transformer stream #288

Allow multiple clients to use the same Transformer stream #288

vdumoulin commented Jan 26, 2016

dwf Jan 26, 2016

vdumoulin Jan 26, 2016

bartvm commented Jan 27, 2016

vdumoulin commented Jan 27, 2016

dwf commented Feb 3, 2016

vdumoulin commented Feb 3, 2016

rizar commented Feb 4, 2016

dwf commented Feb 8, 2016

bartvm commented Feb 8, 2016

dwf commented Feb 8, 2016

bartvm commented Feb 8, 2016

vdumoulin commented Feb 9, 2016

Allow multiple clients to use the same Transformer stream #288

Are you sure you want to change the base?

Allow multiple clients to use the same Transformer stream #288

Conversation

vdumoulin commented Jan 26, 2016

dwf Jan 26, 2016

Choose a reason for hiding this comment

vdumoulin Jan 26, 2016

Choose a reason for hiding this comment

bartvm commented Jan 27, 2016

vdumoulin commented Jan 27, 2016

dwf commented Feb 3, 2016

vdumoulin commented Feb 3, 2016

rizar commented Feb 4, 2016

dwf commented Feb 8, 2016

bartvm commented Feb 8, 2016

dwf commented Feb 8, 2016

bartvm commented Feb 8, 2016

vdumoulin commented Feb 9, 2016