Fourth development release
We are pleased to announce the release of Thunder 0.4.0.
This release introduces some major API changes, especially around loading and converting data types. It also brings some substantial updates to the documentation and tutorials, and better support for data sets stored on Amazon S3. While some big changes have been made, we feel that this new architecture provides a more solid foundation for the project, better supporting existing use cases, and encouraging contributions. Please read on for more!
Major Changes
- Data representation. Most data in Thunder now exists as subclasses of the new
thunder.rdds.Data
object. This wraps a PySpark RDD and provides several general convenience methods. Users will typically interact with two main subclasses of data,thunder.rdds.Images
andthunder.rdds.Series
, representing spatially- and temporally-oriented data sets, respectively. A common workflow will be to load image data into anImages
object and then convert it to aSeries
object for further analysis, or just to convertImages
directly toSeries
data. - Loading data. The main entry point for most users remains the
thunder.utils.context.ThunderContext
object, available in the interactive shell astsc
, but this class has many new, expanded, or renamed methods, in particularloadImages()
,loadSeries()
,loadImagesAsSeries()
, andconvertImagesToSeries()
. Please see the Thunder Context tutorial and the API documentation for more examples and detail. - New methods for manipulating and processing images and series data, including refactored versions of some earlier analyses (e.g. routines from the package previously known as
timeseries
). - Documentation has been expanded, and new tutorials have been added.
- Core API components are now exposed at the top-level for simpler importing, e.g. from thunder import Series or from thunder import ICA
Improved support for loading image data directly from Amazon S3, using the boto AWS client library. Theload*
methods in ThunderContext now all supports3n://
schema URIs as data path specifiers.
Notes about requirements and environments
- Spark 1.1.0 is required. Most functionality will be intact with earlier versions of Spark, with the exception of loading flat binary data.
- “Hadoop 1” jars as packaged with Spark are recommended, but Thunder should work fine if recompiled against the CDH4, CDH5, or “Hadoop 2” builds.
- Python 2 required, version 2.6 or greater.
- PIL/pillow libraries are used to handle tif images. We have encountered some issues working with these libraries, particularly on OSX 10.9. Some errors related to image loading may be traceable to a broken PIL/pillow installation.
- This release has been tested most extensively in three environments: local usage, a private research compute cluster, and Amazon EC2 clusters stood up using the thunder-ec2 script packaged with the distribution.
Future Directions
Thunder is still young, and will continue to grow. Now is a great time to get involved! While we will try to minimize changes to the API, it should not yet be considered stable, and may change in upcoming releases. That said, if you are using or contemplating using Thunder in a production environment, please reach out and let us know what you’re working on, or post to the mailing list.
Contributors
Jascha Swisher (@industrial-sloth): loading functionality, data types, AWS compatibility, API
Jeremy Freeman (@freeman-lab): API, data types, analyses, general performance and stability