Skip to content

Fourth development release

Compare
Choose a tag to compare
@freeman-lab freeman-lab released this 16 Oct 05:35
· 1435 commits to master since this release

We are pleased to announce the release of Thunder 0.4.0.

This release introduces some major API changes, especially around loading and converting data types. It also brings some substantial updates to the documentation and tutorials, and better support for data sets stored on Amazon S3. While some big changes have been made, we feel that this new architecture provides a more solid foundation for the project, better supporting existing use cases, and encouraging contributions. Please read on for more!

Major Changes

  • Data representation. Most data in Thunder now exists as subclasses of the new thunder.rdds.Data object. This wraps a PySpark RDD and provides several general convenience methods. Users will typically interact with two main subclasses of data, thunder.rdds.Images and thunder.rdds.Series, representing spatially- and temporally-oriented data sets, respectively. A common workflow will be to load image data into an Images object and then convert it to a Series object for further analysis, or just to convert Images directly to Series data.
  • Loading data. The main entry point for most users remains the thunder.utils.context.ThunderContext object, available in the interactive shell as tsc, but this class has many new, expanded, or renamed methods, in particular loadImages(), loadSeries(), loadImagesAsSeries(), and convertImagesToSeries(). Please see the Thunder Context tutorial and the API documentation for more examples and detail.
  • New methods for manipulating and processing images and series data, including refactored versions of some earlier analyses (e.g. routines from the package previously known as timeseries).
  • Documentation has been expanded, and new tutorials have been added.
  • Core API components are now exposed at the top-level for simpler importing, e.g. from thunder import Series or from thunder import ICA
    Improved support for loading image data directly from Amazon S3, using the boto AWS client library. The load* methods in ThunderContext now all support s3n:// schema URIs as data path specifiers.

Notes about requirements and environments

  • Spark 1.1.0 is required. Most functionality will be intact with earlier versions of Spark, with the exception of loading flat binary data.
  • “Hadoop 1” jars as packaged with Spark are recommended, but Thunder should work fine if recompiled against the CDH4, CDH5, or “Hadoop 2” builds.
  • Python 2 required, version 2.6 or greater.
  • PIL/pillow libraries are used to handle tif images. We have encountered some issues working with these libraries, particularly on OSX 10.9. Some errors related to image loading may be traceable to a broken PIL/pillow installation.
  • This release has been tested most extensively in three environments: local usage, a private research compute cluster, and Amazon EC2 clusters stood up using the thunder-ec2 script packaged with the distribution.

Future Directions

Thunder is still young, and will continue to grow. Now is a great time to get involved! While we will try to minimize changes to the API, it should not yet be considered stable, and may change in upcoming releases. That said, if you are using or contemplating using Thunder in a production environment, please reach out and let us know what you’re working on, or post to the mailing list.

Contributors

Jascha Swisher (@industrial-sloth): loading functionality, data types, AWS compatibility, API
Jeremy Freeman (@freeman-lab): API, data types, analyses, general performance and stability