This is a patch-level release that fixes two issues:
- Allow "skip" to be set on MongoInputSplit (HADOOP-304)
- Correctly handle renaming nested fields in Hive (HADOOP-303)
Thanks to mkrstic for the patch for HADOOP-304!
For complete details on the issues resolved in 2.0.2, consult the release notes on Jira: https://jira.mongodb.org/browse/HADOOP/fixforversion/17932
This is a patch-level release that adds the noTimeout option to the cursor used by MongoPaginatingSplitter. More details of the issue can be found on the ticket's Jira page: https://jira.mongodb.org/browse/HADOOP-295
This is a major release touting several new features. As a major release, it removes several deprecated methods and objects, breaking API in ways that should not affect most users of Hadoop tools.
Some of the major new features introduced in this version include:
- Ability to collocate Hadoop nodes and MongoDB shards for data locality (HADOOP-202)
- Add GridFSInputFormat (HADOOP-272)
- Add MongoSampleSplitter (HADOOP-283)
- Support document replacement (HADOOP-263)
- Add back support for Hadoop 1.2.x (HADOOP-246)
For complete details on the issues resolved in 2.0.0, consult the release notes on Jira: https://jira.mongodb.org/browse/HADOOP/fixforversion/15622
This version is identical to 2.0.0-rc0.
This is a major release touting several new features. As a major release, it removes several deprecated methods and objects, breaking API in ways that should not affect most users of Hadoop tools.
Some of the major new features introduced in this version include:
- Ability to collocate Hadoop nodes and MongoDB shards for data locality (HADOOP-202)
- Add GridFSInputFormat (HADOOP-272)
- Add MongoSampleSplitter (HADOOP-283)
- Support document replacement (HADOOP-263)
- Add back support for Hadoop 1.2.x (HADOOP-246)
For complete details on the issues resolved in 2.0.0, consult the release notes on Jira: https://jira.mongodb.org/browse/HADOOP/fixforversion/15622
This release candidate became the stable 2.0.0 version.
This release fixes a couple issues when using the "pymongo-spark" library, including a bug where datetimes were being decoded to java.util.GregorianCalendar and another bug where pymongo-was not working in non-local Spark setups.
For complete details on the issues resolved in 1.5.2, consult the release notes on Jira: https://jira.mongodb.org/browse/HADOOP/fixforversion/16602
This release features a few fixes from 1.5.0, including patching a few MongoDB connection leaks, avoiding a warning when using MongoUpdateStorage with Pig, and allowing a limit to be set on MongoInputSplits.
For complete details on the issues resolved in 1.5.1, consult the release notes on Jira: https://jira.mongodb.org/browse/HADOOP/fixforversion/16544
This release features major improvements to Pig, Hive, and Spark. Pig and Hive both have the ability to push down simple queries and projections to MongoDB, potentially saving time and memory when running MapReduce jobs. New included UDFs allow writing MongoDB-specific types from Pig jobs and extracting timestamp information from ObjectIds. A new "pymongo-spark" library (under spark/src/main/python) allows using PyMongo objects with the connector, greatly simplifying the Python interface to Spark when running with MongoDB.
For a complete list of tickets resolved in this release, see the release notes on Jira: https://jira.mongodb.org/browse/HADOOP/fixforversion/15466
Changes from rc0:
- [HADOOP-255] Return null early in getTypeForBSON if input is null.
This release features major improvements to Pig, Hive, and Spark. Pig and Hive both have the ability to push down simple queries and projections to MongoDB, potentially saving time and memory when running MapReduce jobs. New included UDFs allow writing MongoDB-specific types from Pig jobs and extracting timestamp information from ObjectIds. A new "pymongo-spark" library (under spark/src/main/python) allows using PyMongo objects with the connector, greatly simplifying the Python interface to Spark when running with MongoDB.
For a complete list of tickets resolved in this release, see the release notes on Jira: https://jira.mongodb.org/browse/HADOOP/fixforversion/15466
This is a minor release that contains minor improvements and bug fixes from 1.4.0.
- [HADOOP-231] (Python) Streaming reports success but output collection stays empty
- [HADOOP-226] HiveException: java.lang.ClassCastException: java.lang.String cannot be cast to java.sql.Timestamp
- [HADOOP-219] Do not log the username:password portion of the mongo connection URI to hadoop logs
- [HADOOP-206] Update progress inside MongoOutputCommitter so that Hadoop doesn't time out the commit
This stable release also includes all features and fixes from the 1.4-rc0 release described below.
- [HADOOP-204] Allow concurrent access to MongoRecordReader instances
- [HADOOP-201] Support mongo.auth.uri in StandaloneMongoSplitter
- [HADOOP-196] Update Hadoop dependencies
- [HADOOP-195] 3.0 Java driver compatibility
- [HADOOP-188] Support MapWritable
- [HADOOP-179] When mongo.output.uri has a replica set specified, mongo-hadoop fails
- [HADOOP-175] Records dropped due to incorrectly computed file splits
- [HADOOP-173] Bulk write support from MongoOutputFormat
- [HADOOP-170] Pig integration doesn't call close() on Client
- [HADOOP-153] Add capability of BSONLoader.java to parse UUID
- [HADOOP-152] NumberFormatExceptions when splitting on a sharded, replica set cluster
- [HADOOP-151] Fix MongoUpdateWritable serialization
- [HADOOP-150] Use Primary read preference when sending splitVector command in StandaloneMongoSplitter
- [HADOOP-143] MongoConfigUtil.getCollection() creates orphaned MongoClients
- [HADOOP-110] Add non-args constructor for all spiltters for multi-collection input
- [HADOOP-98] handle binary types in pig schema mode
- [HADOOP-94] BSONLoader failing to parse binary data
- [HADOOP-93] Processing GUID data when importing data from mongo to HDFS using Pig
- [HADOOP-82] Use OutputCommitter with MongoOutputFormat
- Fixed file distribution for streaming addon files
- Fixed Thrift dep for cdh3.
- Add treasury yield example build support.
- Added a Streaming Example M/R job with enron email corpus
- HADOOP-29 - removes excessive logging for each tuple stored in MongoDB (RJurney)
- Streaming: Add support for python generators in reduce functions (MLew)
- Pig: Fix for exporting tuples to mongodb as map
- Fixed CDH4 build flags to correct compilation step.
- Fixed Hadoop build for dependencies across versions.
- Added a "load-sample-data" task to use for loading samples into mongo for testing/demos
- Hadoop 0.22.x support now works for those who need it (although I believe it's a deprecated branch)
- Stock Apache 0.23.x now builds, using the actual 0.23.1 release... insanity around the MapReduce dep
- added twitter hashtag examples
- Relocate Pymongo_Hadoop module to a new "language_Support" subdirectory. Created a setup.py file to build an egg / package. Available on PyPi as 'pymongo_hadoop'.
- Fixed pymongo_hadoop output to use BSON.encode
- Added support to streaming for the -file flag to distribute files out to the cluster if they don't exist.
- Make InputFormat and OutputFormat implied on Streaming jobs, defaulting to the Mongo ones.
- Streaming now builds as a fat assembly jar and works.
- Added an 0.23 / cdh4 build. No longer allow raw "cdh" or "cloudera" build artifacts to avoid confusion as to 'which cloudera?'
- Added a .23 build, based on Cloudera's current distro (should be binary compatible with stock)
- If combiner is not specified, do not pass it to Hadoop. While the combiner should be optional, giving Hadoop a null combiner will result in a NullPointerException.
- Initial Release, Release Candidate