forked from cwensel/cascading
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathCHANGES.txt
1021 lines (605 loc) · 48.1 KB
/
CHANGES.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
Cascading Change Log
1.2.5
Removed accidental SLF4J dependencies.
Fixed bug where ISE was thrown if c.f.Flow#stop() was called immediately after #start().
1.2.4
Added info logging of current split input path with a task, if any.
Fixed bug in c.o.f.And, c.o.f.Or, and c.o.f.Xor where the sub-select of arguments was not honored.
Added info log message when writing "direct" to a filesystem, bypassing the temporary folder removing the need to
rename the output file to its target location.
Fixed bug where if all paths that match a glob pattern are empty, an exception is not thrown causing Hadoop to throw
a java.lang.ArrayIndexOutOfBoundsException.
Updated planner to issue an error message if a tail c.p.Pipe instance doesn't not properly bind to a c.t.Tap instance.
1.2.3
Added c.f.Flow#setMaxConcurrentSteps to set the maximum number of steps that can be submitted concurrently.
Fixed bug where NPE was thrown when c.c.CascadeConnector tried to unwind nested c.t.MultiSourceTap instances.
Fixed bug where c.t.Fields#append() would fail when appending unordered selectors.
Updated c.f.FlowProcess to include #isCounterStatusInitialized() to test if the underlying reporting framework
is initialized.
Updated c.f.FlowProcess#keepAlive() method to fail silently if the underlying reporting framework is not initialized.
Updated error message thrown by c.f.FlowStep when unable to find c.t.Tap or c.p.Pipe instances in the flow plan due
to a Class serialized field not implementing #hashCode() or #equals() and relying in the object identity.
Added error message explaining the Hadoop mapred.jobtracker.completeuserjobs.maximum property needs to be increased
when dealing with large numbers of jobs. Also caching success value to lower chance of failure.
Fixed bug in c.t.GlobHfs where #equals() and #hashCode() were not consistent between calls.
1.2.2
Fixed bug where OOME caught from within the source c.t.Tap was not being re-thrown properly.
Added #getMapProgress() and #getReduceProgress() to c.f.h.HadoopStepStats.
Fixed NPE with some invocations of c.t.TupleEntry ctor.
Fixed bug where if an operation declared it returned Fields.ARGS and the argument selector used positions, the
outgoing values may merge incorrectly.
1.2.1
Changed info message to not announce ambiguous source trap if none has been set.
Fixed bug where if the c.o.Function result c.t.Tuple was passed immediately to a c.p.Group, it may become modified.
Fixed bug where c.t.TupleEntryIterator#hasNext() failed if called again after returning false.
Fixed issue where reduce task may fail with a OOME during sorting.
1.2.0
Added c.p.a.AverageBy sub-assembly for optimizing averaging processes.
Added c.p.c.GroupClosure#getFlowProcess method to allow c.p.c.Joiner implementations access to current
properties and counters.
Added c.s.CascadingStats methods for accessing available counter groups and names.
Added c.s.WritableSequenceFile as a convenience for reading/writing sequence files holding custom Hadoop
Writable types in either they key, value, or key and value positions.
Added retrieve/publish support to the Conjars repo via Ivy.
Added the c.p.a.AggregateBy class to encapsulate parallel partial Function aggregations and their reduce
side Aggregator. This is a superior alternative to so called MapReduce Combiners. See javadoc for details.
Changed c.o.Debug to print the number of tuples encountered on #cleanup().
Changed c.s.TextDelimited to always return the expected number of fields even if they are not parsed from
the current line and strict is false, unless Fields.ALL or Fields.UNKNOWN is declared.
Added c.p.a.SumBy sub-assembly for optimizing summing processes.
Added c.p.a.CountBy sub-assembly for optimizing counting processes.
Added c.s.CascadingStatus.Status.Skipped state so skipped c.f.Flow instances can be identified.
Added c.f.Flow#setSubmitPriority() to allow for custom order of Flows.
Fixed bug where c.t.MultiSourceTap#pathExists() would return true if one of the child paths was missing.
Changed c.c.CascadeConnector to fail if it detects cycles in the set of given c.f.Flow instances to manage.
Disable Hadoop warning about not using "options parser".
Added #isSource() and #isSink() methods to c.s.Scheme so that some Scheme instances can report they are either
sink or source only.
Added c.t.Fields#merge() method to allow simple merging of Fields instances which discarding duplicate names and
positions.
Added convenience methods on c.c.CascadeConnector#connect() and c.f.FlowConnector#connect() to accept
j.u.Collection<Flow> and j.u.Collection<Pipe> arguments, respectively.
Added Riffle support via the new c.f.ProcessFlow wrapper class. Riffle allows for non-Cascading jobs and/or
sets of iterative Flows to participate in a c.c.Cascade.
Changed c.c.Cascade instances to disable parallel execution if more than one Flow is a local only job.
Added c.c.Cascade#setMaxConcurrentFlows() property that limits the number of concurrently running Flows.
Added c.c.Cascade#writeDOT method for visualizing the dependencies between flow instances.
Added c.p.a.Unique sub-assembly for optimizing de-duping processes.
Changed c.s.TextDelimited to accept Fields.ALL or Fields.UNKNOWN for arbitrarily sized or unknown records.
Changed c.t.MultiSourceTap to support #openForRead().
Added c.t.Comparison and c.t.StreamComparator interfaces which allow for custom types to be
lazily deserialized during sort comparisons.
Added support for lazy deserialization during c.t.Tuple comparisons while shuffle sorting.
1.1.3
Added publishing of artifacts to the conjars.org jar repo via Ivy.
Added method c.s.CascadingStats#getCurrentDuration to return the current execution duration whether or not the
process/work is finished.
Fixed issues where c.t.Fields#getIndex may return invalid results if accessed from multiple threads simultaneously.
Fixed NPE when attempting to increment a counter before the first map/reduce invocation. Now throws a more
informative ISE message.
Fixed possible NPE when accessing counters via c.f.h.HadoopStepStats.
Fixed bug in c.s.TextDelimited where some unquoted empty values would not be properly parsed.
Added c.f.FlowStep#setName() method to allow override of MR job names. Use in conjunction with
FlowStep#containsPipeNamed() to find appropriate steps.
Fixed bug where c.f.MultiMapReducePlanner did not detect a split after a c.p.GroupBy or c.p.CoGroup where
one or more of the immediate pipes is an c.p.Every instance. An Each split is allowed.
Fixed c.t.TupleEntry#set method so that it may take a c.t.Fields instance for a field name.
Fixed NPE in c.t.TempHfs when parent c.f.Flow is used in a Cascade under certain conditions.
Fixed bug where mixed absolute and relative paths didn't not result in a proper topological sort when used
in a c.c.Cascade.
Fixed bug where a c.c.Cascading of c.f.Flow and c.f.MapReduceFlow instances did not properly sort topologically.
Added c.c.Cascade#writeDOT method to simplify debugging Cascade instances.
1.1.2
Fixed bug preventing c.s.TextDelimited schemes from being used with a c.t.TemplateTap.
Updated c.t.Scheme base class to force Field.ALL source declaration to Fields.UNKNOWN, and to force Fields.UNKNOWN
sink declaration to Fields.ALL.
Fixed bug where if null was passed to c.s.TextLine sinkCompression, the behavior would be undefined.
Added back c.t.Tuple#add( Comparable ) to remain backwards compatible with 1.0.
Fixed bug preventing Fields.ALL selector in c.p.Every when incoming positions are used instead of field names
and the given aggregator declares field names.
Fixed bug that prevented the configured codecs from loading for co-group spills.
Fixed bug where c.s.TextDelimited would fail on delimiters that are also regex special characters.
Fixed random j.u.ConcurrentModificationException error when running in Hadoop local mode by synchronizing
the c.f.s.StackElement#closeTraps method.
Fixed missing property values when stored in a nested j.u.Properties object.
Fixed NPE when counter group does not exist yet when querying c.s.FlowStats#getCounterValue.
1.1.1
Fixed bug where some unsafe operations followed by named c.p.Pipe instances were not considered during planning.
Removed imports for SLF4J and replaced with Apache LOG4j in c.s.TextDelimited.
Fixed bug where c.t.Fields.SWAP did not properly resolve when following a c.p.Every pipe.
1.1.0
Fixed bug where a c.t.Fields instance can be marked as ordered when modified via #set call.
Changed c.p.CoGroup to detect self-joins and optimize for them.
Changed trap handling to include failures from source and sink c.t.Tap instances. The source Tap will inherit
the assembly head trap and the sink will inherit the assembly tail trap.
Deprecated c.t.Tuple#parse(). It does not properly handle null values or types other than primitives.
Changed c.f.s.StackElement to log a warning for each trap captured. This includes a truncated print of the offending
c.t.TupleEntry and the thrown exception and stack trace. Traps being for exceptional cases, logging exceptions is a
reasonable response.
Changed map and reduce operation stack so that collected c.t.Tuple instances do not remain 'unmodifiable' after
being collected via the c.t.TupleEntryCollector.
Add #getArgumentFields() to c.o.OperationCall for all operations.
Added support for custom EMR properties used for managing task attempt temporary path management for some filesystems.
Changed c.t.TemplateTap to support an openTapsThreshold value. The default open taps is 300. After the threshold
is met, 10% of the least recently used open taps will be closed.
Changed c.t.Fields #setComparator fieldName argument to accept Fields instances as the fieldName argument.
Only the first field name or position is considered.
Changed c.t.TupleEntry 'get as type' accessors to now also accept c.t.Fields instances as the fieldName argument. Only
the first field name or position is considered.
Updated janino to 2.5.16.
Updated jgrapht to 0.8.1.
Changed c.f.s.FlowMapperStack to source key/value pairs once, instead of per branch.
Changed c.f.FlowPlanner to fail if not all sources or sinks are bound to heads or tails, respectively.
Changed c.t.TupleOutputStream to lookup tuple element writers by Class identity.
Added j.b.ConstructorProperties annotation to relevant class constructors.
Added new convenience method c.p.Pipe#names to return an array of all the pipe names in an assembly. This supports
the dynamic creation of traps from opaque assemblies.
Added new c.s.Scheme type c.s.TextDelimited to allow native support for delimited text files.
Added optimization during CoGrouping where the most LHS pipe will not ever be accumulated, instead the values iterator
will be used directly. This allows for the most dense values to be on the LHS, and the most sparse to be on the
RHS of the join.
Added new counters for tuple spills and reads. Also logs grouping after first spill.
Added compression of object serialization and deserialization, on by default. This improves reliability
of very large jobs with very large numbers of input files.
Fixed bad cast of j.l.Error when caught in map/reduce pipeline stack.
Added c.t.Fields#rename to simplify Fields instance manipulations.
Added support for resultGroupFields in c.p.CoGroup. This allows the outgoing grouping fields to be set.
Added c.t.h.BytesSerialization and c.t.h.BytesComparator to allow for c.t.Tuple instances
to hold raw byte arrays (byte[]), and allow joining, grouping, and secondary sorting.
Changed c.t.Tuple and underlying framework to support j.l.Object instead of j.l.Comparable. Note that
Tuple#get() returns Comparable to maintain backwards compatibility.
Added support for custom j.u.Comparator instances to control the grouping and sort orders in c.p.CoGroup and
c.p.GroupBy via the c.t.Fields class.
Added support for planner managed debugging levels via the c.o.DebugLevel enum. Now c.o.Debug operations
can be planned out at runtime in the same manner as c.o.Assertion operations.
Refactored xpath operations to re-use j.x.p.DocumentBuilder instances.
Refactored fields resolver framework to emit consistent error messages across all field resolution types.
Fixed bug where c.t.Tuples would fail when coercing non-standard java types or primitives.
Fixed bug where c.t.Tap instances that returned true for #isWriteDirect() were not properly being initialized
when used as a sink.
Added guid like ID values to c.f.Flow and c.c.Cascade instances.
Refactored reduce side grouping and co-grouping operations to remove redundant code calls.
Added ability to capture Hadoop specific job details like task start and stop times, and all available counter values.
Added accessor for increment counters on c.s.CascadingStats. This allows applications to pull aggregate counter
values from c.c.Cascade, c.f.Flow, or c.f.FlowSteps.
Added c.t.GlobHfs c.t.Tap type that accepts Hadoop style globbing syntax. This allows multiple files that match
a given pattern to be used as the sources to a Flow.
Added c.o.s.State and c.o.s.Counter helper operations that respectively set 'state' and increment counters.
Added c.f.FlowProcess#setStatus method to allow for text status messages to be posted.
Added c.o.a.AssertNotEquals assertion type.
Removed planner restriction that traps must not cross map/reduce boundaries. This allows for a single c.t.Tap
trap to be used across a whole branch, regardless of underlying topology.
Added new c.t.Field field set type named Fields.SWAP. Can only be used as a result selector. Specifies operation
results will replace the argument fields. The remaining input fields will remain intact.
Deprecated c.t.SinkMode#APPEND and replaced with c.t.SinkMode#UPDATE.
Added c.t.MultiSinkTap to allow for simultaneous writes to multiple unique locations.
Added support for compression of c.t.SpillableTupleList by default in order to speed up c.p.CoGrouping operations
where there are very large numbers of values per grouping key.
Added c.o.f.SetValue function for setting values based on the result of a c.o.Filter instance.
Added support for configuring polling interval of job status via c.f.h.MultiMapReducePlanner.
Added c.f.h.MultiMapReducePlanner optimization to detect 'equivalent' adjacent c.t.Tap instances in a c.f.Flow.
This can drastically reduce the number of jobs when there are intermediate sinks between pipe assemblies.
If the taps are not compatible, a job will be inserted to convert the temp tap data to the sink format.
Added support for 'safe' c.o.Operations. By default Operations are safe, that is, they have no side-effects, or
if they do, they are idempotent. Non-safe operations are treated differently by the c.f.h.MultiMapReducePlanner.
Added new c.t.Field field set type named Fields.REPLACE. Can only be used as a result selector. Specifies the
operation results will replace values in fields with the same names. That is, inline values can be replaced in a
single c.p.Each or c.p.Every. It is especially useful when used with Fields.ARGS as the operation field declaration.
Fix for case where one side of a branch multiplexed in a mapper could step on c.t.Tuple values before being
handed to the next branch. Previous fix was only for CoGroup, this support GroupBy merges.
1.0.18
Changed c.t.Tuple#print to not quote null elements to distinguish between 'null' Strings and null values.
Changed planner exception messages to quote head and tail names.
Changed log messages to info when hdfs client finalizer hook cannot be found.
Fix for NPE in c.t.h.MultiInputFormat during certain testing scenarios. Also changed proportioning to honor
suggested numSplits value.
Fix for temp files starting with underscores (_) causing them to be ignored.
Fix for mixed types in properties object causing ClassCastExceptions.
Fix for case where one side of a branch multiplexed in a mapper could step on c.t.Tuple values before being
handed to the next branch.
Fix for edge case where Cascading jars are stored in Hadoop classpath and deserialization of c.f.Flow fails.
Fix for bad cast of j.l.Error when caught in map/reduce pipeline stack.
Fix for bug when selecting positional Fields from positional Fields.
Fix for case when an c.o.Aggregator#start is called when there are no values to iterate across in current grouping.
1.0.17
Changed behavior when cleaning temp files that allows shutdown to continue even if an exception is thrown
during temp file delete.
Fix bug where c.f.FlowProcess#openTapForRead() included current input file values in iterator.
Fix for intermediate temp files not being cleaned up on c.f.Flow#stop().
Fixed bug where NPE is thrown if all hadoop default properties are not available.
1.0.16
Fixed bug where in some instances o.a.h.m.JobConf hangs when instantiated during co-grouping.
Fixed bug in c.CascadingTestCase#invokeBuffer where the output collector was not properly being set. Added
new methods on #invokeBuffer and #invokeAggregator to take a groping c.t.TupleEntry.
1.0.15
Fixed bug where c.t.Fields did not check for a null field name or position on the ctor.
Fixed bug in c.u.Util#join() methods where if the first value was empty, the delimiter was not properly applied.
Fixed issue in c.t.h.FSDigestOutputStream where seek() now must be implemented with modern versions of Hadoop.
1.0.14
Fixed bug in planner where JGraphT sometimes returns null instead of an empty List.
Fixed bug in c.o.x.XPathParser that prevented use of multiple xpath expressions.
Added configuration propety allowing job polling interval to be configured per c.f.Flow via
Flow#setJobPollingInterval().
Updated ant build to not hard-code hadoop/lib sub-dir names.
1.0.13
Fixed bug where non-String j.u.Property values where not being copied to the internal o.a.h.m.JobConf instance.
Fixed bug where custom serializations where not recognized during co-grouping spills inside c.t.SpillableTupleList.
1.0.12
Fixed bug where the c.f.FlowPlanner did not detect that tails were not bound to sinks, or that some tail references
were missing.
Fixed j.u.ConcurrentModificationException when using a c.c.CascadeConnector on c.f.Flows using a c.t.MultiSink
c.t.Tap.
Fixed bug where c.f.s.StackException was being wrapped preventing failures within sink c.t.Tap instances from
causing the c.f.Flow to fail. This mainly affected Flows using traps.
1.0.11
Added clearer error message when c.t.Tap is used as both source and sink in a given Flow.
Demoted all DEBUG related c.t.Tuple#print() calls to TRACE.
Fixed NPE when planner finds inconsistencies with c.t.Tap and c.p.Pipe names.
1.0.10
Updated planner error messages when field name collisions detected.
Fixed issue where temporary paths were not getting deleted consistently.
1.0.9
Fixed issue where reverse ordering a c.p.GroupBy was not possible when sortFields were not given.
Changed c.f.s.StackElement#close() behavior to close elements from the top of the stack.
1.0.8
Fixed bug where Hadoop FS shutdown hooks prevented cleanup of c.f.Flow intermediate files.
Fixed bug where c.t.MultiTap was not accounted for when planning a c.c.Cascade.
Fixed bug where operations in the default package caused NPE when calculating the stacktrace.
Added c.f.StepCounters enum and now increment the counters Tuples_Read, Tuples_Written, Tuples_Trapped.
Fixes for instabilities when using traps in some instances.
Workaround for bug in o.a.h.f.s.NativeS3FileSystem where a null is returned when getting a FileStatus array
in some cases.
1.0.7
Fixed bug where c.o.r.RegexSplitter did not consistently split incoming values if the value had blank
fields between the split delimiter. This only occurs if the incoming tuple is declared Fields.UNKNOWN
and won't affect any tuple with declared field names. Though this is an incompatible change, the bug
breaks the contract of the splitter.
Deprecated all S3 supporting classes, including c.t.S3fs. The s3n:// protocol is the preferred S3 interface.
Fixed bug where c.t.Hfs caused a NPE from the NativeS3FileSytem when attempting to delete the root directory.
Hfs now detects a delete is attempted on the root dir, and returns immediately.
1.0.6
Fixed bug where a uri path to a s3n://bucket/ could cause an NPE when determining mod time on the path.
Fixed bug where sink c.s.Scheme sink fields were not being consulted during planning. This fix may
cause planner errors in existing applications where the sink fields are not actually available in the incoming
tuple stream.
Updated application jar discovery to provide more sane defaults supporting simple cases.
Fixed bug where default properties in nested j.u.Properties object were not being copied.
1.0.5
Added check if num reducers is zero, if so, assume #reduce() has no intention of being called and return silently.
1.0.4
Updated split optimizer to perform a multipass optimization.
Fixed bug where c.f.MultiMapReducePlanner was not properly handling splits on named Pipe instances.
Added c.t.TemplateTap constructor arg that allows for independent tuple selection for use by template path.
Fixed bug where unsafe filename characters were leaking into temporary filenames, didn't take the first time.
1.0.3
Fixed bug in c.f.MultiMapReducePlanner where split and joins with the same source were not handled properly.
Fixed bug in c.f.Flow#writeDOT caused by changes in 1.0.2.
Fixed bug in c.o.t.DateFormatter and c.o.t.DateParser where the TimeZone value was not being properly set. This
fix could affect existing applications.
1.0.2
Added rules to verify no duplicate head or tail names exist in an assembly when calling c.f.FlowConnector#connect().
Currently a WARNING will be issued via the logger, next major release this will be an exception. This is a change
that was supported in prior releases, but turns out to allow error prone code. Two workarounds are availabe: bind
the same tap to both names in the tap map, or split from a single named c.p.Pipe instance.
Added support for c.o.e.ExpressionFunction to evaluate expressions with no input parameters.
Reverted MR job naming to include sink c.t.Tap name. More verbose, but easier for degugging.
Update c.c.Cascade to not delete c.f.Flow sinks if they are appendable before the Flow is executed.
Updated error messages to warn when internal element graphs remove all place holders resulting in an empty graph
usually due to missing linkages between pipe assemblies.
Allowing Fields.UNKNOWN to propagate through pipes that do not declare argument selectors. This is a relaxation
of the strict planning and seems very natural when assembling pipes to process unknown field sets. Reserving
the right to revert this feature if it causes unforseen issues.
Fixed bug in c.o.f.UnGroup where the num arg value was improperly calculated.
Allow for white space in the serializations token property so it can be set in a config file simply.
Added new log message if no serialization token is found for a class being serialized out.
Fixed bug that allowed c.t.Field instances to be nested in new Fields instances.
Updated many error messages to print the number of fields along with a list of the field names.
Fixed bug preventing custom c.s.Scheme types from using a different key/value classes in some situations.
Fixed bug preventing c.t.TemplateTap from being written to in Reducer.
1.0.1
Improved error message for the case a Hadoop serializer/deserializer cannot be found.
Changed c.s.Scheme sourceFields default to Fields.UKNOWN. sinkFields default remains Fields.ALL.
Fixed bug where unsafe filename characters were leaking into temporary filenames.
Changed SinkMode.APPEND support checks to be done in c.t.Hfs, instead of c.t.Tap.
1.0.0
Updated copyright messages.
Fixed bug where c.t.TuplePair threw a NPE during dubugging.
Fixed bug where positional selectors failed against Fields.UNKNOWN.
Changed all constructors on c.p.Group to be protected. Must now use subclasses to construct.
Renamed c.t.Fields#minus to subtract.
0.10.0
Changed c.p.CoGroup "repeat" parameter to numSelfJoins to respresent the actual number of self joins to be performed.
Thus a value of 1, will cause a single self join of a pipe. Users will need to decrement the current value by 1.
Changed c.p.CoGroup "repeat" parameter to numSelfJoins to respresent the actual number of self joins to be performed.
Thus a value of 1, will cause a single self join of a pipe. Users will need to decrement the current value by 1.
Fixed bug with temporary filename generation where path created was too long.
Fixed Janino c.o.expression operations to require parameter names and types. Janino
was returning guessed parameter names in an undeterministic order.
Fixed boolean type c.t.Tuple serialization.
Fixed c.p.GroupBy merging case where grouping field names were not properly resolved.
Changed c.o.r.RegexParser to emit variable sized Tuples if a fieldDeclaration is not given. Also will emit group
matches if they are any, otherwise the match is emitted.
Removed deprecated classes; c.o.t.Texts, c.o.r.Regexes, c.p.EndPipe.
Removed experimental c.p.EndPipe class.
Changed c.t.Tap#isUseTapCollector to Tap#isWriteDirect.
Changed c.t.Tap and c.f.Flow to return c.t.TupleEntryIterator instead of c.t.TupleIterator. This is more consistent
and more useful.
Added c.t.TemplateTap to support dynamically writing out c.t.Tuple values to unique directories.
Changed Cascading to support null values returned from c.t.Tap#source() and subsequently c.t.Scheme#source().
This allows for Schemes to skip records returned by an internal Hadoop InputFormat without having to implement
a custom Hadoop InputFormat or instrument a pipe assembly with a c.o.Filter.
0.9.0
Updated c.o.Debug to allow for printing field names and tuple values in intervals.
Changed planner to fail if traps are not contained within single Map or Reduce tasks. This prevents the chance of
multiple tasks writing to the same output location. Hadoop only partially supports appends, so it is not currently
possible to append subsequent jobs to existing trap files. Naming sections of a pipe assembly allows traps to be
bound to smaller sections of assemblies.
c.o.f.Sample and c.o.f.Limit Filters. Sample allows a given percentage of Tuples to pass. Limit only allows the
specified number of Tuples to pass.
c.p.Pipe instances now capture line numbers and classnames where they are instantiated so this information
can be printed out during planner failures.
Added c.f.FlowSkipStrategy interface to allow for pluggable rules for when to skip executing a c.f.Flow participating
in a c.c.Cascade. The default implementation is c.f.FlowSkipIfSinkStale, with an optional c.f.FlowSkipIfSinkExists.
Setting a skip strategy on a Cascade overrides all Flow instance strategies.
Fixed bug with c.t.Tuple#remove() method not correctly removing values from Tuple.
Updated c.t.Tap api to support c.t.SinkMode enums. This opens up ability to support appends in the near future.
Added support for Hadoop 0.19.x. This release skips Hadoop 0.18.x.
Changed project structure so that XML functions live in their own sub-project. This includes renaming the base
Cascading tree and jars to 'core'.
Fixed bug that prevented Fields.UNKNOWN input sources from begin fed into a c.p.CoGroup for joining.
Changed all operations so that incoming c.t.Tuple and c.t.TupleEntry instances are unmodifiable. An
UnsupportedOperationException will be thrown on any attempt to modify argument tuples within an operation.
This enforces the rule argument tuples should not be modified to protect against concurrent modification in
parallel threads.
Updated c.o.r.RegexMatcher base class to use j.u.r.Matcher#find() instead of #matches(). This is more consistent
with default behaviors of popular languages. Matcher is now also initialized in prepare() and reset() in
the operation to reduce overhead.
Added new lifecycle methods to c.o.Operation, prepare and cleanup. These methods are called so that an Operation
instance can initialize and destroy any resources. They may be called more than once before the instance is
garbage collected.
Added a new operation called c.o.Buffer. Buffers are similiar to Reduce in MapReduce. They are given an Iterator
of input arguments and can emit any number of result c.t.Tuple instances. For many problems, this is more
efficient than using an c.o.Aggregator operation. Only one c.p.Every pipe with a Buffer operation may
follow a GroupBy or CoGroup.
Fixed dot file writing so GraphViz can properly load.
Upgraded jgrapht library, requires JDK 1.6.
Fixed bug where selecting postions from a c.t.Fields.UNKNOWN declaration would return the first position, not
the specified position.
Renamed c.t.Fields.KEYS to c.t.Fields.GROUP to be consistent with the Cascading model.
Fixed bug where c.t.Tap may inappropriately delete a sink from a task.
Changed c.o.Aggregator to no longer use a Map for the context. Users can now specify custom types by returning
either a new instance from start() or recycling an instance passed into start(). This change will break all existing
implementations of Aggregator. Note, simply setting a new Map<Object,Object> on the call instance in start()
should be sufficient.
Changed all c.o.Function, c.o.Filter, c.o.Aggregator, c.o.ValueAssertion, and c.o.GroupAssertions to accept
a c.f.FlowProcess object on all relevant methods. FlowProcess provides call-backs into the underyling system
to get configuration properties, fire a "keep alive" ping, or increment a custom counter. This change will
break all existing implemenations of the above interfaces.
Added ability to set serialization tokens via the cascading.serialization.tokens property. This compliments the
c.t.h.SerializationToken annotation.
Optimized co-grouping operation by using c.t.IndexTuple instead of a nested c.t.Tuple.
Changed c.t.Tap and c.s.Scheme sink methods to take a c.t.TupleEntry, instead of c.t.Fields and c.t.Tuple
individually.
Added the c.t.h.SerializationToken Java Annotation. This allows for an int value to be written during serialization
instead of a Class name for custom objects nested in c.t.Tuple instances. This feature should dramatically reduce
the size of Tuples saved in SequenceFiles, and improve the general performance during 'shuffling' between Map and
Reduce stages.
Added c.t.h.TupleSerialization, a Hadoop Serialization implementation. Tuple is no longer Hadoop Writable
and now relies on TupleSerialization for serialization support. Subequently nested objects in c.t.Tuple
only need to be c.l.Comparable. So they can be serialized properly, a Serialization implementation must be
registered with Hadoop. Note all primitive types are handled directly by Tuple, but custom types must
have a Serialization implementation, or must be Hadoop WritableComparable so that the default WritableSerialization
implementation will write them out.
0.8.3
Fix for c.p.CoGroup declared fields being generated out of order.
0.8.2
Added new properties via c.f.FlowConnector.setJarClass and c.f.FlowConnector.setJarPath for
setting the application jar file.
Fixed bug where job jar was not being inherited by subsequent MapReduce jobs when the first job was executed
in local mode.
Fixed bug where unserializable Operations were being squashed internally. c.f.Flow instances will now
fail immediately and be marked as 'failed'.
0.8.1
Fixed bug where c.t.Lfs did not force local mode for current MapReduce step.
Fixed bug where writing to a c.t.TupleCollector would fail if using a c.s.SequenceFile in some cases.
Added a few minor improvements to reduce stray object creations, and speedup c.t.Tuple serialization.
0.8.0
Updated c.o.x.TagSoupParser to accept 'features', use these features to recover past behaviors.
Updated janino and tagsoup libraries to 2.5.15 and 1.2, respectively. Note that tagsoup, in theory, is not
backwards compatible by default. See their release notes: http://home.ccil.org/~cowan/XML/tagsoup/#1.2
Added some forward compatible changes for supporting Hadoop 0.18 at the API level. Currently there are other
issues preventing some tests from passing on Hadoop 0.18.
Changed c.f.FlowException to return the parent c.f.Flow name.
Changed behavior of c.f.MultiMapReducePlanner to use c.t.h.MultiInputFormat to allow single Mappers
to support many different Hadoop InputFormat types simultaneously. This deprecates the need to normalize
sources to a map and reduces the number of jobs in a c.f.Flow in some cases.
Changed behavior of Cascading to allow for multiple paths from the same c.t.Tap source to be co-grouped on
via c.p.CoGroup. This allows for a kind of self-join where each stream is processed by a different operation
path within the Mapper.
Added c.o.f.And, c.o.f.Or, c.o.f.Xor, and c.o.f.Not logic operator c.o.Filter implementations. They should be used
to compose more complex filters from existing implementations.
Changed the behavior of c.o.BaseOperation to properly initialize itself if it is a c.o.Filter instance. This
removes the requirement that Filter implementations must set declaredFields to Fields.ALL, as it makes no
sense for a Filter to declare fields.
Added c.f.PlannerException, a subclass of c.f.FlowException, and updated c.f.MultiMapReducePlanner to throw
it on failures. Functionality of writing DOT files has been moved from FlowException to PlannerException.
Added c.o.f.FilterNotNull and c.o.f.FilterNull filter classes.
Changed c.f.MultiMapReducePlanner to fail if it encounters an c.p.Each to c.p.Every chain. In these cases, a
c.p.Group type must be between them.
Deleted c.o.Cut class as it was effectively a duplicate of c.o.Identity.
Changed c.f.MultiMapReducePlanner to fail if a c.p.GroupAssertion is not accompanied by another c.o.Aggregator
operation. This is required so that the GroupAssertion does not change the passing tuple stream if it is planned out.
Changed c.f.MultiMapReducePlanner to no longer insert new c.p.Each( ..., new Identity(), ... ) as a place holder.
Renamed c.p.PipeAssembly to c.p.SubAssembly to better reflect its purpose, which is to encapuslate reusable
pipe assemblies in the same manner as a sub-process or sub-routine. A temporary c.p.PipeAssembly class has been
provided for backwards compatibility.
Fixed bug where c.t.TapCollector would throw an NPE if a custom Tap was not using paths.
Changed behavior of c.f.Flow where if a c.f.FlowListener throws an exception, the Flow instance receiving the
exception will stop (by calling Flow.stop()). Listeners will continue to fire as expected and Flow.complete()
will re-throw the thrown exception (as was the original behavior).
Added ability to set a Cascading specific temporary directory path for use by intermediate taps created
within c.f.Flow instances. Use c.t.Hfs.setTemporaryDirectory() to configure.
Fixed bug where the 'mapred.jar' property was begin stepped on if previously set by the calling application.
Changed c.t.Tap and c.f.Flow to return c.t.TupleIterator and c.t.TupleCollector instead of c.t.TapIterator and
c.t.TapCollector, respectively.
Added c.t.Tap.flowInit( c.f.Flow flow ) to allow a given tap to know what flows it is participating in. It is called
immediately after the Flow instance is initailized.
Fixed bug with nested c.p.PipeAssembly instances where some nested assemblies threw an internal error from
the planner.
Changed c.o.Debug to accept a prefix text string that will be prefixed to every message.
Fixed bug where c.f.MultiMapReducePlanner would fail when normalizing inputs to a group where the inputs
passed through one or more splits.
Fixed bug where c.g.CoGroup silently stepped on input pipes with the same input name.
0.7.1
Fixed bug in c.f.MultiMapReducePlanner where a source used on more than one c.p.Group would cause an internal
error during planning.
Changed c.f.MultiMapReducePlanner to normalize heterogeneous sinks.
Changed c.f.MultiMapReducePlanner to keep a splitting c.p.Each on the previous step, instead of being duplicated
on each branch. If the Each is preceeded by a source c.t.Tap, it will be duplicated across branches to reduce
the number of step in the Flow.
Fixed bug in c.f.MultiMapReducePlanner where too many temp tap instances were being inserted while normalizing
the flow sources.
Changed c.t.Fields to fail if given duplicate field names.
Changed behavior if Hadoop FileInputSplit is not used and property "map.input.file" is not set. If there is one
source, it will returned as the source for the mapper stack, otherwise an exception is thrown. Subsequently joins
and merges of non-file sources is not supported until a discriminator can be passed to the mapper.
Fixed bug in c.t.Tuple where NPE was thrown under certain compareTo operations.
Fixed bug that prevented CoGrouping or Merging on the same source even though it was one or more Groupings away.
0.7.0
Changes project structure, removed 'examples' sub-project.
Updated to support Hadoop 0.17.x. This version is not API compatible with any Hadoop version less than 0.17.0.
Added ability to stop all c.f.Flows executing within a c.c.Cascade instance via the stop() method.
Changed c.f.FlowConnector to only take a Map of properties. These properties are passed downstream to various
subsystems. This removes the Hadoop JobConf constructor, but it still can be passed as a property value. Also
properties will be pushed into a defaul JobConf, bypassing any direct JobConf coupling in applications.
Changed c.f.Flow to automatically register a shutdown hook killing remote jobs on vm exit.
Changed c.f.Flow.stop() to immediately stop all running jobs.
Changed c.o.Operation to an interface and introduced c.o.BaseOperation. This makes creating custom Operation types
more flexible and intuitive. c.o.Filter, c.o.Function, c.o.Aggregator, and c.o.Assertion now extend c.o.Operation.
Added c.p.c.OuterJoin, c.p.c.MixedJoin, c.p.c.LeftJoin, and c.p.c.RightJoin c.p.c.CoGrouper classes. They
compliment the default c.p.c.InnerJoin CoGrouper class.
Added support for passing an intermediateSchemeClass to the underlying planner to be used as the default c.s.Scheme
for intermediate c.t.Tap instances internal to a given c.f.Flow.
Fixed bug where c.p.Group is immediately followed by another c.p.Group (or their sub-classes) and fields could not
be resolved between them.
Added support for c.t.Tap instances implementing c.f.FlowListener. If implemented, they will automatically be
added to the Flow event listeners collection and will receive Flow events.
Fixed case where multiple source c.t.Tap instances return true for the containsFile method. Now verifies only one
Tap contains the file, and fails otherwise.
Changed c.s.TextLine to not set numSinkParts to 1 by default. Now uses the natural number of parts.
Changed MapReduce planner to force an intermediate file between branches with Hadoop incompatible source Taps
on joins/merges. If the taps are compatible (have same Scheme), all branches will be processed in same Mapper
before the c.p.Group.
Added merge capabilities in c.p.GroupBy. This allows multiple input branches to be grouped as if a single stream.
Fixed bug in c.t.TapCollector where writing to a Sequence file threw a NPE.
Added c.f.MapReduceFlow to support custom MapReduce jobs, allowing them to participate in a Cascade job.
0.6.1
Changed thrown c.f.FlowException instances to include cause message.
Fixed bug where empty sink or source map was not detected.
0.6.0
Changed default argument selector for c.p.Every to be Fields.ALL, to be consistent with the default value of c.p.Each.
Added support for assembly traps. If an exception is thrown from inside an c.o.Operation, the offending Tuple
can be saved to a file for later processing, allowing the job to complete.
Added support for stream assertions. STRICT and VALID assertions can be built into a pipe assembly, and optionally
planned out during runtime. Assertions will throw exceptions if they fail.
Changed c.o.a.First, Last, Min, and Max to optionally ignore specified values. Useful if you do not wish
for a 'default' value to be considered first, or last in a set.
Changed c.o.a.Sum to take a Class for coercion of the result value.
Changes c.o.Max and Min to use infinity as initial values so zero is bigger than a really small number
for Max, and zero is smaller than a really big number for Min.
Changed order of JobConf initialization. c.f.FlowStep now is added to the JobConf last in order to catch
all lazily configured values.
Changed compile to include debug info by default.
Fixed bug in c.t.MultiTap where super scheme was not returned if available.
0.5.0
Added skipIfSinkExists property to c.f.Flow. Set to true if the c.c.Cascade should skip the Flow instance even
if the sink is stale and not set to be deleted on initialization.
Fixed bug in c.t.h.HttpFileSystem that URL escaped the ? prefixing the query string.
Fixed bug where a join with duplicate taps was not recognized during job planning. Now an appropriate error
message is displayed, instead of jobs completing with only one instance of the resource stream.
Fixed c.t.h.HttpFileSystem to remember authority information in the url and prefix it when missing.
Changed c.s.TextLine to accept either on or two source fields. If one, only the 'line' value
is sourced from the value, discarding the 'offset' value.
Added c.o.r.RegexSplitGenerator to support splitting single tuple values into multiple tuples based on a regex
delimiter. Includes new tests.
Added c.s.CascadeStats and c.s.FlowStats to provide access to current state and statistics of particular
Cascade, Flow, or the child Flows of a Cascade.
Added ability to sort grouping values with sort argument on c.p.GroupBy. Sorts can be reversed.
Added c.o.e.ExpressionFilter, the c.o.Filter analog to c.o.e.ExpressionFunction.
0.4.1
Fixed path normalization regex in c.u.Util where it munged any path starting with file:///.
0.4.0
Changed c.p.GroupBy default grouping fields to c.t.Fields.ALL from Fields.FIRST. This change provides a simple
way to sort a tuple stream based on the order of the tuple fields.
Changed c.f.FlowConnector to create c.f.Flow instances that will bypass the reducer if no c.p.Group is participating
in the assembly. Previoiusly Group instances were inserted if missing. This allows a chain of c.p.Every instances
to be used to process/filter a tuple stream without the invoking the reducer needlessly (if a sort isn't required).
This change also supports bypassing the default Hadoop OutputCollector in the mapper via the sink c.t.Tap instance.
Changed c.f.FlowStep behavior to run in 'local' mode if either the sink or source tap is a c.t.Lfs instance. This
allows for c.f.Flow instances to run mixed if configured to execute on a particular cluster by default. This behavior
supports complex import/export processes against the HDFS or other supported remote filesystem.
Changed behavior of c.t.Dfs to force use of HDFS. Previously Dfs would default to the local FileSystem
if the job was run in 'local'mode. Now a Dfs instance will cause failures if it cannot connect to a HDFS cluster.
Using c.t.Hfs will provide previous Dfs behavior. Hfs will use the 'default' filesystem if a scheme is not present
in the 'stringPath' (i.e. hdfs://host:port/some/path).
Added c.stats package to allow for collecting statics of Cascades, Flows, and FlowSteps.
Updated c.f.Flow and c.c.Cascade log messages to be easier to follow when executing many flow instances
simultaneously.
Added compression flag to c.s.TextLine. Can now toggle compression (Hadoop style compression) per Tap instance.
This prevents clusters with compression enabled by default to export text files with a .deflate extension.
Added support for bypassing Hadoop OutputCollector via Tap.setUseTapCollector() method. Setting to true will force
Cascading to use the c.t.TapCollector instead. This bypasses bugs in Hadoop with custom FileSystem types. This will
always be true for http(s) and s3tp filesystems when using a c.t.Hfs Tap type (atleast until HADOOP-3021 is resolved).
Added c.t.TupleCollector, complementing c.t.TupleIterator, for directly writing Tuple instances out via a c.t.Tap
instance.
Added c.f.FlowListener so that c.f.Flow instances can fire events on starting, completed, and throwable.
Changed c.t.h.S3HttpFileSystem so it can now create files remotely.
Renamed cascading.spill.threshold to cascading.cogroup.spill.threshold, so there is less a chance of collision.
Made numerous optimizations to improve overall performance. Namely split and merge of key/value tuples to remove
redundancy in the stream between the mapper and reducer.
Changed c.p.Operators to push c.o.Operation results directly through to next operation without intermediate
collection. This should improve pipelining of large result streams and lower runtime memory footprint.
Changed c.c.Cascade so it now runs Flows in parallel if Hadoop is clustered, and there are no dependencies between the
Flows.
Moved c.Cascade and related classed to c.cascade package. Wanted to preempt any future ugliness.
Added support in c.t.h.S3HttpFileSystem for these properties: fs.s3tp.awsAccessKeyId and fs.s3tp.awsSecretAccessKey
0.3.0
Added ability to push Log4j logger properties to mapper/reducer via JobConf.
Use jobConf.set("log4j.logger","logger1=LEVEL,logger2=LEVEL")
Added missing equals() and hashCode() in c.t.MultiTap.
Added c.t.h.ZipInputFormat (and ZipSplit) to support zip files. c.s.TextLine supports transparent
reading of zip files if the filename ends with .zip, but cannot write to them. This code is
loosely based on HADOOP-1824. If the underlying filesystem is hdfs or file, splits will be created
for each ZipEntry. Otherwise ZipEntries are iterated over to be more stream friendly. Progress status is
supported.
Added http, https, and s3tp read-only file systems to Hadoop. Use these URLs, respectively:
http://, https://, and s3tp://AWS_ACCESS_KEY_ID:AWS_SECRET_ACCESS_KEY@bucket-name/key
Added c.o.t.DateFormatter supporting text formatting of time stamps created by c.o.t.DateParser.
Fixed bug where in complex assemblies, some Scopes were not resolved.
Fixed bug where tap instances were not being inserted before some CoGroup joins if there was a previous Group in the
assembly.
Upgraded JGraphT to 0.7.3
Changed c.t.SpillableTupleList allows for iteration across entries.
Changed c.f.FlowException to optionally allow for printing of underlying pipe graph for debugging.
Added c.o.t.FieldFormatter function to format Tuples into complex strings using j.u.Formatter formatting.
Added c.o.a.Last aggregator to find the last value encountered in a group.
Changed c.o.a.Max and c.o.a.Min to maintain original value type. Will return null if no values are encountered.
Changed c.o.a.First to use Fields.ARG by default. Removed Fields constructor.
Added c.t.Fields.join(Fields...) method to allow for joining multiple Fields instances into a new instance.
Can retrieve Tuple values by field name through the TupleEntry class via the get(String) method.
Added c.t.TupleCollector interface to simplify the operation interfaces.
Added a Debug filter that will print to either stderr or stdout. Useful for debugging stream transformations.
Added CascadingTestCase base test class
Added Insert Function that allows for literal values to be inserted into the Tuple stream.
0.2.0