-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement OD verifier algorithm #420
base: main
Are you sure you want to change the base?
Conversation
Introducing a novel algorithm to validate canonical order dependencies. This algorithm takes as input left and right column indices, context, and a flag indicating whether the dependency is ascending or descending. As output, it identifies rows where the dependency is violated through swaps or splits. Additionally, new options such as context and the ascending flag have been introduced. These parameters serve as inputs for the newly added algorithm. Furthermore, access rights in the class algos::fastod::ComplexStrippedPartition have been modified from private to protected. This adjustment was necessary as I developed a new class inheriting functionality from algos::fastod::ComplexStrippedPartition, requiring access to its private fields.
@@ -2,3 +2,4 @@ | |||
|
|||
#include "algorithms/od/fastod/fastod.h" | |||
#include "algorithms/od/order/order.h" | |||
#include "algorithms/od/od_verifier/od_verifier.h" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
od_verifier should be mentioned in verification_algorithms.h
void ODVerifier::LoadDataInternal() { | ||
relation_ = ColumnLayoutRelationData::CreateFrom(*input_table_, is_null_equal_null_); | ||
|
||
if (relation_->GetColumnData().empty()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should maintain consistent style of writing braces.
} | ||
|
||
unsigned long long ODVerifier::ExecuteInternal() { | ||
auto start_time = std::chrono::system_clock::now(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to use existing methods. Use util::TimedInvoke (example).
split | ||
split, | ||
/* Canonical OD verifier algorithm */ | ||
od_verifier |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that OD verifier algorithms should be mentioned after OD mining algorithms.
@@ -11,7 +11,7 @@ using AlgorithmTypes = | |||
Apriori, metric::MetricVerifier, DataStats, fd_verifier::FDVerifier, HyUCC, | |||
PyroUCC, cfd::FDFirstAlgorithm, ACAlgorithm, UCCVerifier, Faida, Spider, Mind, | |||
Fastod, GfdValidation, EGfdValidation, NaiveGfdValidation, order::Order, | |||
dd::Split>; | |||
dd::Split, od_verifier::ODVerifier>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think OD verifier algorithms should be mentioned after OD mining algorithms.
|
||
// Returns the number of rows that violate the OD by split | ||
size_t GetNumRowsViolateBySplit() const { | ||
return row_violate_ods_by_split_.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to implement methods in .cpp.
|
||
// Returns the number of rows that violate the OD by swap | ||
size_t GetNumRowsViolateBySwap() const { | ||
return row_violate_ods_by_swap_.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to implement methods in .cpp.
@@ -0,0 +1,57 @@ | |||
#include "partition.h" | |||
|
|||
#include <strings.h> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this include necessary? If not, remove it.
|
||
for (size_t i = group_begin + 1; i < group_end; i++) { | ||
if (data_->GetValue((*sp_indexes_)[i], right) != group_value) { | ||
violates.emplace_back(std::pair<int, int>(right, (*sp_indexes_)[i])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can omit creating std::pair due to using emplace_back.
|
||
namespace algos::od_verifier { | ||
|
||
std::vector<std::pair<int, int>> Partition::CommonViolationBySplit(model::ColumnIndex right) const { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is better to have using X = std::pair<int, int>
for clarity of what std::pair<int, int> is.
@@ -76,7 +76,9 @@ BETTER_ENUM(AlgorithmType, char, | |||
order, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that OD mining algorithms should be mentioned next to each other, so you can move 'order'.
#include "config/common_option.h" | ||
|
||
namespace config { | ||
extern CommonOption<AscendingODFlagType> const kAscendingODOpt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'AscendingOD' option is useful only for OD verifier and won't be needed by other algorithms. So I think that it should be removed. Instead, you can use Option
as shown here.
#include "config/indices/type.h" | ||
|
||
namespace config { | ||
extern CommonOption<IndicesType> const kODContextOpt; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'ODContext' option is useful only for OD verifier and won't be needed by other algorithms. So I think that it should be removed. Instead, you can use Option
as shown here.
|
||
// checks whether OD is violated and finds the rows where it is violated | ||
template <bool Ascending> | ||
void VerifyOD(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Template methods should be implemented in header file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
General comments about tests:
- You should test not only dependencies of the form
{X}: A ~ B
, but also{X}: [] -> A
. - You should test dependencies that contain several attributes in context.
src/tests/test_od_verifier.cpp
Outdated
|
||
struct ODVerifyingParams { | ||
algos::StdParamsMap params; | ||
size_t const row_violate_ods_by_split = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I understand correctly that this field stores the number of rows that violate the order dependency by split? If so, the field should be renamed.
src/tests/test_od_verifier.cpp
Outdated
struct ODVerifyingParams { | ||
algos::StdParamsMap params; | ||
size_t const row_violate_ods_by_split = 0; | ||
size_t const row_violate_ods_by_swap = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did I understand correctly that this field stores the number of rows that violate the order dependency by swap? If so, the field should be renamed.
// input data | ||
config::InputTable input_table_; | ||
config::EqNullsType is_null_equal_null_; | ||
IndicesType lhs_indices_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Canonical ODs contain only one attribute (or don't contain attributes) in left side. Why lhs_indices_
type is std::vector?
config::InputTable input_table_; | ||
config::EqNullsType is_null_equal_null_; | ||
IndicesType lhs_indices_; | ||
IndicesType rhs_indices_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Canonical ODs contain only one attribute in right side. Why rhs_indices_
type is std::vector?
|
||
namespace algos::od_verifier { | ||
|
||
class Partition : protected algos::fastod::ComplexStrippedPartition { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name of this class should be more specific.
} | ||
|
||
if (!is_first_group && values[prev_group_max_index].second > second) { | ||
violates.push_back(std::pair<int, int>(right, row_pos[i])); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use emplace_back
.
Implement novel algorithm for validating canonical ODs (order dependencies). This algorithm receives as input left and right column indices, context, and a flag to determine dependency order (ascending/descending). The algorithm outputs rows violated by swaps or splits.
Add unit tests for OD verification.
For more information about canonical ODs: http://www.vldb.org/pvldb/vol10/p721-szlichta.pdf