Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement OD verifier algorithm #420

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

vano105
Copy link

@vano105 vano105 commented May 27, 2024

Implement novel algorithm for validating canonical ODs (order dependencies). This algorithm receives as input left and right column indices, context, and a flag to determine dependency order (ascending/descending). The algorithm outputs rows violated by swaps or splits.
Add unit tests for OD verification.

For more information about canonical ODs: http://www.vldb.org/pvldb/vol10/p721-szlichta.pdf

vano105 added 4 commits May 27, 2024 20:43
Introducing a novel algorithm to validate canonical order dependencies.
This algorithm takes as input left and right column indices, context,
and a flag indicating whether the dependency is ascending or descending.
As output, it identifies rows where the dependency is violated through
swaps or splits. Additionally, new options such as context and the ascending
flag have been introduced. These parameters serve as inputs for the newly
added algorithm. Furthermore, access rights in the class
algos::fastod::ComplexStrippedPartition have been modified from private
to protected. This adjustment was necessary as I developed a new class
inheriting functionality from algos::fastod::ComplexStrippedPartition,
requiring access to its private fields.
@@ -2,3 +2,4 @@

#include "algorithms/od/fastod/fastod.h"
#include "algorithms/od/order/order.h"
#include "algorithms/od/od_verifier/od_verifier.h"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

od_verifier should be mentioned in verification_algorithms.h

void ODVerifier::LoadDataInternal() {
relation_ = ColumnLayoutRelationData::CreateFrom(*input_table_, is_null_equal_null_);

if (relation_->GetColumnData().empty()) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should maintain consistent style of writing braces.

}

unsigned long long ODVerifier::ExecuteInternal() {
auto start_time = std::chrono::system_clock::now();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to use existing methods. Use util::TimedInvoke (example).

split
split,
/* Canonical OD verifier algorithm */
od_verifier
Copy link
Collaborator

@yakovypg yakovypg May 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that OD verifier algorithms should be mentioned after OD mining algorithms.

@@ -11,7 +11,7 @@ using AlgorithmTypes =
Apriori, metric::MetricVerifier, DataStats, fd_verifier::FDVerifier, HyUCC,
PyroUCC, cfd::FDFirstAlgorithm, ACAlgorithm, UCCVerifier, Faida, Spider, Mind,
Fastod, GfdValidation, EGfdValidation, NaiveGfdValidation, order::Order,
dd::Split>;
dd::Split, od_verifier::ODVerifier>;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think OD verifier algorithms should be mentioned after OD mining algorithms.


// Returns the number of rows that violate the OD by split
size_t GetNumRowsViolateBySplit() const {
return row_violate_ods_by_split_.size();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to implement methods in .cpp.


// Returns the number of rows that violate the OD by swap
size_t GetNumRowsViolateBySwap() const {
return row_violate_ods_by_swap_.size();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to implement methods in .cpp.

@@ -0,0 +1,57 @@
#include "partition.h"

#include <strings.h>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this include necessary? If not, remove it.


for (size_t i = group_begin + 1; i < group_end; i++) {
if (data_->GetValue((*sp_indexes_)[i], right) != group_value) {
violates.emplace_back(std::pair<int, int>(right, (*sp_indexes_)[i]));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can omit creating std::pair due to using emplace_back.


namespace algos::od_verifier {

std::vector<std::pair<int, int>> Partition::CommonViolationBySplit(model::ColumnIndex right) const {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is better to have using X = std::pair<int, int> for clarity of what std::pair<int, int> is.

@@ -76,7 +76,9 @@ BETTER_ENUM(AlgorithmType, char,
order,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that OD mining algorithms should be mentioned next to each other, so you can move 'order'.

#include "config/common_option.h"

namespace config {
extern CommonOption<AscendingODFlagType> const kAscendingODOpt;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'AscendingOD' option is useful only for OD verifier and won't be needed by other algorithms. So I think that it should be removed. Instead, you can use Option as shown here.

#include "config/indices/type.h"

namespace config {
extern CommonOption<IndicesType> const kODContextOpt;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'ODContext' option is useful only for OD verifier and won't be needed by other algorithms. So I think that it should be removed. Instead, you can use Option as shown here.


// checks whether OD is violated and finds the rows where it is violated
template <bool Ascending>
void VerifyOD();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Template methods should be implemented in header file.

Copy link
Collaborator

@yakovypg yakovypg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General comments about tests:

  • You should test not only dependencies of the form {X}: A ~ B, but also {X}: [] -> A.
  • You should test dependencies that contain several attributes in context.


struct ODVerifyingParams {
algos::StdParamsMap params;
size_t const row_violate_ods_by_split = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I understand correctly that this field stores the number of rows that violate the order dependency by split? If so, the field should be renamed.

struct ODVerifyingParams {
algos::StdParamsMap params;
size_t const row_violate_ods_by_split = 0;
size_t const row_violate_ods_by_swap = 0;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did I understand correctly that this field stores the number of rows that violate the order dependency by swap? If so, the field should be renamed.

// input data
config::InputTable input_table_;
config::EqNullsType is_null_equal_null_;
IndicesType lhs_indices_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canonical ODs contain only one attribute (or don't contain attributes) in left side. Why lhs_indices_ type is std::vector?

config::InputTable input_table_;
config::EqNullsType is_null_equal_null_;
IndicesType lhs_indices_;
IndicesType rhs_indices_;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Canonical ODs contain only one attribute in right side. Why rhs_indices_ type is std::vector?


namespace algos::od_verifier {

class Partition : protected algos::fastod::ComplexStrippedPartition {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Name of this class should be more specific.

}

if (!is_first_group && values[prev_group_max_index].second > second) {
violates.push_back(std::pair<int, int>(right, row_pos[i]));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use emplace_back.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants