Skip to content
This repository has been archived by the owner on May 12, 2021. It is now read-only.

TAJO-2087: Support DirectOutputCommitter for AWS S3 file system #979

Open
wants to merge 49 commits into
base: master
Choose a base branch
from

Conversation

blrunner
Copy link
Contributor

Here is prototype codes for DirectOutputCommitter. This PR is not ready to review, it shows my approach to implement DirectOutputCommitter. Current version works as following:

  • Register commit history to catalog (TODO).
  • Each tasks will write the output data directly to the final location.
  • In a commit phase, delete existing files with query type as follows. First, backup existing files or directories to staging directory. And then delete backup files or directories.
  • Update the status of commit history to catalog (TODO).
  • If query fails, QueryMaster will delete committed files and update the status of query history to catalog (TODO).
  • When TajoMaster starting, it will check the status of query histories to catalog. If it find running query, it will delete committed files and update the status of query history (TODO).
  • Add unit test cases for failed query (TODO).

@blrunner blrunner changed the title TAJO-2087: Implement DirectOutputCommitter TAJO-2087: Support DirectOutputCommitter for AWS S3 file system Mar 16, 2016
@blrunner
Copy link
Contributor Author

I designed the table of direct output commit history as following:

Column Name Column Type Null Desc
QUERY_ID VARCHAR(128) NOT NULL the id of Query, PRIMARY KEY
PATH VARCHAR(4096) NOT NULL the output path of table
START_TIME BIGINT NOT NULL query start time
END_TIME BIGINT query finish time
QUERY_STATE VARCHAR(50) NOT NULL the state of Query, TajoProtos.QueryState will be used

Implemented necessary codes to Query and TajoMaster as following:

  • When query starting, add the history to catalog.
  • If query fails, QueryMaster will delete committed files and update the status of output commit history to catalog
  • When TajoMaster starting, it will check the status of output commit histories to catalog. If it find running query, it will delete committed files and update the status of output commit history.

Not yet implemented unit test cases for failed queries.

@blrunner
Copy link
Contributor Author

I implemented unit test cases for verifying following cases.

  • DirectOutputCommitter can recover existing files successfully in query failure case.
  • DirectOutputCommitter can remove output files successfully in query failure case.
  • When executing INSERT INTO query, DirectOutputCommitter can maintain existing files.

For the reference, I found that outputs of TestInsertQuery and TestTablePartitions with DirectOutputCommitter were equals to outputs of them without DirectOutputCommitter.

@blrunner
Copy link
Contributor Author

@blrunner
Copy link
Contributor Author

Testing with S3 finished successfully as following:

  • Table type: partitioned table and non-partitioned table
  • Insert overwrite and CTAS queries
  • Insert into a table without DirectOutputCommitter, and then insert into the table with DirectOutputCommitter, and then check output files for above two query.
  • While inserting data, kill the query, and then check two factors: delete temporary data and rollback previous data.
  • While inserting data, restart tajo cluster, and then check two factors: delete temporary data and rollback previous data.

…into direct-output-committer

Conflicts:
	tajo-core-tests/src/test/java/org/apache/tajo/engine/query/TestTablePartitions.java
…into direct-output-committer

Conflicts:
	tajo-catalog/tajo-catalog-server/src/main/resources/schemas/mariadb/mariadb.xml
	tajo-core/src/main/java/org/apache/tajo/master/TajoMaster.java
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant