Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue 13: Initial Check in of the S3 Connector #19

Merged
merged 65 commits into from
Jul 15, 2021
Merged
Show file tree
Hide file tree
Changes from 62 commits
Commits
Show all changes
65 commits
Select commit Hold shift + click to select a range
3271b61
Initial code base commit
karansinghneu May 5, 2021
92f185b
Fixing bugs in the previous PR
karansinghneu May 13, 2021
172d96e
Adding latest event listener logs
karansinghneu May 13, 2021
efe7de7
Fixing license header
karansinghneu Jun 8, 2021
74b580b
Fixing license
karansinghneu Jun 9, 2021
62c395c
Removing log directory and fixing license
karansinghneu Jun 9, 2021
756463a
Removing comments & fixing shell scripts
karansinghneu Jun 9, 2021
0eb70ba
Removing comments from shell scripts
karansinghneu Jun 9, 2021
8924b9e
Adding modified build file
karansinghneu Jun 16, 2021
fa86480
Removing log directory
karansinghneu Jun 16, 2021
9c1c0c3
Change integration tests to use scality s3 server
chipmaurer Jun 16, 2021
edb3629
Adding old gradle build
karansinghneu Jun 16, 2021
02abfba
Merge branch 'presto-karan-s3' of https://github.com/karansinghneu/pr…
chipmaurer Jun 16, 2021
c8875da
Down to 2 failed tests. TestDropTable and TestDropTableJson
chipmaurer Jun 17, 2021
3e72564
Set integration test run to include --info flag
chipmaurer Jun 17, 2021
1d88ff9
Use a profile for aws commands. Make sure to pull schema registry be…
chipmaurer Jun 18, 2021
1c26dbf
Put docker ps commands in startup to ensure containers are running
chipmaurer Jun 18, 2021
b2f9303
Use curl to test for s3 server readiness
chipmaurer Jun 24, 2021
a716810
Use Bash for shell
chipmaurer Jun 24, 2021
3cd7ee0
Add some debug and change setup method annotation
chipmaurer Jun 28, 2021
551e0d5
Force presto to reload tables after s3 server and schema registry con…
chipmaurer Jun 28, 2021
4e542e4
Show firewall and port info
chipmaurer Jun 28, 2021
e5f6c57
More debugging
chipmaurer Jun 28, 2021
fb6dd18
Fix syntax error
chipmaurer Jun 28, 2021
008669f
.github/workflows/s3-build.yml
chipmaurer Jun 28, 2021
4f45c85
More changes to action file
chipmaurer Jun 28, 2021
e48732b
More changes to action file - part 2
chipmaurer Jun 28, 2021
d2eb687
More changes to action file - part 2
chipmaurer Jun 28, 2021
b0b3f50
More changes to action file - part 3
chipmaurer Jun 28, 2021
b64bdf3
More changes to action file - part 4
chipmaurer Jun 28, 2021
3836927
Still trying to get this to work
chipmaurer Jun 28, 2021
c1e2d88
Still trying to get this to work - 1
chipmaurer Jun 28, 2021
67b7595
Make sure aws cli is install
chipmaurer Jun 28, 2021
3bd0904
Replace aws usage with s3curl
chipmaurer Jun 29, 2021
cf47f97
Turn on debugging in the integration test scripts
chipmaurer Jun 29, 2021
1280db2
Make scripts dumber
chipmaurer Jun 30, 2021
0e54feb
Force a schema reset before running tests
chipmaurer Jun 30, 2021
21b3009
Hopefully, this time it will really force a reset of the schema befor…
chipmaurer Jun 30, 2021
678cf9e
Revert back to starting containers in github action yml file
chipmaurer Jul 1, 2021
19b9248
Run integration test in same step as starting containers
chipmaurer Jul 1, 2021
a0f5bc0
Use network host instead of -p for docker run
chipmaurer Jul 1, 2021
3904688
Add network config check
chipmaurer Jul 1, 2021
4ee1fc9
Try to use docker proxy, and dump bucket contents
chipmaurer Jul 1, 2021
224a6d3
Fix yml error
chipmaurer Jul 1, 2021
1ecfaf7
Use localhost instead of 127.0.0.1 - yes, it's dumb but I'm running o…
chipmaurer Jul 1, 2021
1783ab0
Start containers with --bind arg
chipmaurer Jul 1, 2021
e24b314
Remove --bind arg
chipmaurer Jul 1, 2021
f444b1a
Replace localhost with 127.0.0.1
chipmaurer Jul 1, 2021
01a8531
Back out the yml stuff
chipmaurer Jul 1, 2021
506ea6f
add a stupid sleep
chipmaurer Jul 1, 2021
279ed7b
Make sure s3_start REALLY finishes
chipmaurer Jul 1, 2021
17735cc
Go back to starting containers in yml file
chipmaurer Jul 1, 2021
98c9a23
Fail setup on error. Catch listBuckets exception for slow systems du…
chipmaurer Jul 1, 2021
9025561
OK, maybe waiting up to 30 seconds isn't enough. Now waiting up to 2…
chipmaurer Jul 2, 2021
b5d58e5
Run docker logs if can't create bucket
chipmaurer Jul 2, 2021
ff3b5e0
Fix logic errors in setup and s3_start. Add more debug to s3 start s…
chipmaurer Jul 2, 2021
3b1a31e
Make sure hmac digest is installed
chipmaurer Jul 2, 2021
bbf92fd
Don't use sudo
chipmaurer Jul 2, 2021
b19200e
Move install of hmac to action file. Replace sh with bash in java se…
chipmaurer Jul 2, 2021
49646c8
Well, I guess sudo doesn't work
chipmaurer Jul 2, 2021
d3e1895
So, sudo does work, my bad
chipmaurer Jul 2, 2021
addf00c
Updating the README
chipmaurer Jul 2, 2021
d843147
Gradle file cleanup
chipmaurer Jul 6, 2021
d94e1a3
Fix a couple gradle issues, and integration test issues, and depracat…
chipmaurer Jul 6, 2021
68828e6
Applied comments from PR review
chipmaurer Jul 7, 2021
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
60 changes: 60 additions & 0 deletions .github/workflows/s3-build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

name: Java CI

on: [push]

env:
BUILD_CACHE_PATH: ./*

jobs:
build:
name: Build
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Set up JDK 8
uses: actions/setup-java@v1
with:
java-version: 8
- name: Build Output Cache
uses: actions/[email protected]
with:
path: ${{env.BUILD_CACHE_PATH}}
key: ${{github.run_id}}
- name: Compile
run: ./gradlew build

integration:
name: Setup and run integration tests
needs: build
runs-on: ubuntu-latest
steps:
- name: Build Output Cache
uses: actions/[email protected]
with:
path: ${{env.BUILD_CACHE_PATH}}
key: ${{github.run_id}}
- name: Install libdigest-hmac-perl
run: sudo apt-get install -y libdigest-hmac-perl
- name: Integration Test Gradle Run
run: ./gradlew --info -Pintegration test

build_and_test_complete:
name: CI Complete
needs: [build, integration]
runs-on: ubuntu-latest
steps:
- name: Check Build Status
run: echo build, unit and integration tests successful.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
/var/
166 changes: 165 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,165 @@
# presto-s3-connector
# S3 Presto connector

Presto is a distributed SQL query engine for big data. Presto uses connectors to query storage from different storage sources. This repository contains the code for a connector (the S3 Presto connector) to query storage from many S3 compatibile object stores. In many ways, the S3 connector is similar to the hive connector, but is specific to S3 storage, and does not require configuration and access to a Hive metastore.

Amazon S3 or Amazon Simple Storage Service is a service offered by Amazon Web Services (AWS) that provides object storage through a web service interface. There are numerous vendors that supply S3 alternatives to AWS. The products for these vendors comply with the S3 protocol. Objects written to S3 can typically store any type of object, which allows for uses like storage for Internet applications, backup and recovery, disaster recovery, data archives, data lakes for analytics, and hybrid cloud storage.

See the [User Manual](https://prestodb.github.io/docs/current/) for Presto deployment instructions and end user documentation.

## Types of S3 Servers Evaluated

The S3 Presto connector has been evaluated with the following S3 compatible object storage servers

- Dell Technologies Elastic Cloud Storage [ECS](https://www.delltechnologies.com/en-us/storage/ecs/index.htm)
- Scality Cloudserver [Scality](https://www.scality.com/)
- Minio Object Storage [MinIO](https://min.io/)

## Requirements

To build and run the S3 Presto connector, you must meet the following requirements:

* Linux
* To build: Java 11+ 64-bit. Both Oracle JDK and OpenJDK are supported (we build using Java 11 JDK but with Java 8 compatibility)
* To run: Java 8 Update 151 or higher (8u151+), 64-bit. Both Oracle JDK and OpenJDK are supported.
* Gradle 6.5.1+ (for building)
* Python 2.7+ (for running with the launcher script)
* S3 Compatible Storage server mentioned above
* Pravega Schema Registry version 0.2.0 or higher is recommended

## Building S3 Presto connector

S3 Presto connector is a standard Gradle project. Simply run the following command from the project root directory:

# [root@lrmk226 ~]# ./gradlew clean build

On the first build, Gradle will download all the dependencies from various locations of the internet and cache them in the local repository (`~/.gradle / caches `), which can take a considerable amount of time. Subsequent builds will be faster.

S3 Presto connector has a set of unit tests that can take a few minutes to run. You can run the tests using this command:

# [root@lrmk226 ~]# ./gradlew test

S3 Presto connector has a more comprehensive set of integrations tests that are longer to run than the unit tests. You can run the integration tests using this command:

# [root@lrmk226 ~]# ./gradlew test -Pintegration

The --info argument can be passed on the command line for more information during the integration test run.

## Installing Presto

If you haven't already done so, install the Presto server and default connectors on one or more Linux hosts. Instructions for downloading, installing and configuring Presto can be found here: https://prestodb.io/docs/current/installation/deployment.html.

When using the tar.gz Presto bundle downloaded from Maven Central, the Presto installation can be installed in any location. Determine a location with sufficient available storage space. Using the wget command, download the gzip'd tar file from Maven using the link defined in the PrestoDB deployment section, but similar to the steps below:

# [root@lrmk226 ~]# pwd
/root
# [root@lrmk226 ~]# wget https://repo1.maven.org/maven2/com/facebook/presto/presto-server/0.248/presto-server-0.248.tar.gz
# [root@lrmk226 ~]# tar xvzf presto-server-0.248.tar.gz
# [root@lrmk226 ~]# export PRESTO_HOME=/root/presto-server-0.248

Make a directory for the Presto configuration files

[root@lrmk226 ~]# mkdir $PRESTO_HOME/etc

Now follow the directions to create the necessary configuration files for configuring Presto found in the PrestoDB documentation.

Note that if you are also running with Java 11, you may have to add the following to your etc/jvm.config
-Djdk.attach.allowAttachSelf=true

## Installing and Configuring S3 Connector

The plugin file that gets created during the presto-connector build process is: ./build/distributions/s3-presto-connector--<VERSION>.tar.gz. This file can be untar'd in the $PRESTO_ROOT/plugin directory of a valid Presto installation. Like all Presto connectors, the S3 Presto connector uses a properties files to point to the storage provider (e.g. S3 server IP and port). Create a properties file similar to below, but replace the # characters with the appropriate IP address of the S3 storage server and the Pravega Schema Registry server of your configuration.

[root@lrmk226 ~]# cd $PRESTO_HOME/plugin
[root@lrmk226 ~]# ls *.gz
s3-presto-connector-0.1.0.tar.gz
[root@lrmk226 ~]# tar xvfz s3-presto-connector-0.1.0.tar.gz
[root@lrmk226 ~]# cat $PRESTO_HOME/etc/catalog/s3.properties
s3.s3SchemaFileLocationDir=etc/s3
s3.s3Port=9020
s3.s3UserKey=<S3USER>
s3.s3UserSecretKey=<S3SECRETKEY>
s3.s3Nodes=##.###.###.###,##.###.###.###

s3.schemaRegistryServerIP=##.###.###.###
s3.schemaRegistryPort=9092
s3.schemaRegistryNamespace=s3-schemas

s3.maxConnections=500
s3.s3SocketTimeout=5000
s3.s3ConnectionTimeout=5000
s3.s3ClientExecutionTimeout=5000

More options for the S3 secret key will be available in a future release.

If you have deployed Presto on more than one host (coordinator and one or more workers), you must download/copy the S3 connector gzip tar file to each node, and create the configuration properties file on all hosts.

## Running Presto Server

As mentioned in the PrestoDB documentation, use the launcher tool to start the Presto server on each node.

## Running Presto in your IDE

After building Presto for the first time, you can load the project into your IDE and run the server in your IDE. We recommend using [IntelliJ IDEA](http://www.jetbrains.com/idea/). Because S3 Presto connector is a standard Gradle project, you can import it into your IDE. In IntelliJ, choose Import Project from the Quick Start box and point it to the root of the source tree. IntelliJ will identify the *.gradle files and prompt you to confirm.

After opening the project in IntelliJ, double check that the Java SDK is properly configured for the project:

* Open the File menu and select Project Structure
* In the SDKs section, ensure that a Java 11+ JDK is selected (create one if none exist)
* In the Project section, ensure the Project language level is set to 8.0 as Presto makes use of several Java 8 language features

Use the following options to create a run configuration that runs the Presto server using the Pravega Presto connector:

* Main Class: 'com.facebook.presto.server.PrestoServer'
* VM Options: '-ea -XX:+UseG1GC -XX:G1HeapRegionSize=32M -XX:+UseGCOverheadLimit -XX:+ExplicitGCInvokesConcurrent -Xmx2G -Dconfig=etc/config.properties -Dcom.sun.xml.bind.v2.bytecode.ClassTailor.noOptimize=true -Dlog.levels-file=etc/log.properties'
* Working directory: '$MODULE_DIR$'
* Use classpath of module: 'presto-s3-connector.main'
* Add a 'Before Launch' task - Add a gradle task to run the 'jar' task for the 'presto-s3-connector' Gradle project.

Please note that some versions of Intellij do not display VM Options by default. For this, enable them with 'Modify options'

Modify the s3.properties file in etc/catalog as previously described to point to a running S3 storage server, and a running Schema Registry.

## Schema Definitions

Optionally, you may manually create schema definitions using a JSON file. The 'CREATE TABLE' and 'CREAE TABLE AS' Presto commands are also available to create ad-hoc tables. The JSON configuration files are read at server startup, and should be located in etc/s3 directory.

In the JSON schema example below, "testdb" is the Presto schema in the S3 catalog, addressTable is the name of the table. In sources, testbucket is the name of the bucket, and testdb/addressTable is the S3 prefix location of the object data for the table. The objectDataFormat, hasHeaderRow and recordDelimiter settings are self explanitory.

{
"schemas": [
{
"schemaTableName": {
"schema_name": "testdb",
"table_name": "addressTable"
},
"s3Table": {
"name": "addressTable",
"columns": [
{
"name": "Name",
"type": "VARCHAR"
},
{
"name": "Address",
"type": "VARCHAR"
}
],
"sources": {"testbucket": ["testdb/addressTable"] },
"objectDataFormat": "csv",
"hasHeaderRow": "false",
"recordDelimiter": "\n"
}
}
]
}


## Tests
The pravega presto connector has 2 types of tests
* unit tests
* all unit tests run during developer builds
* integration tests
* by default only run on CI server
* use '-Pintegration' flag to run: ./gradlew test -Pintegration


Loading