-
Notifications
You must be signed in to change notification settings - Fork 0
Setting up Flume for an ADLS sink
Apache Flume is widely used to move large amounts of log data reliably and efficiently. One of the frequently used destination to store the logs is HDFS (Hadoop Distributed File System) but it comes with the pain of "Do It Yourself". So here I will explore the option of sending all the logs to ADLS (Azure Data Lake Store). According to Microsoft, Azure Data Lake store is HDFS for the cloud. Setting up ADLS sink in Flume will give you huge benefit of low maintenance and high reliability at a better scale.
ADLS is HDFS compatible file system and we will take it's advantage in Flume. Flume comes with HDFS sink which is highly configurable. We will be able to use all those configurations for pushing logs in ADLS as well with few extra properties. Let's see how.
Let's start by setting up normal HDFS sink in Flume and then modify few things to get it working for ADLS with the inclusion of few jars. By the way, to keep things simple we will use Memory channel and Sequence Generator as source in this article.
First step is to setup the flume conf properties file. In Flume_HOME under /conf directory, you can find a template file which goes by the name flume-conf.properties.template. I have modified that a bit to make it work with hdfs. Here is a sample one which works with hdfs flume-conf-hdfs.properties Notable changes are at the lines 26, 35, 49.
- @line26
agent.sinks = hadoop
we have mentioned the sink name we have to setup. - @line35
agent.sinks.hadoop.type = hdfs
we have mentioned the type of sink we want to use. For us it's hdfs - @line49
adlagent.sinks.hadoop.hdfs.path = hdfs://localhost:8020/logs/
we have mentioned the path of hdfs where logs have to be stored. - lines 50-53 has some of the frequently used hdfs properties.
This configuration file is good enough for us to setup HDFS sink in Flume assuming we have right jars in the classpath. Later on we will see which jars are to be included. A point to note here is that setting up Hadoop on the server is not mandatory to setup ADLS but just the inclusion of right jars is sufficient and that is what we are focusing on.
The only change required in this configuration file for ADLS to work is @line49. Here is the modified configuration file which works with ADLS sink.
To setup ADLS we haven't changed the sink type, it's still hdfs. This is where the ADLS and HDFS compatibility part comes in. And the benefit we get is that other hdfs related configurations are valid for adls as well.
The part we changed is
- @line49
agent.sinks.hadoop.hdfs.path = adl://accountname.azuredatalakestore.net/logs/
which is now pointing to adl path rather than hdfs path.
Here the accountname is the name of your adls account that you have to setup in ADLS. If you haven't done already then please follow this article. Once you have setup the account you also need to get 3 things namely application id, authentication key and OAuth 2.0 token endpoint. We will need these for Flume to connect to your account safely. Here is how to get them.
After setting flume conf properties file, we have to add a new file in same directory, FLUME_HOME/conf/ directory. This file should be named as core-site.xml and here is a template for the same. core-site.xml is the Hadoop's configuration file which should be present in the Flume's classpath. Since we are focusing on system where Hadoop installation is not required, we manually included this xml in Flume's conf directory. Otherwise core-site.xml present in Hadoop conf directory could have been modified but that's a discussion for some other time. If you notice we only have ADLS specific configuration in this xml since that is all what we want. Here you particularly have to focus on FILL-IN-HERE values which should be filled correctly by you according to your ADLS account.
- @line24 value for dfs.adls.oauth2.refresh.url will be OAuth 2.0 token endpoint obtained from previous step.
- @line29 value for dfs.adls.oauth2.client.id will be application id obtained from previous step.
- @line34 value for dfs.adls.oauth2.credential will be authentication key obtained from previous step.
We are done setting up ADLS related configurations and last step remaining is inclusion of right jars.
For ADLS to work we need to include few Hadoop and Azure related jars, which could have been optional if Hadoop was already there on system. But we will not assume that and will include all necessary jars. These jars are:
- hadoop-auth-2.8.1.jar
- hadoop-common-2.8.1.jar
- hadoop-azure-datalake-2.8.1
- azure-data-lake-store-sdk-2.2.3
All these jars should be included FLUME_HOME/lib/ directory for Flume to pick them up. If you notice, specific versions are provided for all these jars because these are the tested ones. You can definitely try the latest version of these jars but beware of compatibility issues.
We are done with setting up Flume for ADLS sink and you can run a test to see whether the logs are being uploaded into ADLS.
Please be aware of these common pitfalls.
- ADLS account ACL - Make sure the user running Flume has the access to ADLS account and directory where logs are being uploaded.
- Compatibility issues - If you are using different versions of jars, you might see unwanted errors like MethodNotFound or ClassNotFound etc. Make sure the jars being used are compatible to each other and there is no easy way to find that out.
- Hadoop already present on system - If Hadoop is already present on system and you want to use it rather than including your own jars, then make sure to first setup ADLS on Hadoop. And you can skip adding core-site.xml in Flume.
- ADLS Source - ADLS or for that matter HDFS source is not available in Flume.
- Authentication errors - Double check your application id, authentication key and OAuth 2.0 token endpoint by following this article.
- ADLS Sink - The sink type in Flume configuration is hdfs and not adls