This repository has been archived by the owner on Nov 30, 2018. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathREADME.txt
88 lines (61 loc) · 3.16 KB
/
README.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
Welcome to HBase-Writer README.
This document can also be found online here:
http://code.google.com/p/hbase-writer/wiki/README
Specific versions of HBase-Writer now support different
version combinations of Heritrix and HBase. Please refer to
http://code.google.com/p/hbase-writer/wiki/VERSIONS
for a more detailed list.
= INTRODUCTION =
This is a processor for Heritrix that writes fetched pages to HBase.
The layout of this contribution is modeled after Doug Judds'
heritrix-hadoop-dfs-processor available off the heritrix home page.
This software is licensed under the LGPL. See accompanying LICENSE.txt document.
The HBase-Writer project is an extension to the Heritrix open
source crawler written by the Internet Archive (http://crawler.archive.org/)
that enables it to store crawled content directly into HBase tables (http://hbase.org/) running on the Hadoop Distributed FileSystem (http://lucene.apache.org/hadoop/).
HBase-Writer writes crawled content into a given hbase table as records.
In turn, these tables are directly supported by the Map/Reduce framework via HBase so Map/Reduce jobs can be done on them.
= GETTING STARTED =
HBase-Writer now supports Heritrix 2 and 3. Please refer to the corresponding
READMEHeritrix*.txt files for specific instructions.
http://code.google.com/p/hbase-writer/wiki/READMEHeritrix3
http://code.google.com/p/hbase-writer/wiki/READMEHeritrix2
= COMPILING THE SOURCE =
mvn clean compile
= BUILDING THE JAR =
mvn clean package
The hbase-writer-x.x.x.jar should be in the target/ directory.
You can get the hadoop, hbase and log4j dependency jars from your ${HOME}/.m2/repository/ directory.
For example:
cp ${HOME}/.m2/repository/org/apache/hadoop/hbase/0.20.1/hbase-0.20.1.jar ${HERITRIX_HOME}/lib/
cp ${HOME}/.m2/repository/org/apache/hadoop/zookeeper/3.2.1/zookeeper-3.2.1.jar ${HERITRIX_HOME}/lib/
cp ${HOME}/.m2/repository/org/apache/hadoop/hadoop-core/0.20.1/hadoop-core-0.20.1.jar ${HERITRIX_HOME}/lib/
cp ${HOME}/.m2/repository/log4j/log4j/1.2.16/log4j-1.2.16.jar ${HERITRIX_HOME}/lib/
= UPGRADING TO NEW HADOOP/HBASE/HERITRIX VERSIONS =
To build hbase-writer with new versions of hadoop, hbase or heritrix (or any of the dependencies), use a ${HOME}/.m2/settings.xml file.
A sample settings.xml file:
{{{
<?xml version="1.0" encoding="UTF-8"?>
<settings>
<profiles>
<profile>
<id>myBuild</id>
<properties>
<heritrix.version>3.1.0</heritrix.version>
<hbase.version>0.90.5</hbase.version>
<hadoop.version>0.20.205.0</hadoop.version>
<zookeeper.version>3.3.2</zookeeper.version>
</properties>
</profile>
</profiles>
</settings>
}}}
Place this file in your ${HOME}/.m2/ directory and run the maven build command:
mvn clean package -PmyBuild
= CONTRIBUTING SOURCE =
If you would like to contribute a patch to help improve the source code, please feel free to create an Issue in the Issues section on the project website.
Just attach your patch to the issue and a committer will merge the changes as soon as possible. Thank you.
= BUILDING THE SITE REPORT =
mvn clean site
= PING BACK =
Thanks to the Open Source community for all the support for releasing this project.