Skip to content

Commit

Permalink
Merge pull request #23 from aecio/dev
Browse files Browse the repository at this point in the history
Dependency upgrades and maintenance changes
  • Loading branch information
aecio authored Aug 19, 2023
2 parents 5b438bf + 449feaa commit ba325bf
Show file tree
Hide file tree
Showing 6 changed files with 103 additions and 26 deletions.
6 changes: 6 additions & 0 deletions .github/dependabot.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
version: 2
updates:
- package-ecosystem: maven
directory: "/"
schedule:
interval: weekly
23 changes: 23 additions & 0 deletions .github/workflows/build.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
name: crawler-commons-http-fetcher build

on: [push]

jobs:
build:
runs-on: ubuntu-latest
strategy:
matrix:
java: [ 8, 11, 17 ]
name: Java ${{ matrix.java }}
steps:
- uses: actions/checkout@v2

- name: Setup JDK
uses: actions/setup-java@v2
with:
distribution: 'temurin'
java-version: ${{ matrix.java }}
cache: 'maven'

- name: Build
run: mvn install javadoc:aggregate
18 changes: 18 additions & 0 deletions CHANGES.txt
Original file line number Diff line number Diff line change
Expand Up @@ -8,3 +8,21 @@ Current Development 0.1-SNAPSHOT
- Port HttpClient code from crawler-commons (kkrugler) #1
- Allow configuration of HTTP proxies (aecio via kkrugler) #8
- Allow custom cookie store (aecio via kkrugler)
- Enable GitHub's dependabot (aecio)
- Enable GitHub Actions CI (aecio)
- Set Java version to 1.8 (aecio)
- Fix invalid HTML5 tags in Javadoc (aecio)
- Bump commons-io from 2.4 to 2.11.0
- Bump httpclient from 4.5.8 to 4.5.13
- Bump jetty version from 9.3.6.v20151106 to 9.4.44.v20210927
- Bump mockito-core from 1.8.0 to 4.2.0
- Bump junit from 4.7 to 4.13.2
- Bump forbiddenapis from 1.8 to 3.3
- Bump mockito-core from 4.2.0 to 4.6.1
- Bump Jetty from 9.4.44.v20210927 to 9.4.48.v20220622
- Bump maven-javadoc-plugin from 2.9.1 to 3.4.0
- Bump maven-source-plugin from 2.1.2 to 3.2.1
- Bump maven-surefire-plugin from 2.12 to 2.22.2
- Bump maven-release-plugin from 2.5.1 to 2.5.3
- Bump slf4j-api from 1.7.7 to 1.7.36
- Bump slf4j-log4j12 from 1.7.32 to 1.7.33
36 changes: 30 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,38 @@
# http-fetcher
Wrapper code for Apache HttpClient that provides common page fetching functionality

TODO - add more context here.
The Crawler Commons' http-fetcher is a Java library that provides common page fetching functionality needed in web crawlers.
Currently, it uses Apache HttpClient library to implement low-level HTTP communication.

An example of creating a fetcher with five threads that will only accept content identified by the server as text/html:
## Requirements
Currently, http-fetcher requires Java 8+.

## API

An example of creating a fetcher with five threads that will only accept content identified by the server as `text/html`:

``` java
BaseFetcher fetcher = new SimpleHttpFetcher(1, new UserAgent("mycrawler", "[email protected]", "http://domain.com"));
Set<String> validMimeTypes = new HashSet<String>();
// Data passed to UserAgent will be used to automatically create HTTP header 'User-Agent'
UserAgent userAgent = new UserAgent("mycrawler", "[email protected]", "http://domain.com");

// Instantiate the BaseFetcher object used to fetch pages
BaseFetcher fetcher = new SimpleHttpFetcher(1, userAgent);

// Configure the accepted mime-types
Set<String> validMimeTypes = new HashSet<>();
validMimeTypes.add("text/html");
fetcher.setValidMimeTypes(validMimeTypes);
FetchedResult result = fetcher.get("http://localhost:8089/");

try {
// Fetch the web page from the Web
FetchedResult result = fetcher.get("http://localhost:8089/");

// Read downloaded content (additional data is available via remaining methods from FetchedResult object)
String requestedUrl = result.getBaseUrl(); // the requested URL (same as above)
String finalUrl = result.getFetchedUrl(); // the final URL after redirects (if any)
byte[] page = result.getContent(); // the page data returned by server as a byte array
long fetchTime = result.getFetchTime(); // the time taken to download the page
String address = result.getHostAddress(); // the host address
} catch (BaseFetchException e) {
// The download has failed. Check the actual subclass of BaseFetchException to get error details.
}
```
36 changes: 21 additions & 15 deletions pom.xml
Original file line number Diff line number Diff line change
Expand Up @@ -92,6 +92,12 @@
<name>Avi Hayun</name>
<email>[email protected]</email>
</developer>

<developer>
<id>aecio</id>
<name>Aécio Santos</name>
<email>[email protected]</email>
</developer>
</developers>

<build>
Expand Down Expand Up @@ -221,7 +227,7 @@
<plugin>
<groupId>de.thetaphi</groupId>
<artifactId>forbiddenapis</artifactId>
<version>1.8</version>
<version>3.3</version>
<configuration>
<!-- disallow undocumented classes like sun.misc.Unsafe: -->
<internalRuntimeForbidden>true</internalRuntimeForbidden>
Expand Down Expand Up @@ -335,24 +341,24 @@

<properties>
<!-- Dependencies -->
<httpclient.version>4.5.8</httpclient.version>
<commons-io.version>2.4</commons-io.version>
<slf4j-api.version>1.7.7</slf4j-api.version>
<httpclient.version>4.5.13</httpclient.version>
<commons-io.version>2.11.0</commons-io.version>
<slf4j-api.version>1.7.36</slf4j-api.version>

<!-- Dependencies for testing -->
<slf4j-log4j12.version>1.7.7</slf4j-log4j12.version>
<junit.version>4.7</junit.version>
<mockito-core.version>1.8.0</mockito-core.version>
<jetty.version>9.3.6.v20151106</jetty.version>
<slf4j-log4j12.version>1.7.33</slf4j-log4j12.version>
<junit.version>4.13.2</junit.version>
<mockito-core.version>4.6.1</mockito-core.version>
<jetty.version>9.4.48.v20220622</jetty.version>

<!-- Maven Plugin Dependencies -->
<maven-compiler-plugin.version>2.3.2</maven-compiler-plugin.version>
<maven-resources-plugin.version>2.5</maven-resources-plugin.version>
<maven-jar-plugin.version>2.4</maven-jar-plugin.version>
<maven-surfire-plugin.version>2.12</maven-surfire-plugin.version>
<maven-release-plugin.version>2.5.1</maven-release-plugin.version>
<maven-source-plugin.version>2.1.2</maven-source-plugin.version>
<maven-javadoc-plugin.version>2.9.1</maven-javadoc-plugin.version>
<maven-surfire-plugin.version>2.22.2</maven-surfire-plugin.version>
<maven-release-plugin.version>2.5.3</maven-release-plugin.version>
<maven-source-plugin.version>3.2.1</maven-source-plugin.version>
<maven-javadoc-plugin.version>3.4.0</maven-javadoc-plugin.version>
<maven-gpg-plugin.version>1.4</maven-gpg-plugin.version>
<apache-rat-plugin.version>0.8</apache-rat-plugin.version>
<maven-assembly-plugin.version>2.2.2</maven-assembly-plugin.version>
Expand All @@ -361,9 +367,9 @@

<!-- General Properties -->
<implementation.build>${scmBranch}@r${buildNumber}</implementation.build>
<javac.src.version>1.7</javac.src.version>
<javac.target.version>1.7</javac.target.version>
<maven.compiler.target>1.7</maven.compiler.target>
<javac.src.version>1.8</javac.src.version>
<javac.target.version>1.8</javac.target.version>
<maven.compiler.target>1.8</maven.compiler.target>
<maven.build.timestamp.format>yyyy-MM-dd HH:mm:ssZ</maven.build.timestamp.format>
<skipTests>false</skipTests>
<assembly.finalName>${project.build.finalName}</assembly.finalName>
Expand Down
10 changes: 5 additions & 5 deletions src/main/java/crawlercommons/fetcher/http/UserAgent.java
Original file line number Diff line number Diff line change
Expand Up @@ -25,11 +25,11 @@
* User Agent enables us to describe characteristics of any http-fetcher
* agent. There are a number of constructor options to describe the following:
* <ol>
* <li><tt>_agentName</tt>: Primary agent name.</li>
* <li><tt>_emailAddress</tt>: The agent owners email address.</li>
* <li><tt>_webAddress</tt>: A web site/address representing the agent owner.</li>
* <li><tt>_browserVersion</tt>: Broswer version used for compatibility.</li>
* <li><tt>_crawlerVersion</tt>: Version of the user agents personal crawler. If
* <li><code>_agentName</code>: Primary agent name.</li>
* <li><code>_emailAddress</code>: The agent owners email address.</li>
* <li><code>_webAddress</code>: A web site/address representing the agent owner.</li>
* <li><code>_browserVersion</code>: Broswer version used for compatibility.</li>
* <li><code>_crawlerVersion</code>: Version of the user agents personal crawler. If
* this is not set, it defaults to the http-fetcher maven artifact version.</li>
* </ol>
*
Expand Down

0 comments on commit ba325bf

Please sign in to comment.