http-fetcher

The Crawler Commons' http-fetcher is a Java library that provides common page fetching functionality needed in web crawlers. Currently, it uses Apache HttpClient library to implement low-level HTTP communication.

Requirements

Currently, http-fetcher requires Java 11+.

API

An example of creating a fetcher with five threads that will only accept content identified by the server as text/html:

// Data passed to UserAgent will be used to automatically create HTTP header 'User-Agent'
UserAgent ua = new UserAgent.Builder()
    .setAgentName("MyCrawler")
    .setCrawlerVersion("1.0")
    .setWebAddress("www.mycrawler.com/bot.html")
    .build();

// Instantiate the BaseFetcher object used to fetch pages
BaseFetcher fetcher = new SimpleHttpFetcher(5, userAgent);

// Configure the accepted mime-types
Set<String> validMimeTypes = new HashSet<>();
validMimeTypes.add("text/html");
fetcher.setValidMimeTypes(validMimeTypes);

try {
  // Fetch the web page from the Web
  FetchedResult result = fetcher.get("http://localhost:8089/");
  
  // Read downloaded content (additional data is available via remaining methods from FetchedResult object)
  String requestedUrl = result.getBaseUrl(); // the requested URL (same as above)
  String finalUrl = result.getFetchedUrl(); // the final URL after redirects (if any)
  byte[] page = result.getContent(); // the page data returned by server as a byte array
  long fetchTime = result.getFetchTime(); // the time taken to download the page
  String address = result.getHostAddress(); // the host address
} catch (BaseFetchException e) {
  // The download has failed. Check the actual subclass of BaseFetchException to get error details.
}

Name		Name	Last commit message	Last commit date
Latest commit History 82 Commits
.github		.github
doc		doc
src		src
.gitignore		.gitignore
CHANGES.txt		CHANGES.txt
LICENSE		LICENSE
README.md		README.md
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

http-fetcher

Requirements

API

About

Releases

Packages

Contributors 4

Languages

License

crawler-commons/http-fetcher

Folders and files

Latest commit

History

Repository files navigation

http-fetcher

Requirements

API

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages