|
| 1 | +# jsoup: Java HTML Parser |
| 2 | + |
| 3 | +**jsoup** is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods. |
| 4 | + |
| 5 | + |
| 6 | +**jsoup** implements the [WHATWG HTML5](http://whatwg.org/html) specification, and parses HTML to the same DOM as modern browsers do. |
| 7 | + |
| 8 | +* scrape and [parse](https://jsoup.org/cookbook/input/parse-document-from-string) HTML from a URL, file, or string |
| 9 | +* find and [extract data](https://jsoup.org/cookbook/extracting-data/selector-syntax), using DOM traversal or CSS selectors |
| 10 | +* manipulate the [HTML elements](https://jsoup.org/cookbook/modifying-data/set-html), attributes, and text |
| 11 | +* [clean](https://jsoup.org/cookbook/cleaning-html/whitelist-sanitizer) user-submitted content against a safe white-list, to prevent XSS attacks |
| 12 | +* output tidy HTML |
| 13 | + |
| 14 | +jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree. |
| 15 | + |
| 16 | +See [**jsoup.org**](https://jsoup.org/) for downloads and the full [API documentation](https://jsoup.org/apidocs/). |
| 17 | + |
| 18 | +## Example |
| 19 | +Fetch the [Wikipedia](http://en.wikipedia.org/wiki/Main_Page) homepage, parse it to a [DOM](https://developer.mozilla.org/en-US/docs/Web/API/Document_Object_Model/Introduction), and select the headlines from the *In the News* section into a list of [Elements](https://jsoup.org/apidocs/index.html?org/jsoup/select/Elements.html) ([online sample](https://try.jsoup.org/~LGB7rk_atM2roavV0d-czMt3J_g)): |
| 20 | + |
| 21 | +```java |
| 22 | +Document doc = Jsoup.connect("http://en.wikipedia.org/").get(); |
| 23 | +Elements newsHeadlines = doc.select("#mp-itn b a"); |
| 24 | +``` |
| 25 | + |
| 26 | +## Open source |
| 27 | +jsoup is an open source project distributed under the liberal [MIT license](https://jsoup.org/license). The source code is available at [GitHub](https://github.com/jhy/jsoup/tree/master/src/main/java/org/jsoup). |
| 28 | + |
| 29 | +## Getting started |
| 30 | +1. [Download](https://jsoup.org/download) the latest jsoup jar (or it add to your Maven/Gradle build) |
| 31 | +2. Read the [cookbook](https://jsoup.org/cookbook/) |
| 32 | +3. Enjoy! |
| 33 | + |
| 34 | +## Development and support |
| 35 | +If you have any questions on how to use jsoup, or have ideas for future development, please get in touch via the [mailing list](https://jsoup.org/discussion). |
| 36 | + |
| 37 | +If you find any issues, please file a [bug](https://jsoup.org/bugs) after checking for duplicates. |
| 38 | + |
| 39 | +The [colophon](https://jsoup.org/colophon) talks about the history of and tools used to build jsoup. |
| 40 | + |
| 41 | +## Status |
| 42 | +jsoup is in general, stable release. |
0 commit comments