-
Notifications
You must be signed in to change notification settings - Fork 716
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[SPARKNLP-1092] Adding support to read HTML files (#14449)
* [SPARKNLP-1089] Adding support to read HTML files * [SPARKNLP-1089] Adding documentation and support for set of URLs in python * [SPARKNLP-1089] Adding input validation in python * [SPARKNLP-1089] Minor fix to notebook
- Loading branch information
Showing
17 changed files
with
15,675 additions
and
6 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,296 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"![JohnSnowLabs](https://sparknlp.org/assets/images/logo.png)\n", | ||
"\n", | ||
"[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/JohnSnowLabs/spark-nlp/blob/master/examples/python/reader/SparkNLP_HTML_Reader_Demo.ipynb)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "tzcU5p2gdak9" | ||
}, | ||
"source": [ | ||
"# Introducing HTML reader in SparkNLP\n", | ||
"This notebook showcases the newly added `sparknlp.read().html()` method in Spark NLP that parses HTML content from both local files and real-time URLs into a Spark DataFrame.\n", | ||
"\n", | ||
"**Key Features:**\n", | ||
"- Ability to parse HTML from local directories and URLs.\n", | ||
"- Versatile support for varied data ingestion scenarios." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "RFOFhaEedalB" | ||
}, | ||
"source": [ | ||
"## Setup and Initialization\n", | ||
"Let's keep in mind a few things before we start 😊\n", | ||
"\n", | ||
"Support for reading html files was introduced in `Spark NLP 5.5.2`. Please make sure you have upgraded to the latest Spark NLP release." | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"- Let's install and setup Spark NLP in Google Colab\n", | ||
"- This part is pretty easy via our simple script" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"! wget -q http://setup.johnsnowlabs.com/colab.sh -O - | bash" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"For local files example we will download a couple of HTML files from Spark NLP Github repo:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": { | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"id": "ya8qZe00dalC", | ||
"outputId": "4399cc35-31d4-459c-bee8-d7eeba3d40cd" | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"--2024-11-05 20:02:19-- https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1089-Support-more-file-types-in-SparkNLP/src/test/resources/reader/html/example-10k.html\n", | ||
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n", | ||
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", | ||
"HTTP request sent, awaiting response... 200 OK\n", | ||
"Length: 2456707 (2.3M) [text/plain]\n", | ||
"Saving to: ‘html-files/example-10k.html’\n", | ||
"\n", | ||
"\r", | ||
"example-10k.html 0%[ ] 0 --.-KB/s \r", | ||
"example-10k.html 100%[===================>] 2.34M --.-KB/s in 0.01s \n", | ||
"\n", | ||
"2024-11-05 20:02:19 (157 MB/s) - ‘html-files/example-10k.html’ saved [2456707/2456707]\n", | ||
"\n", | ||
"--2024-11-05 20:02:20-- https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/feature/SPARKNLP-1089-Support-more-file-types-in-SparkNLP/src/test/resources/reader/html/fake-html.html\n", | ||
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...\n", | ||
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", | ||
"HTTP request sent, awaiting response... 200 OK\n", | ||
"Length: 665 [text/plain]\n", | ||
"Saving to: ‘html-files/fake-html.html’\n", | ||
"\n", | ||
"fake-html.html 100%[===================>] 665 --.-KB/s in 0s \n", | ||
"\n", | ||
"2024-11-05 20:02:20 (41.9 MB/s) - ‘html-files/fake-html.html’ saved [665/665]\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!mkdir html-files\n", | ||
"!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/html/example-10k.html -P html-files\n", | ||
"!wget https://raw.githubusercontent.com/JohnSnowLabs/spark-nlp/master/src/test/resources/reader/html/fake-html.html -P html-files" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "EoFI66NAdalE" | ||
}, | ||
"source": [ | ||
"## Parsing HTML from Local Files\n", | ||
"Use the `html()` method to parse HTML content from local directories." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": { | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"id": "bAkMjJ1vdalE", | ||
"outputId": "c4bb38d4-963d-465b-e222-604dc6b617aa" | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Warning::Spark Session already created, some configs may not take.\n", | ||
"+--------------------+--------------------+--------------------+\n", | ||
"| path| content| html|\n", | ||
"+--------------------+--------------------+--------------------+\n", | ||
"|file:/content/htm...|<!DOCTYPE html>\\n...|[{Title, 0, My Fi...|\n", | ||
"|file:/content/htm...|<?xml version=\"1...|[{Title, 0, UNITE...|\n", | ||
"+--------------------+--------------------+--------------------+\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"import sparknlp\n", | ||
"html_df = sparknlp.read().html(\"./html-files\")\n", | ||
"\n", | ||
"html_df.show()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "VQD2k4E5dalF" | ||
}, | ||
"source": [ | ||
"## Parsing HTML from Real-Time URLs\n", | ||
"Use the `html()` method to fetch and parse HTML content from a URL or a set of URLs in real time." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": { | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"id": "MMTGmxLQdalG", | ||
"outputId": "57e99213-0fc7-483c-b7c2-695552fc8d73" | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Warning::Spark Session already created, some configs may not take.\n", | ||
"+--------------------+\n", | ||
"| html|\n", | ||
"+--------------------+\n", | ||
"|[{Title, 0, Examp...|\n", | ||
"+--------------------+\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"html_df = sparknlp.read().html(\"https://example.com/\")\n", | ||
"html_df.select(\"html\").show()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": { | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"id": "-psYdzWodalG", | ||
"outputId": "544cd7e3-93a6-465a-8b9a-52d487d63b21" | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Warning::Spark Session already created, some configs may not take.\n", | ||
"+--------------------+--------------------+\n", | ||
"| url| html|\n", | ||
"+--------------------+--------------------+\n", | ||
"|https://www.wikip...|[{Title, 0, Wikip...|\n", | ||
"|https://example.com/|[{Title, 0, Examp...|\n", | ||
"+--------------------+--------------------+\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"htmls_df = sparknlp.read().html([\"https://www.wikipedia.org\", \"https://example.com/\"])\n", | ||
"htmls_df.show()" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "FrVKxdySz8pR" | ||
}, | ||
"source": [ | ||
"### Configuration Parameters" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": { | ||
"id": "QOXXVx5e7Ri1" | ||
}, | ||
"source": [ | ||
"You can customize the font size used to identify paragraphs that should be treated as titles. By default, the font size is set to 16. However, if your HTML files require a different configuration, you can adjust this parameter accordingly. The example below demonstrates how to modify and work with this setting:" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 12, | ||
"metadata": { | ||
"colab": { | ||
"base_uri": "https://localhost:8080/" | ||
}, | ||
"id": "aNfN0fQC0Vzz", | ||
"outputId": "0b849a86-2d59-4415-981a-dcd9a9f7a14a" | ||
}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Warning::Spark Session already created, some configs may not take.\n", | ||
"+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", | ||
"|html |\n", | ||
"+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", | ||
"|[{Title, 0, My First Heading, {pageNumber -> 1}}, {Title, 0, My Second Heading, {pageNumber -> 1}}, {NarrativeText, 0, My first paragraph. lorem ipsum dolor set amet. if the cow comes home under the sun how do you fault the cow for it's worn hooves?, {pageNumber -> 1}}, {Title, 0, A Third Heading, {pageNumber -> 1}}, {Table, 0, Column 1 Column 2 Row 1, Cell 1 Row 1, Cell 2 Row 2, Cell 1 Row 2, Cell 2, {pageNumber -> 1}}]|\n", | ||
"+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"params = {\"titleFontSize\": \"12\"}\n", | ||
"html_df = sparknlp.read(params).html(\"./html-files/fake-html.html\")\n", | ||
"html_df.select(\"html\").show(truncate=False)" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"colab": { | ||
"provenance": [] | ||
}, | ||
"kernelspec": { | ||
"display_name": "Python 3 (ipykernel)", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.10.12" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 1 | ||
} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
# Copyright 2017-2022 John Snow Labs | ||
# | ||
# Licensed under the Apache License, Version 2.0 (the "License"); | ||
# you may not use this file except in compliance with the License. | ||
# You may obtain a copy of the License at | ||
# | ||
# http://www.apache.org/licenses/LICENSE-2.0 | ||
# | ||
# Unless required by applicable law or agreed to in writing, software | ||
# distributed under the License is distributed on an "AS IS" BASIS, | ||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. | ||
# See the License for the specific language governing permissions and | ||
# limitations under the License. | ||
"""Module for reading different files types.""" | ||
from sparknlp.reader.sparknlp_reader import * |
Oops, something went wrong.