diff --git a/docs/getting_started.ipynb b/docs/getting_started.ipynb deleted file mode 100644 index 1daaf1fa..00000000 --- a/docs/getting_started.ipynb +++ /dev/null @@ -1,536 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Getting started with `delb`\n", - "\n", - "\n", - "This notebook gives a brief introduction with a few examples on TEI-XML\n", - "encoded transcriptions. The cells' output isn't pre-rendered, you need to\n", - "execute them in order to see the output of code cells. The first cell requires\n", - "an internet connection. The guide assumes that you have basic knowledge about\n", - "the Python programming language, XML markup and with text encodings that use\n", - "the latter.\n", - "\n", - "The full API reference is available at https://delb.readthedocs.io/en/latest/api.html .\n", - "\n", - "First, we will install the library, including the required dependencies to load \n", - "resources over `https` and then import the `Document` class:" - ], - "id": "8184c5ac7680ee04" - }, - { - "cell_type": "code", - "metadata": { - "trusted": true - }, - "source": [ - "! pip install delb[https-loader]\n", - "\n", - "# workaround for issue with entrypoint registration on mybinder.org\n", - "from _delb.plugins import https_loader # not necessary in local environments\n", - "\n", - "from delb import Document" - ], - "execution_count": null, - "id": "5bfe0c696fc56644", - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Loading a document\n", - "\n", - "`delb` can instantiate document representations from a variety of source\n", - "arguments, namely URLs, strings and `lxml` objects. If these are not enough,\n", - "more document loaders can be implemented and configured to be used. See the API\n", - "doc's *Document loaders* section as well as the chapter on *Extending delb* for\n", - "more on this.\n", - "\n", - "Let's load R.L. Stevenson's *Treasure Island* from a web server:" - ], - "id": "fe9f87e20ac857d0" - }, - { - "cell_type": "code", - "metadata": { - "trusted": true - }, - "source": [ - "treasure_island = Document(\n", - " \"https://ota.bodleian.ox.ac.uk/repository/xmlui/bitstream/handle/20.500.12024/5730/5730.xml\"\n", - ")\n", - "print(f\"{str(treasure_island)[:367]}\\n[…]\\n{str(treasure_island)[-384:]}\")" - ], - "execution_count": null, - "id": "81ad40c9d44cf6e1", - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Related API docs\n", - "\n", - "[Document class](https://delb.readthedocs.io/en/latest/api.html#delb.Document),\n", - "[Documment loaders](https://delb.readthedocs.io/en/latest/api.html#document-loaders)" - ], - "id": "4eda55a577afbe8a" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Node types\n", - "\n", - "The document's content is contained in a tree whose root node is available as\n", - "the `root` property of a document instance and is of the `TagNode` type." - ], - "id": "664c59e1c0feb8b0" - }, - { - "cell_type": "code", - "metadata": { - "trusted": true - }, - "source": [ - "root = treasure_island.root\n", - "print(\"The root node's name:\", root.universal_name)\n", - "print(\"The number of the root's child nodes:\", len(root))" - ], - "execution_count": null, - "id": "53730e46f039adf5", - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Beside `TagNode`s, there are also `TextNode`s as well as `CommentNode`s and\n", - "`ProcessingInstruction`s. The latter two are filtered out by default, there's\n", - "more on this in the API doc's *Default filters* section.\n", - "\n", - "Querying document contents is of course a common task and in the context of the\n", - "whole document, CSS queries are straight-forward. So, what title is recorded in\n", - "the document's header section?" - ], - "id": "9fa8e92aff1a8bb1" - }, - { - "cell_type": "code", - "metadata": { - "trusted": true - }, - "source": [ - "title_node = treasure_island.css_select(\"titleStmt title\").first\n", - "print(title_node.full_text)" - ], - "execution_count": null, - "id": "12a9fc6e1229a217", - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For our purpose we want to manipulate the title to really just contain the\n", - "work's title without further textual annotation, so we need to fetch the containing\n", - "text node and alter its content:" - ], - "id": "c8a1ce54c0ab6bde" - }, - { - "cell_type": "code", - "metadata": { - "trusted": true - }, - "source": [ - "text_node = title_node.first_child\n", - "text_node.content = \"Treasure Island\"\n", - "print(title_node.full_text)" - ], - "execution_count": null, - "id": "647c780d44aa414e", - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Related API docs\n", - "\n", - "[`Document.root`](https://delb.readthedocs.io/en/latest/api.html#delb.Document.root),\n", - "[`TagNode` class](https://delb.readthedocs.io/en/latest/api.html#tag),\n", - "[`TextNode` class](https://delb.readthedocs.io/en/latest/api.html#text),\n", - "[`CommentNode` class](https://delb.readthedocs.io/en/latest/api.html#comment),\n", - "[`ProcessingInstruction` class](https://delb.readthedocs.io/en/latest/api.html#processing-instruction),\n", - "[default filters](https://delb.readthedocs.io/en/latest/api.html#default-filters),\n", - "[`Document.css_select`](https://delb.readthedocs.io/en/latest/api.html#delb.Document.css_select),\n", - "[`TagNode.full_text`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.full_text)" - ], - "id": "a52685e10002ab4a" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Navigating the tree\n", - "\n", - "Now, let's find out what the author record says and use different ways that\n", - "`delb` provides to navigate the tree.\n", - "\n", - "Given we know that the document has only one author and that the node\n", - "holding that information is following the `title`, it can be simply fetched\n", - "by targeting the following sibling tag:" - ], - "id": "c64b76f3eefe4a83" - }, - { - "cell_type": "code", - "metadata": { - "trusted": true - }, - "source": [ - "from delb import is_tag_node\n", - "\n", - "author_node = title_node.fetch_following_sibling(is_tag_node)\n", - "print(author_node.full_text)" - ], - "execution_count": null, - "id": "a24e41bdd2c7a754", - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "But a more generic approach would consider that there could be several authors\n", - "and that there's no constraint regarding the order of nodes in the containing\n", - "`titleStmt` node. Therefore we define a filter that matches only nodes with\n", - "the name `author` and use it after one that only passes `TagNode` s to fetch\n", - "all matching child nodes of the `titleStmt`:" - ], - "id": "a6189b52c06264ff" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "def has_author_name(node: \"NodeBase\") -> bool:\n", - " return node.local_name == \"author\"\n", - "\n", - "title_statement = title_node.parent\n", - "author_nodes = title_statement.iterate_children(is_tag_node, has_author_name)\n", - "print(author_nodes)" - ], - "execution_count": null, - "id": "26853aa1ece14809", - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Wait, that `author_nodes` is a generator object, now what? Or firstly, why?\n", - "Employing generators allows lazy evaluation of iterables and avoids large\n", - "intermediate containers that could be expected in operations like this:\n", - "\n", - "```python\n", - "def is_paragraph_node(node):\n", - " return is_tag_node(node) and node.local_name == \"p\"\n", - "\n", - "for node in root.iterate_descendants(is_paragraph_node):\n", - " # in the previous expression a rather big list could be allocated in memory\n", - " # while only one item is used at a time within the loop; also you might\n", - " # wanna `break` out of the loop earlier\n", - " pass\n", - "```\n", - "\n", - "Since we are after the author names and not the containing nodes, we can use\n", - "that generator (once) in a\n", - "[list comprehension](https://docs.python.org/glossary.html#term-list-comprehension):" - ], - "id": "9281f8d4dbe2ab1a" - }, - { - "cell_type": "code", - "metadata": { - "trusted": true - }, - "source": [ - "[node.full_text for node in author_nodes]" - ], - "execution_count": null, - "id": "ae2a53faa9e03a5d", - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Related API docs\n", - "\n", - "[`TagNode.fetch_following_sibling`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.fetch_following_sibling),\n", - "[`is_tag_node`](https://delb.readthedocs.io/en/latest/api.html#delb.is_tag_node),\n", - "[`TagNode.local_name`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.local_name),\n", - "[`TagNode.parent`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.parent),\n", - "[`TagNode.iterate_children`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.iterate_children)," - ], - "id": "1cd6c58f5d17d349" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Rearranging nodes\n", - "\n", - "What about we create another document that contains a table of contents, or\n", - "rather a tree, and sparse title information? As this shall demonstrate the\n", - "manipulation of trees, we start off with a copy of the document (instead of\n", - "just extracting the information and building a new tree, which would be more\n", - "appropriate in a real application). First we define a namespace, register it\n", - "with a prefix for serializations and clone the root node." - ], - "id": "ecc68cc399ab787f" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "from delb import register_namespace\n", - "\n", - "TOC_NS = \"https://t.oc/\"\n", - "register_namespace(\"toc\", TOC_NS)\n", - "\n", - "root = treasure_island.clone().root" - ], - "execution_count": null, - "id": "e267cbeb265cef22", - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next we select and clone the nodes containing the title and author information,\n", - "alter their namespace and place them as first child nodes of the root. Note\n", - "that except for `replace` all methods that add nodes to a tree, can take a\n", - "variable amount of nodes as arguments and therefore the destructuring notation\n", - "of iterables by prefixing such with `*` can be used." - ], - "id": "d595edbf40d853bf" - }, - { - "cell_type": "code", - "metadata": { - "trusted": true - }, - "source": [ - "nodes = [\n", - " node.clone(deep=True) for node in root.css_select(\"titleStmt title, titleStmt author\")\n", - "]\n", - "\n", - "for node in nodes:\n", - " node.namespace = TOC_NS\n", - "\n", - "root.prepend_children(*nodes)\n", - "print(str(root)[:216], \"…\")" - ], - "execution_count": null, - "id": "ab191969080f3355", - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "It's noteworthy at this point that `delb` doesn't allow to just move nodes\n", - "around within or between trees. Following the paradigm that *explicit is better\n", - "than implicit* it doesn't detach nodes for you and you have to either `clone`\n", - "(as before or by passing the `clone` argument as `True`) or `detach` a\n", - "non-root node before you insert it into a tree.\n", - "\n", - "Then we get rid of the `teiHeader` and its contents. As the reference to the\n", - "detached nodes are lost, the nodes themselves will be removed from the heap\n", - "upon garbage collection." - ], - "id": "12f950bfd75b7e09" - }, - { - "cell_type": "code", - "metadata": { - "trusted": true - }, - "source": [ - "root.css_select(\"teiHeader\").first.detach()\n", - "print(str(root)[:215], \"…\")" - ], - "execution_count": null, - "id": "c12a787bc321ddeb", - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "For the directory of sections, we'll go that more straight-forward way of\n", - "extracting information and assembling a new tree. So let's make a container for\n", - "the items:" - ], - "id": "2c2d7f92d640c476" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "contents = root.new_tag_node(\"contents\", namespace=TOC_NS)\n", - "print(contents)" - ], - "execution_count": null, - "id": "e9f51b91af8ab319", - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We'll need to implement the data conversion in a function because we need to\n", - "recursively scan for subsections. Also, filter functions are defined and used\n", - "to iterate over the relevant nodes. To reduce imperative instructions, the\n", - "`tag` function is employed that allows a brief, declarative way to do build\n", - "subtrees." - ], - "id": "97eacb710789ce11" - }, - { - "cell_type": "code", - "metadata": {}, - "source": [ - "from delb import first, tag\n", - "\n", - "def is_head(node: \"NodeBase\") -> bool:\n", - " return node.local_name == \"head\"\n", - "\n", - "def is_section(node: \"NodeBase\") -> bool:\n", - " return node.local_name == \"div\"\n", - "\n", - "\n", - "def extract_section_titles(node: \"TagNode\") -> \"List[TagNode, ...]\":\n", - " result = []\n", - " for child_node in node.child_nodes(is_tag_node, is_section):\n", - " head = first(child_node.child_nodes(is_tag_node, is_head))\n", - " section_item = node.new_tag_node(\"section\", namespace=TOC_NS, children=[\n", - " tag(\"title\", head.full_text)\n", - " ])\n", - " result.append(section_item)\n", - "\n", - " subsections = extract_section_titles(child_node)\n", - " if subsections:\n", - " section_item.append_children(tag(\"subsections\", subsections))\n", - "\n", - " return result\n", - "\n", - "body = root.css_select(\"text body\").first\n", - "contents.append_children(*extract_section_titles(body))\n", - "root.append_child(contents)" - ], - "execution_count": null, - "id": "411129c3e15c652", - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Finally we'll alter the root node's identity and get rid of the originating\n", - "contents. For that last bit, we'll actually put these target nodes into a list\n", - "that references these. Because like a list, a tree is a mutable object, and\n", - "must not be changed when iterating over it." - ], - "id": "8145031f753463b3" - }, - { - "cell_type": "code", - "metadata": { - "trusted": true - }, - "source": [ - "root.namespace = TOC_NS\n", - "root.local_name = \"book_contents\"\n", - "\n", - "def namespace_filter(node: \"NodeBase\") -> bool:\n", - " return node.namespace != TOC_NS\n", - "\n", - "for node in list(root.iterate_descendants(is_tag_node, namespace_filter)):\n", - " node.detach()" - ], - "execution_count": null, - "id": "fe0d5e5a758e163d", - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Related API docs\n", - "\n", - "[`register_namespace`](https://delb.readthedocs.io/en/latest/api.html#delb.register_namespace),\n", - "[`Document.clone`](https://delb.readthedocs.io/en/latest/api.html#delb.Document.clone),\n", - "[`TagNode.css_select`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.css_select),\n", - "[`TagNode.namespace`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.namespace),\n", - "[`TagNode.prepend_children`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.prepend_children),\n", - "[`TagNode.detach`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.detach),\n", - "[`TagNode.new_tag_node`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.new_tag_node),\n", - "[`tag`](https://delb.readthedocs.io/en/latest/api.html#delb.tag),\n", - "[`TagNode.append_children`](https://delb.readthedocs.io/en/latest/api.html#delb.TagNode.append_children)" - ], - "id": "d57a93b06ab2e688" - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Saving documents\n", - "\n", - "All right then, to wrap all up, we'll attach the newly created tree to a\n", - "document, clean up the namespace-prefix-mess (well mostly, unfortunately\n", - "there's no way yet to declare a namespace that is already associated with a\n", - "prefix as a default namespace) and save the document to disk:" - ], - "id": "ecc5f02d4d2748af" - }, - { - "cell_type": "code", - "metadata": { - "trusted": true - }, - "source": [ - "contents = Document(root)\n", - "\n", - "from pathlib import Path\n", - "from tempfile import TemporaryDirectory\n", - "\n", - "with TemporaryDirectory() as tmp_path:\n", - " target = Path(tmp_path) / \"treasure_island_sections.xml\"\n", - " contents.save(target, pretty=True)\n", - " print(target.read_text())" - ], - "execution_count": null, - "id": "65bc00eec5bdfb85", - "outputs": [] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "#### Related API docs\n", - "\n", - "[`Document.save`](https://delb.readthedocs.io/en/latest/api.html#delb.Document.save)" - ], - "id": "cfe907fb3ed50ce8" - } - ] -} diff --git a/docs/index.rst b/docs/index.rst index 314caeb2..81aad9c2 100644 --- a/docs/index.rst +++ b/docs/index.rst @@ -5,7 +5,7 @@ design installation - Getting Started + Getting Started api/index changes license