[Bug]: rag_google_documentation.ipynb has isssues in execution #788

rafiqhasan · 2024-06-18T14:42:56Z

File Name

/search/retrieval-augmented-generation/examples/rag_google_documentation.ipynb

What happened?

# Given a Google documentation URL, retrieve a list of all text chunks within h2 sections
def get_sections(url: str) -> list[str]:
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")

    sections = []
    paragraphs = []

    body_div = soup.find("div", class_="devsite-article-body")
    for child in body_div.findChildren():
        if child.name == "p":
            paragraphs.append(child.get_text().strip())
        if child.name == "h2":
            sections.append(" ".join(paragraphs))
            break

    for header in soup.find_all("h2"):
        paragraphs = []
        nextNode = header.nextSibling
        while nextNode:
            if isinstance(nextNode, Tag):
                if nextNode.name in {"p", "ul"}:
                    paragraphs.append(nextNode.get_text().strip())
                elif nextNode.name == "h2":
                    sections.append(" ".join(paragraphs))
                    break
            nextNode = nextNode.nextSibling
    return sections

Needs to be fixed to handle cases when there is no H2 or devsite-article-body class / tag. Currently the code for child in body_div.findChildren(): runs into error if no such tag is found in the URL source code

Relevant log output

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-440b9131ebc9> in <cell line: 1>()
----> 1 all_text = [t for url in URLS for t in get_sections(url) if t]

1 frames
<ipython-input-6-440b9131ebc9> in <listcomp>(.0)
----> 1 all_text = [t for url in URLS for t in get_sections(url) if t]

<ipython-input-5-73e0f3cdcce1> in get_sections(url)
      8 
      9     body_div = soup.find("div", class_="devsite-article-body")
---> 10     for child in body_div.findChildren():
     11         if child.name == "p":
     12             paragraphs.append(child.get_text().strip())

AttributeError: 'NoneType' object has no attribute 'findChildren'

CC: @holtskinner

The text was updated successfully, but these errors were encountered:

holtskinner · 2024-07-30T16:14:17Z

@grivescorbett is the creator of this notebook.

holtskinner · 2024-07-30T16:15:47Z

Possible improvement to be made to this notebook:

The Document AI Layout Parser
can handle HTML pages. This could be a way to extract the paragraph/title/etc information without doing the manual HTML parsing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: rag_google_documentation.ipynb has isssues in execution #788

[Bug]: rag_google_documentation.ipynb has isssues in execution #788

rafiqhasan commented Jun 18, 2024 •

edited

Loading

holtskinner commented Jul 30, 2024

holtskinner commented Jul 30, 2024

[Bug]: rag_google_documentation.ipynb has isssues in execution #788

[Bug]: rag_google_documentation.ipynb has isssues in execution #788

Comments

rafiqhasan commented Jun 18, 2024 • edited Loading

File Name

What happened?

Relevant log output

holtskinner commented Jul 30, 2024

holtskinner commented Jul 30, 2024

rafiqhasan commented Jun 18, 2024 •

edited

Loading