Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: rag_google_documentation.ipynb has isssues in execution #788

Open
rafiqhasan opened this issue Jun 18, 2024 · 2 comments
Open

[Bug]: rag_google_documentation.ipynb has isssues in execution #788

rafiqhasan opened this issue Jun 18, 2024 · 2 comments

Comments

@rafiqhasan
Copy link

rafiqhasan commented Jun 18, 2024

File Name

/search/retrieval-augmented-generation/examples/rag_google_documentation.ipynb

What happened?

# Given a Google documentation URL, retrieve a list of all text chunks within h2 sections
def get_sections(url: str) -> list[str]:
    page = requests.get(url)
    soup = BeautifulSoup(page.content, "html.parser")

    sections = []
    paragraphs = []

    body_div = soup.find("div", class_="devsite-article-body")
    for child in body_div.findChildren():
        if child.name == "p":
            paragraphs.append(child.get_text().strip())
        if child.name == "h2":
            sections.append(" ".join(paragraphs))
            break

    for header in soup.find_all("h2"):
        paragraphs = []
        nextNode = header.nextSibling
        while nextNode:
            if isinstance(nextNode, Tag):
                if nextNode.name in {"p", "ul"}:
                    paragraphs.append(nextNode.get_text().strip())
                elif nextNode.name == "h2":
                    sections.append(" ".join(paragraphs))
                    break
            nextNode = nextNode.nextSibling
    return sections

Needs to be fixed to handle cases when there is no H2 or devsite-article-body class / tag. Currently the code for child in body_div.findChildren(): runs into error if no such tag is found in the URL source code

Relevant log output

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-6-440b9131ebc9> in <cell line: 1>()
----> 1 all_text = [t for url in URLS for t in get_sections(url) if t]

1 frames
<ipython-input-6-440b9131ebc9> in <listcomp>(.0)
----> 1 all_text = [t for url in URLS for t in get_sections(url) if t]

<ipython-input-5-73e0f3cdcce1> in get_sections(url)
      8 
      9     body_div = soup.find("div", class_="devsite-article-body")
---> 10     for child in body_div.findChildren():
     11         if child.name == "p":
     12             paragraphs.append(child.get_text().strip())

AttributeError: 'NoneType' object has no attribute 'findChildren'

CC: @holtskinner

@holtskinner
Copy link
Collaborator

@grivescorbett is the creator of this notebook.

@holtskinner
Copy link
Collaborator

Possible improvement to be made to this notebook:

The Document AI Layout Parser
can handle HTML pages. This could be a way to extract the paragraph/title/etc information without doing the manual HTML parsing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants