You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Given a Google documentation URL, retrieve a list of all text chunks within h2 sections
def get_sections(url: str) -> list[str]:
page = requests.get(url)
soup = BeautifulSoup(page.content, "html.parser")
sections = []
paragraphs = []
body_div = soup.find("div", class_="devsite-article-body")
for child in body_div.findChildren():
if child.name == "p":
paragraphs.append(child.get_text().strip())
if child.name == "h2":
sections.append(" ".join(paragraphs))
break
for header in soup.find_all("h2"):
paragraphs = []
nextNode = header.nextSibling
while nextNode:
if isinstance(nextNode, Tag):
if nextNode.name in {"p", "ul"}:
paragraphs.append(nextNode.get_text().strip())
elif nextNode.name == "h2":
sections.append(" ".join(paragraphs))
break
nextNode = nextNode.nextSibling
return sections
Needs to be fixed to handle cases when there is no H2 or devsite-article-body class / tag. Currently the code for child in body_div.findChildren(): runs into error if no such tag is found in the URL source code
The Document AI Layout Parser
can handle HTML pages. This could be a way to extract the paragraph/title/etc information without doing the manual HTML parsing.
File Name
/search/retrieval-augmented-generation/examples/rag_google_documentation.ipynb
What happened?
Needs to be fixed to handle cases when there is no H2 or devsite-article-body class / tag. Currently the code
for child in body_div.findChildren():
runs into error if no such tag is found in the URL source codeRelevant log output
CC: @holtskinner
The text was updated successfully, but these errors were encountered: