Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conduct research on given URLs without forgetting and add more research #734

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions docs/docs/gpt-researcher/tailored-research.md
Original file line number Diff line number Diff line change
@@ -1,16 +1,18 @@
# Tailored Research
The GPT Researcher package allows you to tailor the research to your needs such as researching on specific sources or local documents, and even specify the agent prompt instruction upon which the research is conducted.
The GPT Researcher package allows you to tailor the research to your needs such as researching on specific sources (URLs) or local documents, and even specify the agent prompt instruction upon which the research is conducted.

### Research on Specific Sources 📚

You can specify the sources you want the GPT Researcher to research on by providing a list of URLs. The GPT Researcher will then conduct research on the provided sources.
You can specify the sources you want the GPT Researcher to research on by providing a list of URLs. The GPT Researcher will then conduct research on the provided sources via `source_urls`.

If you want GPT Researcher to perform additional research outside of the URLs you provided, i.e., conduct research on various other websites that it finds suitable for the query/sub-query, you can set the parameter `add_additional_sources` as `True`. Default value of `False` will only scour the websites you provide via `source_urls`.

```python
from gpt_researcher import GPTResearcher
import asyncio

async def get_report(query: str, report_type: str, sources: list) -> str:
researcher = GPTResearcher(query=query, report_type=report_type, source_urls=sources)
researcher = GPTResearcher(query=query, report_type=report_type, source_urls=sources, add_additional_sources=False)
await researcher.conduct_research()
report = await researcher.write_report()
return report
Expand Down
52 changes: 37 additions & 15 deletions gpt_researcher/master/agent.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,7 @@ def __init__(
report_source=ReportSource.Web.value,
tone: Tone = Tone.Objective,
source_urls=None,
add_additional_sources=False,
documents=None,
config_path=None,
websocket=None,
Expand All @@ -36,20 +37,30 @@ def __init__(
headers: dict = None, # Add headers parameter
):
"""
Initialize the GPT Researcher class.
Initializes the GPTResearcher class with the specified parameters to set up the research environment.

Args:
query: str,
report_type: str
source_urls
tone
config_path
websocket
agent
role
parent_query: str
subtopics: list
visited_urls: set
query (str): The main query for which research is conducted.
report_type (str): Type of report to generate. Defaults to a research report.
report_source (str): The source of data for research. Defaults to web sources.
tone (Tone): The tone of the report; objective by default.
source_urls (list or None): Initial list of URLs for research.
add_additional_sources (bool): Whether to add additional sources/links to the research. Set a non-empty valid value for source_urls for this parameter to take effect.
documents (list or None): Predefined list of documents to use.
config_path (str or None): Path to the configuration file.
websocket: Websocket connection for real-time updates.
agent (str or None): Designated agent for conducting research.
role (str or None): Role of the agent if specified.
parent_query (str): Main query that this query is derived from if any.
subtopics (list): List of subtopics related to the main query from the user.
visited_urls (set): Set of URLs that have already been visited.
verbose (bool): Toggle for verbose output for debugging or detailed logs.
context (list): Initial context for the research.
headers (dict or None): HTTP headers for web requests.

Initializes internal state and prepares the gptr with necessary configuration.
"""

self.headers = headers or {}
self.query: str = query
self.agent: str = agent
Expand All @@ -66,6 +77,7 @@ def __init__(
) or get_default_retriever()
self.context = context
self.source_urls = source_urls
self.add_additional_sources: bool = add_additional_sources
self.documents = documents
self.memory = Memory(self.cfg.embedding_provider, self.headers)
self.visited_urls: set[str] = visited_urls
Expand All @@ -92,12 +104,10 @@ def __init__(

async def conduct_research(self):
"""
Runs the GPT Researcher to conduct research
Runs the GPT Researcher to conduct research on the specified source
"""
# Reset visited_urls and source_urls at the start of each research task
self.visited_urls.clear()
if self.report_source != ReportSource.Sources.value:
self.source_urls = []

if self.verbose:
await stream_output(
Expand All @@ -123,6 +133,18 @@ async def conduct_research(self):
# If specified, the researcher will use the given urls as the context for the research.
if self.source_urls:
self.context = await self.__get_context_by_urls(self.source_urls)
if len(self.context) == 0 and self.verbose:
# Could not find any relevant resources in source_urls to answer the query or sub-query. Will answer using model's inherent knowledge
await stream_output(
"logs",
"answering_from_memory",
f"🧐 I was unable to find relevant context in the provided sources...",
self.websocket,
)
# If add_additional_sources parameter is set, more resources can be gathered to create additional context using default web search
if self.add_additional_sources:
additional_research = await self.__get_context_by_search(self.query)
self.context += ' '.join(additional_research)

elif self.report_source == ReportSource.Local.value:
document_data = await DocumentLoader(self.cfg.doc_path).load()
Expand Down
103 changes: 103 additions & 0 deletions tests/research_test.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,103 @@
"""
Hi! The following test cases are for the new parameter `add_additional_sources` and fix on the functional error with `source_urls` in GPTResearcher class.

The source_urls parameter was resetting each time in conduct_research function causing gptr to forget the given links. Now, that has been fixed and a new parameter is introduced.
This parameter named will `add_additional_sources` allow GPTR to research on sources other than the provided sources via source_urls if set to True.
Default is False, i.e., no additional research will be conducted on newer sources.
"""

## Notes:
## Please uncomment the test case to run and comment the rest.
## Thanks!



#### Test case 1 (original test case as control from https://docs.gptr.dev/docs/gpt-researcher/tailored-research)

from gpt_researcher.master.agent import GPTResearcher # Ensure this path is correct
import asyncio

async def get_report(query: str, report_type: str, sources: list) -> str:
researcher = GPTResearcher(query=query, report_type=report_type, source_urls=sources)
await researcher.conduct_research()
report = await researcher.write_report()
return report, researcher

if __name__ == "__main__":
query = "Research the latest advancements in AI and provide a detailed report in APA format including sources."
report_type = "research_report"
sources = ["https://en.wikipedia.org/wiki/Artificial_intelligence", "https://www.ibm.com/watson/ai"] # query is related

report, researcher = asyncio.run(get_report(query, report_type, sources))
print(report)

print(f"\nLength of the context = {len(researcher.get_research_context())}") # Must say Non-zero value because the query is related to the contents of the page, so there will be relevant context present



#### Test case 2 (Illustrating the problem, i.e., source_urls are not scoured. Hence, no relevant context)

# from gpt_researcher.master.agent import GPTResearcher # Ensure this path is correct
# import asyncio

# async def get_report(query: str, report_type: str, sources: list) -> str:
# researcher = GPTResearcher(query=query, report_type=report_type, source_urls=sources)
# await researcher.conduct_research()
# report = await researcher.write_report()
# return report, researcher

# if __name__ == "__main__":
# query = "What is Microsoft's business model?"
# report_type = "research_report"
# sources = ["https://www.apple.com", "https://en.wikipedia.org/wiki/Olympic_Games"] # query is UNRELATED.

# report, researcher = asyncio.run(get_report(query, report_type, sources))
# print(report)

# print(f"\nLength of the context = {len(researcher.get_research_context())}") # Must say 0 (zero) value because the query is UNRELATED to the contents of the pages, so there will be NO relevant context present



#### Test case 3 (Suggested solution - add_additional_sources parameter allows GPTR to scour more of the web and not restrict to source_urls)

# from gpt_researcher.master.agent import GPTResearcher # Ensure this path is correct
# import asyncio

# async def get_report(query: str, report_type: str, sources: list) -> str:
# researcher = GPTResearcher(query=query, report_type=report_type, source_urls=sources, add_additional_sources=True)
# await researcher.conduct_research()
# report = await researcher.write_report()
# return report, researcher

# if __name__ == "__main__":
# query = "What is Microsoft's business model?"
# report_type = "research_report"
# sources = ["https://www.apple.com", "https://en.wikipedia.org/wiki/Olympic_Games"] # query is UNRELATED

# report, researcher = asyncio.run(get_report(query, report_type, sources))
# print(report)

# print(f"\nLength of the context = {len(researcher.get_research_context())}") # Must say Non-zero value because the query is UNRELATED to the contents of the page, but the add_additional_sources is set which should make gptr do default web search to gather contexts



# #### Test case 4 (Furthermore, GPTR will create more context in addition to source_urls if the add_additional_sources parameter is set allowing for a larger research scope)

# from gpt_researcher.master.agent import GPTResearcher # Ensure this path is correct
# import asyncio

# async def get_report(query: str, report_type: str, sources: list) -> str:
# researcher = GPTResearcher(query=query, report_type=report_type, source_urls=sources, add_additional_sources=True)
# await researcher.conduct_research()
# report = await researcher.write_report()
# return report, researcher

# if __name__ == "__main__":
# query = "What are the latest advancements in AI?"
# report_type = "research_report"
# sources = ["https://en.wikipedia.org/wiki/Artificial_intelligence", "https://www.ibm.com/watson/ai"] # query is related

# report, researcher = asyncio.run(get_report(query, report_type, sources))
# print(report)

# print(f"\nLength of the context = {len(researcher.get_research_context())}") # Must say Non-zero value because the query is related to the contents of the page, and additionally the add_additional_sources is set which should make gptr do default web search to gather more contexts!