This is a series of experiments on how effectively an LLM agent can analyze and improve an academic argument. The experiments focus on historical arguments as an excellent playground for non-technical arguments grounded in verifiable facts. In particular, I noticed that LLM answers to historical questions are promising but broad enough that they could not effectively be assessed without significant further information. Moreover, the claims often break down when an LLM is asked for clarification without tools. I aim to see how far methods such as LLM self-examination, structured data generation, and access to external data can go toward improving a model's effective reasoning capabilities on complex questions.
The project is a Jupyter Notebook using the LangChain API, using Google's low-end model gemini-1.5-flash.