A Python library for classifying legal judgements using the langchain library and OpenAI.
Example use: On this case: https://www.caselaw.nsw.gov.au/decision/54a004453004262463c948bc
Prerequisites: Git and Poetry
Get a copy of the repo using git clone
and then set up your Python environment
and dependencies with poetry install
:
git clone [email protected]:Sydney-Informatics-Hub/langchainlaw.git
cd langchainlaw
poetry install
The classify
command will classify a directory containing judgments in the
json format output by the nswcaselaw library,
caching the LLM responses and writing the results out to a spreadsheet. It
is configured using a JSON file with the following format:
{
"providers": {
"OpenAI": {
"api_key": "SECRET_GOES_HERE",
"organization": "ORG_KEY_GOES_HERE",
"model": "gpt-4o"
}
},
"provider": "OpenAI",
"temperature": 0,
"rate_limit": 15,
"prompts": "./tests/sample_prompts.xlsx",
"input": "./input/",
"output": "./output/results.xlsx",
"cache": "./output/cache",
"test_prompts": "./outputs/test_prompts.txt"
}
You should make a copy of config.example.json
as config.json
before you
add your API keys.
The configurations for files and directories for input and output are as follows:
prompts
: spreadsheet with the prompt questions - see below for formatinput
: all .json files here will be read as casesoutput
: results are written to this spreadsheet, one line per casecache
: a directory will be created in this for each case, and results from the LLM for each prompt will be written to it in a file with that prompt's name.test_prompts
: text file to write all prompts when using--test
To run the classify
command, use poetry run
:
poetry run classify --config config.json
If you re-run the classifier, it will look in the cache for each case / prompt
combination and return a cached result if it exists, rather than going to the
LLM. To force the classifier to go to the LLM even if a cached result exists,
use the --no-cache
flag.
Command line options for the command-line tool:
--config FILE
- specify the JSON config file--test
- generate prompts and write them to thetest_prompts
file but don't call the LLM for classification--case CASEFILE
- run the classifier for a single case, specified by its JSON filename--prompt PROMPT
- run the classifier for only one prompt, specified by its name in the spreadsheet--no-cache
- call the LLM even if there is a cached result for a prompt
GPT-4o sometimes adds 'notes' to its output even when instructed to return JSON - these notes are also saved to the cache, although they are ignored when building the results spreadsheet.
You can use the Classifier object in your own Python scripts or notebooks:
from langchainlaw.langchainlaw import Classifier
from pandas import DataFrame
from pathlib import Path
import json
with open("config.json", "r") as cfh:
config = json.load(cfh)
classifier = Classifier(config)
classifier.load_prompts(config["prompts"])
# classify a single case
output = classifier.classify("cases/123456789abcdef0.json")
# iterate over a directory and build a dataframe
results = []
for casefile in Path("cases").glob("*.json"):
output = classifier.classify(casefile)
results.append(classifier.as_dict(output))
df = DataFrame(results)
See the sample notebook for an example of using langchainlaw from a Jupyter notebook. To run this notebook locally use the following poetry command:
poetry run jupyter notebook notebook.ipynb
The notebook assumes that you have a config.json
file with your OpenAI
keys in the root directory of the repo.
Prompts are configured using an Excel spreadsheet - here is an example
Cell A2 contains the system prompt: this is the message which is sent to the LLM as a System prompt and is used to set the persona for the rest of the chat. For example:
You are a legal research assistant helping an academic researcher to answer questions about a public judgment of a decision in inheritance law. You will be provided with the judgment and metadata as a JSON document. Please answer the questions about the judgment based only on information contained in the judgment. Where your answer comes from a specific paragraph in the judgment, provide the paragraph number as part of your answer. If you cannot answer any of the questions based on the judgment or metadata, do not make up
information, but instead write ""answer not found"""
Cell A2 contains the template which is used to start each chat message. The string {judgment} is expanded to the JSON of the case being classified.
Based on the metadata and judgment in the following JSON {judgment},
Each request to the LLM is a set of related questions configured with the prompts worksheet. The columns of this sheet are:
Prompt_name | return_type | repeats | prompt_question | return_instructions | additional_instructions | fields | question_description | example |
---|---|---|---|---|---|---|---|---|
prompt id | json or json_multiple |
repeat json_multiple this many times |
top-level question | description of JSON structure | additional instructions if required | unique field name for each sub-question | text of the sub-question | example answer |
For example, in the sample spreadsheet, the prompt dates
has the following
spreadsheet values:
Prompt_name | return_type | repeats | prompt_question | return_instructions | additional_instructions | fields | question_description | example |
---|---|---|---|---|---|---|---|---|
dates | json | answer the following questions about the case: | Return your answer as a JSON object, following this example: | filing_date | What is the filing date? DD/MM/YYYY | 5/6/2010 | ||
dates | interlocutory | Does this judgment concern an interlocutory application? Answer "yes", "no" or "unclear" | yes | |||||
dates | interlocutory_date | If the judgment concerns an interlocutory application, what was the date of the application? DD/MM/YYYY | 4/3/2010 |
From these, the classifier will build the following prompt:
answer the following questions about the case:
Q1: what is the filing date? DD/MM/YYYY
Q2: does this judgment concern an interlocutory application? Answer "yes", "no" or "unclear"
Q3: if the judgment concerns an interlocutory application, what was the date of the application? DD/MM/YYYY
Return your answer as a JSON object, following this example:
{{
"filing_date": "5/6/2010",
"interlocutory": "yes",
"interlocutory_date": "4/3/2010"
}}
Note that the example JSON is constructed automatically from the example answers in the "example" column.
This project is partially funded by a 2022 University of Sydney Research Accelerator (SOAR) Prize awarded to Ben Chen.