diff --git a/docs/README.md b/docs/README.md new file mode 100644 index 0000000..8711f8d --- /dev/null +++ b/docs/README.md @@ -0,0 +1,11 @@ +Projects related docs: +- [To-do](./to-do.md): project to-do +- [Workflow](./workflow.md): workflow related docs and to-do +- [FactDB](./factdb.md): a database experiment +- [Experiences](./experiences.md): project experiences +- [Change-log](changelog.md): main project changes + +## Reminders +LLM are getting better and better at their jobs, we need to be aware of the temptation of outsourcing all including foundation (facts) and decision making (thoughts, verdicts) to LLM. + +Be aware of closed information ecosystems. \ No newline at end of file diff --git a/docs/datasets.md b/docs/datasets.md new file mode 100644 index 0000000..87f19bc --- /dev/null +++ b/docs/datasets.md @@ -0,0 +1,14 @@ +Collection of datasets and related models, tools, code, etc. + +## Index & Search +- [Google Cloud Datasets](https://cloud.google.com/datasets) +- [Google Dataset Search](https://datasetsearch.research.google.com) + +## Datasets +### Data Commons +Knowledge database by Google, public as well as private data. +- [Concept](https://docs.datacommons.org/data_model.html) + +Models: +- [DataGemma](https://blog.google/technology/ai/google-datagemma-ai-llm/) + * [paper](https://arxiv.org/abs/2409.13741) \ No newline at end of file diff --git a/docs/experience.md b/docs/experiences.md similarity index 99% rename from docs/experience.md rename to docs/experiences.md index a8f7cb9..f65b25d 100644 --- a/docs/experience.md +++ b/docs/experiences.md @@ -1,3 +1,2 @@ ## Prompt - If there is unclear reasoning or logic in a prompt, which might have a huge impact on LLM's reasoning ability. - diff --git a/docs/factdb.md b/docs/factdb.md new file mode 100644 index 0000000..c69d951 --- /dev/null +++ b/docs/factdb.md @@ -0,0 +1,39 @@ +Media including social-media should follow some basic standards. + +Words are not just words, they expand and expand differently for different people in different time. They are like seeds, web grows out of them and expanding. One goal of counter disinformation is to stop the wrong one from expand. + +We have some common expansion directions, for example: + - When connected a object with a bad thing: blame + +## Goals +- Index the web in the age of automation + - [ ] What's the new data(database) standards in the age of AI? + - [ ] How llm store facts? + - [ ] Maybe we can learn how to build a better database by separate the fact part from LLM model? + - [ ] With a separate and accurate way to store facts, maybe the better way to provide facts to LLM is to connect the database with LLM natively. + - researches: + - https://arxiv.org/pdf/2308.09124 + - https://arxiv.org/pdf/2310.05177 +- [ ] With the rapid development of information environment, humans along are destined to lose. The solution is machine VS machine. + +## To-do +### Get Start +- [ ] Collect examples of mis/dis-information. (Anything related to facts can be used as example, focus on mis/dis-information at the moment.) + - [ ] How to connect parts of one event in database? + - [ ] How to connect entity to their roots, and choose what as roots? + - [ ] From low level factual events to high level summary events. +- [ ] Split the examples to events: time range, known entity, happens(split so that can be matched exactly to a source) +- [ ] When statements is about a general conclusion, when and how to update its related facts to latest? + +### Ensure +- [ ] Sematic search as well as accurate search + +## Database components +- actual event + +## Case Study +### Missing Context +The facts are correct but missing context and will led people misinterpret. +``` +A photo of people wear T-shirt says "Warts for Trump". These people might named Walz but they have never spoken to The Tim Walz. +``` diff --git a/docs/to-do.md b/docs/to-do.md index 329526c..c3a95c0 100644 --- a/docs/to-do.md +++ b/docs/to-do.md @@ -1,3 +1,11 @@ +## Goals +This topic is very complicated. Here we try to define some clean goals: +- Ability to fact-check given information. +- A open source database of facts, with connections and hierarchy between facts. Build a foundation. +- Open access tools for real-time fact-checking. +- Broad media entity quality assess, including social media. +- Enrich first-hand facts. Make facts afloat. + ## Roadmap - [ ] Check one line of single statement. - [ ] Check long paragraphs or a content source (URL, file, etc.) @@ -5,6 +13,14 @@ - [ ] Long-term memory and caching. - [ ] Fact-check standards and database. +## Alphabet +context: + - Every sides (it's question and answer in QA setup) has context. + - There might be multiple points, for example social context. + - It can be commonly accepted knowledge or extra facts. + +- [ ] How to handle context in RAG? Needs to ensure integrity. + ## Work ### Frontend - [ ] API: Input string or url, output analysis @@ -76,3 +92,13 @@ DSPy - [ ] Multi-language support. - [ ] Add logging and long-term memory. - [ ] Integrate with other fact-check services. + +### Legal +- [ ] Copyright: citations, etc. + +## Considerations +RAG: + - Order-Preserve RAG + +## References +- [ ] https://github.com/ICTMCG/LLM-for-misinformation-research \ No newline at end of file diff --git a/src/pipeline/__init__.py b/src/pipeline/__init__.py index 1ccd29e..4fdc16d 100644 --- a/src/pipeline/__init__.py +++ b/src/pipeline/__init__.py @@ -111,6 +111,7 @@ async def get_statements(self): if not self.statements: raise HTTPException(status_code=500, detail="No statements found") + self.statements = self.statements[:2] # TODO: limiting max statements to 2 since haven't support long input yet. logging.info(f"statements: {self.statements}") # add statements to data with order