Merge pull request #28 from ittia-research/dev

Docs: add factdb, datasets; update to-do
ittia-research · Oct 12, 2024 · e485644 · e485644
2 parents 80e6c3c + 70a77cc
commit e485644
Show file tree

Hide file tree

Showing 6 changed files with 91 additions and 1 deletion.
diff --git a/docs/README.md b/docs/README.md
@@ -0,0 +1,11 @@
+Projects related docs:
+- [To-do](./to-do.md): project to-do
+- [Workflow](./workflow.md): workflow related docs and to-do
+- [FactDB](./factdb.md): a database experiment
+- [Experiences](./experiences.md): project experiences
+- [Change-log](changelog.md): main project changes
+
+## Reminders
+LLM are getting better and better at their jobs, we need to be aware of the temptation of outsourcing all including foundation (facts) and decision making (thoughts, verdicts) to LLM.
+
+Be aware of closed information ecosystems.
diff --git a/docs/datasets.md b/docs/datasets.md
@@ -0,0 +1,14 @@
+Collection of datasets and related models, tools, code, etc.
+
+## Index & Search
+- [Google Cloud Datasets](https://cloud.google.com/datasets)
+- [Google Dataset Search](https://datasetsearch.research.google.com)
+
+## Datasets
+### Data Commons
+Knowledge database by Google, public as well as private data.
+- [Concept](https://docs.datacommons.org/data_model.html)
+
+Models:
+- [DataGemma](https://blog.google/technology/ai/google-datagemma-ai-llm/)
+  * [paper](https://arxiv.org/abs/2409.13741)
diff --git a/docs/experience.md → docs/experiences.md b/docs/experience.md → docs/experiences.md
@@ -1,3 +1,2 @@
 ## Prompt
 - If there is unclear reasoning or logic in a prompt, which might have a huge impact on LLM's reasoning ability.
-
diff --git a/docs/factdb.md b/docs/factdb.md
@@ -0,0 +1,39 @@
+Media including social-media should follow some basic standards.
+
+Words are not just words, they expand and expand differently for different people in different time. They are like seeds, web grows out of them and expanding. One goal of counter disinformation is to stop the wrong one from expand.
+
+We have some common expansion directions, for example:
+  - When connected a object with a bad thing: blame
+
+## Goals
+- Index the web in the age of automation
+  - [ ] What's the new data(database) standards in the age of AI?
+  - [ ] How llm store facts? 
+    - [ ] Maybe we can learn how to build a better database by separate the fact part from LLM model?
+    - [ ] With a separate and accurate way to store facts, maybe the better way to provide facts to LLM is to connect the database with LLM natively.
+    - researches:
+      - https://arxiv.org/pdf/2308.09124
+      - https://arxiv.org/pdf/2310.05177
+- [ ] With the rapid development of information environment, humans along are destined to lose. The solution is machine VS machine.
+
+## To-do
+### Get Start
+- [ ] Collect examples of mis/dis-information. (Anything related to facts can be used as example, focus on mis/dis-information at the moment.)
+  - [ ] How to connect parts of one event in database?
+  - [ ] How to connect entity to their roots, and choose what as roots?
+  - [ ] From low level factual events to high level summary events.
+- [ ] Split the examples to events: time range, known entity, happens(split so that can be matched exactly to a source)
+- [ ] When statements is about a general conclusion, when and how to update its related facts to latest?
+
+### Ensure
+- [ ] Sematic search as well as accurate search
+
+## Database components
+- actual event
+
+## Case Study
+### Missing Context
+The facts are correct but missing context and will led people misinterpret.
+```
+A photo of people wear T-shirt says "Warts for Trump". These people might named Walz but they have never spoken to The Tim Walz.
+```
diff --git a/docs/to-do.md b/docs/to-do.md
@@ -1,10 +1,26 @@
+## Goals
+This topic is very complicated. Here we try to define some clean goals:
+- Ability to fact-check given information.
+- A open source database of facts, with connections and hierarchy between facts. Build a foundation.
+- Open access tools for real-time fact-checking.
+- Broad media entity quality assess, including social media.
+- Enrich first-hand facts. Make facts afloat.
+
 ## Roadmap
 - [ ] Check one line of single statement.
 - [ ] Check long paragraphs or a content source (URL, file, etc.)
   - [ ] What's the ultimate goals?
 - [ ] Long-term memory and caching.
 - [ ] Fact-check standards and database.
 
+## Alphabet
+context:
+  - Every sides (it's question and answer in QA setup) has context.
+  - There might be multiple points, for example social context.
+  - It can be commonly accepted knowledge or extra facts.
+
+- [ ] How to handle context in RAG? Needs to ensure integrity.
+
 ## Work
 ### Frontend
 - [ ] API: Input string or url, output analysis
@@ -76,3 +92,13 @@ DSPy
 - [ ] Multi-language support.
 - [ ] Add logging and long-term memory.
 - [ ] Integrate with other fact-check services.
+
+### Legal
+- [ ] Copyright: citations, etc.
+
+## Considerations
+RAG:
+  - Order-Preserve RAG
+
+## References
+- [ ] https://github.com/ICTMCG/LLM-for-misinformation-research
diff --git a/src/pipeline/__init__.py b/src/pipeline/__init__.py
@@ -111,6 +111,7 @@ async def get_statements(self):
 
         if not self.statements:
             raise HTTPException(status_code=500, detail="No statements found")
+        self.statements = self.statements[:2]  # TODO: limiting max statements to 2 since haven't support long input yet.
         logging.info(f"statements: {self.statements}")
 
         # add statements to data with order
Original file line number	Diff line number	Diff line change
		@@ -1,3 +1,2 @@
		## Prompt
		- If there is unclear reasoning or logic in a prompt, which might have a huge impact on LLM's reasoning ability.