Skip to content

Commit

Permalink
Merge pull request #28 from ittia-research/dev
Browse files Browse the repository at this point in the history
Docs: add factdb, datasets; update to-do
  • Loading branch information
etwk authored Oct 12, 2024
2 parents 80e6c3c + 70a77cc commit e485644
Show file tree
Hide file tree
Showing 6 changed files with 91 additions and 1 deletion.
11 changes: 11 additions & 0 deletions docs/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
Projects related docs:
- [To-do](./to-do.md): project to-do
- [Workflow](./workflow.md): workflow related docs and to-do
- [FactDB](./factdb.md): a database experiment
- [Experiences](./experiences.md): project experiences
- [Change-log](changelog.md): main project changes

## Reminders
LLM are getting better and better at their jobs, we need to be aware of the temptation of outsourcing all including foundation (facts) and decision making (thoughts, verdicts) to LLM.

Be aware of closed information ecosystems.
14 changes: 14 additions & 0 deletions docs/datasets.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
Collection of datasets and related models, tools, code, etc.

## Index & Search
- [Google Cloud Datasets](https://cloud.google.com/datasets)
- [Google Dataset Search](https://datasetsearch.research.google.com)

## Datasets
### Data Commons
Knowledge database by Google, public as well as private data.
- [Concept](https://docs.datacommons.org/data_model.html)

Models:
- [DataGemma](https://blog.google/technology/ai/google-datagemma-ai-llm/)
* [paper](https://arxiv.org/abs/2409.13741)
1 change: 0 additions & 1 deletion docs/experience.md → docs/experiences.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,2 @@
## Prompt
- If there is unclear reasoning or logic in a prompt, which might have a huge impact on LLM's reasoning ability.

39 changes: 39 additions & 0 deletions docs/factdb.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
Media including social-media should follow some basic standards.

Words are not just words, they expand and expand differently for different people in different time. They are like seeds, web grows out of them and expanding. One goal of counter disinformation is to stop the wrong one from expand.

We have some common expansion directions, for example:
- When connected a object with a bad thing: blame

## Goals
- Index the web in the age of automation
- [ ] What's the new data(database) standards in the age of AI?
- [ ] How llm store facts?
- [ ] Maybe we can learn how to build a better database by separate the fact part from LLM model?
- [ ] With a separate and accurate way to store facts, maybe the better way to provide facts to LLM is to connect the database with LLM natively.
- researches:
- https://arxiv.org/pdf/2308.09124
- https://arxiv.org/pdf/2310.05177
- [ ] With the rapid development of information environment, humans along are destined to lose. The solution is machine VS machine.

## To-do
### Get Start
- [ ] Collect examples of mis/dis-information. (Anything related to facts can be used as example, focus on mis/dis-information at the moment.)
- [ ] How to connect parts of one event in database?
- [ ] How to connect entity to their roots, and choose what as roots?
- [ ] From low level factual events to high level summary events.
- [ ] Split the examples to events: time range, known entity, happens(split so that can be matched exactly to a source)
- [ ] When statements is about a general conclusion, when and how to update its related facts to latest?

### Ensure
- [ ] Sematic search as well as accurate search

## Database components
- actual event

## Case Study
### Missing Context
The facts are correct but missing context and will led people misinterpret.
```
A photo of people wear T-shirt says "Warts for Trump". These people might named Walz but they have never spoken to The Tim Walz.
```
26 changes: 26 additions & 0 deletions docs/to-do.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,26 @@
## Goals
This topic is very complicated. Here we try to define some clean goals:
- Ability to fact-check given information.
- A open source database of facts, with connections and hierarchy between facts. Build a foundation.
- Open access tools for real-time fact-checking.
- Broad media entity quality assess, including social media.
- Enrich first-hand facts. Make facts afloat.

## Roadmap
- [ ] Check one line of single statement.
- [ ] Check long paragraphs or a content source (URL, file, etc.)
- [ ] What's the ultimate goals?
- [ ] Long-term memory and caching.
- [ ] Fact-check standards and database.

## Alphabet
context:
- Every sides (it's question and answer in QA setup) has context.
- There might be multiple points, for example social context.
- It can be commonly accepted knowledge or extra facts.

- [ ] How to handle context in RAG? Needs to ensure integrity.

## Work
### Frontend
- [ ] API: Input string or url, output analysis
Expand Down Expand Up @@ -76,3 +92,13 @@ DSPy
- [ ] Multi-language support.
- [ ] Add logging and long-term memory.
- [ ] Integrate with other fact-check services.

### Legal
- [ ] Copyright: citations, etc.

## Considerations
RAG:
- Order-Preserve RAG

## References
- [ ] https://github.com/ICTMCG/LLM-for-misinformation-research
1 change: 1 addition & 0 deletions src/pipeline/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,7 @@ async def get_statements(self):

if not self.statements:
raise HTTPException(status_code=500, detail="No statements found")
self.statements = self.statements[:2] # TODO: limiting max statements to 2 since haven't support long input yet.
logging.info(f"statements: {self.statements}")

# add statements to data with order
Expand Down

0 comments on commit e485644

Please sign in to comment.