Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data computer] add corpus similarity ws #138

Merged
merged 15 commits into from
Oct 21, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
154 changes: 154 additions & 0 deletions services/data-computer/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,3 +256,157 @@ cat input.tar.gz |curl --data-binary @- -H "X-Hook: https://webhook.site/dce2fe

# When the corpus is processed, get the result
cat output.json |curl --data-binary @- "http://localhost:31976/v1/retrieve" > output.tar.gz
```


### v1/similarity

Compare des petits documents (Titre, phrases, petit abstractes) entre eux, et renvoit pour chaque document les documents qui lui sont similaire. Il est conseillé d'utiliser cette route avec au moins 6-7 documents dans le corpus.

Il existe un paramètre optionnel "output" pour choisir le type de sortie en fonction de sa valeur:
- 0 (par défaut) : L'algorithme choisit automatiquement les documents les plus similaires à chaque document
- 1 : L'algorithme renvoit pour chaque document tout les documents, classé par ordre de proximité (Les plus similaires en premier)
- n (avec n un entier plus grand que 1) : L'algorithme renvoit pour chaque document au plus les n documents les plus proches, classé par ordre de proximité (Les plus similaires en premier), ainsi que le score de similarité associé à chaque document.

par exemple en utilisant example-similarity-json.tar.gz avec le paramètre output par défaut (0), obtiendra :
Luc-Ank marked this conversation as resolved.
Show resolved Hide resolved

```json
[
{
"id": "Titre 1",
"value": {
"similarity": [
"Titre 4",
"Titre 2"
],
"score": [
0.9411764705882353,
0.9349112426035503
]
}
},
{
"id": "Titre 2",
"value": {
"similarity": [
"Titre 1"
],
"score": [
0.9349112426035503
]
}
},
{
"id": "Titre 3",
"value": {
"similarity": [
"Titre 4"
],
"score": [
0.8888888888888888
]
}
},
{
"id": "Titre 4",
"value": {
"similarity": [
"Titre 1"
],
"score": [
0.9411764705882353
]
}
}
]
```

Avec le paramètre output=3, on obtiendra :

```json
[
{
"id": "Titre 1",
"value": {
"similarity": [
"Titre 4",
"Titre 2",
"Titre 3"
],
"score": [
0.9411764705882353,
0.9349112426035503,
0.8757396449704142
]
}
},
{
"id": "Titre 2",
"value": {
"similarity": [
"Titre 1",
"Titre 4",
"Titre 3"
],
"score": [
0.9349112426035503,
0.8888888888888888,
0.8651685393258427
]
}
},
{
"id": "Titre 3",
"value": {
"similarity": [
"Titre 4",
"Titre 1",
"Titre 2"
],
"score": [
0.8888888888888888,
0.8757396449704142,
0.8651685393258427
]
}
},
{
"id": "Titre 4",
"value": {
"similarity": [
"Titre 1",
"Titre 3",
"Titre 2"
],
"score": [
0.9411764705882353,
0.8888888888888888,
0.8888888888888888
]
}
}
]
```

#### Paramètre(s) URL

| nom | description |
| ------------------- | ------------------------------------------- |
| indent (true/false) | Indenter le résultat renvoyer immédiatement |
| output (0,1,n) | Choix de la sortie |

#### Entête(s) HTTP

| nom | description |
| ------ | ------------------------------------------------------------ |
| X-Hook | URL à appeler quand le résultat sera disponible (facultatif) |

#### Exemple en ligne de commande


```bash
# Send data for batch processing
cat input.tar.gz |curl --data-binary @- -H "X-Hook: https://webhook.site/dce2fefa-9a72-4f76-96e5-059405a04f6c" "http://localhost:31976/v1/similarity" > output.json

# When the corpus is processed, get the result
cat output.json |curl --data-binary @- "http://localhost:31976/v1/retrieve" > output.tar.gz
Binary file not shown.
11 changes: 11 additions & 0 deletions services/data-computer/examples.http
Original file line number Diff line number Diff line change
Expand Up @@ -113,3 +113,14 @@ X-Webhook-Success: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9
X-Webhook-Failure: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9

< ./example-json.tar.gz


###
# @name v1Similarity
POST {{host}}/v1/similarity HTTP/1.1
Luc-Ank marked this conversation as resolved.
Show resolved Hide resolved
Content-Type: application/x-tar
X-Webhook-Success: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9
X-Webhook-Failure: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9

< ./example-similarity-json.tar.gz

1 change: 1 addition & 0 deletions services/data-computer/tests.hurl
Original file line number Diff line number Diff line change
Expand Up @@ -73,3 +73,4 @@ HTTP 200
[{"id":"#1","value":{"sample":2,"frequency":0.6666666666666666,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"a"}},{"id":"#2","value":{"sample":2,"frequency":0.6666666666666666,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"b"}},{"id":"#3","value":{"sample":1,"frequency":0.3333333333333333,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"c"}},{"id":"#4","value":{"sample":2,"frequency":0.6666666666666666,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"a"}},{"id":"#5","value":{"sample":2,"frequency":0.6666666666666666,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"b"}}]

# TODO: ajouter les deux autres routes (v1GraphSegment, v1Lda)
# TODO: ajouter les routes rapido et similarity
Luc-Ank marked this conversation as resolved.
Show resolved Hide resolved
65 changes: 65 additions & 0 deletions services/data-computer/v1/corpus-similarity.ini
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
# OpenAPI Documentation - JSON format (dot notation)
mimeType = application/json

post.operationId = post-v1-corpus-similarity
post.description = Web service de calcul de similarité entre documents d un corpus
post.summary = 3 sorties sont disponibles
post.tags.0 = data-computer
post.requestBody.content.application/x-tar.schema.type = string
post.requestBody.content.application/x-tar.schema.format = binary
post.requestBody.required = true
post.responses.default.description = Informations permettant de récupérer les données le moment venu
post.parameters.0.description = Indenter le JSON résultant
post.parameters.0.in = query
post.parameters.0.name = indent
post.parameters.0.schema.type = boolean
post.parameters.1.description = URL pour signaler que le traitement est terminé
post.parameters.1.in = header
post.parameters.1.name = X-Webhook-Success
post.parameters.1.schema.type = string
post.parameters.1.schema.format = uri
post.parameters.1.required = false
post.parameters.2.description = URL pour signaler que le traitement a échoué
post.parameters.2.in = header
post.parameters.2.name = X-Webhook-Failure
post.parameters.2.schema.type = string
post.parameters.2.schema.format = uri
post.parameters.2.required = false

post.parameters.3.in = query
post.parameters.3.name = output
post.parameters.3.schema.type = int
post.parameters.3.description = Choix du nombre de documents similaires à afficher dans la sortie : 0 pour automatique, 1 pour tout afficher, n'importe quel autre nombre pour afficher au maximum ce nombre d'élements.


[env]
path = generator
value = corpus-similarity

[use]
plugin = basics
plugin = spawn

# Step 1 (générique): Charger le fichier corpus
[delegate]
file = charger.cfg

# Step 2 (générique): Traiter de manière asynchnore les items reçus
Luc-Ank marked this conversation as resolved.
Show resolved Hide resolved
[fork]
standalone = true
logger = logger.cfg

# Step 2.1 (spécifique): Lancer un calcul sur tous les items reçus
[fork/exec]
# command should be executable !
command = ./v1/corpus-similarity.py
args = fix('-p')
args = env('output', "0")

# Step 2.2 (générique): Enregister le résulat et signaler que le traitment est fini
Luc-Ank marked this conversation as resolved.
Show resolved Hide resolved
[fork/delegate]
file = recorder.cfg

# Step 3 : Renvoyer immédiatement un seul élément indiquant comment récupérer le résulat quand il sera prêt
Luc-Ank marked this conversation as resolved.
Show resolved Hide resolved
[delegate]
file = recipient.cfg
51 changes: 51 additions & 0 deletions services/data-computer/v1/corpus-similarity.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import json
import sys
from difflib import SequenceMatcher
import numpy as np

# load all datas
all_data = []
for line in sys.stdin:
data=json.loads(line)
all_data.append(data)
charged = False
Luc-Ank marked this conversation as resolved.
Show resolved Hide resolved
print("ARV : ",sys.argv,file=sys.stderr)
output = int(sys.argv[sys.argv.index('-p') + 1] if '-p' in sys.argv else 0)
print("Output is ",output,file=sys.stderr)
for line in all_data:
data = line[0]
currentTitle = data['value']
currentId = data['id']
sim = []
idList = []
ratioList = []
for i,line_cmp in enumerate(all_data):
data_cmp = line_cmp[0]
id,title = data_cmp["id"],data_cmp["value"]
if currentId == id:
continue
ratio = SequenceMatcher(None, currentTitle, title).ratio()
idList.append(id)
ratioList.append(ratio)
sim.append((id,ratio))
ratioList,idList = (list(t) for t in zip(*sorted(zip(ratioList, idList),reverse=True)))
if output == 0:
if ratioList[0] < 0.6:
sim = []
score = []
else:
diff = -np.diff(ratioList)
mean = np.mean(diff)
argmx = np.argmax(diff-mean)
sim = idList[:argmx+1]
score = ratioList[:argmx+1]
elif output == 1:
sim = idList
score = ratioList
else:
sim = idList[:min(len(idList),output)]
score = ratioList[:min(len(idList),output)]
sys.stdout.write(json.dumps({"id":data["id"],"value":{"similarity":sim, "score":score}}))
sys.stdout.write('\n')