Skip to content

Commit

Permalink
Merge pull request #138 from Inist-CNRS/services/data-computer/add-co…
Browse files Browse the repository at this point in the history
…rpus-similarity-ws

[data computer] add corpus similarity ws
  • Loading branch information
parmentf authored Oct 21, 2024
2 parents 8e3870b + ab663f1 commit 718e472
Show file tree
Hide file tree
Showing 9 changed files with 416 additions and 41 deletions.
3 changes: 2 additions & 1 deletion package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

160 changes: 159 additions & 1 deletion services/data-computer/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# ws-data-computer@2.15.0
# ws-data-computer@2.16.0

Le service `data-computer` offre plusieurs services **asynchrones** pour des calculs et de transformations de données simples.

Expand Down Expand Up @@ -256,3 +256,161 @@ cat input.tar.gz |curl --data-binary @- -H "X-Hook: https://webhook.site/dce2fe

# When the corpus is processed, get the result
cat output.json |curl --data-binary @- "http://localhost:31976/v1/retrieve" > output.tar.gz
```


### v1/corpus-similarity

Compare des petits documents (Titre, phrases, petits *abstracts*) entre eux, et renvoie pour chaque document les documents qui lui sont similaires.
Il est conseillé d'utiliser cette route avec au moins 6-7 documents dans le corpus.

Il existe un paramètre optionnel `output` pour choisir le type de sortie en fonction de sa valeur:
- 0 (par défaut) : l'algorithme choisit automatiquement les documents les plus similaires à chaque document
- 1 : l'algorithme renvoie pour chaque document tous les documents, classés par ordre de proximité (les plus similaires en premier)
- *n* (avec *n* un entier plus grand que 1) : l'algorithme renvoie pour chaque document les *n* documents les plus proches, classés par ordre de proximité (les plus similaires en premier), ainsi que le score de similarité associé à chaque document.
par exemple en utilisant `example-similarity-json.tar.gz` avec le paramètre output par défaut (0), obtiendra :

> **Attention** : Le champ ID est utilisé comme référence de chaque document.
par exemple en utilisant `example-similarity-json.tar.gz` avec le paramètre output par défaut (0), obtiendra :

```json
[
{
"id": "Titre 1",
"value": {
"similarity": [
"Titre 4",
"Titre 2"
],
"score": [
0.9411764705882353,
0.9349112426035503
]
}
},
{
"id": "Titre 2",
"value": {
"similarity": [
"Titre 1"
],
"score": [
0.9349112426035503
]
}
},
{
"id": "Titre 3",
"value": {
"similarity": [
"Titre 4"
],
"score": [
0.8888888888888888
]
}
},
{
"id": "Titre 4",
"value": {
"similarity": [
"Titre 1"
],
"score": [
0.9411764705882353
]
}
}
]
```

Avec le paramètre output=3, on obtiendra :

```json
[
{
"id": "Titre 1",
"value": {
"similarity": [
"Titre 4",
"Titre 2",
"Titre 3"
],
"score": [
0.9411764705882353,
0.9349112426035503,
0.8757396449704142
]
}
},
{
"id": "Titre 2",
"value": {
"similarity": [
"Titre 1",
"Titre 4",
"Titre 3"
],
"score": [
0.9349112426035503,
0.8888888888888888,
0.8651685393258427
]
}
},
{
"id": "Titre 3",
"value": {
"similarity": [
"Titre 4",
"Titre 1",
"Titre 2"
],
"score": [
0.8888888888888888,
0.8757396449704142,
0.8651685393258427
]
}
},
{
"id": "Titre 4",
"value": {
"similarity": [
"Titre 1",
"Titre 3",
"Titre 2"
],
"score": [
0.9411764705882353,
0.8888888888888888,
0.8888888888888888
]
}
}
]
```

#### Paramètre(s) URL

| nom | description |
| ------------------- | ------------------------------------------- |
| indent (true/false) | Indenter le résultat renvoyer immédiatement |
| output (0,1,n) | Choix de la sortie |

#### Entête(s) HTTP

| nom | description |
| ------ | ------------------------------------------------------------ |
| X-Hook | URL à appeler quand le résultat sera disponible (facultatif) |

#### Exemple en ligne de commande


```bash
# Send data for batch processing
cat input.tar.gz |curl --data-binary @- -H "X-Hook: https://webhook.site/dce2fefa-9a72-4f76-96e5-059405a04f6c" "http://localhost:31976/v1/similarity" > output.json

# When the corpus is processed, get the result
cat output.json |curl --data-binary @- "http://localhost:31976/v1/retrieve" > output.tar.gz
Binary file not shown.
11 changes: 11 additions & 0 deletions services/data-computer/examples.http
Original file line number Diff line number Diff line change
Expand Up @@ -123,3 +123,14 @@ X-Webhook-Success: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9
X-Webhook-Failure: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9

< ./example-json.tar.gz


###
# @name v1CorpusSimilarity
POST {{host}}/v1/corpus-similarity HTTP/1.1
Content-Type: application/x-tar
X-Webhook-Success: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9
X-Webhook-Failure: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9

< ./example-similarity-json.tar.gz

66 changes: 33 additions & 33 deletions services/data-computer/package.json
Original file line number Diff line number Diff line change
@@ -1,35 +1,35 @@
{
"private": true,
"name": "ws-data-computer",
"version": "2.15.0",
"description": "Calculs sur fichier corpus compressé",
"repository": {
"type": "git",
"url": "git+https://github.com/Inist-CNRS/web-services.git"
},
"keywords": [
"ezmaster"
],
"author": " <[email protected]>",
"license": "MIT",
"bugs": {
"url": "https://github.com/Inist-CNRS/web-services/issues"
},
"homepage": "https://github.com/Inist-CNRS/web-services/#readme",
"scripts": {
"version:insert:readme": "sed -i \"s#\\(${npm_package_name}.\\)\\([\\.a-z0-9]\\+\\)#\\1${npm_package_version}#g\" README.md && git add README.md",
"version:insert:swagger": "sed -i \"s/\\\"version\\\": \\\"[0-9]\\+.[0-9]\\+.[0-9]\\+\\\"/\\\"version\\\": \\\"${npm_package_version}\\\"/g\" swagger.json && git add swagger.json",
"version:insert": "npm run version:insert:readme && npm run version:insert:swagger",
"version:commit": "git commit -a -m \"release ${npm_package_name}@${npm_package_version}\"",
"version:tag": "git tag \"${npm_package_name}@${npm_package_version}\" -m \"${npm_package_name}@${npm_package_version}\"",
"version:push": "git push && git push --tags",
"version": "npm run version:insert && npm run version:commit && npm run version:tag",
"postversion": "npm run version:push",
"build:dev": "docker build -t cnrsinist/${npm_package_name}:latest .",
"start:dev": "npm run build:dev && docker run --name dev --rm --detach -p 31976:31976 cnrsinist/${npm_package_name}:latest",
"stop:dev": "docker stop dev",
"build": "docker build -t cnrsinist/${npm_package_name}:${npm_package_version} .",
"start": "docker run --rm -p 31976:31976 cnrsinist/${npm_package_name}:${npm_package_version}",
"publish": "docker push cnrsinist/${npm_package_name}:${npm_package_version}"
}
"private": true,
"name": "ws-data-computer",
"version": "2.16.0",
"description": "Calculs sur fichier corpus compressé",
"repository": {
"type": "git",
"url": "git+https://github.com/Inist-CNRS/web-services.git"
},
"keywords": [
"ezmaster"
],
"author": " <[email protected]>",
"license": "MIT",
"bugs": {
"url": "https://github.com/Inist-CNRS/web-services/issues"
},
"homepage": "https://github.com/Inist-CNRS/web-services/#readme",
"scripts": {
"version:insert:readme": "sed -i \"s#\\(${npm_package_name}.\\)\\([\\.a-z0-9]\\+\\)#\\1${npm_package_version}#g\" README.md && git add README.md",
"version:insert:swagger": "sed -i \"s/\\\"version\\\": \\\"[0-9]\\+.[0-9]\\+.[0-9]\\+\\\"/\\\"version\\\": \\\"${npm_package_version}\\\"/g\" swagger.json && git add swagger.json",
"version:insert": "npm run version:insert:readme && npm run version:insert:swagger",
"version:commit": "git commit -a -m \"release ${npm_package_name}@${npm_package_version}\"",
"version:tag": "git tag \"${npm_package_name}@${npm_package_version}\" -m \"${npm_package_name}@${npm_package_version}\"",
"version:push": "git push && git push --tags",
"version": "npm run version:insert && npm run version:commit && npm run version:tag",
"postversion": "npm run version:push",
"build:dev": "docker build -t cnrsinist/${npm_package_name}:latest .",
"start:dev": "npm run build:dev && docker run --name dev --rm --detach -p 31976:31976 cnrsinist/${npm_package_name}:latest",
"stop:dev": "docker stop dev",
"build": "docker build -t cnrsinist/${npm_package_name}:${npm_package_version} .",
"start": "docker run --rm -p 31976:31976 cnrsinist/${npm_package_name}:${npm_package_version}",
"publish": "docker push cnrsinist/${npm_package_name}:${npm_package_version}"
}
}
6 changes: 3 additions & 3 deletions services/data-computer/swagger.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"info": {
"title": "data-computer - Calculs sur fichier corpus compressé",
"summary": "Calculs sur un corpus compressé",
"version": "2.15.0",
"version": "2.16.0",
"termsOfService": "https://services.istex.fr/",
"contact": {
"name": "Inist-CNRS",
Expand All @@ -15,7 +15,7 @@
"x-comment": "Will be automatically completed by the ezs server."
},
{
"url": "http://vptdmjobs.intra.inist.fr:49191/",
"url": "http://vptdmjobs.intra.inist.fr:49196/",
"description": "Latest version for production",
"x-profil": "Standard"
}
Expand All @@ -30,4 +30,4 @@
}
}
]
}
}
89 changes: 86 additions & 3 deletions services/data-computer/tests.hurl
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,92 @@ delay: 2000
HTTP 200
[{"id":"#1","value":{"sample":2,"frequency":0.6666666666666666,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"a"}},{"id":"#2","value":{"sample":2,"frequency":0.6666666666666666,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"b"}},{"id":"#3","value":{"sample":1,"frequency":0.3333333333333333,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"c"}},{"id":"#4","value":{"sample":2,"frequency":0.6666666666666666,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"a"}},{"id":"#5","value":{"sample":2,"frequency":0.6666666666666666,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"b"}}]

#
# group
################################ Test for Similarity ################################

POST {{host}}/v1/corpus-similarity
content-type: application/x-tar
x-hook: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9
file,example-similarity-json.tar.gz;

HTTP 200
# Capture the computing token
[Captures]
computing_token: jsonpath "$[0].value"
[Asserts]
variable "computing_token" exists

# There should be a waiting time, representing the time taken to process data.
# Fortunately, as the data is sparse, and the computing time is small,
# the need is small.

# Version 4.1.0 of hurl added a delay option, which value is milliseconds.
# https://hurl.dev/blog/2023/09/24/announcing-hurl-4.1.0.html#add-delay-between-requests

POST {{host}}/v1/retrieve-json?indent=true
content-type: application/json
[Options]
delay: 1000
```
[
{
"value":"{{computing_token}}"
}
]
```

HTTP 200
[{
"id": "Titre 1",
"value": {
"similarity": [
"Titre 4",
"Titre 2"
],
"score": [
0.9411764705882353,
0.9349112426035503
]
}
},
{
"id": "Titre 2",
"value": {
"similarity": [
"Titre 1"
],
"score": [
0.9349112426035503
]
}
},
{
"id": "Titre 3",
"value": {
"similarity": [
"Titre 4"
],
"score": [
0.8888888888888888
]
}
},
{
"id": "Titre 4",
"value": {
"similarity": [
"Titre 1"
],
"score": [
0.9411764705882353
]
}
}]


# TODO: ajouter les deux autres routes (v1GraphSegment, v1Lda)
# TODO: ajouter la route rapido

##################################### group-by ######################
POST {{host}}/v1/group-by
content-type: application/gzip
x-hook: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9
Expand Down Expand Up @@ -109,4 +193,3 @@ HTTP 200
[{"id":"#1","value":["#1","#4"]},{"id":"#4","value":["#1","#4"]},{"id":"#2","value":["#2","#5"]},{"id":"#5","value":["#2","#5"]},{"id":"#3","value":["#3"]}]

#
# TODO: ajouter les deux autres routes (v1GraphSegment, v1Lda)
Loading

0 comments on commit 718e472

Please sign in to comment.