Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[data computer] add corpus similarity ws #138

Merged
merged 15 commits into from
Oct 21, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion package-lock.json

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

160 changes: 159 additions & 1 deletion services/data-computer/README.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# ws-data-computer@2.15.0
# ws-data-computer@2.16.0

Le service `data-computer` offre plusieurs services **asynchrones** pour des calculs et de transformations de données simples.

Expand Down Expand Up @@ -256,3 +256,161 @@ cat input.tar.gz |curl --data-binary @- -H "X-Hook: https://webhook.site/dce2fe

# When the corpus is processed, get the result
cat output.json |curl --data-binary @- "http://localhost:31976/v1/retrieve" > output.tar.gz
```


### v1/corpus-similarity

Compare des petits documents (Titre, phrases, petits *abstracts*) entre eux, et renvoie pour chaque document les documents qui lui sont similaires.
Il est conseillé d'utiliser cette route avec au moins 6-7 documents dans le corpus.

Il existe un paramètre optionnel `output` pour choisir le type de sortie en fonction de sa valeur:
- 0 (par défaut) : l'algorithme choisit automatiquement les documents les plus similaires à chaque document
- 1 : l'algorithme renvoie pour chaque document tous les documents, classés par ordre de proximité (les plus similaires en premier)
- *n* (avec *n* un entier plus grand que 1) : l'algorithme renvoie pour chaque document les *n* documents les plus proches, classés par ordre de proximité (les plus similaires en premier), ainsi que le score de similarité associé à chaque document.
par exemple en utilisant `example-similarity-json.tar.gz` avec le paramètre output par défaut (0), obtiendra :

> **Attention** : Le champ ID est utilisé comme référence de chaque document.

par exemple en utilisant `example-similarity-json.tar.gz` avec le paramètre output par défaut (0), obtiendra :

```json
[
{
"id": "Titre 1",
"value": {
"similarity": [
"Titre 4",
"Titre 2"
],
"score": [
0.9411764705882353,
0.9349112426035503
]
}
},
{
"id": "Titre 2",
"value": {
"similarity": [
"Titre 1"
],
"score": [
0.9349112426035503
]
}
},
{
"id": "Titre 3",
"value": {
"similarity": [
"Titre 4"
],
"score": [
0.8888888888888888
]
}
},
{
"id": "Titre 4",
"value": {
"similarity": [
"Titre 1"
],
"score": [
0.9411764705882353
]
}
}
]
```

Avec le paramètre output=3, on obtiendra :

```json
[
{
"id": "Titre 1",
"value": {
"similarity": [
"Titre 4",
"Titre 2",
"Titre 3"
],
"score": [
0.9411764705882353,
0.9349112426035503,
0.8757396449704142
]
}
},
{
"id": "Titre 2",
"value": {
"similarity": [
"Titre 1",
"Titre 4",
"Titre 3"
],
"score": [
0.9349112426035503,
0.8888888888888888,
0.8651685393258427
]
}
},
{
"id": "Titre 3",
"value": {
"similarity": [
"Titre 4",
"Titre 1",
"Titre 2"
],
"score": [
0.8888888888888888,
0.8757396449704142,
0.8651685393258427
]
}
},
{
"id": "Titre 4",
"value": {
"similarity": [
"Titre 1",
"Titre 3",
"Titre 2"
],
"score": [
0.9411764705882353,
0.8888888888888888,
0.8888888888888888
]
}
}
]
```

#### Paramètre(s) URL

| nom | description |
| ------------------- | ------------------------------------------- |
| indent (true/false) | Indenter le résultat renvoyer immédiatement |
| output (0,1,n) | Choix de la sortie |

#### Entête(s) HTTP

| nom | description |
| ------ | ------------------------------------------------------------ |
| X-Hook | URL à appeler quand le résultat sera disponible (facultatif) |

#### Exemple en ligne de commande


```bash
# Send data for batch processing
cat input.tar.gz |curl --data-binary @- -H "X-Hook: https://webhook.site/dce2fefa-9a72-4f76-96e5-059405a04f6c" "http://localhost:31976/v1/similarity" > output.json

# When the corpus is processed, get the result
cat output.json |curl --data-binary @- "http://localhost:31976/v1/retrieve" > output.tar.gz
Binary file not shown.
11 changes: 11 additions & 0 deletions services/data-computer/examples.http
Original file line number Diff line number Diff line change
Expand Up @@ -123,3 +123,14 @@ X-Webhook-Success: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9
X-Webhook-Failure: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9

< ./example-json.tar.gz


###
# @name v1CorpusSimilarity
POST {{host}}/v1/corpus-similarity HTTP/1.1
Content-Type: application/x-tar
X-Webhook-Success: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9
X-Webhook-Failure: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9

< ./example-similarity-json.tar.gz

66 changes: 33 additions & 33 deletions services/data-computer/package.json
Original file line number Diff line number Diff line change
@@ -1,35 +1,35 @@
{
"private": true,
"name": "ws-data-computer",
"version": "2.15.0",
"description": "Calculs sur fichier corpus compressé",
"repository": {
"type": "git",
"url": "git+https://github.com/Inist-CNRS/web-services.git"
},
"keywords": [
"ezmaster"
],
"author": " <[email protected]>",
"license": "MIT",
"bugs": {
"url": "https://github.com/Inist-CNRS/web-services/issues"
},
"homepage": "https://github.com/Inist-CNRS/web-services/#readme",
"scripts": {
"version:insert:readme": "sed -i \"s#\\(${npm_package_name}.\\)\\([\\.a-z0-9]\\+\\)#\\1${npm_package_version}#g\" README.md && git add README.md",
"version:insert:swagger": "sed -i \"s/\\\"version\\\": \\\"[0-9]\\+.[0-9]\\+.[0-9]\\+\\\"/\\\"version\\\": \\\"${npm_package_version}\\\"/g\" swagger.json && git add swagger.json",
"version:insert": "npm run version:insert:readme && npm run version:insert:swagger",
"version:commit": "git commit -a -m \"release ${npm_package_name}@${npm_package_version}\"",
"version:tag": "git tag \"${npm_package_name}@${npm_package_version}\" -m \"${npm_package_name}@${npm_package_version}\"",
"version:push": "git push && git push --tags",
"version": "npm run version:insert && npm run version:commit && npm run version:tag",
"postversion": "npm run version:push",
"build:dev": "docker build -t cnrsinist/${npm_package_name}:latest .",
"start:dev": "npm run build:dev && docker run --name dev --rm --detach -p 31976:31976 cnrsinist/${npm_package_name}:latest",
"stop:dev": "docker stop dev",
"build": "docker build -t cnrsinist/${npm_package_name}:${npm_package_version} .",
"start": "docker run --rm -p 31976:31976 cnrsinist/${npm_package_name}:${npm_package_version}",
"publish": "docker push cnrsinist/${npm_package_name}:${npm_package_version}"
}
"private": true,
"name": "ws-data-computer",
"version": "2.16.0",
"description": "Calculs sur fichier corpus compressé",
"repository": {
"type": "git",
"url": "git+https://github.com/Inist-CNRS/web-services.git"
},
"keywords": [
"ezmaster"
],
"author": " <[email protected]>",
"license": "MIT",
"bugs": {
"url": "https://github.com/Inist-CNRS/web-services/issues"
},
"homepage": "https://github.com/Inist-CNRS/web-services/#readme",
"scripts": {
"version:insert:readme": "sed -i \"s#\\(${npm_package_name}.\\)\\([\\.a-z0-9]\\+\\)#\\1${npm_package_version}#g\" README.md && git add README.md",
"version:insert:swagger": "sed -i \"s/\\\"version\\\": \\\"[0-9]\\+.[0-9]\\+.[0-9]\\+\\\"/\\\"version\\\": \\\"${npm_package_version}\\\"/g\" swagger.json && git add swagger.json",
"version:insert": "npm run version:insert:readme && npm run version:insert:swagger",
"version:commit": "git commit -a -m \"release ${npm_package_name}@${npm_package_version}\"",
"version:tag": "git tag \"${npm_package_name}@${npm_package_version}\" -m \"${npm_package_name}@${npm_package_version}\"",
"version:push": "git push && git push --tags",
"version": "npm run version:insert && npm run version:commit && npm run version:tag",
"postversion": "npm run version:push",
"build:dev": "docker build -t cnrsinist/${npm_package_name}:latest .",
"start:dev": "npm run build:dev && docker run --name dev --rm --detach -p 31976:31976 cnrsinist/${npm_package_name}:latest",
"stop:dev": "docker stop dev",
"build": "docker build -t cnrsinist/${npm_package_name}:${npm_package_version} .",
"start": "docker run --rm -p 31976:31976 cnrsinist/${npm_package_name}:${npm_package_version}",
"publish": "docker push cnrsinist/${npm_package_name}:${npm_package_version}"
}
}
6 changes: 3 additions & 3 deletions services/data-computer/swagger.json
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
"info": {
"title": "data-computer - Calculs sur fichier corpus compressé",
"summary": "Calculs sur un corpus compressé",
"version": "2.15.0",
"version": "2.16.0",
"termsOfService": "https://services.istex.fr/",
"contact": {
"name": "Inist-CNRS",
Expand All @@ -15,7 +15,7 @@
"x-comment": "Will be automatically completed by the ezs server."
},
{
"url": "http://vptdmjobs.intra.inist.fr:49191/",
"url": "http://vptdmjobs.intra.inist.fr:49196/",
"description": "Latest version for production",
"x-profil": "Standard"
}
Expand All @@ -30,4 +30,4 @@
}
}
]
}
}
89 changes: 86 additions & 3 deletions services/data-computer/tests.hurl
Original file line number Diff line number Diff line change
Expand Up @@ -72,8 +72,92 @@ delay: 2000
HTTP 200
[{"id":"#1","value":{"sample":2,"frequency":0.6666666666666666,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"a"}},{"id":"#2","value":{"sample":2,"frequency":0.6666666666666666,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"b"}},{"id":"#3","value":{"sample":1,"frequency":0.3333333333333333,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"c"}},{"id":"#4","value":{"sample":2,"frequency":0.6666666666666666,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"a"}},{"id":"#5","value":{"sample":2,"frequency":0.6666666666666666,"percentage":null,"sum":0,"count":5,"min":0,"max":0,"mean":0,"range":0,"midrange":0,"variance":0,"deviation":0,"population":3,"input":"b"}}]

#
# group
################################ Test for Similarity ################################

POST {{host}}/v1/corpus-similarity
content-type: application/x-tar
x-hook: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9
file,example-similarity-json.tar.gz;

HTTP 200
# Capture the computing token
[Captures]
computing_token: jsonpath "$[0].value"
[Asserts]
variable "computing_token" exists

# There should be a waiting time, representing the time taken to process data.
# Fortunately, as the data is sparse, and the computing time is small,
# the need is small.

# Version 4.1.0 of hurl added a delay option, which value is milliseconds.
# https://hurl.dev/blog/2023/09/24/announcing-hurl-4.1.0.html#add-delay-between-requests

POST {{host}}/v1/retrieve-json?indent=true
content-type: application/json
[Options]
delay: 1000
```
[
{
"value":"{{computing_token}}"
}
]
```

HTTP 200
[{
"id": "Titre 1",
"value": {
"similarity": [
"Titre 4",
"Titre 2"
],
"score": [
0.9411764705882353,
0.9349112426035503
]
}
},
{
"id": "Titre 2",
"value": {
"similarity": [
"Titre 1"
],
"score": [
0.9349112426035503
]
}
},
{
"id": "Titre 3",
"value": {
"similarity": [
"Titre 4"
],
"score": [
0.8888888888888888
]
}
},
{
"id": "Titre 4",
"value": {
"similarity": [
"Titre 1"
],
"score": [
0.9411764705882353
]
}
}]


# TODO: ajouter les deux autres routes (v1GraphSegment, v1Lda)
# TODO: ajouter la route rapido

##################################### group-by ######################
POST {{host}}/v1/group-by
content-type: application/gzip
x-hook: https://webhook.site/69300b22-a251-4c16-9905-f7ba218ae7e9
Expand Down Expand Up @@ -109,4 +193,3 @@ HTTP 200
[{"id":"#1","value":["#1","#4"]},{"id":"#4","value":["#1","#4"]},{"id":"#2","value":["#2","#5"]},{"id":"#5","value":["#2","#5"]},{"id":"#3","value":["#3"]}]

#
# TODO: ajouter les deux autres routes (v1GraphSegment, v1Lda)
Loading