-
Notifications
You must be signed in to change notification settings - Fork 76
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added new annotator solving the NLI problem #311
base: dev
Are you sure you want to change the base?
Conversation
|
||
RUN mkdir /cache | ||
|
||
COPY . . |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
сначала надо копировать только файл с зависимостями и ставить их, а уже потом копировать всю папку. Потому в твоем текущем вараинте, зависимости будут переустаналиваться каждый раз, когда меняются файлы в папке.
RUN mkdir /convert_model | ||
RUN tar -xf nocontext_tf_model.tar.gz --directory /convert_model | ||
|
||
ENV TRAINED_MODEL_PATH model.h5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
вижу, что ты добавил файл model.h5 - веса на гит. Так нельзя, их надо скачивать, даже если файл маленький
sentry-sdk==0.12.3 | ||
jinja2<=3.0.3 | ||
Werkzeug<=2.0.3 | ||
protobuf==3.20.* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Надо либо ==, либо <= фиксированной версии, без звездочек, пожалуйста
annotators/ConveRTBasedNLI/server.py
Outdated
candidates = request.json.get("candidates", []) | ||
history = request.json.get("history", []) | ||
logger.info(f"Candidates: {candidates}") | ||
logger.info(f"History: {history}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
лишние логи можно убрать в дебаг
annotators/ConveRTBasedNLI/server.py
Outdated
logger.info(f"History: {history}") | ||
result = annotator.candidate_selection(candidates, history) | ||
total_time = time.time() - start_time | ||
logger.info(f"Annotator candidate prediction time: {total_time: .3f}s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
какой аннотатор? в лога надо оставить название компоненты, как у других
annotators/ConveRTBasedNLI/server.py
Outdated
logger.info(f"sentence: {response}") | ||
result = annotator.response_encoding(response) | ||
total_time = time.time() - start_time | ||
logger.info(f"Annotator response encoding time: {total_time: .3f}s") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
какой аннотатор? в лога надо оставить название компоненты, как у других
state_formatters/dp_formatters.py
Outdated
history = [u["text"] for u in dialog["bot_utterances"][-20:]] | ||
else: | ||
history = [] | ||
return [{"candidates": hypots, "history": history}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
нет, в батче длины должны быть одинаковые. У тебя один сэмпл истории, его надо дублировать. Потому что батчи могут перемешаться
state_formatters/dp_formatters.py
Outdated
hypots = [h["text"] for h in hypotheses] | ||
last_bot_utterances = [u["text"] for u in dialog["bot_utterances"][-20:]] | ||
last_bot_utterances = [last_bot_utterances for h in hypotheses] | ||
return [{"sentences": hypots, "last_bot_utterances": last_bot_utterances}] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
last_bot_utterances = [u["text"] for u in dialog["bot_utterances"][-20:]]
return [{"sentences": hypots, "last_bot_utterances": [last_bot_utterances] * len(hypots)}]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
дополнительно к комментам по файлам:
- пройтись black (black -l 120 .) и flake8 по всем файлам, иначе не смерджится
- разрешить мердж конфликты
- добавить новый аннотатор в три места: components.json в корневой папке + создать component.yml и pipeline.yml внутри папки твоего аннотатора. как это делать можно посмотреть в свежем деве, например, в annotators/asr
annotators/ConveRTBasedNLI/test.py
Outdated
'entailment': 0.0019739873241633177, | ||
'neutral': 0.0290225762873888, | ||
'contradiction': 0.9690034985542297}] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wild guess: не закреплен random seed где-то, где стоило бы закрепить?
либо модель не переведена в режим инференса, нужно проверить
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
не убирай числа, используй round() до двух знаков после запятой
но вообще лучше дополнительно проверить то что я написала про сид и инференс
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
curr_is_toxics.append(is_toxic_or_contr_utterance) | ||
|
||
if is_toxic_or_contr_utterance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
если тут меняется так, лучше изменить sentry_sdk.capture_message и msg для логгера (стр 105-111), добавить инфу, что это не только badlisted phrases, но и contradiction. иначе сообщение будет очень путать потом)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
limits: | ||
memory: 1G | ||
reservations: | ||
memory: 1G |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ты проверял работу аннотатора внутри дрима? у меня начиналось отлично, а на третьем utterance он вылетел dream_convert-based-nli_1 exited with code 137
, памяти не хватило. обязательно нужно проверять на 3+ диалогах, чем длиннее, тем лучше, хотя бы пять-десять ходов юзера
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
сейчас ты выделяешь ему 1гб на работу, вероятно нужно больше, попробуй поднять и посмотреть docker stats
common/utils.py
Outdated
@@ -1289,6 +1289,13 @@ def is_toxic_or_badlisted_utterance(annotated_utterance): | |||
return toxic_result or any([badlist_result.get(bad, False) for bad in ["bad_words", "inappropriate", "profanity"]]) | |||
|
|||
|
|||
def is_contradiction_utterance(annotated_utterance): | |||
contradiction_result = annotated_utterance.get("annotations", {}).get("convert_based_nli")["decision"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
безопаснее всегда через get доставать и с default value, предлагаю переделать на contradiction_result = annotated_utterance.get("annotations", {}).get("convert_based_nli", {}).get("decision", "")
common/utils.py
Outdated
contradiction_result = True if "contradiction" in contradiction_result else False | ||
|
||
return contradiction_result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
читабельнее в одну строку: return "contradiction" in contradiction_result
ARG DATA_URL=https://github.com/davidalami/ConveRT/releases/download/1.0/nocontext_tf_model.tar.gz | ||
ARG NEL_URL=https://github.com/Kolpnick/dream/raw/convert_based_nli/annotators/ConveRTBasedNLI/model.h5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
нужно написать Федору Игнатову, чтобы он положил эти файлы нужно положить в share и дал путь, как их скачивать (нужно качать с нашего share, а не с гитхаба)
COPY requirements.txt . | ||
RUN pip install -r requirements.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
сначала установить все зависимости, потом уже скачивать модельку (перенести эти строчки до скачивания моделек)
with sentry_sdk.push_scope() as scope: | ||
scope.set_extra("utterance", skill_data["text"]) | ||
scope.set_extra("selected_skills", skill_data) | ||
sentry_sdk.capture_message("response selector got candidate with badlisted phrases") | ||
sentry_sdk.capture_message("response selector got candidate with badlisted phrases and detected contradiction") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or, не and
или или, не то и то)
msg = ( | ||
"response selector got candidate with badlisted phrases:\n" | ||
"response selector got candidate with badlisted phrases and detected contradiction:\n" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
annotators/ConveRTBasedNLI/test.py
Outdated
'entailment': 0.0019739873241633177, | ||
'neutral': 0.0290225762873888, | ||
'contradiction': 0.9690034985542297}] | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
не убирай числа, используй round() до двух знаков после запятой
но вообще лучше дополнительно проверить то что я написала про сид и инференс
reservations: | ||
memory: 1G |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
проверил, работает ли с памятью 1.5G в дриме? минимум на 3 диалогах и 7 репликах юзера
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
имею в виду пробовал ли поднимать весь дрим дистрибутив и с ним разговаривать
если нет, это нужно сделать
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
если вдруг что-то будет падать во время поднятия, подтяни и подмердж свежий дев, должно исправить
FROM python:3.7.4 | ||
|
||
ARG DATA_URL=http://files.deeppavlov.ai/tmp/nocontext_tf_model.tar.gz | ||
ARG NEL_URL=http://files.deeppavlov.ai/tmp/model.h5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
почему h5 не архив? может заархивировать? сколько весит файл?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
давай пути к моделям принимать в виде параметров без дефолтных значейний. значения задавтаь в докер компоуз
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Модель занимает немного места - всего 10,5 Мб весит (которая model.h5)
ARG NEL_URL=http://files.deeppavlov.ai/tmp/model.h5 | ||
|
||
ENV CACHE_DIR /cache | ||
ENV TRAINED_MODEL_PATH /data/nli_model/model.h5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
плохо задать переменную в докерфайле, лучше не задавтаь вообще - зачем переменная-то?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Переменную с cache можно убрать, а вот переменная с путём к модели нужна - вдруг захочется в другое место её загружать)
ENV CONVERT_MODEL_PATH /data/convert_model | ||
|
||
WORKDIR /src | ||
RUN mkdir /cache |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
почему кэш, а не в дата?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
что в кэше лежит?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Кэш нужен при обучении: если обучать модель с 0, то в папку cache закачиваются датасеты и сохраняются checkpoints модели
if is_toxic_utterance: | ||
is_toxic_or_contr_utterance = is_toxic_utterance or is_contr_utterance | ||
|
||
curr_is_toxics.append(is_toxic_or_contr_utterance) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
лучше сделать это отдельными параметрами, чтобы можно было по дом параметру в докер компоуз файле менять "фильтруем бэдлист или нет, фильтрум контрадикшен или нет"
# Conflicts: # assistant_dists/dream/dev.yml # components.tsv
RUN mkdir /cache | ||
RUN mkdir /data | ||
RUN mkdir /data/nli_model/ | ||
RUN mkdir /data/convert_model/ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
mkdir
может принимать на вход несколько аргкментов, так что можно собрать 4 строки в одну вида mkdir /cache /data ...
RUN curl -L $NLI_URL --output /tmp/nli_model.tar.gz && tar -xf /tmp/nli_model.tar.gz -C /data/nli_model && rm /tmp/nli_model.tar.gz | ||
RUN curl -L $CONVERT_URL --output /tmp/conv_model.tar.gz && tar -xf /tmp/conv_model.tar.gz -C /data/convert_model && rm /tmp/conv_model.tar.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
можно собрать в один вызов RUN
через &&
No description provided.