Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] ID Mismatch Error in VectorDB During Evaluation #1033

Open
e7217 opened this issue Dec 3, 2024 · 8 comments
Open

[BUG] ID Mismatch Error in VectorDB During Evaluation #1033

e7217 opened this issue Dec 3, 2024 · 8 comments
Assignees
Labels
bug Something isn't working High Priority

Comments

@e7217
Copy link
Contributor

e7217 commented Dec 3, 2024

Describe the bug
Hello,

I have a question regarding the error described below.
It appears that the error occurs because the id values in the corpus data located in the benchmark/data folder do not match the id values retrieved from the VectorDB.

  • corpus data :
    Image

  • vector db :
    Image

  • code :
    Image
    Image

Full logs

[12/03/24 16:58:20] ERROR    [__init__.py:60] >> Unexpected exception                                              __init__.py:60
                             ╭──────────────────────── Traceback (most recent call last) ────────────────────────╮               
                             │ /root/.pyenv/versions/3.10.13/lib/python3.10/runpy.py:196 in _run_module_as_main  │               
                             │                                                                                   │               
                             │   193 │   main_globals = sys.modules["__main__"].__dict__                         │               
                             │   194 │   if alter_argv:                                                          │               
                             │   195 │   │   sys.argv[0] = mod_spec.origin                                       │               
                             │ ❱ 196 │   return _run_code(code, main_globals, None,                              │               
                             │   197 │   │   │   │   │    "__main__", mod_spec)                                  │               
                             │   198                                                                             │               
                             │   199 def run_module(mod_name, init_globals=None,                                 │               
                             │                                                                                   │               
                             │ /root/.pyenv/versions/3.10.13/lib/python3.10/runpy.py:86 in _run_code             │               
                             │                                                                                   │               
                             │    83 │   │   │   │   │      __loader__ = loader,                                 │               
                             │    84 │   │   │   │   │      __package__ = pkg_name,                              │               
                             │    85 │   │   │   │   │      __spec__ = mod_spec)                                 │               
                             │ ❱  86 │   exec(code, run_globals)                                                 │               
                             │    87 │   return run_globals                                                      │               
                             │    88                                                                             │               
                             │    89 def _run_module_code(code, init_globals=None,                               │               
                             │                                                                                   │               
                             │ /root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/lib │               
                             │ s/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py:71 in <module> │               
                             │                                                                                   │               
                             │   68 │                                                                            │               
                             │   69 │   from debugpy.server import cli                                           │               
                             │   70 │                                                                            │               
                             │ ❱ 71 │   cli.main()                                                               │               
                             │   72                                                                              │               
                             │                                                                                   │               
                             │ /root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/lib │               
                             │ s/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py:5 │               
                             │ 01 in main                                                                        │               
                             │                                                                                   │               
                             │   498 │   │   │   │   "code": run_code,                                           │               
                             │   499 │   │   │   │   "pid": attach_to_pid,                                       │               
                             │   500 │   │   │   }[options.target_kind]                                          │               
                             │ ❱ 501 │   │   │   run()                                                           │               
                             │   502 │   except SystemExit as exc:                                               │               
                             │   503 │   │   log.reraise_exception(                                              │               
                             │   504 │   │   │   "Debuggee exited via SystemExit: {0!r}", exc.code, level="debug │               
                             │                                                                                   │               
                             │ /root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/lib │               
                             │ s/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py:3 │               
                             │ 51 in run_file                                                                    │               
                             │                                                                                   │               
                             │   348 │   log.describe_environment("Pre-launch environment:")                     │               
                             │   349 │                                                                           │               
                             │   350 │   log.info("Running file {0!r}", target)                                  │               
                             │ ❱ 351 │   runpy.run_path(target, run_name="__main__")                             │               
                             │   352                                                                             │               
                             │   353                                                                             │               
                             │   354 def run_module():                                                           │               
                             │                                                                                   │               
                             │ /root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/lib │               
                             │ s/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py:310 in run_path         │               
                             │                                                                                   │               
                             │   307 │   │   # Not a valid sys.path entry, so run the code directly              │               
                             │   308 │   │   # execfile() doesn't help as we want to allow compiled files        │               
                             │   309 │   │   code, fname = _get_code_from_file(run_name, path_name)              │               
                             │ ❱ 310 │   │   return _run_module_code(code, init_globals, run_name, pkg_name=pkg_ │               
                             │       script_name=fname)                                                          │               
                             │   311 │   else:                                                                   │               
                             │   312 │   │   # Finder is defined for path, so add it to                          │               
                             │   313 │   │   # the start of sys.path                                             │               
                             │                                                                                   │               
                             │ /root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/lib │               
                             │ s/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py:127 in _run_module_code │               
                             │                                                                                   │               
                             │   124 │   fname = script_name if mod_spec is None else mod_spec.origin            │               
                             │   125 │   with _TempModule(mod_name) as temp_module, _ModifiedArgv0(fname):       │               
                             │   126 │   │   mod_globals = temp_module.module.__dict__                           │               
                             │ ❱ 127 │   │   _run_code(code, mod_globals, init_globals, mod_name, mod_spec, pkg_ │               
                             │       script_name)                                                                │               
                             │   128 │   # Copy the globals of the temporary module, as they                     │               
                             │   129 │   # may be cleared when the temporary module goes away                    │               
                             │   130 │   return mod_globals.copy()                                               │               
                             │                                                                                   │               
                             │ /root/.vscode-server/extensions/ms-python.debugpy-2024.12.0-linux-x64/bundled/lib │               
                             │ s/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py:118 in _run_code        │               
                             │                                                                                   │               
                             │   115 │   run_globals.update(                                                     │               
                             │   116 │   │   __name__=mod_name, __file__=fname, __cached__=cached, __doc__=None, │               
                             │       __loader__=loader, __package__=pkg_name, __spec__=mod_spec                  │               
                             │   117 │   )                                                                       │               
                             │ ❱ 118 │   exec(code, run_globals)                                                 │               
                             │   119 │   return run_globals                                                      │               
                             │   120                                                                             │               
                             │   121                                                                             │               
                             │                                                                                   │               
                             │ /app/evaluate/main.py:186 in <module>                                             │               
                             │                                                                                   │               
                             │   183 │   │   │   opt["qa_data_path"] = os.path.join(current_dir,                 │               
"qa","parsed_{}_chunk_{}_qa.parquet".format(i,j))                           │               
                             │   184 │   │   │   opt["corpus_data_path"] = os.path.join(current_dir,             │               
"qa","parsed_{}_chunk_{}_corpus.parquet".format(i,j))                       │               
                             │   185 │   │   │   opt["project_dir"] = os.path.join(current_dir, "benchmark")     │               
                             │ ❱ 186 │   │   │   evaluate(**opt)                                                 │               
                             │   187                                                                             │               
                             │                                                                                   │               
                             │ /app/evaluate/main.py:158 in evaluate                                             │               
                             │                                                                                   │               
                             │   155 │   │   os.makedirs(project_dir)                                            │               
                             │   156 │                                                                           │               
                             │   157 │   evaluator = Evaluator(qa_data_path, corpus_data_path, project_dir=proje │               
                             │ ❱ 158 │   evaluator.start_trial(config, skip_validation=True)                     │               
                             │   159                                                                             │               
                             │   160                                                                             │               
                             │   161 if __name__ == "__main__":                                                  │               
                             │                                                                                   │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/s │               
                             │ ite-packages/autorag/evaluator.py:206 in start_trial                              │               
                             │                                                                                   │               
                             │   203 │   │   │   │   if i == 0:                                                  │               
                             │   204 │   │   │   │   │   previous_result = self.qa_data                          │               
                             │   205 │   │   │   │   logger.info(f"Running node line {node_line_name}...")       │               
                             │ ❱ 206 │   │   │   │   previous_result = run_node_line(                            │               
                             │   207 │   │   │   │   │   node_line, node_line_dir, previous_result, progress, ta │               
                             │   208 │   │   │   │   )                                                           │               
                             │   209                                                                             │               
                             │                                                                                   │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/s │               
                             │ ite-packages/autorag/node_line.py:52 in run_node_line                             │               
                             │                                                                                   │               
                             │   49 │                                                                            │               
                             │   50 │   summary_lst = []                                                         │               
                             │   51 │   for node in nodes:                                                       │               
                             │ ❱ 52 │   │   previous_result = node.run(previous_result, node_line_dir)           │               
                             │   53 │   │   node_summary_df = load_summary_file(                                 │               
                             │   54 │   │   │   os.path.join(node_line_dir, node.node_type, "summary.csv")       │               
                             │   55 │   │   )                                                                    │               
                             │                                                                                   │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/s │               
                             │ ite-packages/autorag/schema/node.py:57 in run                                     │               
                             │                                                                                   │               
                             │    54 │   def run(self, previous_result: pd.DataFrame, node_line_dir: str) -> pd. │               
                             │    55 │   │   logger.info(f"Running node {self.node_type}...")                    │               
                             │    56 │   │   input_modules, input_params = self.get_param_combinations()         │               
                             │ ❱  57 │   │   return self.run_node(                                               │               
                             │    58 │   │   │   modules=input_modules,                                          │               
                             │    59 │   │   │   module_params=input_params,                                     │               
                             │    60 │   │   │   previous_result=previous_result,                                │               
                             │                                                                                   │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/s │               
                             │ ite-packages/autorag/nodes/retrieval/run.py:173 in run_retrieval_node             │               
                             │                                                                                   │               
                             │   170 │   │   │   │   zip(modules, module_params),                                │               
                             │   171 │   │   │   )                                                               │               
                             │   172 │   │   )                                                                   │               
                             │ ❱ 173 │   │   semantic_results, semantic_times = run(semantic_modules, semantic_m │               
                             │   174 │   │   semantic_summary_df = save_and_summary(                             │               
                             │   175 │   │   │   semantic_modules,                                               │               
                             │   176 │   │   │   semantic_module_params,                                         │               
                             │                                                                                   │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/s │               
                             │ ite-packages/autorag/nodes/retrieval/run.py:71 in run                             │               
                             │                                                                                   │               
                             │    68 │   │   :return: First, it returns list of result dataframe.                │               
                             │    69 │   │   Second, it returns list of execution times.                         │               
                             │    70 │   │   """                                                                 │               
                             │ ❱  71 │   │   result, execution_times = zip(                                      │               
                             │    72 │   │   │   *map(                                                           │               
                             │    73 │   │   │   │   lambda task: measure_speed(                                 │               
                             │    74 │   │   │   │   │   task[0].run_evaluator,                                  │               
                             │                                                                                   │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/s │               
                             │ ite-packages/autorag/nodes/retrieval/run.py:73 in <lambda>                        │               
                             │                                                                                   │               
                             │    70 │   │   """
                             │    71 │   │   result, execution_times = zip(                                      │               
                             │    72 │   │   │   *map(                                                           │               
                             │ ❱  73 │   │   │   │   lambda task: measure_speed(                                 │               
                             │    74 │   │   │   │   │   task[0].run_evaluator,                                  │               
                             │    75 │   │   │   │   │   project_dir=project_dir,                                │               
                             │    76 │   │   │   │   │   previous_result=previous_result,                        │               
                             │                                                                                   │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/s │               
                             │ ite-packages/autorag/strategy.py:14 in measure_speed                              │               
                             │                                                                                   │               
                             │    11 │   Method for measuring execution speed of the function.                   │               
                             │    12 │   """                                                                     │               
                             │    13 │   start_time = time.time()                                                │               
                             │ ❱  14 │   result = func(*args, **kwargs)                                          │               
                             │    15 │   end_time = time.time()                                                  │               
                             │    16 │   return result, end_time - start_time                                    │               
                             │    17                                                                             │               
                             │                                                                                   │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/s │               
                             │ ite-packages/autorag/schema/base.py:27 in run_evaluator                           │               
                             │                                                                                   │               
                             │   24 │   ):                                                                       │               
                             │   25 │   │   log_to_file(content=f"Running {cls.__name__} with {cls.run_evaluator │               
                             │   26 │   │   instance = cls(project_dir, *args, **kwargs)                         │               
                             │ ❱ 27 │   │   result = instance.pure(previous_result, *args, **kwargs)             │               
                             │   28 │   │   del instance                                                         │               
                             │   29 │   │   return result                                                        │               
                             │   30                                                                              │               
                             │                                                                                   │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/s │               
                             │ ite-packages/autorag/utils/util.py:72 in wrapper                                  │               
                             │                                                                                   │               
                             │    69 │   def decorator_result_to_dataframe(func: Callable):                      │               
                             │    70 │   │   @functools.wraps(func)                                              │               
                             │    71 │   │   def wrapper(*args, **kwargs) -> pd.DataFrame:                       │               
                             │ ❱  72 │   │   │   results = func(*args, **kwargs)                                 │               
                             │    73 │   │   │   if len(column_names) == 1:                                      │               
                             │    74 │   │   │   │   df_input = {column_names[0]: results}                       │               
                             │    75 │   │   │   else:                                                           │               
                             │                                                                                   │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/s │               
                             │ ite-packages/autorag/nodes/retrieval/vectordb.py:73 in pure                       │               
                             │                                                                                   │               
                             │    70 │   │   queries = self.cast_to_run(previous_result)                         │               
                             │    71 │   │   pure_params = pop_params(self._pure, kwargs)                        │               
                             │    72 │   │   ids, scores = self._pure(queries, **pure_params)                    │               
                             │ ❱  73 │   │   contents = fetch_contents(self.corpus_df, ids)                      │               
                             │    74 │   │   return contents, ids, scores                                        │               
                             │    75 │                                                                           │               
                             │    76 │   def _pure(                                                              │               
                             │                                                                                   │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/s │               
                             │ ite-packages/autorag/utils/util.py:40 in fetch_contents                           │               
                             │                                                                                   │               
                             │    37 │   ):                                                                      │               
                             │    38 │   │   return list(map(lambda x: fetch_one_content(corpus_data, x, column_ │               
                             │    39 │                                                                           │               
                             │ ❱  40 │   result = flatten_apply(                                                 │               
                             │    41 │   │   fetch_contents_pure, ids, corpus_data=corpus_data, column_name=colu │               
                             │    42 │   )                                                                       │               
                             │    43 │   return result                                                           │               
                             │                                                                                   │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/s │               
                             │ ite-packages/autorag/utils/util.py:377 in flatten_apply                           │               
                             │                                                                                   │               
                             │   374 │   """                                                                     │               
                             │   375 │   df = pd.DataFrame({"col1": nested_list})                                │               
                             │   376 │   df = df.explode("col1")                                                 │               
                             │ ❱ 377 │   df["result"] = func(df["col1"].tolist(), **kwargs)                      │               
                             │   378 │   return df.groupby(level=0, sort=False)["result"].apply(list).tolist()   │               
                             │   379                                                                             │               
                             │   380                                                                             │               
                             │                                                                                   │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/s │               
                             │ ite-packages/autorag/utils/util.py:38 in fetch_contents_pure                      │               
                             │                                                                                   │               
                             │    35 │   def fetch_contents_pure(                                                │               
                             │    36 │   │   ids: List[str], corpus_data: pd.DataFrame, column_name: str         │               
                             │    37 │   ):                                                                      │               
                             │ ❱  38 │   │   return list(map(lambda x: fetch_one_content(corpus_data, x, column_ │               
                             │    39 │                                                                           │               
                             │    40 │   result = flatten_apply(                                                 │               
                             │    41 │   │   fetch_contents_pure, ids, corpus_data=corpus_data, column_name=colu │               
                             │                                                                                   │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/s │               
                             │ ite-packages/autorag/utils/util.py:38 in <lambda>                                 │               
                             │                                                                                   │               
                             │    35 │   def fetch_contents_pure(                                                │               
                             │    36 │   │   ids: List[str], corpus_data: pd.DataFrame, column_name: str         │               
                             │    37 │   ):                                                                      │               
                             │ ❱  38 │   │   return list(map(lambda x: fetch_one_content(corpus_data, x, column_ │               
                             │    39 │                                                                           │               
                             │    40 │   result = flatten_apply(                                                 │               
                             │    41 │   │   fetch_contents_pure, ids, corpus_data=corpus_data, column_name=colu │               
                             │                                                                                   │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/s │               
                             │ ite-packages/autorag/utils/util.py:57 in fetch_one_content                        │               
                             │                                                                                   │               
                             │    54 │   │   │   return None                                                     │               
                             │    55 │   │   fetch_result = corpus_data[corpus_data[id_column_name] == id_]      │               
                             │    56 │   │   if fetch_result.empty:                                              │               
                             │ ❱  57 │   │   │   raise ValueError(f"doc_id: {id_} not found in corpus_data.")    │               
                             │    58 │   │   else:                                                               │               
                             │    59 │   │   │   return fetch_result[column_name].iloc[0]                        │               
                             │    60 │   else:                                                                   │               
                             ╰───────────────────────────────────────────────────────────────────────────────────╯               
                             ValueError: doc_id: 5713eedc-1e2e-4395-8efe-06433f40a094 not found in corpus_data.                  
Exception ignored in: <function VectorDB.__del__ at 0x7f7096dde560>
Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/autorag/nodes/retrieval/vectordb.py", line 66, in __del__
  File "/root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/autorag/nodes/retrieval/base.py", line 28, in __del__
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/logging/__init__.py", line 1477, in info
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/logging/__init__.py", line 1624, in _log
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/logging/__init__.py", line 1634, in handle
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/logging/__init__.py", line 1696, in callHandlers
  File "/root/.pyenv/versions/3.10.13/lib/python3.10/logging/__init__.py", line 968, in handle
  File "/root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/rich/logging.py", line 168, in emit
  File "/root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/rich/logging.py", line 229, in render
  File "/root/.cache/pypoetry/virtualenvs/app-rag-sample-9TtSrW0h-py3.10/lib/python3.10/site-packages/rich/_log_render.py", line 43, in __call__
ImportError: sys.meta_path is None, Python is likely shutting down

When performing the evaluate step repeatedly, items seem to accumulate in the VectorDB.
Image

Should the number of items in the VectorDB collection always be reset to zero before performing the evaluate step?

It would be very helpful to understand the intention behind implementing this flow, as it will assist me in using this package more effectively.

Thank you for your assistance!

@e7217 e7217 added the bug Something isn't working label Dec 3, 2024
@vkehfdl1
Copy link
Contributor

vkehfdl1 commented Dec 3, 2024

@e7217 Did you use passage augmenter & validation?
From now, the passage augmenter is not supporting the validation process, so you have to unable the validation process while evaluation.
Using skip_validation parameter

@e7217
Copy link
Contributor Author

e7217 commented Dec 3, 2024

@vkehfdl1 umm... This is my current configuration, and I do not include the passage augmenter in any of my stages.

Image

vectordb:
- name: autorag_test
  db_type: milvus
  embedding_model: openai
  collection_name: autorag_test
  uri: http://192.xxx.xxx.xxx:19530
  embedding_batch: 50
  similarity_metric: l2
  # index_type: hnsw
  params: 
    nlist : 16384
node_lines:
- node_line_name: retrieve_node_line
  nodes:
    - node_type: retrieval
      strategy:
        metrics: [retrieval_f1, retrieval_ndcg, retrieval_map]
      top_k: 3
      modules:
        - module_type: vectordb
          vectordb: autorag_test
- node_line_name: post_retrieve_node_line
  nodes:
    - node_type: prompt_maker
      strategy:
        metrics:
          - metric_name: rouge
          - metric_name: sem_score
            embedding_model: openai
        generator_modules:
          - module_type: openai_llm
            llm: gpt-4o-mini
      modules:
        - module_type: fstring
          prompt:
          - | 
            단락을 읽고 질문에 답하세요. \n 질문 : {query} \n 단락: {retrieved_contents} \n 답변 :
          - |
            단락을 읽고 질문에 답하세요. 답할때 단계별로 천천히 고심하여 답변하세요. 반드시 단락 내용을 기반으로 말하고 거짓을 말하지 마세요. \n 질문: {query} \n 단락: {retrieved_contents} \n 답변 :
    - node_type: generator
      strategy:
        metrics: # bert_score 및 g_eval 사용 역시 추천합니다. 빠른 실행을 위해 여기서는 제외하고 하겠습니다.
          - metric_name: rouge
          - metric_name: sem_score
            embedding_model: openai
          - metric_name: bert_score
            lang: ko
          # - metric_name: g_eval
          #   metrics: ["consistency"]
          #   model: ['gpt-4o-mini']
      modules:
        - module_type: vllm
          llm: [
            Qwen/Qwen2.5-7B-Instruct,
            meta-llama/Llama-3.1-8B-instruct,
            meta-llama/Llama-3.2-3B-instruct,
            meta-llama/Llama-3.2-1B-instruct,
          ]
          # llm: meta-llama/Llama-3.2-3B-instruct
          temperature: [ 
            0,
            0.1, 
            # 0.5,
            ]
          # temperature: 0.1
          # tensor_parallel_size: 2 # If the gpu is two.
          max_tokens: 128
          max_model_len: 800
          gpu_memory_utilization: 0.8

@vkehfdl1
Copy link
Contributor

vkehfdl1 commented Dec 4, 2024

@e7217 Thanks for providing the configuration sir!
I will further investigate the issue. Thanks :)

@vkehfdl1 vkehfdl1 self-assigned this Dec 8, 2024
@vkehfdl1
Copy link
Contributor

Hi @e7217 Did you resolve the issue?
Because It was hard to reproduce the same error. Seems the id not in corpus data ingested earlier to the vector db.

@e7217
Copy link
Contributor Author

e7217 commented Dec 13, 2024

@vkehfdl1
Hello,

The issue has not been resolved yet. I’ve written some temporary code to work around the issue, but it’s not ideal for general use.

  1. Delete all trials in benchmark dir.
  2. Delete a collection in vectordb
  3. Try running the code for evaluation.

I haven’t found a proper solution yet, but I will give it another try.

Below is the temporary code I’m using:
Image
Image
Image
Image

@vkehfdl1
Copy link
Contributor

@e7217 Oh, in closer look of the YAML file, it looks like you used different vectordb at configuration and vectordb module?

Image

Can you double-check the YAML file?

@e7217
Copy link
Contributor Author

e7217 commented Dec 13, 2024

@vkehfdl1
Thanks for yout checking

I encountered an error when I changed vectordb: autorag_test to vectordb: milvus. I thought the value of vectordb should correspond to the name of an item defined in the vectordb list. However, should the value actually map to the db_type instead?

[12/13/24 10:29:50] ERROR    [__init__.py:60] >> Unexpected exception                                                                                                                           __init__.py:60
                             ╭────────────────────────────────────────────────────────────── Traceback (most recent call last) ───────────────────────────────────────────────────────────────╮               
                             │ /intelli-chatbot/evaluate/main.py:199 in <module>                                                                                                              │               
                             │                                                                                                                                                                │               
                             │   196 │   opt["qa_data_path"] = os.path.join(current_dir,                                                                                                      │               
                             │       "qa","parsed_{}_chunk_{}_qa.parquet".format(5,0))                                                                                                        │               
                             │   197 │   opt["corpus_data_path"] = os.path.join(current_dir,                                                                                                  │               
                             │       "qa","parsed_{}_chunk_{}_corpus.parquet".format(5,0))                                                                                                    │               
                             │   198 │   opt["project_dir"] = os.path.join(current_dir, "benchmark")                                                                                          │               
                             │ ❱ 199 │   evaluate(**opt)                                                                                                                                      │               
                             │   200 │                                                                                                                                                        │               
                             │   201 │   # for i in range(10):                                                                                                                                │               
                             │   202 │   #     for j in range(4):                                                                                                                             │               
                             │                                                                                                                                                                │               
                             │ /intelli-chatbot/evaluate/main.py:148 in evaluate                                                                                                              │               
                             │                                                                                                                                                                │               
                             │   145 │   │   os.makedirs(project_dir)                                                                                                                         │               
                             │   146 │                                                                                                                                                        │               
                             │   147 │   evaluator = Evaluator(qa_data_path, corpus_data_path, project_dir=project_dir)                                                                       │               
                             │ ❱ 148 │   evaluator.start_trial(config, skip_validation=True)                                                                                                  │               
                             │   149                                                                                                                                                          │               
                             │   150                                                                                                                                                          │               
                             │   151 def generate_sample_qa_set():                                                                                                                            │               
                             │                                                                                                                                                                │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-Sewi02OQ-py3.10/lib/python3.10/site-packages/autorag/evaluator.py:214 in start_trial                          │               
                             │                                                                                                                                                                │               
                             │   211 │   │   │   │   if i == 0:                                                                                                                               │               
                             │   212 │   │   │   │   │   previous_result = self.qa_data                                                                                                       │               
                             │   213 │   │   │   │   logger.info(f"Running node line {node_line_name}...")                                                                                    │               
                             │ ❱ 214 │   │   │   │   previous_result = run_node_line(                                                                                                         │               
                             │   215 │   │   │   │   │   node_line, node_line_dir, previous_result, progress, task_eval                                                                       │               
                             │   216 │   │   │   │   )                                                                                                                                        │               
                             │   217                                                                                                                                                          │               
                             │                                                                                                                                                                │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-Sewi02OQ-py3.10/lib/python3.10/site-packages/autorag/node_line.py:52 in run_node_line                         │               
                             │                                                                                                                                                                │               
                             │   49 │                                                                                                                                                         │               
                             │   50 │   summary_lst = []                                                                                                                                      │               
                             │   51 │   for node in nodes:                                                                                                                                    │               
                             │ ❱ 52 │   │   previous_result = node.run(previous_result, node_line_dir)                                                                                        │               
                             │   53 │   │   node_summary_df = load_summary_file(                                                                                                              │               
                             │   54 │   │   │   os.path.join(node_line_dir, node.node_type, "summary.csv")                                                                                    │               
                             │   55 │   │   )                                                                                                                                                 │               
                             │                                                                                                                                                                │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-Sewi02OQ-py3.10/lib/python3.10/site-packages/autorag/schema/node.py:57 in run                                 │               
                             │                                                                                                                                                                │               
                             │    54 │   def run(self, previous_result: pd.DataFrame, node_line_dir: str) -> pd.DataFrame:                                                                    │               
                             │    55 │   │   logger.info(f"Running node {self.node_type}...")                                                                                                 │               
                             │    56 │   │   input_modules, input_params = self.get_param_combinations()                                                                                      │               
                             │ ❱  57 │   │   return self.run_node(                                                                                                                            │               
                             │    58 │   │   │   modules=input_modules,                                                                                                                       │               
                             │    59 │   │   │   module_params=input_params,                                                                                                                  │               
                             │    60 │   │   │   previous_result=previous_result,                                                                                                             │               
                             │                                                                                                                                                                │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-Sewi02OQ-py3.10/lib/python3.10/site-packages/autorag/nodes/retrieval/run.py:173 in run_retrieval_node         │               
                             │                                                                                                                                                                │               
                             │   170 │   │   │   │   zip(modules, module_params),                                                                                                             │               
                             │   171 │   │   │   )                                                                                                                                            │               
                             │   172 │   │   )                                                                                                                                                │               
                             │ ❱ 173 │   │   semantic_results, semantic_times = run(semantic_modules, semantic_module_params)                                                                 │               
                             │   174 │   │   semantic_summary_df = save_and_summary(                                                                                                          │               
                             │   175 │   │   │   semantic_modules,                                                                                                                            │               
                             │   176 │   │   │   semantic_module_params,                                                                                                                      │               
                             │                                                                                                                                                                │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-Sewi02OQ-py3.10/lib/python3.10/site-packages/autorag/nodes/retrieval/run.py:71 in run                         │               
                             │                                                                                                                                                                │               
                             │    68 │   │   :return: First, it returns list of result dataframe.                                                                                             │               
                             │    69 │   │   Second, it returns list of execution times.                                                                                                      │               
                             │    70 │   │   """                                                                                                                                              │               
                             │ ❱  71 │   │   result, execution_times = zip(                                                                                                                   │               
                             │    72 │   │   │   *map(                                                                                                                                        │               
                             │    73 │   │   │   │   lambda task: measure_speed(                                                                                                              │               
                             │    74 │   │   │   │   │   task[0].run_evaluator,                                                                                                               │               
                             │                                                                                                                                                                │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-Sewi02OQ-py3.10/lib/python3.10/site-packages/autorag/nodes/retrieval/run.py:73 in <lambda>                    │               
                             │                                                                                                                                                                │               
                             │    70 │   │   """                                                                                                                                              │               
                             │    71 │   │   result, execution_times = zip(                                                                                                                   │               
                             │    72 │   │   │   *map(                                                                                                                                        │               
                             │ ❱  73 │   │   │   │   lambda task: measure_speed(                                                                                                              │               
                             │    74 │   │   │   │   │   task[0].run_evaluator,                                                                                                               │               
                             │    75 │   │   │   │   │   project_dir=project_dir,                                                                                                             │               
                             │    76 │   │   │   │   │   previous_result=previous_result,                                                                                                     │               
                             │                                                                                                                                                                │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-Sewi02OQ-py3.10/lib/python3.10/site-packages/autorag/strategy.py:14 in measure_speed                          │               
                             │                                                                                                                                                                │               
                             │    11 │   Method for measuring execution speed of the function.                                                                                                │               
                             │    12 │   """                                                                                                                                                  │               
                             │    13 │   start_time = time.time()                                                                                                                             │               
                             │ ❱  14 │   result = func(*args, **kwargs)                                                                                                                       │               
                             │    15 │   end_time = time.time()                                                                                                                               │               
                             │    16 │   return result, end_time - start_time                                                                                                                 │               
                             │    17                                                                                                                                                          │               
                             │                                                                                                                                                                │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-Sewi02OQ-py3.10/lib/python3.10/site-packages/autorag/schema/base.py:25 in run_evaluator                       │               
                             │                                                                                                                                                                │               
                             │   22 │   │   *args,                                                                                                                                            │               
                             │   23 │   │   **kwargs,                                                                                                                                         │               
                             │   24 │   ):                                                                                                                                                    │               
                             │ ❱ 25 │   │   instance = cls(project_dir, *args, **kwargs)                                                                                                      │               
                             │   26 │   │   result = instance.pure(previous_result, *args, **kwargs)                                                                                          │               
                             │   27 │   │   del instance                                                                                                                                      │               
                             │   28 │   │   return result                                                                                                                                     │               
                             │                                                                                                                                                                │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-Sewi02OQ-py3.10/lib/python3.10/site-packages/autorag/nodes/retrieval/vectordb.py:56 in __init__               │               
                             │                                                                                                                                                                │               
                             │    53 │   │   super().__init__(project_dir)                                                                                                                    │               
                             │    54 │   │                                                                                                                                                    │               
                             │    55 │   │   vectordb_config_path = os.path.join(self.resources_dir, "vectordb.yaml")                                                                         │               
                             │ ❱  56 │   │   self.vector_store = load_vectordb_from_yaml(                                                                                                     │               
                             │    57 │   │   │   vectordb_config_path, vectordb, project_dir                                                                                                  │               
                             │    58 │   │   )                                                                                                                                                │               
                             │    59                                                                                                                                                          │               
                             │                                                                                                                                                                │               
                             │ /root/.cache/pypoetry/virtualenvs/app-rag-sample-Sewi02OQ-py3.10/lib/python3.10/site-packages/autorag/vectordb/__init__.py:46 in load_vectordb_from_yaml       │               
                             │                                                                                                                                                                │               
                             │   43 │   │   )                                                                                                                                                 │               
                             │   44 │                                                                                                                                                         │               
                             │   45 │   target_dict = list(filter(lambda x: x["name"] == vectordb_name, vectordb_list))                                                                       │               
                             │ ❱ 46 │   target_dict[0].pop("name")  # delete a name key                                                                                                       │               
                             │   47 │   target_vectordb_name = target_dict[0].pop("db_type")                                                                                                  │               
                             │   48 │   target_vectordb_params = target_dict[0]                                                                                                               │               
                             │   49 │   return load_vectordb(target_vectordb_name, **target_vectordb_params)                                                                                  │               
                             ╰────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯               
                             IndexError: list index out of range                                                                                                                                              
Exception ignored in: <function VectorDB.__del__ at 0x7f77320b0e50>
Traceback (most recent call last):
  File "/root/.cache/pypoetry/virtualenvs/app-rag-sample-Sewi02OQ-py3.10/lib/python3.10/site-packages/autorag/nodes/retrieval/vectordb.py", line 63, in __del__
    del self.vector_store
AttributeError: vector_store

In the load_vectordb_from_yaml(), it seems that vectordb is identified through the name key. Am I misunderstanding this?

Image

@e7217
Copy link
Contributor Author

e7217 commented Dec 13, 2024

@vkehfdl1

I've thought about the potential cause of the issue.

The error occurs when the doc_id of a corpus does not exist in the retrieval result. However, the qa query is still able to retrieve content from another document in the vectordb. This issue may occur more frequently when top_k is set to a value greater than 1.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working High Priority
Projects
None yet
Development

No branches or pull requests

2 participants