Skip to content

Update searchmanager.py #2424

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

saurabhbansal123
Copy link

Integrated vectorizer needs a title field in the index which is not defined while creating the initial index..

Purpose

Fixes the index when integrated vectorization is enabled.

Does this introduce a breaking change?

When developers merge from main and run the server, azd up, or azd deploy, will this produce an error?
If you're not sure, try it out on an old environment.

[ ] Yes
[X ] No

Does this require changes to learn.microsoft.com docs?

This repository is referenced by this tutorial
which includes deployment, settings and usage instructions. If text or screenshot need to change in the tutorial,
check the box below and notify the tutorial author. A Microsoft employee can do this for you if you're an external contributor.

[ ] Yes
[X ] No

Type of change

[X ] Bugfix
[ ] Feature
[ ] Code style update (formatting, local variables)
[ ] Refactoring (no functional changes, no api changes)
[ ] Documentation content changes
[ ] Other... Please describe:

Code quality checklist

See CONTRIBUTING.md for more details.

  • The current tests all pass (python -m pytest).
  • I added tests that prove my fix is effective or that my feature works
  • I ran python -m pytest --cov to verify 100% coverage of added lines
  • I ran python -m mypy to check for type errors
  • I either used the pre-commit hooks or ran ruff and black manually on my code.

Integrated vectorizer needs a title field in the index which is not defined while creating the initial index..
@pamelafox
Copy link
Collaborator

Thanks for the PR, I've discussed with @mattgotteiner. He says that the title should not be strictly required for integrated vectorization, but that some developers may want it. If we support title, then we should also add code in prepdocslib that will set the title field. Could you make that change?

We may also want to flag behind an environment variable.
@mattgotteiner What is the drawback to adding title to all search indexes, index size?

@saurabhbansal123
Copy link
Author

saurabhbansal123 commented Mar 13, 2025

In the code which I have updated, the title field is only added to the index if the user is using integrated vectorization. I have updated /app/backend/prepdocslib/searchmanager.py file

if self.use_int_vectorization:
                    logger.info("Including parent_id field in new index %s", self.search_info.index_name)
                    fields.append(SearchableField(name="parent_id", type="Edm.String", filterable=True))
                    logger.info("Including title field in new index %s", self.search_info.index_name)
                    fields.append(SimpleField(name="title", type="Edm.String", searchable=True, retrievable=True))

This is not strictly required for the vectorizer to work but the indexer (defined in /app/backend/prepdocslib/integratedvectorizerstrategy.py) needs the title field to work in order to show the PDF title in the search results..

indexer = SearchIndexer(
           name=indexer_name,
           description="Indexer to index documents and generate embeddings",
            skillset_name=f"{self.search_info.index_name}-skillset",
            target_index_name=self.search_info.index_name,
            data_source_name=f"{self.search_info.index_name}-blob",
            # Map the metadata_storage_name field to the title field in the index to display the PDF title in the search results
            field_mappings=[FieldMapping(source_field_name="metadata_storage_name", target_field_name="title")],
        )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants