Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make inputs accepted by enrich-doc more flexible #394

Merged
merged 1 commit into from
Sep 4, 2024

Conversation

Westwooo
Copy link
Contributor

@Westwooo Westwooo commented Aug 20, 2024

This PR addresses the issues raised in #392.

Previously the documents produced by vector enrich doc had as the id, the contents of an id field in the doc being enriched. The issue with this is when the content does not contain an ID field at all, or does not contain an ID field that matched the meta().id.

This change addresses this in two cases:

  1. When the input is from a doc get the content does not need to be selected. The result of the doc get can be piped directly into enrich text, and the resulting doc can use the ID from there.
  2. When the input is from a query the query needs to select the meta().id, or some other field, to be used as the id in the resulting docs.

The help command has been updated:

> vector enrich-doc -h
Enriches given JSON with embeddings of selected field

Usage:
  > vector enrich-doc {flags} <field>

Flags:
  -h, --help - Display the help message for this command
  --model <String> - the model to generate the embeddings with
  --dimension <Int> - dimension of the resulting embeddings
  --maxTokens <Int> - the token per minute limit for the provider/model
  --id-column <String> - the name of the id column if used with an input stream
  --vectorField <String> - the name of the field into which the embedding is written

Parameters:
  field <string>: the field from which the vector is generated

Examples:
  Open local json doc and enrich the field named 'description'
  > open ./local.json | vector enrich-doc description --model amazon.titan-embed-text-v2:0

  Fetch a single doc with id '12345' and enrich the field named 'description'
  > doc get 12345 | vector enrich-doc description --model models/text-embedding-004

  Fetch and enrich all landmark documents from travel sample and upload the results to couchabase
  > query  'SELECT meta().id, * FROM `travel-sample` WHERE type = "landmark"' | vector enrich-doc content --model amazon.titan-embed-text-v1 | doc upsert

Here are some examples:

  > doc get foo | select content | vector enrich-doc foo
Error:   × input incorrectly formatted
  help: Run 'vector enrich-doc --help' for examples with input from 'doc get' and 'query'
  
 > doc get foo | vector enrich-doc foo --dimension 4
Embedding batch 1/1
╭───┬─────┬───────────────────────────────╮
│ # │ id  │            content            │
├───┼─────┼───────────────────────────────┤
│ 0 │ foo │ ╭───────────┬───────────────╮ │
│   │     │ │ foo       │ bar           │ │
│   │     │ │           │ ╭───┬───────╮ │ │
│   │     │ │ fooVector │ │ 0 │  0.25 │ │ │
│   │     │ │           │ │ 1 │  0.19 │ │ │
│   │     │ │           │ │ 2 │ -0.86 │ │ │
│   │     │ │           │ │ 3 │  0.39 │ │ │
│   │     │ │           │ ╰───┴───────╯ │ │
│   │     │ ╰───────────┴───────────────╯ │
╰───┴─────┴───────────────────────────────╯

 > open foo.json | vector enrich-doc foo --id-column foo --dimension 4
Embedding batch 1/1
╭───┬─────┬───────────────────────────────╮
│ # │ id  │            content            │
├───┼─────┼───────────────────────────────┤
│ 0 │ bar │ ╭───────────┬───────────────╮ │
│   │     │ │ foo       │ bar           │ │
│   │     │ │           │ ╭───┬───────╮ │ │
│   │     │ │ fooVector │ │ 0 │  0.25 │ │ │
│   │     │ │           │ │ 1 │  0.19 │ │ │
│   │     │ │           │ │ 2 │ -0.86 │ │ │
│   │     │ │           │ │ 3 │  0.39 │ │ │
│   │     │ │           │ ╰───┴───────╯ │ │
│   │     │ ╰───────────┴───────────────╯ │
╰───┴─────┴───────────────────────────────╯

> query  'SELECT * FROM `travel-sample` WHERE type = "landmark" LIMIT 1' | vector enrich-doc content --dimension 4
Error:   × input incorrectly formatted
  help: Run 'vector enrich-doc --help' for examples with input from 'doc get' and 'query'

> query  'SELECT meta().id, * FROM `travel-sample` WHERE type = "landmark" LIMIT 1' | vector enrich-doc content --dimension 4
Embedding batch 1/1
╭───┬────────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮
│ # │       id       │                                                                                                         content                                                                                                         │
├───┼────────────────┼─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────┤
│ 0 │ landmark_10019 │ ╭───────────────┬─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╮ │
│   │                │ │ title         │ Gillingham (Kent)                                                                                                                                                                                   │ │
│   │                │ │ name          │ Royal Engineers Museum                                                                                                                                                                              │ │
│   │                │ │ alt           │                                                                                                                                                                                                     │ │
│   │                │ │ address       │ Prince Arthur Road, ME4 4UG                                                                                                                                                                         │ │
│   │                │ │ directions    │                                                                                                                                                                                                     │ │
│   │                │ │ phone         │ +44 1634 822839                                                                                                                                                                                     │ │
│   │                │ │ tollfree      │                                                                                                                                                                                                     │ │
│   │                │ │ email         │                                                                                                                                                                                                     │ │
│   │                │ │ url           │ http://www.remuseum.org.uk                                                                                                                                                                          │ │
│   │                │ │ hours         │ Tues - Fri 9.00am to 5.00pm, Sat - Sun 11.30am - 5.00pm                                                                                                                                             │ │
│   │                │ │ image         │                                                                                                                                                                                                     │ │
│   │                │ │ price         │                                                                                                                                                                                                     │ │
│   │                │ │ content       │ Adult - £6.99 for an Adult ticket that allows you to come back for further visits within a year (children's and concessionary tickets also available). Museum on military engineering and the       │ │
│   │                │ │               │ history of the British Empire. A quite extensive collection that takes about half a day to see. Of most interest to fans of British and military history or civil engineering. The outside          │ │
│   │                │ │               │ collection of tank mounted bridges etc can be seen for free. There is also an extensive series of themed special event weekends, admission to which is included in the cost of the annual ticket.   │ │
│   │                │ │               │ ╭──────────┬────────────────────╮                                                                                                                                                                   │ │
│   │                │ │ geo           │ │ lat      │ 51.39              │                                                                                                                                                                   │ │
│   │                │ │               │ │ lon      │ 0.54               │                                                                                                                                                                   │ │
│   │                │ │               │ │ accuracy │ RANGE_INTERPOLATED │                                                                                                                                                                   │ │
│   │                │ │               │ ╰──────────┴────────────────────╯                                                                                                                                                                   │ │
│   │                │ │ activity      │ see                                                                                                                                                                                                 │ │
│   │                │ │ type          │ landmark                                                                                                                                                                                            │ │
│   │                │ │ id            │ 10019                                                                                                                                                                                               │ │
│   │                │ │ country       │ United Kingdom                                                                                                                                                                                      │ │
│   │                │ │ city          │ Gillingham                                                                                                                                                                                          │ │
│   │                │ │ state         │                                                                                                                                                                                                     │ │
│   │                │ │               │ ╭───┬──────╮                                                                                                                                                                                        │ │
│   │                │ │ contentVector │ │ 0 │ 0.31 │                                                                                                                                                                                        │ │
│   │                │ │               │ │ 1 │ 0.05 │                                                                                                                                                                                        │ │
│   │                │ │               │ │ 2 │ 0.95 │                                                                                                                                                                                        │ │
│   │                │ │               │ │ 3 │ 0.07 │                                                                                                                                                                                        │ │
│   │                │ │               │ ╰───┴──────╯                                                                                                                                                                                        │ │
│   │                │ ╰───────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯ │
╰───┴────────────────┴─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────╯

Copy link

@brett19 brett19 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, though with a comment that would be good to understand.

)
} else {
// Else piped input is from a query, which needs to contain 3 columns, one to be used as the ID, one holding the json doc and finally one with the cluster
if rec.len() != 3 {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there is some implied behaviour here, but it's strange that the query example above only selects the ID and the contents (via *), but the 3rd field doesn't appear. Where does the cluster come from?

Copy link
Contributor Author

@Westwooo Westwooo Sep 4, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's comes from cbsh, the return from a query has an added column with the cluster identifier against which the query was performed.

👤 Westwooo 🏠 local in 🗄 travel-sample._default._default
> query "SELECT meta().id, * FROM `travel-sample`.inventory.airline LIMIT 1"
╭───┬────────────┬──────────────────────────────┬─────────╮
│ # │     id     │           airline            │ cluster │
├───┼────────────┼──────────────────────────────┼─────────┤
│ 0 │ airline_10 │ ╭──────────┬───────────────╮ │ local   │
│   │            │ │ id       │ 10            │ │         │
│   │            │ │ type     │ airline       │ │         │
│   │            │ │ name     │ 40-Mile Air   │ │         │
│   │            │ │ iata     │ Q5            │ │         │
│   │            │ │ icao     │ MLA           │ │         │
│   │            │ │ callsign │ MILE-AIR      │ │         │
│   │            │ │ country  │ United States │ │         │
│   │            │ ╰──────────┴───────────────╯ │         │
╰───┴────────────┴──────────────────────────────┴─────────╯

@Westwooo Westwooo merged commit 2c6f8d9 into main Sep 4, 2024
11 checks passed
@Westwooo Westwooo deleted the flexible_input_to_enrich-doc branch September 4, 2024 08:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants