Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: add "returning" search option to select only specified fields from a document #770

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

fasenderos
Copy link

Implements #769.

This PR introduces the ability to pass an array of document fields to be returned via the returning option.

Initially, I considered using the existing getDocumentProperties() function. However, this function does not preserve the original structure of objects. Moreover, when dealing with nested objects, it only returns the deepest fields. This behavior forces users to specify all properties if they want to return the entire object, which can be cumbersome.

Given that getDocumentProperties() is widely used throughout the codebase, I decided to create a new function called pickDocumentProperties() that preserves the original structure of objects, allowing users to specify top-level keys for nested objects, which simplifies the process of selecting which fields to return.

It's important to note that the includeVectors option is skipped if the returning option is also provided. It seemed more logical to me to prioritize user selection, although this behavior can be subject to further discussion.

If everything looks good, I’ll proceed with adding the corresponding tests.

Copy link

vercel bot commented Sep 4, 2024

@fasenderos is attempting to deploy a commit to the OramaSearch Team on Vercel.

A member of the Team first needs to authorize it.

@allevo
Copy link
Collaborator

allevo commented Sep 5, 2024

Hi @fasenderos, thanks for your PR!

This solution creates a lot of temporary objects that slows down the application.

Returning to the original question (#769), the application wants to reduce the network traffic usage by specifying which properties return.

Is this could be achieved with some special serialization method? I'm thinking how fastify allows an output schema to speed up the serialization process. Could you use a library like fast-json-stringify to address that topic?

@fasenderos
Copy link
Author

fasenderos commented Sep 7, 2024

Hi @allevo thanks for the reply.

Is this could be achieved with some special serialization method? I'm thinking how fastify allows an output schema to speed up the serialization process. Could you use a library like fast-json-stringify to address that topic?

Yes, you can also achieve the same result with fast-json-stringify.

Before starting the implementation, I wanted to run some tests to check performance (time and memory usage). Based on the results, fast-json-stringify is always the fastest, while pickDocumentProperties() consistently uses the least memory. The "problem" with fast-json-stringify is that it would return an array of strings in the hits that would need to be parsed. I therefore also added tests with JSON.parse() and fast-json-parse, and these latter ones are consistently the slowest.

If you think I should add more use cases or modify the tests, let me know. I can still proceed with using fast-json-stringify, but I’d like to know if the data returned in the hits needs to be parsed or not.

Below are the results, and at the bottom, the script used.

1000 Runs on 10/100/1000 documents and serializing 1 property

RESULTS SUMMARY FOR 1000 Runs - Serializing 1 properties in 10 docs

Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.001 (fastest)
2° pick-document-properties in 0.0017
3° fast-json-stringify-and-normal-parse in 0.0042
4° fast-json-stringify-and-fast-parse in 0.0048

Memory used in byte (50% Percentile)
1° pick-document-properties in 1160 (least consuming)
2° fast-json-stringify in 1480
3° fast-json-stringify-and-normal-parse in 2200
4° fast-json-stringify-and-fast-parse in 2600


RESULTS SUMMARY FOR 1000 Runs - Serializing 1 properties in 100 docs

Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.0047 (fastest)
2° pick-document-properties in 0.0089
3° fast-json-stringify-and-normal-parse in 0.0286
4° fast-json-stringify-and-fast-parse in 0.0341

Memory used in byte (50% Percentile)
1° pick-document-properties in 9800 (least consuming)
2° fast-json-stringify in 13000
3° fast-json-stringify-and-normal-parse in 20200
4° fast-json-stringify-and-fast-parse in 24200


RESULTS SUMMARY FOR 1000 Runs - Serializing 1 properties in 1000 docs

Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.0429 (fastest)
2° pick-document-properties in 0.0813
3° fast-json-stringify-and-normal-parse in 0.286
4° fast-json-stringify-and-fast-parse in 0.3431

Memory used in byte (50% Percentile)
1° pick-document-properties in 96208 (least consuming)
2° fast-json-stringify in 128208
3° fast-json-stringify-and-normal-parse in 200216
4° fast-json-stringify-and-fast-parse in 240224

1000 Runs on 10/100/1000 documents and serializing 3 properties

RESULTS SUMMARY FOR 1000 Runs - Serializing 3 properties in 10 docs

Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.0018 (fastest)
2° pick-document-properties in 0.0039
3° fast-json-stringify-and-normal-parse in 0.0062
4° fast-json-stringify-and-fast-parse in 0.0068

Memory used in byte (50% Percentile)
1° pick-document-properties in 1960 (least consuming)
2° fast-json-stringify in 3400
3° fast-json-stringify-and-normal-parse in 4760
4° fast-json-stringify-and-fast-parse in 5160


RESULTS SUMMARY FOR 1000 Runs - Serializing 3 properties in 100 docs

Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.0071 (fastest)
2° pick-document-properties in 0.029
3° fast-json-stringify-and-normal-parse in 0.0442
4° fast-json-stringify-and-fast-parse in 0.0501

Memory used in byte (50% Percentile)
1° pick-document-properties in 17800 (least consuming)
2° fast-json-stringify in 32200
3° fast-json-stringify-and-normal-parse in 45800
4° fast-json-stringify-and-fast-parse in 49800


RESULTS SUMMARY FOR 1000 Runs - Serializing 3 properties in 1000 docs

Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.0669 (fastest)
2° pick-document-properties in 0.2888
3° fast-json-stringify-and-normal-parse in 0.4408
4° fast-json-stringify-and-fast-parse in 0.4989

Memory used in byte (50% Percentile)
1° pick-document-properties in 176224 (least consuming)
2° fast-json-stringify in 320224
3° fast-json-stringify-and-normal-parse in 456248
4° fast-json-stringify-and-fast-parse in 496248

1000 Runs on 10/100/1000 documents and serializing 6 properties

RESULTS SUMMARY FOR 1000 Runs - Serializing 6 properties in 10 docs

Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.0023 (fastest)
2° pick-document-properties in 0.0068
3° fast-json-stringify-and-normal-parse in 0.0093
4° fast-json-stringify-and-fast-parse in 0.0098

Memory used in byte (50% Percentile)
1° pick-document-properties in 3480 (least consuming)
2° fast-json-stringify in 6840
3° fast-json-stringify-and-normal-parse in 9080
4° fast-json-stringify-and-fast-parse in 9480


RESULTS SUMMARY FOR 1000 Runs - Serializing 6 properties in 100 docs

Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.0142 (fastest)
2° pick-document-properties in 0.0581
3° fast-json-stringify-and-normal-parse in 0.0732
4° fast-json-stringify-and-fast-parse in 0.0784

Memory used in byte (50% Percentile)
1° pick-document-properties in 33000 (least consuming)
2° fast-json-stringify in 66600
3° fast-json-stringify-and-normal-parse in 89000
4° fast-json-stringify-and-fast-parse in 93000


RESULTS SUMMARY FOR 1000 Runs - Serializing 6 properties in 1000 docs

Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.1385 (fastest)
2° pick-document-properties in 0.5851
3° fast-json-stringify-and-normal-parse in 0.7507
4° fast-json-stringify-and-fast-parse in 0.8038

Memory used in byte (50% Percentile)
1° pick-document-properties in 328224 (least consuming)
2° fast-json-stringify in 664232
3° fast-json-stringify-and-normal-parse in 888256
4° fast-json-stringify-and-fast-parse in 928288

1000 Runs on 10/100/1000 documents and serializing 8 properties where one property is an entire object and another one is a single nested property of an object

RESULTS SUMMARY FOR 1000 Runs - Serializing 8 properties in 10 docs

Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.005 (fastest)
2° pick-document-properties in 0.0097
3° fast-json-stringify-and-normal-parse in 0.0243
4° fast-json-stringify-and-fast-parse in 0.0252

Memory used in byte (50% Percentile)
1° pick-document-properties in 5240 (least consuming)
2° fast-json-stringify in 21240
3° fast-json-stringify-and-normal-parse in 28200
4° fast-json-stringify-and-fast-parse in 28600


RESULTS SUMMARY FOR 1000 Runs - Serializing 8 properties in 100 docs

Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.0426 (fastest)
2° pick-document-properties in 0.0895
3° fast-json-stringify-and-normal-parse in 0.2381
4° fast-json-stringify-and-fast-parse in 0.2459

Memory used in byte (50% Percentile)
1° pick-document-properties in 50600 (least consuming)
2° fast-json-stringify in 210608
3° fast-json-stringify-and-normal-parse in 280216
4° fast-json-stringify-and-fast-parse in 284208


RESULTS SUMMARY FOR 1000 Runs - Serializing 8 properties in 1000 docs

Time elapsed in ms (50% Percentile)
1° fast-json-stringify in 0.4319 (fastest)
2° pick-document-properties in 0.9126
3° fast-json-stringify-and-normal-parse in 2.5254
4° fast-json-stringify-and-fast-parse in 2.5811

Memory used in byte (50% Percentile)
1° pick-document-properties in 504248 (least consuming)
2° fast-json-stringify in 2104264
3° fast-json-stringify-and-normal-parse in 2800488
4° fast-json-stringify-and-fast-parse in 2840296

Here is the script used for testing npx tsx ./packages/orama/test-orama.ts

import fastJson from "fast-json-stringify"
import fastParse from "fast-json-parse"
import { pickDocumentProperties } from "./src/utils"

const serialize = {
    // Return 1 prop for each document
    'props-1': { 
        fastJson: fastJson({ type: 'object', properties: { string1: { type: 'string' }}}),
        pick: ['string1']
    },
    // Return 3 props for each document
    'props-3': { 
        fastJson: fastJson({ 
            type: 'object',
            properties: { 
                string1: { type: 'string' }, 
                number1: { type: 'number' },
                bool1: { type: 'boolean' }
            }
        }),
        pick: ['string1', 'number1', 'bool1']
    },
    // Return 6 props for each document
    'props-6': {
        fastJson: fastJson({
            type: 'object',
            properties: { 
                string1: { type: 'string' }, 
                string2: { type: 'string' }, 
                number1: { type: 'number' },
                number2: { type: 'number' },
                bool1: { type: 'boolean' },
                bool2: { type: 'boolean' }
            }
        }),
        pick: ['string1', 'string2', 'number1', 'number2','bool1', 'bool2']
    },
    // Return 8 props for each document where 1 is an entire object and 1 is a single nested prop of another object
    'props-8': {
        fastJson: fastJson({
            type: 'object',
            properties: { 
                string1: { type: 'string' }, 
                string2: { type: 'string' }, 
                number1: { type: 'number' },
                number2: { type: 'number' },
                bool1: { type: 'boolean' },
                bool2: { type: 'boolean' },
                // entire object
                object1: {
                    type: 'object',
                    properties: {
                        string1: { type: 'string' },
                        string2: { type: 'string' },
                        number1: { type: 'number' },
                        number2: { type: 'number' },
                        bool1: { type: 'boolean' },
                        bool2: { type: 'boolean' },
                        nested: {
                            type: 'object',
                            properties: {
                                string1: { type: 'string' },
                                number1: { type: 'number' },
                                bool1: { type: 'boolean' },
                            }
                        }
                    }
                },
                // single nested fields object2.nested.string1
                object2: {
                    type: 'object',
                    properties: {
                        nested: {
                            type: 'object',
                            properties: {
                                string1: { type: 'string' }
                            }
                        }
                    }
                }
            }
        }),
        pick: ['string1', 'string2', 'number1', 'number2','bool1', 'bool2', 'object1', 'object2.nested.string1']
    }
}

function getNDocuments(n: number) {
    const response: any = []
    for (let index = 0; index < n; index++) {
        response.push({
            string1: 'foo bar',
            string2: 'foo bar',
            number1: 99.99,
            number2: 99.99,
            bool1: false,
            bool2: true,
            object1: {
                string1: 'foo bar',
                string2: 'foo bar',
                number1: 99.99,
                number2: 99.99,
                bool1: false,
                bool2: true,
                nested: {
                    string1: 'foo bar',
                    number1: 99.99,
                    bool1: false,
                }
            },
            object2: {
                string1: 'foo bar',
                string2: 'foo bar',
                number1: 99.99,
                number2: 99.99,
                bool1: false,
                bool2: true,
                nested: {
                    string1: 'foo bar',
                    number1: 99.99,
                    bool1: false,
                }
            },
        })
    }
    return response
}

function profiling(fn: (docs, props) => void, label: string, docs: any[], props: number) {
    const memoryBefore = process.memoryUsage().heapUsed;
    const start = performance.now();

    fn(docs, props);

    const end = performance.now();
    const memoryAfter = process.memoryUsage().heapUsed;
    const time = end - start
    const memory = memoryAfter - memoryBefore
    return { label, time, memory, count: docs.length, props }
}

const groupBy = (array, key) => {
    return array.reduce((result, currentValue) => {
        const groupKey = currentValue[key];
        if (!result[groupKey]) result[groupKey] = [];
        result[groupKey].push(currentValue);
        return result;
    }, {});
};

const percentile = (arr, p) => {
    const index = Math.ceil(arr.length * (p / 100)) - 1;
    return arr[index];
}

const mean = (arr, prop) => {
    return arr.reduce((sum, item) => sum + item[prop], 0) / arr.length;
}

const roundTo = (num, decimals = 2) => {
    const factor = Math.pow(10, decimals);
    return Math.round(num * factor) / factor;
}

function printResults(results){
    // { 10: [], 100: []}
    const groupedByRuns = groupBy(results, 'count');
    for (const docs in groupedByRuns) {
        const groupedByLabel = groupBy(groupedByRuns[docs], 'label')
        const summary: any = {
            runs: 0,
            time: [],
            memory: []
        }
        // { fast-json-stringify: [], 'pick-document-properties': [] }
        for (const label in groupedByLabel) {
            const timeOrdered = [...groupedByLabel[label]]
            timeOrdered.sort((a, b) => a.time - b.time);
            
            const memoryOrdered = [...groupedByLabel[label]]
            memoryOrdered.sort((a, b) => a.memory - b.memory);
            
            const bestTime = timeOrdered[0];
            const worstTime = timeOrdered[timeOrdered.length - 1]

            const bestMemory = memoryOrdered[0];
            const worstMemory = memoryOrdered[memoryOrdered.length - 1]

            const avgTime = mean(timeOrdered, 'time')
            const avgMemory = mean(memoryOrdered, 'memory')

            const timePercentile25 = percentile(timeOrdered, 25)
            const timePercentile50 = percentile(timeOrdered, 50)
            const timePercentile75 = percentile(timeOrdered, 75)
            const timePercentile95 = percentile(timeOrdered, 95)

            const memoryPercentile25 = percentile(memoryOrdered, 25)
            const memoryPercentile50 = percentile(memoryOrdered, 50)
            const memoryPercentile75 = percentile(memoryOrdered, 75)
            const memoryPercentile95 = percentile(memoryOrdered, 95)

            summary.time.push(timePercentile50)
            summary.memory.push(memoryPercentile50)
            summary.runs = timeOrdered.length

            console.log(label)
            console.table([
                {
                    "Stats": 'Time ms',
                    "25%": roundTo(timePercentile25.time, 4),
                    "50%": roundTo(timePercentile50.time, 4),
                    "75%": roundTo(timePercentile75.time, 4),
                    "95%": roundTo(timePercentile95.time, 4),
                    "Average (Mean)": roundTo(avgTime, 4),
                    "Best (Min)": roundTo(bestTime.time, 4),
                    "Worst (Max)": roundTo(worstTime.time, 4),
                },
                {
                    "Stats": 'Memory byte',
                    "25%": roundTo(memoryPercentile25.memory, 4),
                    "50%": roundTo(memoryPercentile50.memory, 4),
                    "75%": roundTo(memoryPercentile75.memory, 4),
                    "95%": roundTo(memoryPercentile95.memory, 4),
                    "Average (Mean)": roundTo(avgMemory, 4),
                    "Best (Min)": roundTo(bestMemory.memory, 4),
                    "Worst (Max)": roundTo(worstMemory.memory, 4),
                }]
            );
        }
        summary.time.sort((a, b) => a.time - b.time)
        summary.memory.sort((a, b) => a.memory - b.memory)
        const fastest = summary.time[0]

        console.log(`\n\nRESULTS SUMMARY FOR ${summary.runs} Runs - Serializing ${fastest.props} properties in ${fastest.count} docs`)
        console.log(`\nTime elapsed in ms (50% Percentile)`)
        summary.time.forEach((item, index) => {
            console.log(`${index + 1}° ${item.label} in ${roundTo(item.time, 4)}${index === 0 ? ' (fastest)' : ''}`);
        })
        console.log(`\nMemory used in byte (50% Percentile)`)        
        summary.memory.forEach((item, index) => {
            console.log(`${index + 1}° ${item.label} in ${roundTo(item.memory, 4)}${index === 0 ? ' (least consuming)' : ''}`);
        })
    }
}

function useFastJson(docs, props) {
    const serializer = serialize[`props-${props}`].fastJson
    return docs.map((doc) => serializer(doc))
}

function useFastJsonAndNormalParse(docs, props) {
    const serializer = serialize[`props-${props}`].fastJson
    return docs.map((doc) => JSON.parse(serializer(doc)))
}

function useFastJsonAndFastParse(docs, props) {
    const serializer = serialize[`props-${props}`].fastJson
    return docs.map((doc) => fastParse(serializer(doc)).value)
}

function usePickDocumentProperties(docs, props){
    const properties = serialize[`props-${props}`].pick
    return docs.map((doc) => pickDocumentProperties(doc, properties))
}

const runs = (docs, props) => {
    const results: any = []
    for (let i = 0; i < 1000; i++) {
        results.push(profiling(useFastJson, 'fast-json-stringify', docs, props))
        results.push(profiling(useFastJsonAndNormalParse, 'fast-json-stringify-and-normal-parse', docs, props))        
        results.push(profiling(useFastJsonAndFastParse, 'fast-json-stringify-and-fast-parse', docs, props))
        results.push(profiling(usePickDocumentProperties, 'pick-document-properties', docs, props))
    }
    printResults(results)
}

const init = (props: 1 | 3 | 6 | 8) => {
    const docs = getNDocuments(10000)
    const docs_10 = docs.slice(0, 10)
    const docs_100 = docs.slice(0, 100)
    const docs_1000 = docs.slice(0, 1000)
    const docs_10000 = docs.slice(0, 10000)

    // Execute 1000 runs on 10 docs
    runs(docs_10, props)

    // Execute 1000 runs on 100 docs
    runs(docs_100, props)

    // Execute 1000 runs on 1.000 docs
    runs(docs_1000, props)

    // Execute 1000 runs on 10.000 docs
    runs(docs_10000, props)
}

init(1) // Serialize 1 prop for each document
init(3) // Serialize 3 props for each document
init(6) // Serialize 6 props for each document
init(8) // Serialize 8 props for each document where 1 is an entire object and 1 is a single nested prop of another object

@micheleriva
Copy link
Member

One other thing to consider: @orama/orama must remain dependencies-free

@fasenderos
Copy link
Author

One other thing to consider: @orama/orama must remain dependencies-free

Ok, is there anything I need to do on this PR (besides the tests)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants