-
Notifications
You must be signed in to change notification settings - Fork 213
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory Issue with bigQuery.createQueryStream in Node.js #1392
Comments
hey @gayathri-mandala, thanks for the report. Unfortunately, using the BigQuery REST API, we can't control the amount of rows being returned per page, so that's why even by using the I'm doing some work integrating the BigQuery Storage Read API, that allows reading a table using gRPC and Arrow data format. That API also works in true stream like fashion. The plan is wrap this work by end of month and users will be able to use it transparently here on this package. Or users will be able to use the BigQuery Storage Read wrapper to fetch tables using it more explicitly. You can keep track here:
You can already use the BigQuery Storage Read API, but there is no client wrapper yet, which makes the developer experience not super nice. Here is a basic sample on how to use with AVRO format with the raw client (but I advise using ARROW format, I can provide a sample if you want to go down that route): https://github.com/googleapis/nodejs-bigquery-storage/blob/main/samples/quickstart.js Also I've just opened a PR #1393 to reduce the memory footprint, by cleaning |
Thanks for the information @alvarowolfx |
@gayathri-mandala here is a sample for reading data with the BigQuery Storage API and in Arrow format. In this sample I show that you can read a full table directly using the BQ Storage API or run a query, obtain the destination table with results and then read with the BQ Storage API. async function main() {
// The read stream contains blocks of Arrow-encoded bytes.
const {RecordBatchReader} = require('apache-arrow');
const {Readable} = require('stream');
// See reference documentation at
// https://cloud.google.com/bigquery/docs/reference/storage
const {BigQuery} = require('@google-cloud/bigquery');
const {BigQueryReadClient} = require('@google-cloud/bigquery-storage');
const client = new BigQueryReadClient();
async function bigqueryStorageQuickstart() {
// Get current project ID. The read session is created in this project.
// This project can be different from that which contains the table.
const myProjectId = await client.getProjectId();
// We can run a query job and use the destionation table to read data from it.
const bigquery = new BigQuery();
const sqlQuery = 'SELECT name, number, state from `bigquery-public-data.usa_names.usa_1910_current` where state = "CA"';
const [job] = await bigquery.createQueryJob({
query: sqlQuery,
location: 'US',
});
const [metadata] = await job.getMetadata();
const qconfig = metadata.configuration.query;
const dstTableRef = qconfig.destinationTable;
const projectId = dstTableRef.projectId;
const datasetId = dstTableRef.datasetId;
const tableId = dstTableRef.tableId;
// We cna used a fixed table.
// This example reads baby name data from the public datasets.
/*const projectId = 'bigquery-public-data';
const datasetId = 'usa_names';
const tableId = 'usa_1910_current';*/
const tableReference = `projects/${projectId}/datasets/${datasetId}/tables/${tableId}`;
const parent = `projects/${myProjectId}`;
/* We can limit the output columns to a subset of those allowed in the table,
* and set a simple filter to only report names from the state of
* Washington (WA).
*/
const readOptions = {
selectedFields: [],
// selectedFields: ['name', 'number', 'state'],
// rowRestriction: 'state = "WA"',
};
// API request.
const request = {
parent,
readSession: {
table: tableReference,
// This API can also deliver data serialized in Apache Arrow format.
// This example leverages Apache Avro.
dataFormat: 'ARROW',
readOptions,
},
};
const [session] = await client.createReadSession(request);
console.log(`session ${session.name} with ${session.streams.length} streams`);
const streams = [];
for (const readStream of session.streams) {
const readRowsRequest = {
// Required stream name and optional offset. Offset requested must be less than
// the last row read from readRows(). Requesting a larger offset is undefined.
readStream: readStream.name,
offset: 0,
};
const stream = client.readRows(readRowsRequest);
streams.push(stream);
}
async function* mergeStreams(readables) {
for (const readable of readables) {
for await (const chunk of readable) {
yield chunk;
}
}
}
const joined = Readable.from(mergeStreams(streams));
const rows = [];
joined
.on('error', console.error)
.on('data', data => {
try {
const buf = Buffer.concat([
session.arrowSchema.serializedSchema,
data.arrowRecordBatch.serializedRecordBatch,
]);
const reader = RecordBatchReader.from(buf);
const batches = reader.readAll();
for (const batch of batches) {
for (const row of batch.toArray()) {
rows.push(row);
}
}
} catch (error) {
console.log(error);
}
})
.on('end', () => {
console.log(`Got ${rows.length} rows`);
});
}
bigqueryStorageQuickstart();
}
main(...process.argv.slice(2)); |
We are having the same memory issues using Is there an update on this to make the data truly streamable? |
@jcbowman with PR #1393, memory usage is reduced, but still doesn't make the call 100% streamable - as explained #1392 (comment) that's not possible when using the BigQuery v2 REST API. You can use the BQ Storage Read API, which I added an example here #1392 (comment). We are also close to landing a wrapper to make the BQ Storage Read API easier to use with PR googleapis/nodejs-bigquery-storage#431 |
as we can't make the |
The
bigQuery.createQueryStream
loads an entire data set into memory. When we try to retrieve the data chunk-wise, it causes a memory issue. Upon checking the heap profiles, much data is getting stored in_cachedResponse
,_cachedRows
, androws
.Environment details
@google-cloud/bigquery
version: 7.8.0Steps to reproduce
Here is the sample script
When we have multiple connections, and for every connection request, the data gets loaded into memory, causing the memory size to increase.
Issue with autoPaginate
I tried using the autoPaginate field:
const queryStream = bigquery.createQueryStream(query, { autoPaginate: true });
However, it still behaves as if autoPaginate is set to false. Is there a way or field that allows us to retrieve the data in chunks rather than loading the entire data into memory?
Reference
Here it is mentioned that we need to end the stream after a certain amount of data. However, this approach could lead to data loss. How can we implement this correctly? Please provide a sample.
The text was updated successfully, but these errors were encountered: