You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are leveraging AWS security lake to ingest various log sources into OCSF, have this data be queryable via AWS Athena, as well as ingest this data into AWS OpenSearch. We are attempting to ingest Falco data by following by the following article: falcosidekick integration documentation.
Describe the bug:
After following the instructions provided in the article linked above we are receiving Falco data in our security lake s3 bucket and this data is queryable via S3 Select. However, the lake formation table generated by security lake returns a generic error of Unable to Read Parquet File when attempting to query via Athena. Additionally, we are leveraging the AWS OpenSearch Ingestion Pipeline with the Security Lake S3 parquet OCSF pipeline template. Native sources from security lake are ingested without error but we are seeing an error when Falco data is ingested. The error from OS ingestion pipeline (via CloudWatch) is as follows:
AWS support was contacted regarding this error. The following was their response:
"REPEATED" is a keyword in protobuf. It seems the files are being written from protobufs and the generated schema is not supported by the Avro parquet library used by OS ingestion. The source of this error is https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L303
There's some useful information in this stackoverflow post:
https://stackoverflow.com/questions/72634350/parquetprotowriters-creates-an-unreadable-parquet-file
How to reproduce it:
Following the steps outlined in the following article to ingest Falco data into security lake
Note that I was the one who mentioned that I thought it was an issue converting from proto to parquet. Upon going through the parquet library used to generate the files by this repo, it looks like REPEATED is a valid keyword in parquet. The issue is that the use of REPEATED is not correct. See https://github.com/apache/parquet-format/blob/master/LogicalTypes.md for detailed description of how REPEATED should be used. I see an issue in these places:
Background:
We are leveraging AWS security lake to ingest various log sources into OCSF, have this data be queryable via AWS Athena, as well as ingest this data into AWS OpenSearch. We are attempting to ingest Falco data by following by the following article: falcosidekick integration documentation.
Describe the bug:
After following the instructions provided in the article linked above we are receiving Falco data in our security lake s3 bucket and this data is queryable via S3 Select. However, the lake formation table generated by security lake returns a generic error of
Unable to Read Parquet File
when attempting to query via Athena. Additionally, we are leveraging the AWS OpenSearch Ingestion Pipeline with the Security Lake S3 parquet OCSF pipeline template. Native sources from security lake are ingested without error but we are seeing an error when Falco data is ingested. The error from OS ingestion pipeline (via CloudWatch) is as follows:AWS support was contacted regarding this error. The following was their response:
How to reproduce it:
Expected behaviour:
Environment:
Falco version
0.36.1 (x86_64) - from docker.io/falcosecurity/falco-no-driver:0.36.1
System info
Cloud provider or hardware configuration
AWS EKS - managed nodegroups
OS
Kernel:
Linux falco-6sck4 5.10.197-186.748.amzn2.x86_64 #1 SMP Tue Oct 10 00:30:07 UTC 2023 x86_64 GNU/Linux
Installation method:
Kubernetes
Additional context:
N/A
The text was updated successfully, but these errors were encountered: