AWS Security Lake Parquet File Schema Format Issues upon AWS Opensearch Ingestion & AWS Athena Querying #728

m00lav · 2023-12-19T15:16:35Z

Background:

We are leveraging AWS security lake to ingest various log sources into OCSF, have this data be queryable via AWS Athena, as well as ingest this data into AWS OpenSearch. We are attempting to ingest Falco data by following by the following article: falcosidekick integration documentation.

Describe the bug:

After following the instructions provided in the article linked above we are receiving Falco data in our security lake s3 bucket and this data is queryable via S3 Select. However, the lake formation table generated by security lake returns a generic error of Unable to Read Parquet File when attempting to query via Athena. Additionally, we are leveraging the AWS OpenSearch Ingestion Pipeline with the Security Lake S3 parquet OCSF pipeline template. Native sources from security lake are ingested without error but we are seeing an error when Falco data is ingested. The error from OS ingestion pipeline (via CloudWatch) is as follows:

java.lang.UnsupportedOperationException: REPEATED not supported outside LIST or MAP. Type: repeated binary types (STRING) = 0

AWS support was contacted regarding this error. The following was their response:

"REPEATED" is a keyword in protobuf. It seems the files are being written from protobufs and the generated schema is not supported by the Avro parquet library used by OS ingestion. The source of this error is https://github.com/apache/parquet-mr/blob/master/parquet-avro/src/main/java/org/apache/parquet/avro/AvroSchemaConverter.java#L303
  
There's some useful information in this stackoverflow post:  
https://stackoverflow.com/questions/72634350/parquetprotowriters-creates-an-unreadable-parquet-file

How to reproduce it:

Following the steps outlined in the following article to ingest Falco data into security lake
Create and config an AWS OpenSearch Ingestion Pipeline to send Falco to an OpenSearch domain: https://docs.aws.amazon.com/opensearch-service/latest/developerguide/configure-client-security-lake.html

Expected behaviour:

Falco data in security lake will be ingestible without error by an AWS OpenSearch Ingestion Pipeline

Environment:

Falco version

0.36.1 (x86_64) - from docker.io/falcosecurity/falco-no-driver:0.36.1

System info

{
  "machine": "x86_64",
  "nodename": "falco-6sck4",
  "release": "5.10.197-186.748.amzn2.x86_64",
  "sysname": "Linux",
  "version": "#1 SMP Tue Oct 10 00:30:07 UTC 2023"
}

Cloud provider or hardware configuration

AWS EKS - managed nodegroups

OS

FALCO CONTAINER:
PRETTY_NAME="Debian GNU/Linux 12 (bookworm)"
NAME="Debian GNU/Linux"
VERSION_ID="12"
VERSION="12 (bookworm)"
VERSION_CODENAME=bookworm
ID=debian
HOME_URL="https://www.debian.org/"
SUPPORT_URL="https://www.debian.org/support"
BUG_REPORT_URL="https://bugs.debian.org/"

Kernel:

Linux falco-6sck4 5.10.197-186.748.amzn2.x86_64 #1 SMP Tue Oct 10 00:30:07 UTC 2023 x86_64 GNU/Linux

Installation method:

Kubernetes

Additional context:

N/A

The text was updated successfully, but these errors were encountered:

Issif · 2023-12-19T17:43:41Z

Thanks for this report, I'll work on it asap.

asuresh8 · 2023-12-20T02:57:39Z

Note that I was the one who mentioned that I thought it was an issue converting from proto to parquet. Upon going through the parquet library used to generate the files by this repo, it looks like REPEATED is a valid keyword in parquet. The issue is that the use of REPEATED is not correct. See https://github.com/apache/parquet-format/blob/master/LogicalTypes.md for detailed description of how REPEATED should be used. I see an issue in these places:

If this field is repeated then OCSFSecurityFinding needs to be in a list or a map. I'm not sure if the top level of a parquet file counts as a list

If types is repeated then OCSFFIndingDetails needs to be in a list or a map. It is not

If tags is repeated then OCSFFIndingDetails needs to be in a list or a map. Is is not

See this tip in the parquet-go library.

poiana · 2024-03-19T03:49:54Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

m00lav · 2024-03-19T14:04:35Z

/remove-lifecycle stale

poiana · 2024-09-22T16:10:53Z

Issues go stale after 90d of inactivity.

Mark the issue as fresh with /remove-lifecycle stale.

Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Provide feedback via https://github.com/falcosecurity/community.

/lifecycle stale

m00lav added the kind/bug Something isn't working label Dec 19, 2023

Issif added this to the 2.29.0 milestone Dec 19, 2023

poiana added the lifecycle/stale label Mar 19, 2024

poiana removed the lifecycle/stale label Mar 19, 2024

Issif self-assigned this Apr 30, 2024

Issif modified the milestones: 2.29.0, 2.30 Jun 24, 2024

poiana added the lifecycle/stale label Sep 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS Security Lake Parquet File Schema Format Issues upon AWS Opensearch Ingestion & AWS Athena Querying #728

AWS Security Lake Parquet File Schema Format Issues upon AWS Opensearch Ingestion & AWS Athena Querying #728

m00lav commented Dec 19, 2023

Issif commented Dec 19, 2023

asuresh8 commented Dec 20, 2023 •

edited

Loading

poiana commented Mar 19, 2024

m00lav commented Mar 19, 2024

poiana commented Sep 22, 2024

AWS Security Lake Parquet File Schema Format Issues upon AWS Opensearch Ingestion & AWS Athena Querying #728

AWS Security Lake Parquet File Schema Format Issues upon AWS Opensearch Ingestion & AWS Athena Querying #728

Comments

m00lav commented Dec 19, 2023

Background:

Describe the bug:

How to reproduce it:

Expected behaviour:

Environment:

Falco version

System info

Cloud provider or hardware configuration

OS

Kernel:

Installation method:

Additional context:

Issif commented Dec 19, 2023

asuresh8 commented Dec 20, 2023 • edited Loading

poiana commented Mar 19, 2024

m00lav commented Mar 19, 2024

poiana commented Sep 22, 2024

asuresh8 commented Dec 20, 2023 •

edited

Loading