diff --git a/docs/architecture/collect.md b/docs/architecture/collect.md new file mode 100644 index 00000000..943a44ef --- /dev/null +++ b/docs/architecture/collect.md @@ -0,0 +1,122 @@ +# collect + + +We need multiple kinds of data from our services. + +1. **Service performance**. Jemison itself has operating parameters we'd like to track. For example, we can use Golang's memory monitoring tools to [capture and analyze](https://stackoverflow.com/questions/24863164/how-to-analyze-golang-memory) things like heap usage, allocations, GCs, and so forth. This will help us spot leaks (if we have them) and so on. +2. **Service performance**. At the application (as opposed to system) level, we may want to track the time of some activities. For example, how much time do we spend reading/writing objects to S3? Postgres? Unlike memory/heap usage, we have to explicitly capture this data as part of our application logic. (For example, we may want to periodically query [user table statistics](https://kaleman.netlify.app/getting-query-performance-stats-with-pg). +3. **Service use**. What queries do we receive, from where, and at what rate? What is the clickthrough rate on queries? + +These and other needs will arise. What we want is a common architecture that is in keeping with the rest of the stack for capturing this data. + +Given the architecture of Jemison as a whole, we'll leverage our queue. In a picture: + +```mermaid +sequenceDiagram + autonumber + participant S as Service + participant Q as Queue + participant C as Collect + participant P as S3 + # --- + S->>Q: post data + C->>Q: + activate Q + Q->>C: grab data + C->>P: store data to S3 + note right of C: sort data to appropriate key +``` + +Because every service already knows how to talk to the queue, we'll use the queue to ship data from any given service to `collect`. It will be `collect`'s responsibility to pull the data from the queue, sort it, and store it into the appropriate table for processing. + +Consider the following concrete example: + +1. `fetch` will keep track of how many pages it fetches. +2. Every 5 minutes, `fetch` will post to the `collect` queue a message containing: + 1. The name of the service posting the data (e.g. `fetch`) + 2. The name of the data being posted (e.g. `pages_fetched`) + 3. A JSON document containing the data (e.g. `{ "five_minutes": 123, "since_startup": 123456 , "epoch_nanos": 190231231231 }`) + +The `collect` service will pull this off the queue, and stash the JSON document into S3. It will stash the data at + +``` +2025-01-22/fetch/pages_fetched/234234123123234.json +``` + +where the key is the number of nanoseconds since the Unix epoch, as recorded by the service in the `epoch_nanos` field. + +## data structures + +Data will ship in a JSON structure that looks like: + +``` +{ + "source": , + "data_id": , + "data": , + "epoch_nanos": +} +``` + +This will provide a standard queueing format for all data types. Although it is possible that two services might record a piece of data at the same nanosecond, it 1) seems unlikely, and 2) we will store different kinds of data at different paths in S3. + +### validation + +JSON is dangerous. It is easy to create, ship, somewhat easy to parse, and easy to make mistakes with. (E.g. shipping `epoc_nanos` instead of `epoch_nanos`.) Therefore, **every data element will be validated via JSON Schemas** ([documentation](https://json-schema.org/learn/getting-started-step-by-step#creating-your-first-schema)) before being stored to S3. + +In the `collect` service directory, we will store schemas in a `schemas` directory. Schemas will be defined in Jsonnet. Why? So common patterns can be abstracted out over all of our schemas. As part of our build, we will: + +1. Compile our JSonnet Schemas into JSON Schemas. + 1. This has the benefit of guaranteeing that the resulting schemas are valid JSON. +2. Embed the JSON Schemas into the `collect` service via an [embedded filesystem](https://gobyexample.com/embed-directive). + 1. This has the benefit of guaranteeing that the schemas will be available at time of operation. + +We might lay out the `schemas` directory using a `/` structure, to mirror the object we ship to `collect`. + +The schema for the top-level `collect` object might be described as: + +```json +{ + "$schema": "https://json-schema.org/draft/2020-12/schema", + "$id": "https://search.gov/collect.schema.json", + "title": "Data collection schema", + "description": "The schema applied to all top-level data collection objects", + "type": "object", + "properties": { + "source": { + "description": "The service generating the data", + "type": "string" + }, + "data_id": { + "description": "A short identifier for the data unique to the service", + "type": "string" + }, + "data": { + "description": "The data itself", + "type": "object" + }, + "epoch_nanos": { + "description": "The time in nanoseconds since the data was collected, according to the service", + "type": "integer" + } + } +} +``` + +(Note, that is expressed in JSON, not JSonnet.) It is up to `collect` to use the `source` and `data_id` fields to look up the correct schema in order to validate the `data` field. + +## collect in operation + +When `collect` is in operation, it will: + + +```mermaid +flowchart TD + start[pull data from queue] + start --> get[get service, data type] + get --> check{apply schema} + check --is valid--> s3[store to S3] + check --not valid-->error[throw error] +``` + +In short, all collect does is pull things from the queue, grab the JSON, validate the JSON, and write the JSON document to S3 at the appropriate key. It does not *process* data, it just makes sure it is in the correct *shape*. \ No newline at end of file diff --git a/docs/architecture/index.md b/docs/architecture/index.md index a653fd64..0c058a46 100644 --- a/docs/architecture/index.md +++ b/docs/architecture/index.md @@ -45,5 +45,14 @@ At this point, further services clean, process, and prepare the text for search. Read more about the [data processing pipeline](processing.md). -## administration via API +## a source of data + +We collect data about Jemison (to understand its performance) as well as how our users interact with Jemison (so we know what they're searching for, and we can deliver a better experience to the public). + +We serve the public. It is in their interest to have their privacy protected when they use Jemison. Therefore: + +1. We do **not** collect individually identifiable information. +2. We do **not** collect information that would allow us to identify individuals through aggregate operations. + +**We hold the public's trust**, and part of maintaining that trust is that we will collect information that allows us to deliver a better service, but will preserve the publics privacy as they work to find what they need from Federal services.