Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds UK Offshore Wind Farm dataset #14

Draft
wants to merge 5 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,10 @@ https://s3.amazonaws.com/crate.sampledata/

Datasets used in courses at the [CrateDB Academy](https://learn.cratedb.com) can be found in the `academy` subdirectory.

## Developer Relations datasets

Datasets used in developer relations talks, workshops and other projects can be found in the `devrel` subdirectory.

## Contributions

Before adding files to this repository, please make sure to install [Git LFS]
Expand Down
90 changes: 90 additions & 0 deletions devrel/uk-offshore-wind-farm-data/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# UK Offshore Wind Farms Dataset

This dataset contains data about UK offshore windfarms from [The Crown Estate](https://www.thecrownestate.co.uk/en-gb/what-we-do/asset-map/), which manages these installations. The textual descriptions of each wind farm project are from Wikipedia (see: [List of offshore wind farms in the United Kingdom](https://en.wikipedia.org/wiki/List_of_offshore_wind_farms_in_the_United_Kingdom)).

## Wind Farm Data

### wind_farms.json

This JSONL file contains details of 45 offshore wind farms. Each record includes an ID for the wind farm, name, description, and geo data in WKT format describing the boundaries of the project as one or more polygons. The co-ordinates of each turbine are also included, where known.

Each line in the file contains a JSON object with this structure:

```json
{
"id": "TRTNK",
"name": "Triton Knoll",
"description": "Triton Knoll Wind Farm is an 857 MW...",
"location": "POINT (0.840000 53.480000)",
"territory": "England",
"boundaries": "POLYGON ((0.87630538600007 53.4262737870001, ...))",
"turbines": {
"brand": "MHI Vestas",
"model": "MHI Vestas v164-9.5",
"locations": ["POINT (53.4212466 0.9478282)", ...],
"howmany": 90
},
"capacity": 857.0,
"url": "https://en.wikipedia.org/wiki/Triton_Knoll"
}
```

* `id` is a unique ID for each wind farm. These IDs are used as values for `windfarmid` in the performance data file.
* `location` is a singular point identifying the location of the wind farm.
* `boundaries` is a polygon or multi polygon describing the outer boundaries of the wind farm.
* `capacity` is measured in MW.

Here's an example table schema for this data, using full text indexing for the textual descriptions of each wind farm:

```sql
CREATE TABLE windfarms (
id TEXT PRIMARY KEY,
name TEXT,
description TEXT INDEX USING fulltext WITH (analyzer='english'),
location GEO_POINT,
territory TEXT,
boundaries GEO_SHAPE INDEX USING geohash WITH (PRECISION='1m', DISTANCE_ERROR_PCT=0.025),
turbines OBJECT(STRICT) AS (
brand TEXT,
model TEXT,
locations ARRAY(GEO_POINT),
howmany SMALLINT
),
capacity DOUBLE PRECISION,
url TEXT
);
```

## Wind Farm Performance Data

### wind_farm_output.json.gz

This compressed JSONL file forms the second part of this dataset. It contains data relating to the power output of each wind farm on an hourly basis. The data in this file covers the period TODO to TODO and contains TODO records.

Each line of the file contains a JSON object with this structure:

```json
{
"windfarmid": "SEGRN-1",
"ts": 1724342400000,
"output": 981.6,
"outputpercentage": 91.31
}
```

* `windfarmid` is the ID of the wind farm that the reading is for. This maps to a value of `id` in the `windfarms` table.
* `ts` is the UNIX timestamp in milliseconds for when the output reading was taken.
* `output` is the output of the wind farm in MW.
* `outputpercentage` is the percentage of maximum output that the wind farm is operating at.

Here's an example table schema for this data - including a generated column `day` allowing us to partition the data by day:

```sql
CREATE TABLE windfarm_output (
windfarmid TEXT,
ts TIMESTAMP WITHOUT TIME ZONE,
day TIMESTAMP WITH TIME ZONE GENERATED ALWAYS AS date_trunc('day', ts),
output DOUBLE PRECISION,
outputpercentage DOUBLE PRECISION
) PARTITIONED BY (day);
```
3 changes: 3 additions & 0 deletions devrel/uk-offshore-wind-farm-data/wind_farm_output.json
Git LFS file not shown
3 changes: 3 additions & 0 deletions devrel/uk-offshore-wind-farm-data/wind_farms.json
Git LFS file not shown