crate · simonprickett · Oct 29, 2024 · Aug 23, 2024 · Aug 27, 2024 · Aug 27, 2024
diff --git a/README.md b/README.md
@@ -32,6 +32,10 @@ https://s3.amazonaws.com/crate.sampledata/
 
 Datasets used in courses at the [CrateDB Academy](https://learn.cratedb.com) can be found in the `academy` subdirectory.
 
+## Developer Relations datasets
+
+Datasets used in developer relations talks, workshops and other projects can be found in the `devrel` subdirectory.
+
 ## Contributions
 
 Before adding files to this repository, please make sure to install [Git LFS]

diff --git a/devrel/uk-offshore-wind-farm-data/README.md b/devrel/uk-offshore-wind-farm-data/README.md
@@ -0,0 +1,90 @@
+# UK Offshore Wind Farms Dataset
+
+This dataset contains data about UK offshore windfarms from [The Crown Estate](https://www.thecrownestate.co.uk/en-gb/what-we-do/asset-map/), which manages these installations.  The textual descriptions of each wind farm project are from Wikipedia (see: [List of offshore wind farms in the United Kingdom](https://en.wikipedia.org/wiki/List_of_offshore_wind_farms_in_the_United_Kingdom)).
+
+## Wind Farm Data
+
+### wind_farms.json
+
+This JSONL file contains details of 45 offshore wind farms.  Each record includes an ID for the wind farm, name, description, and geo data in WKT format describing the boundaries of the project as one or more polygons.  The co-ordinates of each turbine are also included, where known.
+
+Each line in the file contains a JSON object with this structure:
+
+```json
+{
+    "id": "TRTNK", 
+    "name": "Triton Knoll", 
+    "description": "Triton Knoll Wind Farm is an 857 MW...",
+    "location": "POINT (0.840000 53.480000)", 
+    "territory": "England", 
+    "boundaries": "POLYGON ((0.87630538600007 53.4262737870001, ...))", 
+    "turbines": {
+        "brand": "MHI Vestas", 
+        "model": "MHI Vestas v164-9.5", 
+        "locations": ["POINT (53.4212466 0.9478282)", ...], 
+        "howmany": 90
+    }, 
+    "capacity": 857.0, 
+    "url": "https://en.wikipedia.org/wiki/Triton_Knoll"
+}
+```
+
+* `id` is a unique ID for each wind farm.  These IDs are used as values for `windfarmid` in the performance data file.
+* `location` is a singular point identifying the location of the wind farm.
+* `boundaries` is a polygon or multi polygon describing the outer boundaries of the wind farm.
+* `capacity` is measured in MW.
+
+Here's an example table schema for this data, using full text indexing for the textual descriptions of each wind farm:
+
+```sql
+CREATE TABLE windfarms (
+    id TEXT PRIMARY KEY,
+    name TEXT,
+    description TEXT INDEX USING fulltext WITH (analyzer='english'),
+    location GEO_POINT,
+    territory TEXT,
+    boundaries GEO_SHAPE INDEX USING geohash WITH (PRECISION='1m', DISTANCE_ERROR_PCT=0.025),
+    turbines OBJECT(STRICT) AS (
+        brand TEXT,
+        model TEXT,
+        locations ARRAY(GEO_POINT),
+        howmany SMALLINT
+    ),
+    capacity DOUBLE PRECISION,
+    url TEXT
+);
+```
+
+## Wind Farm Performance Data
+
+### wind_farm_output.json.gz
+
+This compressed JSONL file forms the second part of this dataset. It contains data relating to the power output of each wind farm on an hourly basis.  The data in this file covers the period from 19th August 2024 to 28th October 2024 and contains 75,825 records. 
+
+Each line of the file contains a JSON object with this structure:
+
+```json
+{
+    "windfarmid": "SEGRN-1",
+    "ts": 1724342400000,
+    "output": 981.6,
+    "outputpercentage": 91.31
+}
+```
+
+* `windfarmid` is the ID of the wind farm that the reading is for.  This maps to a value of `id` in the `windfarms` table.
+* `ts` is the UNIX timestamp in milliseconds for when the output reading was taken.
+* `output` is the output of the wind farm in MW.
+* `outputpercentage` is the percentage of maximum output that the wind farm is operating at.
+
+Here's an example table schema for this data - including a generated column `day` allowing us to partition the data by day:
+
+```sql
+CREATE TABLE windfarm_output (
+   windfarmid TEXT,
+   ts TIMESTAMP WITHOUT TIME ZONE,
+   day TIMESTAMP WITH TIME ZONE GENERATED ALWAYS AS date_trunc('day', ts),
+   output DOUBLE PRECISION,
+   outputpercentage DOUBLE PRECISION
+) PARTITIONED BY (day);
+```
diff --git a/devrel/uk-offshore-wind-farm-data/wind_farm_output.json.gz b/devrel/uk-offshore-wind-farm-data/wind_farm_output.json.gz
diff --git a/devrel/uk-offshore-wind-farm-data/wind_farms.json b/devrel/uk-offshore-wind-farm-data/wind_farms.json