diff --git a/MyApp/_pages/jsonl-format.md b/MyApp/_pages/jsonl-format.md
index 9a9855098a..de00e5b8a9 100644
--- a/MyApp/_pages/jsonl-format.md
+++ b/MyApp/_pages/jsonl-format.md
@@ -5,11 +5,15 @@ title: JSON Lines Data Format
+
+
+
+
[JSON Lines](https://jsonlines.org) is an efficient JSON data format parseable by streaming parsers and text processing tools
like Unix shell pipelines, whose streamable properties is making it a popular data format for maintaining large datasets
like the large AI datasets maintained on https://huggingface.co which is now accessible on [Auto HTML API](/auto-html-api) pages:
@@ -22,11 +26,19 @@ like the large AI datasets maintained on https://huggingface.co which is now acc
+It's added by default, which can be removed with:
+
+```csharp
+Plugins.RemoveAll(x => x is JsonlFormat);
+```
+
Which lists the `.jsonl` format like other data formats where it's accessible from `.jsonl` extension for `?format=jsonl` query string, e.g:
- https://blazor-gallery.servicestack.net/albums.jsonl
- https://blazor-gallery.servicestack.net/api/QueryAlbums?format=jsonl
+In addition to the standard `Accept: text/jsonl` HTTP Header.
+
### CSV Enumerable Behavior
The JSON Lines data format behaves the same way as the CSV format where if your Request DTO is annotated with either
@@ -60,6 +72,25 @@ public class Pocos : IReturn
}
```
+JSON Lines format is similar to the [CSV format](/csv-format) where each line in the file represents a separate JSON object.
+This makes it easy to process large datasets incrementally, without having to load the entire file into memory.
+
+```json
+{"id": 1, "name": "John Doe"}
+{"id": 2, "name": "Jane Doe"}
+{"id": 3, "name": "John Smith"}
+```
+
+These streamable properties in combination of the ubiquitous JSON format has led to its popularity where it can be processed by
+streaming parsers and text processing tools like Unix shell pipelines, making it an ideal format for handling large datasets efficiently,
+or quickly putting together a data pipeline for quick large scale experimentation.
+
+For example, you can pull out the `title` property from a JSON Lines file using the following command line using just `curl` and `jq`:
+
+:::sh
+curl https://blazor-gallery.servicestack.net/albums.jsonl -s | jq '.title'
+:::
+
### Async Streaming Parsing Example
The [HTTP Utils](/http-utils) extension methods makes it trivial to implement async streaming parsing where you can process
@@ -91,4 +122,92 @@ Which can also serialize to a `string`, `Stream` or `TextWriter`:
var jsonl = JsonlSerializer.SerializeToString(albums);
JsonlSerializer.SerializeToStream(albums, stream);
JsonlSerializer.SerializeToWriter(albums, textWriter);
-```
\ No newline at end of file
+```
+
+
+## Indexing AI-Generated Art Albums with JSONL
+
+In this example, we will walk you through a practical demonstration of using JSON Lines to index AI-generated art albums. We will fetch data
+from the Blazor Diffusion API, which provides "Creatives" generated based on user-provided text prompts. Our goal is to index the text and
+image URLs of these Creatives into a Typesense search server.
+
+Here is the code snippet with a practical implementation of interacting with `creatives.jsonl` endpoint:
+
+```csharp
+const string ApiUrl = "https://localhost:5001/creatives.jsonl";
+
+var provider = TypesenseUtils.InitProvider();
+var typesenseClient = provider.GetRequiredService();
+await typesenseClient.InitCollection();
+var sw = new Stopwatch();
+sw.Start();
+
+var lastIndexedCreative = 0;
+
+while (true)
+{
+ await using var stream = await ApiUrl.AddQueryParams(new() {
+ ["IdGreaterThan"] = lastIndexedCreative,
+ ["OrderBy"] = "Id",
+ })
+ .GetStreamFromUrlAsync();
+ await foreach (var line in stream.ReadLinesAsync())
+ {
+ var creative = line.FromJson();
+ lastIndexedCreative = creative.Id;
+ // processing
+ if (creative.Artifacts.Count == 0)
+ return;
+ var imageId = creative.PrimaryArtifactId
+ ?? creative.CuratedArtifactId
+ ?? creative.Artifacts.First().Id;
+ var artifact = creative.Artifacts.First(x => x.Id == imageId);
+ var indexedCreative = new IndexedCreative
+ {
+ Text = creative.Prompt,
+ Url = $"https://cdn.diffusion.works{artifact.FilePath}"
+ };
+ // Process and index the creative data
+ await typesenseClient.CreateDocument("Creatives", indexedCreative);
+ }
+
+ // sleep if no new data ..
+}
+```
+
+Here we have the service client initialization with the `TypesenseUtils.InitProvider()`. Then, inside an infinite loop, we fetch data using `GetStreamFromUrlAsync()` and stream it line by line using the `ReadLinesAsync()` method. After processing each line, we index the creative item into typesense.
+
+## Indexing Creative Data into Typesense
+
+With Typesense, implementation of full-text search for our data can be achieved efficiently. Once we have fetched the Creative data from Blazor Diffusion's API, we can proceed to index it into Typesense for easy searching and analysis of the indexed documents. To index the Creative data, we're using [a C# client library built by the community](https://github.com/DAXGRID/typesense-dotnet).
+
+Processed data are saved into IndexedCreative instances and are added into the Typesense server. Under the hood, the C# client library interacts with Typesense's API and handles all HTTP requests and responses for us.
+
+### Memory Foot Print
+
+Our application logs the memory usage over time, and we see it’s constant relative to the size of our process. Streamed JSONL parsing and async indexing allows us processing infinite datasets size without hitting the memory bounds.
+
+## Skipping Already Indexed Data Example
+
+For increased efficiency, instead of re-indexing all the data each time our program runs, we will only fetch and index data that hasn't been indexed already. We use `lastIndexedCreative` integer variable to keep track of the last creative indexed. On subsequent runs of our program, we fetch only new data by modifying the API URL to fetch Creative objects starting from the ID greater than `lastIndexedCreative`.
+
+This is done by appending the `IdGreaterThan` query parameter to our AutoQuery endpoint.
+
+```csharp
+await using var stream = await ApiUrl.AddQueryParams(new() {
+ ["IdGreaterThan"] = lastIndexedCreative,
+ ["OrderBy"] = "Id",
+ }).GetStreamFromUrlAsync();
+```
+
+## Visualizing Indexed Documents with Typesense Dashboard
+
+Another great community feature is the [Typesense dashboard project](https://github.com/bfritscher/typesense-dashboard), providing a user-friendly interface to manage and browse collections in a Typesense search server. The dashboard, which can be accessed via the browser or as a downloadable desktop app, provides real-time statistics and overview over the indexed collections.
+
+The indexed creative data can now be explored and analyzed using the dashboard's intuitive interface. You can apply filters, perform searches, and manipulate the stored data through the user interface.
+
+## Conclusion
+
+The JSON lines format is a versatile and efficient standard for working with large datasets. Its streamable properties make it particularly useful for handling big data and its ease of use makes it an ideal choice for developers working in any client language.
+
+Streaming data directly to and from HTTP APIs — a major power of the JSON lines format — can dramatically improve the performance of data-intensive applications. Furthermore, ServiceStack's new JSON Lines support makes it straight forward serve and consume these streams.
\ No newline at end of file