Skip to content

Commit

Permalink
Update README.md
Browse files Browse the repository at this point in the history
  • Loading branch information
clipperhouse committed Aug 1, 2024
1 parent 7e06aa9 commit b1f4d11
Showing 1 changed file with 28 additions and 12 deletions.
40 changes: 28 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
## Split.net

A more efficient splitter for bytes and strings, with a focus on zero allocation, in C#.

### Usage
Expand Down Expand Up @@ -29,37 +31,51 @@ world.
This package exists to save allocations on the hot path, if you are using something like `strings.Split` from the standard library. Benchmarks:

```
| Method | Mean | Error | StdDev | Throughput | Gen0 | Gen1 | Gen2 | Allocated |
|------------------ |----------:|----------:|---------:|------------- |--------:|-------:|-------:|----------:|
| Split.net | 92.68 us | 8.484 us | 0.465 us | 1.176 GB/s | - | - | - | - |
| Method | Mean | Error | StdDev | Throughput | Gen0 | Gen1 | Gen2 | Allocated |
|------------------ |----------:|---------:|---------:|----------- |--------:|-------:|-------:|----------:|
| Split.net | 91.68 us | 0.804 us | 0.712 us | 1.19 GB/s | - | - | - | - |
```

Standard library:

```
| Method | Mean | Error | StdDev | Throughput | Gen0 | Gen1 | Gen2 | Allocated |
|------------------ |----------:|----------:|---------:|------------- |--------:|-------:|-------:|----------:|
| StringSplit | 109.97 us | 13.953 us | 0.765 us | .991 GB/s | 49.3164 | 0.3662 | 0.1221 | 413352 B |
| Method | Mean | Error | StdDev | Throughput | Gen0 | Gen1 | Gen2 | Allocated |
|------------------ |----------:|---------:|---------:|----------- |--------:|-------:|-------:|----------:|
| string.Split | 106.40 us | 0.138 us | 0.108 us | 1.02 GB/s | 49.3164 | 0.3662 | 0.1221 | 413352 B |
```

### Techniques

This package does two things to achieve zero allocations. First, it lazily iterates over the splits, instead of collecting them into an array.

Second, those splits are `Span`s, which are a view into the underlying string, and can stay on the stack.
Second, each split is a `Span`, which is a "view" into the underlying `string` or `byte[]`, and stays on the stack.

### Data types

This package supports `string`/`char` (UTF-16) and UTF-8 `byte[]`. We also support `Stream` of UTF-8 bytes and `TextReader`/`StreamReader` of `char`.

### Testing

We [test](https://github.com/clipperhouse/Split.net/tree/main/Tests) that Split.net returns identical results to `string.Split`, including various edge cases.

### Prior art

These are not original ideas! Here are a few other examples with a similar approach:

[`System.MemoryExtensions.SpanSplitEnumerator`](https://github.com/dotnet/runtime/pull/104534) (I started this package by forking SpanSplitEnumerator.)
- [`SpanSplitEnumerator`](https://github.com/dotnet/runtime/pull/104534) (This Split.net package started as a fork of `SpanSplitEnumerator`)

[`Microsoft.Extensions.Primitives.StringTokenizer`](https://learn.microsoft.com/en-us/dotnet/core/extensions/primitives#the-stringtokenizer-type)
- [`StringTokenizer`](https://learn.microsoft.com/en-us/dotnet/core/extensions/primitives#the-stringtokenizer-type)

[`Microsoft.Toolkit.HighPerformance.Extensions.StringExtensions.Tokenize`](https://learn.microsoft.com/en-us/dotnet/api/microsoft.toolkit.highperformance.extensions.stringextensions.tokenize?view=win-comm-toolkit-dotnet-6.1)
- [`StringExtensions.Tokenize`](https://learn.microsoft.com/en-us/dotnet/api/microsoft.toolkit.highperformance.extensions.stringextensions.tokenize?view=win-comm-toolkit-dotnet-6.1)

Each of the above is in the same ballpark of throughput and allocation as this package.

### Why use this package, then?
### Why use Split.net, then?

You might like the UTF-8 support, SplitAny, streams & readers, or heck maybe you just like the API. Feedback welcome.

### By the way

If you are splitting in order to get "words" from natural text, you may wish to use the Unicode definition of word boundaries, which I've implemented in [this package](https://github.com/clipperhouse/uax29.net).

Well, it's in progress, but enhancements will include simpler UTF-8 support, as well as streams and readers.
I've also implemented these ideas [in Go](https://github.com/clipperhouse/split).

0 comments on commit b1f4d11

Please sign in to comment.