Skip to content

Commit

Permalink
Update readme
Browse files Browse the repository at this point in the history
  • Loading branch information
kipcole9 committed Sep 14, 2021
1 parent dad76f5 commit 9688f6b
Showing 1 changed file with 12 additions and 12 deletions.
24 changes: 12 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,31 @@
# Unicode

![Build Status](https://api.cirrus-ci.com/github/elixir-unicode/unicode.svg)
[![Hex.pm](https://img.shields.io/hexpm/v/ex_unicode.svg)](https://hex.pm/packages/ex_unicode)
[![Hex.pm](https://img.shields.io/hexpm/dw/ex_unicode.svg?)](https://hex.pm/packages/ex_unicode)
[![Hex.pm](https://img.shields.io/hexpm/l/ex_unicode.svg)](https://hex.pm/packages/ex_unicode)
[![Hex.pm](https://img.shields.io/hexpm/v/unicode.svg)](https://hex.pm/packages/unicode)
[![Hex.pm](https://img.shields.io/hexpm/dw/unicode.svg?)](https://hex.pm/packages/unicode)
[![Hex.pm](https://img.shields.io/hexpm/l/unicode.svg)](https://hex.pm/packages/unicode)

Functions to return information about Unicode codepoints.

Elixir string are UTF8-encoded [Unicode](https://unicode.org) binaries. This is a flexible and complete encoding scheme for the worlds many scripts, characters and emjois. However since its a variable lenght encoding (using between one and four bytes for UTF8) it is harder to use high-performance byte-oriented functions to decompose strings.
Elixir strings are UTF8-encoded [Unicode](https://unicode.org) binaries. This is a flexible and complete encoding scheme for the worlds many scripts, characters and emjois. However since its a variable length encoding (using between one and four bytes for UTF8) it is harder to use high-performance byte-oriented functions to decompose strings.

Since checking strings and codepoints for certain attributes - like whether they are upper case, or symbols, or whitespace - is a common occurrence, a performant approach to such detection is useful.

It is tempting to assume the use of [US ASCII](https://en.wikipedia.org/wiki/ASCII) encoding and checking only for characters in that range. For example it is very common to see code in Elixir checking `codepoint in ?a..?z` to check for lowercase alphabetic characters. When the underlying programming language has no canonical form for a string beyond bytes this may be considered acceptable - the programmer is defining the script domain as he or she sees fit.

However Elixir string are declared to be [UTF8 encoded Unicode string](https://unicode.org/faq/utf_bom.html#utf8-1) it seems appropriate to make it easier to determins the characteristics of codepoints (and strings) using this standard.
However Elixir string are declared to be [UTF8 encoded Unicode string](https://unicode.org/faq/utf_bom.html#utf8-1) it seems appropriate to make it easier to determine the characteristics of codepoints (and strings) using this standard.

The Elixir standard library does not provide introspection beyond that required to support casing (String.downcase/1, String.upcase/1, String.capitalize/1). This library aims to *fill in the blanks* a little bit.

## Additional Unicode libraries

[ex_unicode](https://hex.pm/packages/ex_unicode) provides basic introspection into Unicode codepoints and strings. Additional libraries (either releases or in development) buil upon this library):
[ex_unicode](https://hex.pm/packages/unicode) provides basic introspection of Unicode codepoints and strings. Additional libraries (either released or in development) build upon this library):

* [unicode_set](https://github.com/elixir-unicode/unicode_set) implements functions to parse and match on [unicode sets](http://unicode.org/reports/tr35/#Unicode_Sets)

* [unicode_guards](https://github.com/elixir-unicode/unicode_guards) is a simple library implementing common function guards using `unicode_set` and `ex_unicode`
* [unicode_guards](https://github.com/elixir-unicode/unicode_guards) is a simple library implementing common function guards using `unicode_set` and `unicode`

* [unicode_string](https://github.com/elixir-unicode/unicode_string) is a library to implement efficient string splitting and replacing based upon unicode sets
* [unicode_string](https://github.com/elixir-unicode/unicode_string) is a library to implement efficient string splitting into words and sentences based upon the [Unicode Segementation](https://unicode.org/reports/tr29/) algorithm.

* [unicode_transform](https://github.com/elixir-unicode/unicode_transform) implements the [Unicode transform](https://unicode.org/reports/tr35/tr35-general.html#Transforms) specification.

Expand Down Expand Up @@ -172,18 +172,18 @@ The function `Unicode.unaccent/1` attempts to transform a Unicode string into a
## Recognition
The information functions are heavily inspired by [@qqwy's elixir-unicode package](https://github.com/Qqwy/elixir-unicode) and compatibility with some of the api is represented by including some of the doctests from that package.
The information functions are heavily inspired by [@qqwy's elixir-unicode package](https://github.com/Qqwy/elixir-unicode) and compatibility with some of the api is represented by including some of the doctests from that package. Originally published under the `:unicode` package name on hex, this original work is now replaced with this library code.
## Installation
The package can be installed by adding `ex_unicode` to your list of dependencies in `mix.exs`:
The package can be installed by adding `unicode` to your list of dependencies in `mix.exs`:
```elixir
def deps do
[
{:ex_unicode, "~> 1.0"}
{:unicode, "~> 1.13"}
]
end
```

The docs can be found at [https://hexdocs.pm/ex_unicode](https://hexdocs.pm/ex_unicode).
The docs can be found at [https://hexdocs.pm/unicode](https://hexdocs.pm/unicode).

0 comments on commit 9688f6b

Please sign in to comment.