Skip to content

ICU MessageFormat for Javascript - i18n Plural and Gender Capable Messages

License

Notifications You must be signed in to change notification settings

tomasruizr/messageformat.js

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Build Status

messageformat.js

The experience and subtlety of your program's text can be important. MessageFormat (PluralFormat + SelectFormat) is a mechanism for handling both pluralization and gender in your applications. It can also lead to much better translations, as it was built by ICU to help solve those two problems for all known CLDR languages - likely all the ones you care about.

There is a good slide-deck on Plural and Gender in Translated Messages by Markus Scherer and Mark Davis. But, again, remember that many of these problems apply even if you're only outputting english.

See just how many different pluralization rules there are.

MessageFormat in Java-land technically incorporates all other type formatting (and the older ChoiceFormat) directly into its messages, however, in the name of filesize, messageformat.js only strives to implement SelectFormat and PluralFormat.

There are plans to pull in locale-aware NumberFormat parsing as a "plugin" to this library, but as of right now, it's best to pass things in preformatted (as suggested in the ICU docs).

We have also ported the Google Closure implementation of NumberFormat, but there is no direct integration of these two libraries. (They work well together!)

What problems does it solve?

A progression of strings in programs:

There are 1 results.

There are 1 result(s).

Number of results: 5.

These are generally unacceptable in this day and age. Not to mention the problem expands when you consider languages with 6 different pluralization rules. You may be using something like Gettext to solve this across multiple languages, but even Gettext falls flat.

What does it look like?

ICU bills the format as easy to read and write. It may be more easy to read and write, but I'd still suggest a tool for non-programmers. It looks a lot like Java's ChoiceFormat - but is different in a few significant ways, most notably its addition of the plural keyword, and more friendly select syntax.

{GENDER, select,
    male {He}
  female {She}
   other {They}
} found {NUM_RESULTS, plural,
            one {1 result}
          other {# results}
        } in {NUM_CATEGORIES, plural,
                  one {1 category}
                other {# categories}
             }.

Here's a few data sets against this message:

{
  "GENDER"         : "male",
  "NUM_RESULTS"    : 1,
  "NUM_CATEGORIES" : 2
}
> "He found 1 result in 2 categories."

{
  "GENDER"         : "female",
  "NUM_RESULTS"    : 1,
  "NUM_CATEGORIES" : 2
}
> "She found 1 result in 2 categories."

{
  "GENDER"         : "male",
  "NUM_RESULTS"    : 2,
  "NUM_CATEGORIES" : 1
}
> "He found 2 results in 1 category."

{
  "NUM_RESULTS"    : 2,
  "NUM_CATEGORIES" : 2
}
> "They found 2 results in 2 categories."

There is very little that needs to be repeated (until gender modifies more than one word), and there are equivalent/appropriate plural keys for every single language in the CLDR database. The syntax highlighting is less than ideal, but parsing a string like this gives you flexibility for your messages even if you're only dealing with english.

Features

  • Handles arbitrary nesting of pluralization and select rules.
  • Works on the server and the client.
  • No i18n necessary - you can use it for just well-formed english sentences. UX++;
  • Speed: Compiles messages to JavaScript code.
    • This is great for speed. Message formatting is just string concatenation.
    • Run a precompiler at build time and remove the need for most of the library. filesize--;
  • Compatible with other languages that support MessageFormat
  • Very whitespace tolerant.
  • Supports unicode characters
  • Most translation companies are beginning to be exposed to translations like this, even if it's not their core business.

Install

Node

> npm install messageformat

var MessageFormat = require('messageformat');

Browser

<script src="/path/to/messageformat-v0.x.x.js"></script>

Adding Provided Locales

<!-- after the messageformat.js include, but before you need to use the locale -->
<script src="/path/to/messageformat/locales/xx_xx.js"></script>

TODO:: In node, we can automatically pull in all known locales for you.

Adding Custom locales

// Any time after MessageFormat is included
MessageFormat.locale["locale_name"] = function () { ... };

// Or during instantiation
var mf = new MessageFormat( 'locale_name', function () { ... } );

Tests

These require node:

> make test

> make test-browser

Build Time Compilation

You really should take advantage of this. It is much faster than parsing in real-time.

I will eventually release a Handlebars and Require.js (r.js) plugin to do this automatically, but if you would like to output the raw javascript function, the following does that:

var mf = new MessageFormat('en');
var js_string_represenation = mf.precompile(
  mf.parse(
    'Your {NUM, plural, one{message} other{messages}} go here.'
  )
);

// This returns an unnamed - unreferenced function that needs to be passed the
// MessageFormat object. See the source of `MessageFormat.compile` for more details.

The CLI compiler

If you don't want to compile your templates programmatically, you can use the built in CLI compiler.

This tool is in early stage. It was tested on Linux and Windows, but if you find a bug, please create an issue.

Usage

> [sudo] npm install -g messageformat

> messageformat
Usage: messageformat -l [locale] [INPUT_DIR] [OUTPUT_DIR]

    --locale, -l        locale to use [mandatory]
    --inputdir, -i      directory containings messageformat files to compile       $PWD
    --output, -o        output where messageformat will be compiled                $PWD
    --watch, -w         watch `inputdir` for change                                false
    --namespace, -ns    object in the browser containing the templates             window.i18n
    --include, -I       Glob patterns for files to include in `inputdir`           **/*.json
    --stdout, -s        Print the result in stdout instead of writing in a file    false
    --verbose, -v       Print logs for debug                                       false

If your prefer looking at an example go there.

messageformat will read every JSON files in inputdir and compile them to output.

When using the CLI, the following commands will works exactly the same:

> messageformat --locale en ./example/en
> messageformat --locale en ./example/en ./i18n.js
> messageformat --locale en --inputdir ./example/en --output ./i18n.js

or even shorter

> cd example/en
> messageformat -l en

You can also do it with a unix pipe

> messageformat -l en --stdout > i18n.js

Take a look at the example inputdir and output

A watch mode is available with the --watch or -w option.

The JSON messageformat files

The original JSON files are simple objects, with a key and a messageformat string as value, like this one:

{
  "test": "Your {NUM, plural, one{message} other{messages}} go here."
}

The CLI walks into inputdir recursively so you can structure your messageformat with dirs and subdirs.

In the browser

Now that you have compiled your messageformat, you can use it in your html by adding a <script src="index.js"></script>.

In the browser, the global window.i18n is an object containing the messageformat compiled functions.

> i18n
Object
  colors: Object
    blue:   [ Function ]
    green:  [ Function ]
    red:    [ Function ]
  "sub/folder/plural": Object
    test:   [ Function ]

You could then use it:

$('<div>').text( window.i18n[ 'sub/folder/plural' ].test( { NUM: 1 } ) ).appendTo('#content');

The namespace window.i18n could be changed with the --namespace or -ns option.

Subdirectories messageformat are available in the window.i18n namespace, prefixed with their relative path :

> window.i18n['sub/folder/plural']
Object
* test: [ Function ]

sub/folder is the path, plural is the name of the JSON file, test is the key used.

A working example is available here.

No Frills

The most simple case of MessageFormat would involve no formatting. Just a string passthrough. This sounds silly, but often it's nice to always use the same i18n system when doing translations, and not everything takes variables.

// Insantiate a MessageFormat object on your locale
var mf = new MessageFormat('en');

// Compile a message
var message = mf.compile( 'This is a message.' ); // returns a function

// You can call the function to get data out
> message();
"This is a message."

// NOTE:: if a message _does_ require data to be passed in, an error is thrown if you do not.

Simple Variable Replacement

The second most simple way to use MessageFormat is for simple variable replacement. MessageFormat looks odd at first, but it's actually fairly simple. One way to think about the { and } is that every level of them bring you into and out-of literal and code mode.

By default (like in the previous example), you are just writing a literal. Then the first level of brackets brings you into one of several data-driven situations. The most simple is variable replacement.

Simply putting a variable name in between { and } will place that variable there in the output.

// Instantiate new MessageFormat object for your locale
var mf = new MessageFormat('en');

// Compile a message
var message = mf.compile('His name is {NAME}.');

// Then send that data into the function
> message({ "NAME" : "Jed" });
"His name is Jed."

// NOTE:: it's best to try and stick to keys that would be natively
//        tolerant in your JS runtimes (think valid JS variable names).

SelectFormat

SelectFormat is a lot like a switch statement for your messages. Most often it's used to select gender in a string. Here's an example:

// Insantiate an instance with your language settings
var mf = new MesssageFormat('en');
// Compile a message - returns a function
var message = mf.compile('{GENDER, select, male{He} female{She} other{They}} liked this.');

// Run your message function with your data
> message({"GENDER" : "male"});
"He liked this."

> message({"GENDER" : "female"});
"She liked this."

// The 'other' key is **required** and in the case of GENDER
// it should be phrased as if you are too far away to tell the gender of the subject.
> message({});
"They liked this."

PluralFormat

PluralFormat is a similar mechanism to SelectFormat (especially syntax wise), but it's specific to numbers, and the key that is chosen is generated by a Plural Function.

// Insantiate a new MessageFormat object
var mf = new MessageFormat('en');

// You can use the provided locales in the `/locale` folder
// (include the file directly after including messageformat.js
var mf = new MessageFormat( 'sl' );

// OR - you can pass a custom plural function to the MessageFormat constructor function.
var mf = new Message( 'requiredCustomName', function (n) {
  if ( n === 42 ) {
    return 'many';
  }
  return 'other';
});

// Then the numbers that are passed into a compiled message will run through this function to select
// the keys. This is for the 'en' locale:
var message = mf.compile('There {NUM_RESULTS, plural, one{is one result} other{are # results}}.');

// Then the data causes the function to output:

> message({"NUM_RESULTS" : 0});
"There are 0 results."

> message({"NUM_RESULTS" : 1});
"There is one result."

> message({"NUM_RESULTS" : 100});
"There are 100 results."

Named Keys

ICU declares the 6 named keys that CLDR defines for their plural form data. Those are:

  • zero
  • one
  • two
  • few
  • many
  • other (required)

All of them are fairly straight-forward, but do remember, that for some languages, they are more loose "guidelines" than they are exact.

The only required key is other. Your compilation will throw an error if you forget this. In english, and many other languages, the logic is simple:

If N equals 1, then ONE, otherwise OTHER

Other languages (take a peak at ar.js or sl.js) can get much more complicated.

Remember. English only uses one and other - so including zero will never get called, even when the number is 0

The most simple (to pluralize) languages have no pluralization rules an rely solely on the other named key.

{NUM, plural,
  zero  {There are zero - in a lang that needs it.}
  one   {There is one - in a lang that has it.}
  two   {There is two - in a lang that has it.}
  few   {There are a few - in a lang that has it.}
  many  {There are many - in a lang that has it.}
  other {There is a different amount than all the other stuff above.}
}

Literal Numeric Keys

There also exists the capability to put literal numbers as keys in a select statement. These are delimited by prefixing them with the = character. These will match single, specific numbers. If there is a match, that branch will immediately run, and the corresponding named key will not also run.

There are plenty of legitimate uses for this, especially when considering base cases and more pleasant language. But if you're a Douglas Adams fan, might use it like so:

You have {NUM_TASKS, plural,
            one {one task}
            other {# tasks}
            =42 {the answer to the life, the universe and everything tasks}
         } remaining.

When NUM_TASKS is 42, this outputs smiles. Remember, these have priority over the named keys.

PluralFormat - offset extension

ICU provided the ability to extend existing select and plural functionality, and the only official extension (that I could find) is the offset extension.

It goes after the plural declaration, and is used to generate sentences that break up a number into multiple sections. For instance:

You and 4 others added this to their profiles.

In this case, the total number of people who added 'this' to their profiles is actually 5. We can use the offset extension to help us with this.

var mf = new MessageFormat('en');

// For simplicity's sake, let's assume the base case here isn't silly.
// The test suite has a bigger offset example at the bottom
// Let's also assume neutral gender for the same reason

// Set the offset to 1
var message = mf.compile(
  'You {NUM_ADDS, plural, offset:1' +
          '=0{didnt add this to your profile}' + // Number literals, with a `=` do **NOT** use
          'zero{added this to your profile}' +   //   the offset value
          'one{and one other person added this to their profile}' +
          'other{and # others added this to their profiles}' +
      '}.'
);

// Tip: I like to consider the `=` prefixed number literals as more of an "inductive step"
// e.g. in this case, since (0 - 1) is _negative_ 1, we want to handle that base case.

> message({"NUM_ADDS" : 0 });
"You didnt add this to your profile."

> message({"NUM_ADDS" : 1 });
"You added this to your profile."

> message({"NUM_ADDS" : 2 });
"You and one other person added this to their profile."

> message({"NUM_ADDS" : 3 });
"You and 3 others added this to their profile."

Nesting

Very simply, you can nest both SelectFormat blocks into PluralFormat blocks, and visa-versa, as deeply as you'd like. Simply start the new block directly inside:

{SEL1, select,
  other {{PLUR1, plural,
          one {1}
          other {{SEL2, select,
                  other {deep in the heart.}
                }}
        }}
}

Escaping

messageformat.js tries to a good job of being tolerant of as much as possible, but some characters, like the ones used the actual MessageFormat spec itself, must be escaped to be a part of your string.

For {, } and # (only inside of a select value) literals, just escape them with a backslash. (If you are in a JS string, you'll need to escape the escape backslash so it'll look like two).

// Technically, it's just:

\{\}\#

// But in practice, since you're often dealing with string literals, it looks more like

var msg = mf.compile("\\{ {S, select, other{# is a \\#}} \\}");

> msg({S:5});
"{ 5 is a # }"

Why not Gettext?

Gettext can generally go only one level deep without hitting some serious roadblocks. For example, two plural elements in a sentence, or the combination of gender and plurals.

This would be prohibitively difficult with Gettext

He found 5 results in 2 categories.

She found 1 result in 1 category.

He found 2 results in 1 category.

It can likely be done with contexts/domains for gender and some extra plural forms work to pick contexts for the plurals, but it's less than ideal. Not to mention every translation must be completed in its entirety for every combination. That stinks too.

You can easily mix Gettext and MessageFormat by storing MessageFormat strings in your .po files. However, I would stop using the built in plural functions of Gettext.

I tend to only use Gettext on projects that are already using it in other languages, so we can share translations, otherwise, I like to live on the wild-side and use PluralFormat and SelectFormat.

Most Gettext tools will look up the Plural Forms for a given locale for you. This is also the opinion of PluralFormat. The library should just contain the known plural forms of every locale, and not force translators to reinput this information each time.

Version

0.1.8

TODO

  • Create a tool to help translators understand and use this format.
  • Template integration - I specifically want to make a build time handlebars.js plugin to build this logic into the template builds.

License

You may use this software under the Apache License, version 2.0.

You may contribute to this software under the Dojo CLA - http://dojofoundation.org/about/cla

Author

Major Contributors

Credits

Thanks to:

  • Bazaarvoice - my employer - for letting me do cool stuff like this.
  • Google has an implementation that is similar in Google Closure, I tried to vet my code against many of their tests.
  • Norbert Lindenberg for showing me how good it can be.

About

ICU MessageFormat for Javascript - i18n Plural and Gender Capable Messages

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages

  • JavaScript 100.0%