|
|||||||
ICU4R is an attempt to provide better Unicode support for Ruby, where it lacks for a long time.
Current code is mostly rewritten string.c from Ruby 1.8.3.
ICU4R is Ruby C-extension binding for ICU library[1] and provides following classes and functionality:
- String-like class with internal UTF16 storage; - UCA rules for UString comparisons (<=>, casecmp); - encoding(codepage) conversion; - Unicode normalization; - transliteration, also rule-based; Bunch of locale-sensitive functions: - upcase/downcase; - string collation; - string search; - iterators over text line/word/char/sentence breaks; - message formatting (number/currency/string/time); - date and number parsing.
> ruby extconf.rb > make && make check > make install
Now, in your scripts just require ‘icu4r’.
To create RDoc, run
> sh tools/doc.sh
To build and use ICU4R you will need GCC and ICU v3.4 libraries[2].
capitalize, capitalize!, swapcase, swapcase!
%, center, ljust, rjust
chomp, chomp!, chop, chop!
count, delete, delete!, squeeze, squeeze!, tr, tr!, tr_s, tr_s!
crypt, intern, sum, unpack
dump, each_byte, each_line
hex, oct, to_i, to_sym
reverse, reverse!
succ, succ!, next, next!, upto
UString uses ICU regexp library. Pattern syntax is described in [./docs/UNICODE_REGEXPS] and ICU docs.
There are some differences between processing in Ruby Regexp and URegexp:
"test".u.gsub(ure("(e)(.)")) do |match|
puts match[0] # => 'es' <--> $&
puts match[1] # => 'e' <--> $1
puts match[2] # => 's' <--> $2
end
NOTE: URegexp considers char to be a digit NOT ONLY ASCII (0x0030-0x0039), but any Unicode char, which has property Decimal digit number (Nd), e.g.:
a = [?$, 0x1D7D9].pack("U*").u * 2
puts a.inspect_names
<U000024>DOLLAR SIGN
<U01D7D9>MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
<U000024>DOLLAR SIGN
<U01D7D9>MATHEMATICAL DOUBLE-STRUCK DIGIT ONE
puts "abracadabra".u.gsub(/(b)/.U, a)
abbracadabbra
/pattern/.U =~ "string".u
t = "text\ntest"
# ^,$ handling : URegexp multiline <-> Ruby default
t.u =~ ure('^\w+$', URegexp::MULTILINE)
=> #<UMatch:0xf6f7de04 @ranges=[0..3], @cg=[\u0074\u0065\u0078\u0074]>
t =~ /^\w+$/
=> 0
# . matches \n : URegexp DOTALL <-> /m
t.u =~ ure('.+test', URegexp::DOTALL)
=> #<UMatch:0xf6fa4d88 ...
t.u =~ /.+test/m
The code is slow and inefficient yet, is still highly experimental, so can have many security and memory leaks, bugs, inconsistent documentation, incomplete test suite. Use it at your own risk.
Bug reports and feature requests are welcome :)
This extension module is copyrighted free software by Nikolai Lugovoi.
You can redistribute it and/or modify it under the terms of MIT License.
Nikolai Lugovoi <meadow.nnick@gmail.com>