git/utf8.c, branch v2.2.0

Merge branch 'rs/export-strbuf-addchars'

2014-09-19T18:38:39Z

Code clean-up. * rs/export-strbuf-addchars: strbuf: use strbuf_addchars() for adding a char multiple times strbuf: export strbuf_addchars()

Merge branch 'nd/strbuf-utf8-replace'

2014-09-09T19:54:02Z

* nd/strbuf-utf8-replace: utf8.c: fix strbuf_utf8_replace() consuming data beyond input string

strbuf: export strbuf_addchars()

2014-09-08T18:26:45Z

Move strbuf_addchars() to strbuf.c, where it belongs, and make it available for other callers. Signed-off-by: Rene Scharfe Signed-off-by: Junio C Hamano

utf8.c: fix strbuf_utf8_replace() consuming data beyond input string

2014-08-11T18:52:22Z

The main loop in strbuf_utf8_replace() could summed up as: while ('src' is still valid) { 1) advance 'src' to copy ANSI escape sequences 2) advance 'src' to copy/replace visible characters } The problem is after #1, 'src' may have reached the end of the string (so 'src' points to NUL) and #2 will continue to copy that NUL as if it's a normal character. Because the output is stored in a strbuf, this NUL accounted in the 'len' field as well. Check after #1 and break the loop if necessary. The test does not look obvious, but the combination of %>>() should make a call trace like this show_log() pretty_print_commit() format_commit_message() strbuf_expand() format_commit_item() format_and_pad_commit() strbuf_utf8_replace() where %C(auto)%d would insert a color reset escape sequence in the end of the string given to strbuf_utf8_replace() and show_log() uses fwrite() to send everything to stdout (including the incorrect NUL inserted by strbuf_utf8_replace) Signed-off-by: Nguyễn Thái Ngọc Duy Signed-off-by: Junio C Hamano

Merge branch 'tb/unicode-6.3-zero-width'

2014-06-06T18:29:38Z

Update the logic to compute the display width needed for utf8 strings and allow us to more easily maintain the tables used in that logic. We may want to let the users choose if codepoints with ambiguous widths are treated as a double or single width in a follow-up patch. * tb/unicode-6.3-zero-width: utf8: make it easier to auto-update git_wcwidth() utf8.c: use a table for double_width

utf8: make it easier to auto-update git_wcwidth()

2014-05-12T17:38:01Z

The function git_wcwidth() returns for a given unicode code point the width on the display: -1 for control characters, 0 for combining or other non-visible code points 1 for e.g. ASCII 2 for double-width code points. This table had been originally been extracted for one Unicode version, probably 3.2. We now use two tables these days, one for zero-width and another for double-width. Make it easier to update these tables to a later version of Unicode by factoring out the table from utf8.c into unicode_width.h and add the script update_unicode.sh to update the table based on the latest Unicode specification files. Thanks to Peter Krefting and Kevin Bracey for helping with their Unicode knowledge. Signed-off-by: Torsten Bögershausen Signed-off-by: Junio C Hamano

utf8.c: use a table for double_width

2014-05-12T17:20:46Z

Refactor git_wcwidth() and replace the if-else-if chain. Use the table double_width which is scanned by the bisearch() function, which is already used to find combining code points. Signed-off-by: Torsten Bögershausen Signed-off-by: Junio C Hamano

Merge branch 'tb/unicode-6.3-zero-width'

2014-04-16T20:38:57Z

Teach our display-column-counting logic about decomposed umlauts and friends. * tb/unicode-6.3-zero-width: utf8.c: partially update to version 6.3

utf8.c: partially update to version 6.3

2014-04-09T17:14:05Z

Unicode 6.3 defines more code points as combining or accents. For example, the character "ö" could be expressed as an "o" followed by U+0308 COMBINING DIARESIS (aka umlaut, double-dot-above). We should consider that such a sequence of two codepoints occupies one display column for the alignment purposes, and for that, git_wcwidth() should return 0 for them. Affected codepoints are: U+0358..U+035C U+0487 U+05A2, U+05BA, U+05C5, U+05C7 U+0604, U+0616..U+061A, U+0659..U+065F Earlier unicode standards had defined these as "reserved". Only the range 0..U+07FF has been checked to see which codepoints need to be marked as 0-width while preparing for this commit; more updates may be needed. Signed-off-by: Torsten Bögershausen Signed-off-by: Junio C Hamano

utf8: use correct type for values in interval table

2014-02-18T23:51:40Z

We treat these as unsigned everywhere and compare against unsigned values, so declare them using the typedef we already have for this. While we're here, fix the indentation as well. Signed-off-by: John Keeping Signed-off-by: Junio C Hamano