NAME
    Unicode::Towctrans - Generate small case mapping tables

SYNOPSIS
        gen_wctrans
        gen_wctrans --safec
        gen_wctrans --musl
        gen_wctrans -n     # no network for default -v
        gen_wctrans -v 10
        gen_wctrans -v 10 --ud UnicodeData.txt.10 --out towctrans-10.h
        gen_wctrans --lower16
        gen_wctrans --fn __towcase
        gen_wctrans --min-excl 10000
        gen_wctrans --unroll 6
        gen_wctrans --bits 18:14:10
        gen_wctrans --lower16
        gen_wctrans --bsearch
        gen_wctrans --bsearch-both
        gen_wctrans --if-tree --bsearch
        gen_wctrans --if-tree --bsearch-both
        gen_wctrans --table

DESCRIPTION
    gen_wctrans generates a towctrans.h header file, which is used by "musl"
    and "safeclib" to generate small and efficient case mapping tables, to
    build the libc towupper() and towlower() functions and its secure
    variants towupper_s() and towlower_s().

    If the code may run on a system with the turkish or azeri locale, you
    need to define "-DHAVE_LOCALE_TR" to check for the special turkish i
    locale and mappings at run-time.

    If you know that your iswalpha() works correctly (only with musl), then
    use "--with_iswalpha" to get a lightly faster function. E.g. for
    benchmarking.

    With "--lower16" it creates larger and more "casemaps" tables, with less
    long "casemapl" tables. Thus it finds those ranges earlier, at the cost
    of more caches misses. For "--bits" the fastest are 18:14:10 and
    12:12:8, the smallest is the default 16:8:8.

    With "--bsearch" the tolower check is done with a binary search, the
    toupper check does a linear search without early exit. It needs more
    space, and its performance is not that good as with "--lower16".

    With "--bsearch-both" the speed is faster and the size is even bigger,
    as we have to store the order of the upper maps and pairs also to be
    able to binary search it.

    With "--table", the musl-new style, the size is much bigger, as we have
    to store mappings for all blocks. The lookup is much faster though.

    With "--if-tree" and "--bsearch" the tolower check is done with an
    inlined binary search as ternary tree, the toupper check does a binary
    search.

    With "--if-tree" and "--bsearch-both" both lower and upper checks are
    done with an inlined binary search as ternary tree. It trades data for
    more code. It is the fastest of the non-table variants, but also the
    biggest.

    More tuning options are "--min-excl" and "--unroll". "--min-excl" gives
    a threshold for the size for the very first exclusion checks. The range
    must be larger than the given threshold. Default is 2500. "--unroll"
    sets the maximum array size for its loops to be unrolled and inlined.
    Default is 5.

    "v" set the UnicodeData version to use or download. "-n" sets the method
    the default UnicodeData version to the UCD version from perl (which is
    usually older than the version from
    https://www.unicode.org/versions/latest/). C--ud> set the name of the
    used UnicodeData.txt file. Default: UnicodeData.txt. C--out> sets the
    output filename, default: towctrans.h

    Planned also for the multi-byte folding tables for wcsfc_s() for
    safeclib. As the single-byte "towupper" and "towlower" conversions are
    meaningless for many multi-byte unicode mappings, those with status F -
    full folding. Use a full string foldcasing function instead, as safeclib
    "wcsfc_s", ICU "u_strToUpper" or libunistring "uc_toupper".

PERFORMANCE
    Currently it is small and fast enough compared to the other
    implementations. And esp. correct compared to glibc, which ignores
    characters from other locales.

    The bench uses Unicode 10.0 data ("-v 10") so that our tables match the
    Unicode version compiled into musl-old. Benchmark errors fall into three
    categories, none of which are bugs in our code:

    Circled letters 0x24B6-0x24E9 (affects musl-old, 52 diffs)
        Our code correctly maps these per UnicodeData.txt (e.g.
        "towupper(0x24D0)=0x24B6"). musl-old does not map them at all.

    Georgian Mtavruli 0x1C90-0x1CBF (affects musl-new, 96 diffs)
        These uppercase Georgian letters were added in Unicode 11.0.
        musl-new includes them, but our Unicode 10.0 bench tables do not, so
        musl-new reports differences for every Mtavruli codepoint.

    Post-Unicode-10.0 additions (affects musl-new, 16+ diffs)
        Additional cased characters introduced after Unicode 10.0 (Osage,
        Adlam, etc.) are present in musl-new but absent from our Unicode
        10.0 tables.

    glibc errors
        glibc errors are caused by glibc ignoring cased characters from
        non-latin locales entirely.

        make -C examples
        ./bench
                    my:        592 [us]  100.00 %
               my_excl:        451 [us]  131.26 %
              my_low16:        636 [us]   93.08 %
               my_bits:        569 [us]  104.04 %
            my_bsearch:        408 [us]  145.10 %
           my_bsearchb:        464 [us]  127.59 %
             my_unroll:        418 [us]  141.63 %
             my_iftree:        361 [us]  163.99 %   42 errors
            my_iftreeb:        351 [us]  168.66 %   86 errors
              my_table:         99 [us]  597.98 %
              musl-new:        100 [us]  592.00 %   9 errors
              musl-old:        868 [us]   68.20 %   3 errors
                 glibc:         98 [us]  604.08 %   15 errors

        wc -c towctrans-*.o
          3528 towctrans-my.o
          3608 towctrans-myexcl.o
          3632 towctrans-mylow16.o
          3920 towctrans-mybits.o
          3968 towctrans-mybsearch.o
          4864 towctrans-mybsearch-both.o
          3944 towctrans-myunroll.o
          8296 towctrans-myiftree.o
         10824 towctrans-myiftree-both.o
          6816 towctrans-mytable.o
          6848 towctrans-musl-new.o
          3464 towctrans-musl-old.o
         97440 towctrans-glibc.o

    Results with more various "--bits" size combinations. They need just
    some logical fixups for the 5 errors.

    "--bits 16:10:8","--bits 12:12:8" and more being promising, the best
    being twice as fast as the default.

         ./bench-bits.sh
                                                    C  CL P  PL EX
              16:8:8:        316 [us] 100.0 %       66 12 120 0 6
             16:16:8:        252 [us] 125.4 %       72 6 120 0 6
             16:10:8:        190 [us] 166.3 %       66 12 120 0 6
            18:14:10:        167 [us] 189.2 %       76 2 120 0 6    5 errors
             18:14:8:        157 [us] 201.3 %       76 2 120 0 6    5 errors
            18:12:10:        154 [us] 205.2 %       75 3 120 0 6    5 errors
             18:12:8:        155 [us] 203.9 %       75 3 120 0 6    5 errors
             16:12:6:        207 [us] 152.7 %       66 12 120 0 6   5 errors
             16:10:6:        327 [us] 96.6 %        66 12 120 0 6   5 errors
             14:10:8:        242 [us] 130.6 %       60 18 120 0 6   5 errors
             14:12:6:        157 [us] 201.3 %       56 22 120 0 6   5 errors
             12:12:8:        157 [us] 201.3 %       33 45 120 0 6   5 errors

         5248 towctrans-bmy.o (16:8:8)
         5320 towctrans-bmylow16.o (16:16:8)
         5656 towctrans-bmybits.o (16:10:8)
         5832 bits-12_12_8.o
         5760 bits-14_12_6.o
         5728 bits-14_10_8.o
         5680 bits-16_10_6.o
         5680 bits-16_12_6.o
         5440 bits-18_12_8.o
         5456 bits-18_12_10.o
         5352 bits-18_14_8.o
         5368 bits-18_14_10.o

INSTALLATION
    Perl 5.12 or later is required. Also the LWP::UserAgent cpan module.

    This module does not need to be installed. Running gen_wctrans is
    enough. However for full testing and global installation run this:

       perl Makefile.PL
       make
       make test
       make test-all
       sudo make install

    or

       sudo apt install wget / sudo dnf install wget / ...
       sudo cp bin/gen_wctrans /usr/local/bin/
       cpan LWP::UserAgent / sudo apt install libwww-perl / ...

DEPENDENCIES
    This module requires a UnicodeData.txt file from Unicode Character
    Database, which is automatically downloaded if missing.

AUTHOR
    Reini Urban <rurban@cpan.org>

    Copyright(C) 2026 Reini Urban. All rights reserved

COPYRIGHT AND LICENSE
    This module is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.

    The generated files are MIT licensed. See the generated files headers.

SEE ALSO
    <https://www.unicode.org/reports/tr44/#Casemapping>
    <https://git.musl-libc.org/cgit/musl/tree/src/ctype/towctrans.c>
    <https://git.musl-libc.org/cgit/musl/tree/src/ctype/towctrans.c?id=e8aba
    58ab19a18f83d7f78e80d5e4f51e7e4e8a9>
    <https://github.com/rurban/safeclib/blob/master/src/extwchar/towctrans.c
    >
    <https://sourceware.org/git/?p=glibc.git;a=tree;f=wctype;;hb=HEAD>

