[erlang-questions] Atom Unicode Support

Mon Feb 1 15:13:32 CET 2016

I have pushed some initial work that creates a separate chunk called AtU8:

https://github.com/erlang/otp/compare/master...josevalim:jv-utf8-atom

First the atoms from the Atom chunk are loaded, followed by atoms from the
new AtU8 chunk. I have, however, stumbled against one complication. When we
generate the atom table in the compiler, the indexes are built based on
first-seen case. So there is a chance we will have an atom table that looks
like this:

    #{0 => module_name,
      1 => 'some_utf8_atom_ł',
      2 => latin1_atom}

However, because we first load the latin chunk, the loaded atom table would
look like this:

      0 => module_name
      1 => latin1_atom
      2 => 'some_utf8_atom_ł'

Which won't work. I have thought of a couple solutions to this problem. I
would appreciate some feedback on which one is preferred:

1. Keep on using two separate chunks and change beam_asm to do two passes.
One first collecting the latin1 atoms so we know how to generate indexes
for the utf8 ones.

2. Keep on using two separate chunks and use negative indexes to encode
utf8 atoms. This means the compiler will build a table that looks like this:

    #{-1 => 'some_utf8_atom_ł',
      0 => module_name
      1 => latin1_atom}

Which we will load to the same table as before:

      0 => module_name
      1 => latin1_atom
      2 => 'some_utf8_atom_ł'

And we will translate negative indexes to the proper position by
calculating "num_atoms + index" when loading the bytecode.

3. Introduce a new chunk called AtoE (Atoms with Encoding) to *replace* the
existing Atom chunk. It will be quite similar to the current Atom chunk
except we will also include the encoding alongside each atom size and
contents. The Atom chunk will only be loaded if there is no AtoE chunk. If
we choose this option, we can also choose to always emit the new chunk or
emit the AtoE chunk only if there are Unicode atoms.

4. Similar to option 3) except that we introduce a new Atom chunk that will
keep all atoms but always in Unicode. Since the runtime translates them to
latin1 later anyway, this is an option if we don't want to store the
encoding in the table.

Thank you!

PS: If you'd prefer this conversation to be moved off the list, please let
me know.

*José Valim*
www.plataformatec.com.br
Skype: jv.ptec
Founder and Director of R&D

On Mon, Feb 1, 2016 at 1:08 PM, José Valim <jose.valim@REDACTED>
wrote:

> Understood! Thank you.
>
>
>
>
> *José Valim*
> www.plataformatec.com.br
> Skype: jv.ptec
> Founder and Director of R&D
>
> On Mon, Feb 1, 2016 at 12:59 PM, Björn Gustavsson <bjorn@REDACTED>
> wrote:
>
>> On Mon, Feb 1, 2016 at 9:44 AM, José Valim
>> <jose.valim@REDACTED> wrote:
>> > So I would say list_to_binary is behaving as expected and that it
>> should not
>> > change as those "limitations" are there today. Same for port_command,
>> as it
>> > expects iodata. Or am I missing something?
>>
>> My point is that we must look for code in
>> OTP that will break when the change to
>> the atoms are made.
>>
>> As an hypothetical example, say that we
>> find the following code in some application:
>>
>>   Str = atom_to_list(Atom),
>>   .
>>   .
>>   .
>>   port_command(Port, Cmd, Str)
>>
>> We must look at the context to determine
>> what we should do. There could be one
>> of several solutions, for example:
>>
>> 1. If the atoms that can be passed to this
>> code have been internally generated we
>> could know that the resulting list is always
>> safe to send to the port. In that case we
>> don't need to update the code.
>>
>> 2. If the origin of the atom is unknown,
>> and the driver cannot handle UTF-8,
>> the solution could be to return an error
>> to the caller if the atom contains
>> non-latin1 characters.
>>
>> 3. If the driver can handle UTF-8 or can
>> be modified to handle UTF-8, the solution
>> could be to use atom_to_binary(Atom, utf8)
>> instead of atom_to_list/1.
>>
>> Basically, we must look at every atom_to_list/1
>> in the OTP code base and determine whether
>> it is safe or if it must be modified in some way.
>>
>> /Björn
>>
>> --
>> Björn Gustavsson, Erlang/OTP, Ericsson AB
>>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://erlang.org/pipermail/erlang-questions/attachments/20160201/f0ac0466/attachment.htm>