String vs Unicode

My epoch of working on String is winding down (that said, one thing that I’d love is for StringLiteral to be non-materializable to String).

That said, I know there are bigger design decisions I know others have thought about, e.g. unicode support. I recently saw this blog post which is pretty interesting survey and covers some nice issues.

Is anyone interested in working on unicode support in String and have opinions? We have a design doc for string that would be great to fill out unicode support for, and I don’t know that anyone is working on it.

-Chris

I am not sure about giant-grapheme-cluster attacks . The rules are meant for what is reasonably used. Extra processing power could be used to verify that a grapheme is more likely a glyph.
But the rules are already complex. For Codepoint to UTF-8, there is a correlation of 4 to 1, but for graphemes it is infinite.

For what do you limit the string length?

  • Not getting ddosed.
  • Charge the user for the amount of character typed.
  • Visual separation.

There are only 64 continuation bytes. So a implicit breakpoint after 64 bytes would be inserted, effectively limiting the grapheme length to 64 codepoints. This could be limited even further.