String vs Unicode

clattner · May 1, 2025, 3:45am

My epoch of working on String is winding down (that said, one thing that I’d love is for StringLiteral to be non-materializable to String).

That said, I know there are bigger design decisions I know others have thought about, e.g. unicode support. I recently saw this blog post which is pretty interesting survey and covers some nice issues.

Is anyone interested in working on unicode support in String and have opinions? We have a design doc for string that would be great to fill out unicode support for, and I don’t know that anyone is working on it.

-Chris

leb-kuchen · May 2, 2025, 10:57am

I am not sure about giant-grapheme-cluster attacks . The rules are meant for what is reasonably used. Extra processing power could be used to verify that a grapheme is more likely a glyph.
But the rules are already complex. For Codepoint to UTF-8, there is a correlation of 4 to 1, but for graphemes it is infinite.

For what do you limit the string length?

Not getting ddosed.
Charge the user for the amount of character typed.
Visual separation.

leb-kuchen · May 2, 2025, 11:32am

There are only 64 continuation bytes. So a implicit breakpoint after 64 bytes would be inserted, effectively limiting the grapheme length to 64 codepoints. This could be limited even further.

Topic		Replies	Views
Can the parameter of the Python.evaluate method only be a StringLiteral instead of a String? Mojo	4	77	May 4, 2025
How to uppercase and lowercase ASCII strings using SIMD in Mojo Mojo discussion , 24_5 , docs	6	121	June 12, 2025
[Proposal] Checking constrained methods at overload resolution time Language Design discussion	10	195	April 18, 2025
Precondition Based Optimization (Library Design Proposal) Performance	5	156	June 3, 2025
Do the traits for copyability need revisiting? Mojo discussion , 24_6	9	216	June 25, 2025

String vs Unicode

Related topics