[stdlib] [proposal] Codepoint as UTF-8 readonly view

proposal PR link

I think we should take a very hard look at what Windows and WASM ↔ JS support will require before we toss out the parameterized version. I don’t think we have any option aside from that if we want support that isn’t horrible. Doing that also resolves all of the complaints around the "aaaa"[byte=1] syntax as well.

If we don’t do that, then I think the linear typed reference for mutable views which requires a fixup makes the most sense.

Since we won’t be dealing with origins

How do you plan to not have origins in the iterator?

1 Like

Yeah I would also prefer parametrization, but doing it UTF-8 native for now doesn’t really make anything harder. The inner storage would still be 4 bytes regardless of the encoding (in fact a parametrized version would just bitcast the internal bytes)

By copying 1-4 bytes for each unicode codepoint into the inline SIMD[DType.uint8, 4] storage. The string iterator reads the first byte of each string anyway to know how many bytes to jump forwards

We need the origin to make sure the string still exists while you’re iterating over it.

The iterator struct will probably still carry around the origin as a parameter, but we need the iterator values themselves to not be bound to an origin for things like replace, lower and upper to return completely new values (that don’t originate from the source data)

Oh, I see what you mean.

I disagree.

We should be trying to do zero copy as much as possible. If someone wants to do copies, they can move it to a codepoint iterator instead of a string span iterator.

I’m still sketching some of it out mentally, but to give you concrete examples.

This is more or less what the current slice iterator does

struct CodepointSliceIter[str_origin]:
  var slice: StringSlice[Self.str_origin]

  fn __next__(mut self) raises StopIteration -> StringSlice[Self.str_origin]:
    var length = _utf8_first_byte_sequence_length(self.slice.as_bytes()[0])
    var ptr = self.slice.ptr
    self.slice = {ptr=ptr + length, length=self.slice.byte_length() - length}
    return {ptr=ptr,  length=length}

I want to inline that data (1-4 bytes per codepoint) since we are dereferencing it anyway (we need to know the sequence length from the first byte).

What making the data inline and not related to an origin means is that we can do this (which is necessary):

struct LowerIter[str_origin]:
  var slice: StringSlice[Self.str_origin]

  fn __next__(mut self) raises StopIteration -> Codepoint:
    var length = _utf8_first_byte_sequence_length(self.slice.as_bytes()[0])
    var ptr = self.slice.ptr
    self.slice = {ptr=ptr + length, length=self.slice.byte_length() - length}
    # invented method, might return the same value or a new one based on the
    # unicode standard. But the value itself can't be bound to an origin it
    # doesn't belong to str_origin
    return Codepoint(ptr=ptr,  length=length).lower()

What I am doubting however is how can we safely chain several Iterator[Codepoint] together. We will probably have to keep the struct pointing to the origin of the other iterator, something like

struct LowerIter[iter_origin]:
  comptime iteratorType = Iterator[Codepoint, Self.iter_origin]
  var slice: Self.iteratorType

  fn __init__(out self, ref[Self.iter_origin] iterator: Self.iteratorType): ...

  fn __next__(mut self) raises StopIteration -> Codepoint:
    var length = _utf8_first_byte_sequence_length(self.slice.as_bytes()[0])
    var ptr = self.slice.ptr
    self.slice = {ptr=ptr + length, length=self.slice.byte_length() - length}
    return Codepoint(ptr=ptr,  length=length).lower()

I do think that we’ll need for example the GraphemeIter to return StringSlice due to the data being potentially big. But for many other string methods I think it makes sense to just inline the data and thus allow new value creation within the iterators and allow a sort of “fusion” of them when chained

I think we’d need to have a type which is either an owning view of a single codepoint or a mutable borrowed view. My assumption is that most codepoints remain unchanged, so we’d want to hand out big spans of codepoints in many cases.

I don’t think we should allow mutable string iteration at all. I can’t think of a single use-case where that is a good idea (other than ASCII vectorized ops).

I agree, many iterators like split and split_lines should (like they currently do) return StringSlice when possible. But there are several cases where if they receive an Iterator[Codepoint] they won’t be able to construct a StringSlice since it doesn’t necessarily have a “pure” memory origin (values can get changed within). An example:

var data = "Something;Something_else"

# s is a `StringSlice`
for s in data.split(";"):
  if s in ("something", "Something"): ...

# s is a `Codepoint` since it can switch e.g. s[0] from "S" to "s"
for s in s.lower(): ...

# s is an `Iterator[Codepoint]` since it can switch e.g. s[0] from "S" to "s"
for s in s.lower().split(";"):
  # `Iterator[Codepoint]` should be comparable to a string
  if s == "something": ...

# s is a `StringSlice`
for s in s.split(";"):
  # `Iterator[Codepoint]` should be comparable to a string
  if s.lower() == "something": ...

I wonder if we need something like bytes - Rust