Class ZenoString

All Implemented Interfaces:
Comparable<UnicodeString>, AtomicMatchKey

public class ZenoString extends UnicodeString
A ZenoString is an implementation of UnicodeString that comprises a list of segments representing substrings of the total string. By convention the segments are not themselves ZenoStrings, so the structure is a shallow tree. An index holds pointers to the segments and their offsets within the string as a whole; this is used to locate the codepoint at any particular location in the string.

The segments will always be non-empty. An empty string contains no segments.

The key to the performance of the data structure (and its name) is the algorithm for consolidating segments when strings are concatenated, so as to keep the number of segments increasing logarithmically with the string size, with short segments at the extremities to allow efficient further concatenation at the ends.

For further details see the paper by Michael Kay at Balisage 2021.

  • Field Details

    • EMPTY

      public static final ZenoString EMPTY
      An empty ZenoString
  • Method Details

    • of

      public static ZenoString of(UnicodeString content)
      Construct a ZenoString from a supplied UnicodeString
      Parameters:
      content - the supplied UnicodeString
      Returns:
      the resulting ZenoString
    • codePoints

      public IntIterator codePoints()
      Get an iterator over the code points present in the string.
      Specified by:
      codePoints in class UnicodeString
      Returns:
      an iterator that delivers the individual code points
    • length

      public long length()
      Get the length of the string
      Specified by:
      length in class UnicodeString
      Returns:
      the number of code points in the string
    • isEmpty

      public boolean isEmpty()
      Ask whether the string is empty
      Overrides:
      isEmpty in class UnicodeString
      Returns:
      true if the length of the string is zero
    • getWidth

      public int getWidth()
      Get the number of bits needed to hold all the characters in this string
      Specified by:
      getWidth in class UnicodeString
      Returns:
      7 for ascii characters, 8 for latin-1, 16 for BMP, 24 for general Unicode.
    • indexOf

      public long indexOf(int codePoint, long from)
      Get the position of the first occurrence of the specified codepoint, starting the search at a given position in the string
      Specified by:
      indexOf in class UnicodeString
      Parameters:
      codePoint - the sought codePoint
      from - the position from which the search should start (0-based). A negative value is treated as zero. A position beyond the end of the string results in a return value of -1 (meaning not found).
      Returns:
      the position (0-based) of the first occurrence found, or -1 if not found
      Throws:
      IndexOutOfBoundsException - if the from value is out of range
    • indexWhere

      public long indexWhere(IntPredicate predicate, long from)
      Description copied from class: UnicodeString
      Get the position of the first occurrence of a codepoint that matches a supplied predicate, starting the search at a given position in the string
      Specified by:
      indexWhere in class UnicodeString
      Parameters:
      predicate - condition that the codepoint must satisfy
      from - the position from which the search should start (0-based). A negative value is treated as zero. A position beyond the end of the string results in a return value of -1 (meaning not found).
      Returns:
      the position (0-based) of the first codepoint to match the predicate, or -1 if not found
    • codePointAt

      public int codePointAt(long index)
      Get the code point at a given position in the string
      Specified by:
      codePointAt in class UnicodeString
      Parameters:
      index - the given position (0-based)
      Returns:
      the code point at the given position
      Throws:
      IndexOutOfBoundsException - if the index is out of range
    • substring

      public UnicodeString substring(long start, long end)
      Get a substring of this codepoint sequence, with a given start and end position
      Specified by:
      substring in class UnicodeString
      Parameters:
      start - the start position (0-based): that is, the position of the first code point to be included
      end - the end position (0-based): specifically, the position of the first code point not to be included
      Returns:
      the requested substring
    • hasSubstring

      public boolean hasSubstring(UnicodeString other, long offset)
      Ask whether this string has another string as its content starting at a given offset
      Overrides:
      hasSubstring in class UnicodeString
      Parameters:
      other - the other string
      offset - the starting position in this string (counting in codepoints)
      Returns:
      true if the other string appears as a substring of this string starting at the given position.
      Throws:
      IndexOutOfBoundsException - if offset is less than zero or greater than the length of this string. Note that there is no exception if offset + other.length() exceeds this.length() - instead this results in a return value of false.
    • concat

      public ZenoString concat(UnicodeString other)
      Concatenate another string
      Overrides:
      concat in class UnicodeString
      Parameters:
      other - the string to be appended to this one
      Returns:
      the result of the concatenation (neither input string is altered)
    • writeSegments

      public void writeSegments(UnicodeWriter writer) throws IOException
      Write each of the segments in turn to a UnicodeWriter
      Parameters:
      writer - the writer to which the string is to be written
      Throws:
      IOException
    • concatSegments

      public static UnicodeString concatSegments(UnicodeString left, UnicodeString right)
    • economize

      public UnicodeString economize()
      Get an equivalent UnicodeString that uses the most economical representation available
      Overrides:
      economize in class UnicodeString
      Returns:
      an equivalent UnicodeString
    • toString

      public String toString()
      Overrides:
      toString in class Object
    • debugSegmentLengths

      public List<Long> debugSegmentLengths()
      This method is for diagnostics and unit testing only: it exposes the lengths of the internal segments. This is an implementation detail that is subject to change and does not affect the exposed functionality.
      Returns:
      the lengths of the segments