Skip to main content

Token Segmentation

medium
Software engineer

You are given a text string text and an array dictionary, where each element in dictionary is a string in the format "<key>:<id>". Here, key represents a token string and id represents its corresponding identifier.

Your task is to implement a function that segments text into a sequence of tokens according to the rules below and returns a list of strings.

The tokenization process must follow these rules:

  1. Longest Match Priority: At each position in text, consider all dictionary keys that start at the current index and select the longest possible matching key. If multiple keys match, the longest one must be chosen.
  2. Greedy Consumption: Once a token is selected, consume all characters of that token and continue processing from the next unconsumed position in text.
  3. Literal Preservation: If no dictionary key matches starting at the current position, the single character at that position must be emitted as a literal token.

Output Format: If a segment matches a dictionary key, output its corresponding id. If a segment does not match any dictionary key, output the character itself as a string.

Example 1

Input

text = "applepiepear", dictionary = ["app:10", "apple:20", "pie:30"]

Output

["20", "30", "p", "e", "a", "r"]

Explanation

At the start, "apple" is matched instead of the shorter "app". Then "pie" is matched. The remaining characters do not match any token and are kept as literals.

Example 2

Input

text = "acdebe", dictionary = ["a:1", "b:2", "cd:3"]

Output

["1", "3", "e", "2", "e"]

Example 3

Input

text = "programmingprogrampropro", dictionary = ["pro:1", "program:2", "programming:3", "gram:4", "ming:5", "pr:6", "og:7"]

Output

["3", "2", "1", "1"]

Constraints

  • 1 <= text.length <= 10⁵
  • 0 <= dictionary.length <= 10⁵
  • All tokens in dictionary are unique.

Screening

ArrayGreedyHash TableSimulationStringTrie
Language
Code editor loads in the browser.

Output

Input

"applepiepear"
["app:10","apple:20","pie:30"]

Expected

["20","30","p","e","a","r"]

Your output

Run to see your output.