Token Segmentation

medium

Software engineer

You are given a text string text and an array dictionary, where each element in dictionary is a string in the format "<key>:<id>". Here, key represents a token string and id represents its corresponding identifier.

Your task is to implement a function that segments text into a sequence of tokens according to the rules below and returns a list of strings.

The tokenization process must follow these rules:

Longest Match Priority: At each position in text, consider all dictionary keys that start at the current index and select the longest possible matching key. If multiple keys match, the longest one must be chosen.
Greedy Consumption: Once a token is selected, consume all characters of that token and continue processing from the next unconsumed position in text.
Literal Preservation: If no dictionary key matches starting at the current position, the single character at that position must be emitted as a literal token.

Output Format: If a segment matches a dictionary key, output its corresponding id. If a segment does not match any dictionary key, output the character itself as a string.

Example 1

Input

text = "applepiepear", dictionary = ["app:10", "apple:20", "pie:30"]

Output

["20", "30", "p", "e", "a", "r"]

Explanation

At the start, "apple" is matched instead of the shorter "app". Then "pie" is matched. The remaining characters do not match any token and are kept as literals.

Example 2

Input

text = "acdebe", dictionary = ["a:1", "b:2", "cd:3"]

Output

["1", "3", "e", "2", "e"]

Example 3

Input

text = "programmingprogrampropro", dictionary = ["pro:1", "program:2", "programming:3", "gram:4", "ming:5", "pr:6", "og:7"]

Output

["3", "2", "1", "1"]

Constraints

1 <= text.length <= 10⁵
0 <= dictionary.length <= 10⁵
All tokens in dictionary are unique.

Screening

ArrayGreedyHash TableSimulationStringTrie

Token Segmentation

Overview

The problem asks us to simulate a classic lexical analysis process, often called tokenization or scanning. We are given a string text and a dictionary of token bindings, and we need to break the string into segments.

The two primary rules dictating our logic are Longest Match Priority and Greedy Consumption. At any character position, we want to look ahead to find the absolute longest dictionary key that matches the substring starting at our current position. Once we find that maximal match, we "consume" it (skip past it) and repeat the process. If no prefix matches any key in our dictionary, we fall back to Literal Preservation, emitting the single character and moving forward by one.

Brute Force

Our initial intuition might be to use a Hash Map. We can parse the input dictionary and populate a hash map where the keys are the token strings and the values are their corresponding IDs.

To tokenize the text, we can iterate through it with a pointer i. At each position i, we can look at all substrings starting at i. To strictly enforce the "longest match priority", we should check the longest possible substring first. To avoid unnecessary work, we can cap our maximum substring length to the length of the longest key present in the dictionary. If we find a match, we append its ID to our result array and advance our pointer by the length of the matched key. If we check down to length 1 and find no match, we append the literal character and advance our pointer by 1.

def tokenize_text(text: str, dictionary: list[str]) -> list[str]:
    # Parse dictionary into a hash map
    token_map = {}
    for item in dictionary:
        key, val = item.split(":", 1)
        token_map[key] = val
        
    # Find the maximum key length to optimize our search window
    max_len = max((len(k) for k in token_map.keys()), default=0)
    
    ans = []
    i = 0
    n = len(text)
    
    while i < n:
        best_match = None
        best_len = 0
        
        # Check from longest possible down to 1
        max_search = min(n - i, max_len)
        for L in range(max_search, 0, -1):
            sub = text[i:i+L]
            if sub in token_map:
                best_match = token_map[sub]
                best_len = L
                break # Found the longest due to iterating downwards
        
        if best_match is not None:
            ans.append(best_match)
            i += best_len
        else:
            ans.append(text[i])
            i += 1
            
    return ans

Time complexity: O(N * L^2 + K) — where N is the length of text, L is the maximum length of a key in the dictionary, and K is the total character length of all dictionary strings. Creating substrings of length up to L and hashing them takes O(L) time inside a loop that runs up to L times per character position N. Space complexity: O(N + K) — We store all keys and values in a Hash Map taking O(K) space, and our output array takes O(N) space in the worst case (all literals).

Optimal / best approach

While the brute force approach is functional, generating and hashing strings repeatedly is computationally expensive. We can optimize the prefix matching heavily by using a Trie (Prefix Tree).

Instead of generating substrings from our text and checking if they exist in a Hash Map, we can insert all our dictionary keys into a Trie. The terminal node of each valid key will store its corresponding id.

When iterating through the text at position i, we traverse down the Trie character by character. During this traversal, we keep track of the last valid id we passed and its corresponding length. We continue traversing down the Trie until the current text character doesn't match any child node. At that point, the last recorded id represents the longest match. We append the id, advance i by the match length, and repeat. If our traversal immediately fails or yields no valid id, we fall back to the literal character.

class TrieNode:
    def __init__(self):
        self.children = {}
        self.token_id = None

def tokenize_text(text: str, dictionary: list[str]) -> list[str]:
    # 1. Build the Trie
    root = TrieNode()
    for item in dictionary:
        key, val = item.split(':', 1)
        node = root
        for char in key:
            if char not in node.children:
                node.children[char] = TrieNode()
            node = node.children[char]
        node.token_id = val  # Marks the end of a valid dictionary key

    # 2. Tokenize the text using the Trie
    ans = []
    i = 0
    n = len(text)
    
    while i < n:
        node = root
        best_id = None
        best_len = 0
        curr_len = 0
        
        # Traverse the Trie to find the longest matching prefix
        for j in range(i, n):
            char = text[j]
            if char not in node.children:
                break
            
            node = node.children[char]
            curr_len += 1
            
            # If this node marks the end of a key, record it as the best so far
            if node.token_id is not None:
                best_id = node.token_id
                best_len = curr_len
        
        if best_id is not None:
            # Longest match found
            ans.append(best_id)
            i += best_len
        else:
            # No match found, fallback to literal preservation
            ans.append(text[i])
            i += 1
            
    return ans

Time complexity: O(N * L + K) — where N is the length of text, L is the maximum key length, and K is the total character length of the dictionary strings. Trie insertion takes O(K). Tokenizing steps forward through N characters, exploring the Trie up to L steps downwards, avoiding costly string concatenation and hashing. Space complexity: O(N + K) — The Trie requires O(K) space to store all the characters across all the dictionary nodes. The ans array requires O(N) space to store the token output.

Comments on the solution

No comments on the solution yet. Start a thread about the editorial or your own approach.

Language

Code editor loads in the browser.

Output

Input

text"applepiepear"

dictionary["app:10","apple:20","pie:30"]

Expected

["20","30","p","e","a","r"]

Your output

Run to see your output.

Example 1

Example 2

Example 3

Constraints

Interview stages

Topics

Token Segmentation

Overview

Brute Force

Optimal / best approach

Comments on the solution

Output