Token Segmentation
You are given a text string text and an array dictionary, where each element in dictionary is a string in the format "<key>:<id>". Here, key represents a token string and id represents its corresponding identifier.
Your task is to implement a function that segments text into a sequence of tokens according to the rules below and returns a list of strings.
The tokenization process must follow these rules:
- Longest Match Priority: At each position in
text, consider all dictionary keys that start at the current index and select the longest possible matching key. If multiple keys match, the longest one must be chosen. - Greedy Consumption: Once a token is selected, consume all characters of that token and continue processing from the next unconsumed position in
text. - Literal Preservation: If no dictionary key matches starting at the current position, the single character at that position must be emitted as a literal token.
Output Format: If a segment matches a dictionary key, output its corresponding id. If a segment does not match any dictionary key, output the character itself as a string.
Example 1
Input
text = "applepiepear", dictionary = ["app:10", "apple:20", "pie:30"]Output
["20", "30", "p", "e", "a", "r"]Explanation
At the start, "apple" is matched instead of the shorter "app". Then "pie" is matched. The remaining characters do not match any token and are kept as literals.
Example 2
Input
text = "acdebe", dictionary = ["a:1", "b:2", "cd:3"]Output
["1", "3", "e", "2", "e"]
Example 3
Input
text = "programmingprogrampropro", dictionary = ["pro:1", "program:2", "programming:3", "gram:4", "ming:5", "pr:6", "og:7"]Output
["3", "2", "1", "1"]
Constraints
1 <= text.length <= 10⁵0 <= dictionary.length <= 10⁵- All tokens in
dictionaryare unique.
Token Segmentation
Overview
The problem asks us to simulate a classic lexical analysis process, often called tokenization or scanning. We are given a string text and a dictionary of token bindings, and we need to break the string into segments.
The two primary rules dictating our logic are Longest Match Priority and Greedy Consumption. At any character position, we want to look ahead to find the absolute longest dictionary key that matches the substring starting at our current position. Once we find that maximal match, we "consume" it (skip past it) and repeat the process. If no prefix matches any key in our dictionary, we fall back to Literal Preservation, emitting the single character and moving forward by one.
Brute Force
Our initial intuition might be to use a Hash Map. We can parse the input dictionary and populate a hash map where the keys are the token strings and the values are their corresponding IDs.
To tokenize the text, we can iterate through it with a pointer i. At each position i, we can look at all substrings starting at i. To strictly enforce the "longest match priority", we should check the longest possible substring first. To avoid unnecessary work, we can cap our maximum substring length to the length of the longest key present in the dictionary. If we find a match, we append its ID to our result array and advance our pointer by the length of the matched key. If we check down to length 1 and find no match, we append the literal character and advance our pointer by 1.
def tokenize_text(text: str, dictionary: list[str]) -> list[str]:
# Parse dictionary into a hash map
token_map = {}
for item in dictionary:
key, val = item.split(":", 1)
token_map[key] = val
# Find the maximum key length to optimize our search window
max_len = max((len(k) for k in token_map.keys()), default=0)
ans = []
i = 0
n = len(text)
while i < n:
best_match = None
best_len = 0
# Check from longest possible down to 1
max_search = min(n - i, max_len)
for L in range(max_search, 0, -1):
sub = text[i:i+L]
if sub in token_map:
best_match = token_map[sub]
best_len = L
break # Found the longest due to iterating downwards
if best_match is not None:
ans.append(best_match)
i += best_len
else:
ans.append(text[i])
i += 1
return ans
Time complexity: O(N * L^2 + K) — where N is the length of text, L is the maximum length of a key in the dictionary, and K is the total character length of all dictionary strings. Creating substrings of length up to L and hashing them takes O(L) time inside a loop that runs up to L times per character position N.
Space complexity: O(N + K) — We store all keys and values in a Hash Map taking O(K) space, and our output array takes O(N) space in the worst case (all literals).
Optimal / best approach
While the brute force approach is functional, generating and hashing strings repeatedly is computationally expensive. We can optimize the prefix matching heavily by using a Trie (Prefix Tree).
Instead of generating substrings from our text and checking if they exist in a Hash Map, we can insert all our dictionary keys into a Trie. The terminal node of each valid key will store its corresponding id.
When iterating through the text at position i, we traverse down the Trie character by character. During this traversal, we keep track of the last valid id we passed and its corresponding length. We continue traversing down the Trie until the current text character doesn't match any child node. At that point, the last recorded id represents the longest match. We append the id, advance i by the match length, and repeat. If our traversal immediately fails or yields no valid id, we fall back to the literal character.
class TrieNode:
def __init__(self):
self.children = {}
self.token_id = None
def tokenize_text(text: str, dictionary: list[str]) -> list[str]:
# 1. Build the Trie
root = TrieNode()
for item in dictionary:
key, val = item.split(':', 1)
node = root
for char in key:
if char not in node.children:
node.children[char] = TrieNode()
node = node.children[char]
node.token_id = val # Marks the end of a valid dictionary key
# 2. Tokenize the text using the Trie
ans = []
i = 0
n = len(text)
while i < n:
node = root
best_id = None
best_len = 0
curr_len = 0
# Traverse the Trie to find the longest matching prefix
for j in range(i, n):
char = text[j]
if char not in node.children:
break
node = node.children[char]
curr_len += 1
# If this node marks the end of a key, record it as the best so far
if node.token_id is not None:
best_id = node.token_id
best_len = curr_len
if best_id is not None:
# Longest match found
ans.append(best_id)
i += best_len
else:
# No match found, fallback to literal preservation
ans.append(text[i])
i += 1
return ans
Time complexity: O(N * L + K) — where N is the length of text, L is the maximum key length, and K is the total character length of the dictionary strings. Trie insertion takes O(K). Tokenizing steps forward through N characters, exploring the Trie up to L steps downwards, avoiding costly string concatenation and hashing.
Space complexity: O(N + K) — The Trie requires O(K) space to store all the characters across all the dictionary nodes. The ans array requires O(N) space to store the token output.
Comments on the solution
No comments on the solution yet. Start a thread about the editorial or your own approach.
No comments yet for Token Segmentation. Be the first to start the thread.
No submissions recorded for this problem yet. Use Submit in the editor — each judge run is stored here.
Output
Input
Expected
Your output
Run to see your output.