Today I want to share a quick antecdote about struggling through deriving an equation for an algorithm. Yes, I’m aware it has label easy. I thought it would be easy too until I started programming it up, and my solution didn’t work. What I learned is that when I am having trouble, I shouldn’t try to solve the “best case” scenario first and then grab a ride on the fail bus because I couldn’t do it. Instead, I tried to derive a solution that was simple and logical to me, and then attempt improving on that. If this were the real world? I would have colleagues and others to share my thinking with, and you know what they say about counting heads. In my case, since it’s only me, I am mainly satisfied that I got some solution and it’s not the worst one *wink). Here is our problem:
Given a string, print the size of the longest possible substring that has exactly k unique characters.
And if there is no possible substring print -1
, because nothing like a good, sharp negative number
to tell you that you’ve ended up less far along than you were before!
Example
We can look at an actual string as an example and derive an expected solution. For
the string aabacbebebe
, with k = 3, we would expect the answer to be cbebebe
.
This is the longest with length 7, and it has unique characters c, b, and e.
My first assumption is that we aren’t reordering the string, and we aren’t dealing
with weird characters or considering upper/lowercase.
First Attempt
Have you noticed that posts like these always have more than one attempt? That’s because these problems are hard (I don’t care what the hacker-schmaker-geek board says). I first started out with this approach:
S='aabacbebebe'
k=3
longest=-1 # keep record of the longest
start=0 # start of the current substring being evaluated
chars=set() # keep a record of characters that we've seen
def update_longest(start, end, S, longest):
contender = S[start:end]
if longest == -1:
return contender
if len(contender) > len(longest):
longest=contender
print('New longest is %s' %longest)
return longest
for idx in range(len(S)):
s = S[idx]
# The character is in the current substring
if s in chars:
print('%s is in %s, continuing' %(s, chars))
longest = update_longest(start, idx, S, longest)
else:
# S isn't one of the characters, and we are at stopping point
if len(chars) == k:
print('final evaluation of %s,%s' %(s, chars))
longest = update_longest(start, idx, S, longest)
start=idx
chars=set(s)
else:
chars.add(s)
longest = update_longest(start, idx, S, longest)
print('added %s,%s' %(s, chars))
And I continually would get an answer that didn’t seem to move along the string (e.g.,
added a,{'a'}
a is in {'a'}, continuing
New longest is a
New longest is aa
added b,{'b', 'a'}
a is in {'b', 'a'}, continuing
New longest is aab
New longest is aaba
added c,{'c', 'b', 'a'}
b is in {'c', 'b', 'a'}, continuing
New longest is aabac
final evaluation of e,{'c', 'b', 'a'}
New longest is aabac
New longest is aabac
added b,{'b', 'e'}
e is in {'b', 'e'}, continuing
New longest is aabac
b is in {'b', 'e'}, continuing
New longest is aabac
e is in {'b', 'e'}, continuing
New longest is aabac
Why isn’t it working?
Option 1 is to throw an avocado against the wall, get frustrated, and give up. I haven’t actually done this because avocados are like gold to me, but I sure have fantastized about it. Option 2 is to try and understand why your soliution isn’t working, and come up with a good example of it. My first attempt wasn’t able to “look backwards,” and I’ll try to illustrate what I mean. When we hit the case that the number of unique characters is equal to k, and the next character isn’t in the set k, I am “resetting” the state by setting the start to be that current index. The problem with this is that it could be the case that we could step backwards along the string and grab some k-N characters that, along with our next character, would be the solution! For example if we have (and here I am splaying out the string and showing you indices):
0|1|2|3|4|5|6 # i
A|B|B|C|C|C|C
and we set k=2, once we hit the first index of C (i=3) we would first check the current start (i=0) through the end (i=2) to see if
the string (ABB
) could be our new longest. Regardless of this outcome, after the check we would throw away the ABB
by setting the new start to be our current index, which is 3, for the next loop. We would then go through the same
procedure again, but since we are only checking a long array of C’s, we wouldn’t meet our k=2 requirement and the presented solution would
be ABB
. Why is this wrong? Because the length of this solution is 3, and we could have done better finding BBCCCC
. We
didn’t do that because we had no logic in the implementation for looking backwards.
Attempt Two
Keep it simple! I decided to rethink the problem at this point. You see, I’m very block headed when it comes to algorithms, and once I realized I had this “backtracking” error I was becoming infinitely more frustrated creating loops inside loops. If the solution isn’t something simple that I can think of today and then remember tomorrow, it’s not a good solution for me. Instead of trying to shove all this logic into one complicated loop, I decided to step back and think of the problem much more simply. Gogo Dinosaur Strategy Session!
Strategy:
Hello my name is strategy! I don’t remember complicated things, but I like simple ALOT. Today we are going to:
- generate all substrings
- count unique characters
- The longest one with k wins!
def generate_substrings(S, k=3):
substrings = []
# Case 1: fewer characters in S than k, impossible
if len(S) < k:
return substrings
# Case 2: if the substring is == k, then it's the only option
elif len(S) == k:
return [S]
# Case 3: we can generate substrings!
for s1 in range(len(S)):
substrings.append(S[s1:])
return substrings
This would produce this output:
$ generate_substrings(S)
['aabacbebebe',
'abacbebebe',
'bacbebebe',
'acbebebe',
'cbebebe',
'bebebe',
'ebebe',
'bebe',
'ebe',
'be',
'e']
Note that we aren’t considering the number of unique characters, actually if the length of the above is less than our k, it’s impossible to have three unique, so let’s filter those out.
def generate_substrings(S, k=3):
substrings = []
# Case 1: fewer characters in S than k, impossible
if len(S) < k:
return substrings
# Case 2: if the substring is == k, then it's the only option
elif len(S) == k:
return [S]
# Case 3: we can generate substrings!
for s1 in range(len(S)):
if len(S[s1:]) >= k:
substrings.append(S[s1:])
return substrings
Now we get a slightly smaller list:
$ generate_substrings(S)
['aabacbebebe',
'abacbebebe',
'bacbebebe',
'acbebebe',
'cbebebe',
'bebebe',
'ebebe',
'bebe',
'ebe']
Okay, so now we want to just find the longest with k unique. Let’s do that.
substrings = generate_substrings(S, k=3)
longest = -1
for contender in substrings:
unique_chars = len(set(contender))
# Condition 1: the number of characters must be less than = k
if unique_chars == k:
# Case 1: we don't have a longest. Now we do
if longest == -1:
longest = contender
# Case 2: the new string is longer
elif len(contender) > len(longest):
longest = contender
We can then run the above and get the longest, cbebebe
(this I assume is
pronounced SEE BAYBAY+). I realize that it’s
probably not super efficient to do this in two steps - generating
all the substrings and then testing, but it’s so much more clean and logical, and
so is a solution I’m happy with.
More Optimized?
About 10 minutes after doing the above, I was again not happy with the solution. I didn’t have a better way, but had the intuition that I wanted to move the second (redundant) loop somehow into the first. Was there a reason we couldn’t check for the substring to be the longest as we generated it? Let’s try to make it better. Actually, no promises that this is better, but actually just different. My first strategy was to get a sense of how I was going to combine the two things:
def generate_longest(S, k=3):
## DEFINE LONGEST DEFAULT HERE
# Case 1: fewer characters in S than k, impossible
if len(S) < k:
return longest
# Case 2: if the substring is == k, AND it has k unique, only option
elif len(S) == k and len(set(S)) == k:
return S
# Case 3: we can check substrings
for s1 in range(len(S)):
substring = S[s1:]
## COMPARE AGAINST LONGEST HERE
return longest
In the above, you can see how (my thinking) started. I would need to rescope my thinking about the function to (instead of returning substrings) to return just the longest. This changed the first set of checks. Next I would need to do some comparison with the longest in Case 3. What I came up with is this:
def generate_longest(S, k=3):
# We can take len() of empty string
longest = ''
# Case 1: fewer characters in S than k, impossible
if len(S) < k:
return longest
# Case 2: if the substring is == k, AND it has k unique, only option
elif len(S) == k and len(set(S)) == k:
return S
# Case 3: we can check substrings
for s1 in range(len(S)):
substring = S[s1:]
# Exactly k! This could be a solution
if len(set(substring)) == k:
substring = S[s1:]
if len(substring) > len(longest):
longest = substring
# Greater than k, could be? But we need to trim
else:
while len(set(substring)) > k:
# Remove last element, try again
substring = substring[:-1] # remove last element
if len(set(substring)) == k and len(substring) > len(longest):
longest = substring
# If we still have empty for longest, return -1 as desired
if longest == '':
longest = -1
return longest
What I originally implemented was just checking if the unique characters was k, and if yes, comparing against the longest and calling success if it was of greater length. I then had a ruhroh moment because I realized that I was always comparing from some index to the end of the string, and it could very well be that we could find a solution by chopping off a character. So I added a while loop to eat away at the substring and do these checks up until hitting k unique characters. Did it work for the previously failing test?
$ generate_longest(S,2)
# 'BBCCCC'
It seems to! No promises for other tests, heh.
Final Dinosaur Drivelings
To summarize, from this I learned that it’s okay to approach things simply, and then try to improve upon. I’d even say you don’t have to find the perfect answer because there are others to help, and you can only do your best. I also think I disagree with returning a value of -1 given no longest substring. I would find it more natural to return an empty string `` which arguably is the longest string you could derive given that you cannot :).
Suggested Citation:
Sochat, Vanessa. "Longest K Unique Characters Substring." @vsoch (blog), 12 May 2018, https://vsoch.github.io/2018/longest-k-substring/ (accessed 18 Nov 24).