The DNA sequence is composed of a series of nucleotides abbreviated as 'A', 'C', 'G', and 'T'.
For example, "ACGAATTCCG" is a DNA sequence.
When studying DNA, it is useful to identify repeated sequences within the DNA.
Given a string s that represents a DNA sequence, return all the 10-letter-long sequences (substrings) that occur more than once in a DNA molecule. You may return the answer in any order.
Example 1:
Input: s = "AAAAACCCCCAAAAACCCCCCAAAAAGGGTTT"
Output: ["AAAAACCCCC","CCCCCAAAAA"]
Example 2:
Input: s = "AAAAAAAAAAAAA"
Output: ["AAAAAAAAAA"]
Constraints:
1 <= s.length <= 105
s[i] is either 'A', 'C', 'G', or 'T'.
Solutions
Solution 1: Hash Table
We define a hash table \(cnt\) to store the occurrence count of all substrings of length \(10\).
We iterate through all substrings of length \(10\) in the string \(s\). For the current substring \(t\), we update its count in the hash table. If the count of \(t\) is \(2\), we add it to the answer.
After the iteration, we return the answer array.
The time complexity is \(O(n \times 10)\), and the space complexity is \(O(n \times 10)\). Here, \(n\) is the length of the string \(s\).
This method essentially combines sliding window and hash. Similar to 0028. Find the Index of the First Occurrence in a String, this problem can use a hash function to reduce the time complexity of counting subsequences to \(O(1)\).
The time complexity is \(O(n)\), and the space complexity is \(O(n)\). Here, \(n\) is the length of the string \(s\).