fork download
  1. # -*- coding: utf-8 -*-
  2. '''
  3. A most frequent k-mer with up to d mismatches in Text is simply a string Pattern maximizing Countd(Text, Pattern) among all k-mers. Note that Pattern does not need to actually appear as a substring of Text; for example, as we saw above, AAAAA is the most frequent 5-mer with 1 mismatch in AACAAGCTGATAAACATTTAAAGAG, even though it does not appear exactly in this string. Keep this in mind while solving the following problem:
  4.  
  5. Frequent Words with Mismatches Problem: Find the most frequent k-mers with mismatches in a string.
  6. Input: A string Text as well as integers k and d. (You may assume k ≤ 12 and d ≤ 3.)
  7. Output: All most frequent k-mers with up to d mismatches in Text.
  8.  
  9. CODE CHALLENGE: Solve the Frequent Words with Mismatches Problem.
  10.  
  11. Sample Input:
  12. CGCCTAAATAGCCTCGCGGAGCCTTATGTCATACTCGTCCT
  13. Sample Output:
  14. GATG ATGC ATGT
  15. '''
  16.  
  17. #define kmers long 4
  18. s = 'CGCCTAAATAGCCTCGCGGAGCCTTATGTCATACTCGTCCT'
  19. motif_len = 3
  20. motif_dict = {}
  21. mismatch = 1
  22.  
  23. print('Sequence = ' + s)
  24.  
  25. #find unique k-mers in the sequence
  26. for i in range(len(s)-motif_len):
  27. motif = s[i:i+motif_len]
  28. if motif not in motif_dict:
  29. motif_dict[motif] = 1
  30. else:
  31. motif_dict[motif] += 1
  32.  
  33. #1: generate a list of motif
  34. motif_list = []
  35. for k in motif_dict:
  36. motif_list.append(k)
  37. print('Motifs found : '),
  38.  
  39. #2: check where the motifs are [wrongly commented]
  40. dict = {}
  41. input = s
  42. ylist = []
  43. for item in motif_list:
  44. motif = item
  45. results = []
  46. y = 0
  47. for n in range(len(input)-len(motif)+1):
  48. counter = 0
  49. sample = input[n:n+len(motif)]
  50. for i in range(len(sample)):
  51. if sample[i] == motif[i]:
  52. pass
  53. else:
  54. counter += 1
  55. if counter <= mismatch:
  56. results.append(n)
  57.  
  58. dict[item] = []
  59. for value in results:
  60. dict[item].append(value)
  61. y += 1
  62. ylist.append(y)
  63.  
  64. print('\nProgram Output:'),
  65. final_list = []
  66. for item in dict:
  67. if len(dict[item]) == max(ylist):
  68. print(item),
  69. final_list.append([item, dict[item]])
  70. #print('')
  71. print(final_list)
  72.  
  73. n=0
  74. for n in range(len(final_list)):
  75. for m in range(max(ylist)):
  76. f = final_list[n][1][m]
  77. final_list[n][1][m] = s[f:f+motif_len]
  78. #for item in final_list:
  79. #for n in range(len(item[1])):
  80. #print(item[1][n])
  81. #define the consensus sequence -> to do but not necessary for this case
  82. #for k in range(len(dict[item])): final_list[item].append(dict[item])
  83. #list founded k-mers
  84. print('\nSample Output: GATG ATGC ATGT') #same values, different order. It doesn't matter
Success #stdin #stdout 0.05s 65532KB
stdin
Standard input is empty
stdout
Sequence = CGCCTAAATAGCCTCGCGGAGCCTTATGTCATACTCGTCCT
Motifs found :  
Program Output: CGT [['CGT', [0, 2, 11, 14, 16, 21, 22, 26, 29, 35, 38]]]

Sample  Output: GATG ATGC ATGT