import re text = "Some text, containing some repeating words. It contains words that are repeating" words = [w.lower() for w in re.findall(r'[a-zA-Z]+', re.sub(r'<br.*>', '', text))] print(words) print(set(words))
Standard input is empty
['some', 'text', 'containing', 'some', 'repeating', 'words', 'it', 'contains', 'words', 'that', 'are', 'repeating'] {'text', 'contains', 'some', 'are', 'that', 'words', 'it', 'containing', 'repeating'}