I am creating a new column based on pandas, string contains and regex. The example:
df.loc[df['random_words'].str.contains(r'^(?=.*SEA)(?=.*MAN)(?!.*WOMAN)(?!.*CHILD)'),'Person'] = 'Fisherman'
There is a lot of words to be inserted in the code and a lot of new data continuously to be integrated in the code. Descriptions in every documents to be searched are slightly different per document. There are about 50 possible outcome keywords, which btw remain constant. Also, There is a lot of overlap between words to be looked for and I would like to take it into account possible typo's as well. To make the process more effective and less time consuming I am now opting to use Fuzzy Wuzzy and a Levenshtein ratio instead of regex. Would anybody have a suggestion on how to achieve something similar to the regex above output using Fuzzy Wuzzy (or other Machine learning/ NLP mechanisms for that matter)?