Regular Expression to find certain words in a document

dsquared2247 · October 17, 2021, 8:27pm

I’m total newbie with regular expressions. What I want to do:

This is to help solve NY Times Spelling Bee. In a Drafts doc, I have a running list of words that are useful in solving the bee. About 1200 words now. For the spelling bee, you want to make as many words as possible from 7 letters. Each word must contain the middle letter. So, I would like to create a regular expression that finds every word in my Drafts doc that contains the Bee’s middle letter where the rest of the word’s letters come from the other 6 letters of the Bee for that day. For example, today’s Bee has the letters “UNCIEDT” and the middle letter is “U”.

My Drafts doc contains on separate lines:

Aura, Guru, Deduct, Induct, Uplift, etc.
I want a regexp that finds only Deduct, Induct.
So far, I have used [U] to find all the words containing U, but that also returns Aura, Guru, Uplift etc.

Andreas_Haberle · October 18, 2021, 4:20am

Great idea!

Regular expressions are a good choice for many problems.
The best resource for learning it is the https://regex101.com/

BUT regular expressions are not build for defining conditions. So a regular expressions would only solve part of this task.

You might get a more simple solution with simple string processing.

The trick would be:

define a search condition (letters of the word and known positions)
put all words in a javascript list (an indexed array)
loop through all words checking for the conditions
outputting the given results - if any

Tough Drafts is able to do this with a JavaScript action and UI for the promoting and output, I would suggest thinking about an app like Pyto or Scriptable to solve this.

If you need help feel free to ask.

sylumer · October 18, 2021, 6:24am

The only way I can think to do that in pure regular expressions is to build it up with a lot of ors ("|"). Your middle letter is the challenge. I have never seen anything in regular expression notation to include a dynamic length comparison of two parts. Therefore, I think you would have to build your manually and explicitly and chain them all together with the ors.

From your description, I think this would cover three and four letter words with “u” as the middle letter, and you would then expand this out up to seven letters.

^(.)[u](.)$|^(..)[u](.)$|^(.)[u](..)$

I think you would be better off with relying more on coding of a solution rather than focusing on a regular expressions solution for this.

Hope that helps.

agiletortoise · October 18, 2021, 12:29pm

Closest I can think, assuming one word per line, would be ^[UNCIEDT]+$ which would find all the lines that contain only those letters.

dsquared2247 · October 18, 2021, 8:17pm

This is good but it finds words that don’t contain ‘U’. Thanks.

agiletortoise · October 18, 2021, 8:25pm

That’s where you get into the suitability of regular expression for this task. Regex is for finding patterns, but not for applying conditional logic and you will need some amount of logic applied to get exactly what you are looking for in this case.

dsquared2247 · October 19, 2021, 1:35am

Yes, I think you are right! Thank you.

Andreas_Haberle · October 19, 2021, 2:34am

It is a common practice of mine to split up one complex requests to two or more simple once. Weather doing that with regular expressions is a good idea I would not judge…

This is only an concept exampl NOT the solution

The request find a word with 8 letters containing no letter U could be split up into:

Find all the 5 letter words
Make sure there is no D in it

The nice thing about simple string processing (lines of text or simple words) is the length function (ok you might have to preprocess your Draft for the words)
A simple python script for illustration (tested on my iPad with Pyto)

use_file = False
if use_file:
    words = open ('wordlist.txt').split('\n')     #| get every word from a file, one per line
else:
    words ="""dachshund
hinterland
Zettelkasten
kaputt
angst
geist
rucksack
""".split('\n')

# get search requests input
search_lenght = int( input('how many characters: ')) 
search_not_in_letter_s = input('the forbidden character: ')
for word in words:             #| process every word
    if len(word) == search_lenght:       
         for  search_not_in_letter in search_not_in_letter_s:
             if word.find( search_not_in_letter ) == -1: #| this excludes a letter on ANY position of the word, use a positive number to search for a specific position
                print(f'- found {word}')

Example call:

Page up
how many characters: 5
the forbidden character: d
- found angst
- found geist
>>>

Is this helpfull?

sylumer · October 19, 2021, 5:57am

Are you sure? Surely those two steps give you 5 letter words with no D, not 8 letter words with a U as you state in the sentence that precedes it.

To extrapolate out your steps to the original request, I believe you would have the following for complete matching:

Match all the 3-7 letter words; though given the word list is predefined for this purpose I suspect that no such match needs to be automated.
Retrieve each word from the list and check that it has “u” as the middle character; not contains “u”, but explicitly is the middle character as per the definition and examples in the first post.
If the middle letter is “u”, check that all letters can be found at least an equal number of times in the original word; we can’t use a letter twice if it only appears once in the original word for example.
Store/output words that have successfully been matched against the above points.

Andreas_Haberle · October 20, 2021, 9:04pm

I aimed at describing my concept not solving the original problem.

I edited my parts of the thread.

Hope that does not mess up everything.

BrettEllingson · March 8, 2025, 5:08am

Skimming this post in the future I can see the logic of your solution (even if it doesn’t precisely solve the problem).

Well done!