FAQ Search
Memberlist Usergroups
Profile
  Forum Statistics Register
 Log in to check your private messages
Log in to check your private messages
Moonpod Homepage Starscape Information Mr. Robot Information Free Game Downloads Starscape Highscore Table
Scrabble lookup coding challenge!
Goto page 1, 2, 3  Next
Post new topic   Reply to topic    Discussion Pod Forum Index -> Independent Game Development View previous topic :: View next topic  
 Author
Message
Weeble
Starscape Jedi
Starscape Jedi


Joined: 25 Apr 2003
Posts: 1143
Location: Glasgow, Scotland



PostPosted: Mon Jul 14, 2008 10:26 pm    Post subject: Scrabble lookup coding challenge! Reply with quote

I think we derailed this thread over here with all the talk of algorithms for finding possible all the possible words that can be built with a given set of letters. Let's have a thread specifically for it.

I don't know if it's exactly the problem that Poo Bear had to solve, but the challenge seems to be this:
  • Given a list of all allowed words, and a set of letters (possibly containing duplicates) find all arrangements of a subset of the letters that form valid words.


I'm working in Python right now, because it's easy, but if we're getting really competitive I might consider rewriting in C++. Here's what I did:

Code:

alphabet="ABCDEFGHIJKLMNOPQRSTUVWXYZ"
def bitcodeword(word):
    """Return a 26-bit bitfield where each 1 bit indicates
       the corresponding letter of the alphabet is present
       in the word. lsb is 'A', msb is 'Z'.
    """
    bit=1
    code=0
    for letter in alphabet:
        if letter in word:
            code|=bit
        bit<<=1
    return code   
def canmakeword(allowedletters,word):
    """Given a dictionary mapping letters to the number of
       each of those letters available, determine if the
       given word can be made by rearranging some or all of
       the letters."""
    remainingletters=dict(allowedletters)
    for ch in word:
        x = remainingletters[ch]-1
        if x<0:
            return False
        remainingletters[ch]=x
    return True

def lettercounters(letters):
    """Create a dictionary mapping letters to the number of
       each of those letters found in the provided
       string or sequence."""
    counters=dict((ch,0) for ch in alphabet)
    for l in letters:
        counters[l]+=1
    return counters

class WordTable(object):
    def __init__(self,filename):
        self.wordtable=[(word,bitcodeword(word))
                        for word in
                        (line.strip() for line in file(filename))
                        ]
    def wordlist(self):
        """Get all the words in the table."""
        return (word for (word, code) in self.wordtable)
    def naivefind(self,letters):
        """Given a set of letters, find all the words that
           can be made by rearranging some or all of them.
           This is pretty slow."""
        allowedletters=lettercounters(letters)
        return (word for word in
                self.wordlist()
                if canmakeword(allowedletters,word))
    def filterwords(self,letters):
        """Given a set of letters, find all the words
           that can be made by rearranging and multiplying
           some or all of them."""
        wordcode=bitcodeword(letters)
        return (word for (word, code) in
                self.wordtable
                if code&wordcode==code)
    def betterfind(self,letters):
        """Given a set of letters, find all the words that
           can be made by rearranging some or all of them.
           This is much faster, because it performs a fast
           test on each candidate word that will eliminate
           all those that need letters we don't have."""
        allowedletters=lettercounters(letters)
        return (word for word in
                self.filterwords(letters)
                if canmakeword(allowedletters,word))

Usage:
Code:

wt=WordTable("sowpods.txt")
list(wt.betterfind("STRAIN"))
['AI', 'AIN', 'AINS', 'AIR', 'AIRN', 'AIRNS', 'AIRS', 'AIRT', 'AIRTS', 'AIS', 'AIT', 'AITS', 'AN', 'ANI', 'ANIS', 'ANT', 'ANTI', 'ANTIS', 'ANTS', 'AR', 'ARIS', 'ARS', 'ART', 'ARTI', 'ARTIS', 'ARTS', 'AS', 'ASTIR', 'AT', 'IN', 'INS', 'INSTAR', 'INTRA', 'IS', 'ISNA', 'IT', 'ITA', 'ITAS', 'ITS', 'NA', 'NARIS', 'NAS', 'NAT', 'NATIS', 'NATS', 'NIS', 'NIT', 'NITS', 'RAI', 'RAIN', 'RAINS', 'RAIS', 'RAIT', 'RAITS', 'RAN', 'RANI', 'RANIS', 'RANT', 'RANTS', 'RAS', 'RAST', 'RAT', 'RATS', 'RIA', 'RIANT', 'RIAS', 'RIN', 'RINS', 'RIT', 'RITS', 'SAI', 'SAIN', 'SAINT', 'SAIR', 'SAN', 'SANT', 'SANTIR', 'SAR', 'SARI', 'SARIN', 'SAT', 'SATI', 'SATIN', 'SI', 'SIN', 'SIR', 'SIT', 'SITAR', 'SNAR', 'SNIRT', 'SNIT', 'SRI', 'ST', 'STAIN', 'STAIR', 'STAR', 'STARN', 'STIR', 'STRAIN', 'STRIA', 'TA', 'TAI', 'TAIN', 'TAINS', 'TAIS', 'TAN', 'TANS', 'TAR', 'TARN', 'TARNS', 'TARS', 'TARSI', 'TAS', 'TI', 'TIAR', 'TIARS', 'TIN', 'TINS', 'TIS', 'TRAIN', 'TRAINS', 'TRANS', 'TRIN', 'TRINS', 'TSAR']

I can't really tackle the slow loading problem. It takes a few seconds to load on my laptop. But what I'm interested in here is optimising the search time. The naive search takes a second or so for most searches. (My favourite test data is "RESTRAINTS".) The optimised search takes about a tenth of that. I imagine it could be substantially faster again in C++. The only trick for optimisation is that I make a 26-bit bitfield for each word in the dictionary, where each bit indicates whether the corresponding letter appears in the word. It doesn't indicate how many times the letter appears, just that it appears at least once. This lets me discard most words very quickly, but I'm fundamentally still ploughing through the whole dictionary.

I'm curious if there's a better algorithm for the problem. Colinvella suggests a trie combined with permutations of the input letters, but as Slyh points out, that will suffer from huge numbers of permutations for large numbers of letters, far more than there are words in the dictionary. I wonder if there's some way to be smarter, perhaps by preprocessing or cross-indexing the word table in some interesting way. Or maybe we could have the permutation generator give up on a branch of the permutation every time it can see that it's not going to find a word in the dictionary. Actually, that sounds quite promising.

PS - I've not really Googled the problem yet, because I'm enjoying it for the challenge, but I would not be surprised if there are well-known solutions for this problem.

EDIT - I fixed the WordTable constructor. It was ignoring the filename argument and always loading "sowpods.txt".
Back to top
View user's profile Visit poster's website MSN Messenger
colinvella



Joined: 15 Sep 2006
Posts: 82



PostPosted: Tue Jul 15, 2008 12:25 pm    Post subject: Reply with quote

Ach... I didn't notice the new thread.. but let me copy in what I posted in the old one:

-------------------------------
There might be a way of pruning out the combos by doing the trie searches simultaneously with the combo construction.

For example:
---
You choose a first letter and check if the trie map root has a child node for that letter (trivially true for the 1st letter as you can find words starting with all the letters in the alphabet, given a comprehensive dictionary)

Repeat the process by selecting the next letter from the remaining valid letters and check if the trie tree can be navigated with the current combo. If not, stop working with this combo and start with a new one.
---

I would agree with you on the performance of hash maps, were it not for the fact that the hashing function for the strings is itself typically of Order(N). So even assuming no bin conflicts (usually not an issue since good hash map implementations grow automatically to counter this), a trie map should more or less perform better. Bear also in mind that some trie implementations actually collapse degenerate sections of their branches so that given a word of M letters, you don't necessarily have to traverse M nodes. Also, you can exit early from the search as soon as it is clear that the word is not in the dictionary given the current initial sub-string. For example, with a word like "xzfoopy", the search algo will terminate as soon as it tests the letter "z" because there are no valid words starting with "xz" (again my assumption!)
-------------------------------

I'm tempted to give my idea a go with C#. Can any one please spare me the dictionary file?
Cheers
Back to top
View user's profile
Poo Bear
Pod Team
Pod Team


Joined: 14 Oct 2002
Posts: 4121
Location: Sheffield, UK



PostPosted: Tue Jul 15, 2008 12:37 pm    Post subject: Reply with quote

http://www.stackwords.com/sowpods.zip

you can download it here
Back to top
View user's profile Visit poster's website
Weeble
Starscape Jedi
Starscape Jedi


Joined: 25 Apr 2003
Posts: 1143
Location: Glasgow, Scotland



PostPosted: Tue Jul 15, 2008 12:50 pm    Post subject: Reply with quote

Yes, that sounds like a good approach. You will lop off huge chunks of the search where it would otherwise have been looking for all the combinations beginning with letters like "TSN" or whatever.

The other nice thing I realised the Python implementation gives you for free is that it's a generator, so you can run it incrementally, just like Poo Bear's implementation which runs in something like 10ms chunks. It can be quite frustrating trying to translate generators into C++; in extreme cases it involves writing whole state machines.
Back to top
View user's profile Visit poster's website MSN Messenger
colinvella



Joined: 15 Sep 2006
Posts: 82



PostPosted: Tue Jul 15, 2008 1:12 pm    Post subject: Reply with quote

Thanks... so the tests will be comparable.. ok let me give it a shot!
Back to top
View user's profile
colinvella



Joined: 15 Sep 2006
Posts: 82



PostPosted: Tue Jul 15, 2008 3:52 pm    Post subject: Reply with quote

oookaay...

my results using the simultaneous trie combo technique:

loading time into a plain list of 267751 words: 499ms
trie tree construction: 2439ms

simple search for "xylophone": <1ms

scrabble search for "xylophone" as set of valid letters: 8ms

valid words (can we please match for correctness?):
eh
el
en
enol
eon
epoxy
ex
exo
exon
expo
he
helo
help
hen
hep
hex
hexyl
hey
ho
hoe
hole
holey
holon
holp
holpen
holy
hon
hone
honey
hoo
hooey
hooley
hooly
hoon
hoop
hop
hope
hox
hoy
hoyle
hye
hyen
hyle
hyp
hype
hypo
leno
lep
lex
ley
lo
lone
loo
looey
loon
looney
loony
loop
loopy
lop
lope
lox
loy
lye
lyne
lynx
ne
nep
no
noel
noh
nole
nolo
noo
noop
nope
nox
noy
ny
nye
oe
oh
oho
ohone
ole
oleo
olpe
on
one
onely
only
ono
ony
onyx
oo
ooh
oon
oop
op
ope
open
openly
ox
oxen
oxo
oxy
oy
oye
pe
peh
pelon
pen
peon
peony
phenol
phenoxy
phenyl
pheon
phlox
pho
phon
phone
phoney
phono
phony
phooey
phyle
phylon
pleon
plex
ploy
ply
po
poh
pol
pole
poley
poleyn
polo
polony
poly
pone
poney
pony
poo
pooh
pool
poon
pox
poxy
pye
pylon
pyne
pyx
xylophone
ye
yeh
yelp
yen
yep
yex
yo
yon
yoop



[edit] p.s. I am running this as a C# app on an XPS M1710 DELL laptop. Some of the words seemed wierd and I thought it was due to a bug.. but all the spot checks I did resulted in valid words according to the given dictionary.


I'd be more than happy to provide the project source by the way Smile
Back to top
View user's profile
Weeble
Starscape Jedi
Starscape Jedi


Joined: 25 Apr 2003
Posts: 1143
Location: Glasgow, Scotland



PostPosted: Tue Jul 15, 2008 4:19 pm    Post subject: Reply with quote

Running the same query through the Python brute-force method gives:

eh
el
en
enol
eon
epoxy
ex
exo
exon
expo
he
helo
help
hen
hep
hex
hexyl
hey
ho
hoe
hole
holey
holon
holp
holpen
holy
hon
hone
honey
hoo
hooey
hooley
hooly
hoon
hoop
hop
hope
hox
hoy
hoyle
hye
hyen
hyle
hyp
hype
hypo
leno
lep
lex
ley
lo
lone
loo
looey
loon
looney
loony
loop
loopy
lop
lope
lox
loy
lye
lyne
lynx
ne
nep
no
noel
noh
nole
nolo
noo
noop
nope
nox
noy
ny
nye
oe
oh
oho
ohone
ole
oleo
olpe
on
one
onely
only
ono
ony
onyx
oo
ooh
oon
oop
op
ope
open
openly
ox
oxen
oxo
oxy
oy
oye
pe
peh
pelon
pen
peon
peony
phenol
phenoxy
phenyl
pheon
phlox
pho
phon
phone
phoney
phono
phony
phooey
phyle
phylon
pleon
plex
ploy
ply
po
poh
pol
pole
poley
poleyn
polo
polony
poly
pone
poney
pony
poo
pooh
pool
poon
pox
poxy
pye
pylon
pyne
pyx
xylophone
ye
yeh
yelp
yen
yep
yex
yo
yon
yoop

EDIT - Looks like it's the same! Very Happy
Back to top
View user's profile Visit poster's website MSN Messenger
colinvella



Joined: 15 Sep 2006
Posts: 82



PostPosted: Tue Jul 15, 2008 4:22 pm    Post subject: Reply with quote

Perfect match Smile
Back to top
View user's profile
Slyh



Joined: 25 Nov 2004
Posts: 480
Location: Karlsruhe, Germany



PostPosted: Tue Jul 15, 2008 7:03 pm    Post subject: Reply with quote

colinvella wrote:
my results using the simultaneous trie combo technique:

loading time into a plain list of 267751 words: 499ms
trie tree construction: 2439ms

2.5 seconds seems quite long. I guess there is some potential for optimization?!

colinvella wrote:
simple search for "xylophone": <1ms

scrabble search for "xylophone" as set of valid letters: 8ms

Quite impressive!
You cheated a little, though. Smile As there is an x and a y in your list of letters, which both are not a very common first letters in English words, your algorithm basically only has to look for "lophone" in the Trie. This results in a lot less relevant word combinations than the original set of letters. (Correct me if I'm wrong!)
How does the algorithm perform for this set of letters?: "eettaoinshrd"
(This will match 2633 words of my Scrabble list.)

colinvella wrote:
I'd be more than happy to provide the project source by the way Smile

I'd be more than happy to look at it. Smile
Back to top
View user's profile
colinvella



Joined: 15 Sep 2006
Posts: 82



PostPosted: Tue Jul 15, 2008 8:47 pm    Post subject: Reply with quote

Slyh wrote:
colinvella wrote:
my results using the simultaneous trie combo technique:

loading time into a plain list of 267751 words: 499ms
trie tree construction: 2439ms

2.5 seconds seems quite long. I guess there is some potential for optimization?!

colinvella wrote:
simple search for "xylophone": <1ms

scrabble search for "xylophone" as set of valid letters: 8ms

Quite impressive!
You cheated a little, though. Smile As there is an x and a y in your list of letters, which both are not a very common first letters in English words, your algorithm basically only has to look for "lophone" in the Trie. This results in a lot less relevant word combinations than the original set of letters. (Correct me if I'm wrong!)
How does the algorithm perform for this set of letters?: "eettaoinshrd"
(This will match 2633 words of my Scrabble list.)

colinvella wrote:
I'd be more than happy to provide the project source by the way Smile

I'd be more than happy to look at it. Smile


For the letters "eettaoinshrd" the query took 131ms and generated precisely 2633 words from "ad" to "tsar".

You do have a point that the chosen set was not a good representative. Would you say that the one you gave me is an average set or is it at the other extreme?

There is definitely room for optimisation.. Trie construction takes over 2 seconds but is an initialisation procedure that needs to be done only once... still I'm sure there's room for improvement. For starters, the construction is recursive.. I could have easily done it iteratively.

Secondly, the individual word search can easily be converted from recursive to iterative form as the recursion is degenerate like that of a binary search or a factorial function.

The scrabble search is more naturally coded recursively due to the ongoing combo construction. The potential combinatorial explosion of this process is pruned extensively by navigating down the trie trie. This search could also be formulated iteratively by using a manual stack structure to hold the state, which might make it a bit more efficient by a constant factor... but is a bit more complicated to code.

If you are willing to share an email or write to colinvella at gmail dot com I'll be happy to share the project. Smile


Last edited by colinvella on Tue Jun 22, 2010 7:59 am; edited 2 times in total
Back to top
View user's profile
colinvella



Joined: 15 Sep 2006
Posts: 82



PostPosted: Tue Jul 15, 2008 8:53 pm    Post subject: Reply with quote

One further note is that my algo may generate repeated words when letters occur more than once in the "set". For example, the set "hello" may give two instances of "hell" and "helo" (according to the given dictionary). The duplicates occur as a result of the permutations of duplicate letters.. the second instance of "hell" is the same as the first instance with the "L"s transposed. Hence, I need to perform a final cleanup, which I choose to do by sorting the list and iterating once to eliminated subsequent duplicates.

All the quoted times do include this last step btw Smile
Back to top
View user's profile
Weeble
Starscape Jedi
Starscape Jedi


Joined: 25 Apr 2003
Posts: 1143
Location: Glasgow, Scotland



PostPosted: Tue Jul 15, 2008 9:33 pm    Post subject: Reply with quote

How long does it take with this input?
"BANANANANANA"
(So chosen because the duplicate letters provide a *lot* of ways of making the same words over and over again.)
Back to top
View user's profile Visit poster's website MSN Messenger
colinvella



Joined: 15 Sep 2006
Posts: 82



PostPosted: Wed Jul 16, 2008 9:01 am    Post subject: Reply with quote

137ms

output:
aa
ab
aba
an
ana
anan
anana
ann
anna
ba
baa
ban
banana
na
naan
nab
nan
nana
nanna
Back to top
View user's profile
Rup



Joined: 19 May 2003
Posts: 363
Location: London, UK



PostPosted: Wed Jul 16, 2008 10:11 am    Post subject: Reply with quote

colinvella wrote:
There is definitely room for optimisation.. Trie construction takes over 2 seconds but is an initialisation procedure that needs to be done only once... still I'm sure there's room for improvement. For starters, the construction is recursive. I could have easily done it iteratively.

How are you storing the trie - as a separate C# object for each node? If this was C++ my gut reaction would be that the time is memory allocation, except that I don't think we're dealing with large objects anyway and I'm fairly sure the runtime gets memory from the OS (the slow bit) in multi-page blocks anyway. I don't know how that translates into C#'s object allocation. If you wanted to try and speed it up, though, I'd try implementing the trie as a single-object flat data structure e.g. storing offsets not pointers.

Frankly I don't think it's worth it, though - 3 seconds load time is nothing. Throw up a progress bar or a pretty picture and no-one will care.

I'm not sure what you mean by recursive / iterative for the trie construction; do you add the word at the top level each time or do you read the next word at the last leaf? If you're reading in a sorted list of words then the next word will generally be near the current leaf, so from the last leaf:

1. read next word from disk / memory
2. compare against last word; back-track up the tree until first letter mismatch
3. add nodes for the remaining letters until reach leaf

You could even store the dictionary in a format appropriate to that; an old shareware Scrabble game (that I cannibalised to make a Countdown game for my brother) stored its dictionary as something like as

<byte> backtrack count
new characters until next byte < 32 or EOF

i.e. to store "open, openly, ox, oxen" from your examples above you'd store "open0ly5x0en". That form can be trivially used to rebuild a tree. (Except back then I didn't know about those and just scanned the expanded word list in memory.) This does falls down though if you need to make some words capitalised only unless you treat the cases separately but you could introduce a marker for that - an end-of-word-that-needs-to-be-capitalised byte, say - or set the top bit on the backtrack count or something. And since we're only working on a small alphabet and a limited set of backtrack counts this would all Huffman-code well.
Back to top
View user's profile
colinvella



Joined: 15 Sep 2006
Posts: 82



PostPosted: Wed Jul 16, 2008 7:28 pm    Post subject: Reply with quote

I'm storing the trie as a bunch of linked nodes where each node is an instance of a TrieNode class. Each of these instances has 26 references for all the potential children and a boolean flag that indicates if the current path from the root represents a complete word. Memory allocation for the instances in .NET are ultimately resolved as native allocations so that's definitely not going to be faster than a native implementation. I could definitely be held in a flat array using indices. Actually, you could use a structure (forgot the name) such that for node at index N, it's children are 26N to 26N + 25. However, the trie tree will hardly be balanced in practice and mayy result in a lot of wasted space.. still it's 2008 so that shouldn't really be an issue Smile

By recusrive I mean defining, for example, the word insertion function in terms of itself. Here's an actual sample - the InsertWord method of TrieNode:

Code:

        public void InsertWord(String p_strWord)
        {
            char ch = p_strWord[0];

            int nOffset = ch - 'a';

            TrieNode trieNodeChild = Nodes[nOffset];
            if (trieNodeChild == null)
            {
                trieNodeChild = new TrieNode();
                Nodes[nOffset] = trieNodeChild;
            }

            if (p_strWord.Length == 1)
            {
                trieNodeChild.CompleteWord = true;
                return;
            }

            trieNodeChild.InsertWord(p_strWord.Substring(1));
        }


The above is the 'academically intuitive' way of implementing the algorithm. The iterative version would require a method external to the class, or perhaps a static method, to iterate from the root node down the appropriate path using a loop rather than calling the InsertWord method of a child node (i.e. recursively)

The following is the TrieNode class method for looking up a word in the dictionary. This is done by calling the function on the root node and let it recurse in terms of its descendant nodes:

Code:

        public bool ContainsWord(String p_strWord)
        {
            if (p_strWord.Length == 0)
                return false;

            char ch = p_strWord[0];

            int nOffset = ch - 'a';

            TrieNode trieNodeChild = Nodes[nOffset];
            if (trieNodeChild == null)
                return false;

            if (p_strWord.Length == 1 && trieNodeChild.CompleteWord)
                return true;

            return trieNodeChild.ContainsWord(p_strWord.Substring(1));
        }


Again, the above is easily converted from recursion to a loop.

Finally, here is the ScrabbleMatch algo:

Code:

        public List<String> ScrabbleMatch(String p_strValidLetters)
        {
            List<String> listWords = new List<String>();

            if (p_strValidLetters.Length == 0)
                return listWords;

            if (p_strValidLetters.Length == 1)
            {
                char ch = p_strValidLetters[0];

                int nOffset = ch - 'a';

                TrieNode trieNodeChild = Nodes[nOffset];

                if (trieNodeChild != null && trieNodeChild.CompleteWord)
                    listWords.Add(p_strValidLetters);

                return listWords;
            }

            // combo test
            for (int nIndex = 0; nIndex < p_strValidLetters.Length; nIndex++)
            {
                char ch = p_strValidLetters[nIndex];

                int nOffset = ch - 'a';

                TrieNode trieNodeChild = Nodes[nOffset];

                if (trieNodeChild == null)
                    continue;

                if (trieNodeChild.CompleteWord)
                    listWords.Add(ch + "");

                String p_strValidLettersSubset = p_strValidLetters.Remove(nIndex, 1);
                List<String> listSubWords = trieNodeChild.ScrabbleMatch(p_strValidLettersSubset);

                foreach (String strSubWord in listSubWords)
                    listWords.Add(ch + strSubWord);
            }

            return listWords;
        }
    }


The last method is harder to turn into a loop as at each invocation, it may call itself up to 26 times in a given level due to the generation of the combinations. As I said earlier though, most of these are eliminated by checking against the current path in the trie tree.. effectively checking if the current substring of the combination exists before permuting it further.
Back to top
View user's profile
Display posts from previous:   
Post new topic   Reply to topic    Discussion Pod Forum Index -> Independent Game Development All times are GMT
Goto page 1, 2, 3  Next
Page 1 of 3

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum


Powered by phpBB © 2001, 2005 phpBB Group