I was recently given a small programming challenge by a friend, which entailed splitting a text file into multiple text files, based on word length. I got run time down to 11 seconds. Can you do better?
The text file in question is a listing of dictionary words, A through Z, containing a total of 234,936 words. The goal is to have the program create a file containing words of each length in individual files. For example, all the words in the main list that are nine characters long would end up in a file called 9.txt.
Sounds pretty straightforward, no? My first swag at this didn’t use any pre-processing at all, and took a whopping twelve minutes to run. The initial draft, which gave the correct output, followed this flow:
- Open the main file
- Read a line, determine length of word
- Create filename, open file for append, write to file, close file
- Repeat until done
The speed killer in this process is the constant opening/writing/closing of files. The program was basically strangling itself on disk I/O operations, which resulted in a run time of about 11 minutes.
I gave the output to my friend, and he was happy. However, I knew I could improve this program. By doing some basic preprocessing, and performing most operations in memory, run time for the same list is down to about 11 seconds, compiled. Withing the IDE, run time is about 17 seconds.
The flow of operations is basically the same, with some exceptions:
- Get a count of words in the file
- Read the words into an array
- Scan the array to find the longest word length
- For all words of length n, put into a string
- Write entire string to file at once
- Repeat until all lengths are complete
While there are more steps in this method, it is much faster due to the large reduction of small disk I/O operations. Rather than writing each individual string to each file, almost 235,000 times, this method only writes to disk 24 times.
Think of it this way:
For words of length 12, there 20,460 words. You could write to disk 20,460 times, or you could write to disk once. Which would be faster?
I’ve attached the project and word list to this post. I can see at least one way to speed this up a bit more, but I was interested to see if anyone else could think of a way to speed it up? Can we get this to run even faster? I look forward to seeing your solutions!
Concepts
There are a few concepts demonstrated in this program:
- File input/output
- Basic string processing
- Dynamic arrays
Attachments
File | Uploaded | Size |
---|---|---|
389-20170924-080222-wordthing.zip | 9/24/2017 8:02:22 AM | 755606 |