James Joyce    If you have ever wondered what it might be like to read minds, then consider reading James Joyce’s Ulysses.  But do not expect to understand what the minds are thinking or whatever they are doing, unless you are versed in the classics, as well as ancient and modern languages, and your vocabulary consists of about a half million words.

    According to Wikipedia, Ulysses contains more than 30,000 distinct words, but what exactly is a word?  To be sure, writers invent new words—Shakespeare invented a great many new words.

     It also goes without saying (but I am saying it) that words should not be expected to appear in the dictionary before they are invented.  Nor should every newly made up word be lookupable in a pocket dictionary.  However, it seems only fair to expect legitimate candidate words to find their way into a moderately large lexicon within, say, a hundred years after their first appearance in a significant book.

    A few years ago, in order to play with word puzzles and such, I compiled a Lexicon containing more than a quarter million English language words, about as many as the total number of words counting repetitions in Ulysses.  Words in this lexical database were gleaned from multiple on-line sources, but not from the book Ulysses itself.  So the question occurred to me, “what are the odds that a random ‘word’ from Ulysses exists in this database?”

    In my student days this kind of puzzle was called an empirical question.  It should only be necessary to check all unique words in the book against the contents of the selected reference Lexicon.  Such a comparison would ignore case, of course, and perhaps should include a few additional tweaks.  For a meaningful calculation it might be necessary to eliminate proper names and non-English language words (Ulysses has more than a few of these).   In the end it may not be practical to propose a precise answer to the odds question, due to its inherent ambiguities, or due to failing to notice or purposely ignoring subtleties, in preparing a
Ulysses lexicon.

    The text of Ulysses is readily available for analysis—or for that matter, for reading—thanks to Project Gutenberg.  To the computer programmer, words are more-or-less equivalent to space-delimited substrings, after superfluous punctuation has been removed.  Internal punctuation like hyphens and apostrophes might need to be preserved.  I know of no perfect formula for extracting words from text, but space-delimited substrings are a fair approximation or starting point.  A more rigorous approach would be needed for real research.

    In any case, it should be interesting to compare the size of the Ulysses lexicon, as computed by this approximation method, to the Wikipedia article’s 30,000 number. Such a comparison would serve as a rough validity check on the method.

    And the number is 30,800.  On quick inspection of the imported
Ulysses lexicon, more than a few words have a long dash stuck to them (a double hyphen ‘--’ in the text rendering of the book).  Re-do the import, removing long dashes. Good, now the number is 30,200.  I did not screen out proper names or non-English words, so 30,200 will do—it is close enough.  The beauty of the MUMPS programming language is that one can do this sort of quick-study very quickly. Creating a Lexicon database and supporting code to read in the book, clean up the words, and file them, count the number of occurrences of each, etc. took only about an hour.  I will summarize the method below.

    But, before doing that let’s look at a few imported words.   Ha! The 74th 
H word (in alphabetic order) is hairynostrilled.  I do not remember it, but am nevertheless confident of two things, 1) it is in Ulysses and 2) it will not be in the other-sourced Lexicon.  True enough, it is here, “Ben Jumbo Dollard, Rubicund, musclebound, hairynostrilled,...” and it is not in the 200,000 word comparison lexicon.

    Let’s try an on-line source http://www.oxforddictionaries.com/

No results

    Perhaps intentionally selected a compound word that looked funny was not fair...

    The 541st “M” word (in alphabetic order) is one I don’t know, but it looks like a word and I probably should know it.  It is maugre:  “But sir Leopold was passing grave maugre his word by cause he still had...”  Yes it is in the big Lexicon and it is also in the dictionary.  The on-line dictionary meaning is “bad pleasure.”  James Joyce knew something about that subject.

    Let’s try a “T” word.  Well, I have just learned another word tatterdemalion: “feeble goosefat whore in a tatterdemalion gown of mildewed strawberry...”  It is in the big lexicon and apparently means something like ragamuffin.

    Skimming the Ulysses lexicon is revealing.  The majority of words that I do not recognize are either proper names: Poulaphouca (a place in County Wicklow), or foreign words, generally part of a quoted phrase, or compound words like the first example above, or sounds rendered as words.

    I wish there were a convenient way to filter proper names and foreign words.  My revised expectation, though, based on a quick scan of the Ulysses lexicon is that a smaller proportion of terms will fail lookup in the larger lexicon than I had originally thought.

    Having started this exercise, I may as well finish it.  Setting aside many valid objections, cases that should be excluded and so forth, we compare the Ulysses lexicon to the considerably larger one that is based on other sources.  And the answer is approximately 1/4 of the unique terms in Ulysses (including foreign words, proper nouns, sounds, run-together words, and so forth) are not in the large lexicon, which contains only English language words and not very many proper nouns.

    In conclusion, no conclusion is possible, except that I should probably stay away from lexical analysis.

    For anyone who programs in MUMPS, and is familiar with the MUMPS File Manager —

  A.) FILE NAME:------------- BOOK LEXICON
                                                F.) FILE ACCESS:
  B.) FILE NUMBER:----------- 29340.5                DD______ @
                                                     Read____ @
  C.) NUM OF FLDS:----------- 4                      Write___ @
                                                     Delete__ @
  D.) DATA GLOBAL:----------- ^SIS(29340.5,          Laygo___ @

  E.) TOTAL GLOBAL ENTRIES:-- 30220             G.) PRINTING STATUS:-- Off
   .01          WORD   [0;1] [RF]
   1            BOOK   [1;0] [29340.51PA]                               <-Mult
       .01          BOOK   [0;1] [P1360105']
       .02          COUNT   [0;2] [NJ8,0]

    The programming steps to populate the database were approximately as follows:

  1. Read the text file into a scratch global.
  2. Inspect the global to determine where the book begins and ends (i.e., eliminating publisher data, etc.).
  3. For each line in the book, remove extraneous punctuation, convert case, split into space pieces (quasi-words).
  4. For each word either add it to the Lexicon or increment the count if it is already there.

    The time required to parse and file all the words in Ulysses on my now retired quad-core AMD was about 2 seconds.  The time required to test words against the larger lexicon was negligible.