Attempt at R version of Peter Norvig’s Spelling Corrector

I recently came across a short tutorial by Peter Norvig on natural language processing using some interesting examples. I found it very interesting and really liked Peter Norvig’s explanations and the fact that it was all in ipython notebook for somebody to work alongside reading the tutorial. I wanted to see how a R version would work and so this is my attempt at a R version for the first part of his tutorial on spelling corrector. The knitted document is in the following location.

This entry was posted in R and tagged , , , , . Bookmark the permalink.

3 Responses to Attempt at R version of Peter Norvig’s Spelling Corrector

  1. Florian says:

    Here is an about 100x faster way of generating the lower-case tokens in R (the problem you describe at the beginning, data preparation):

    # by use a better string processing package

    # ensure the install worked (libicu):

    # run the extraction (in about a second or two):
    Words <- stri_trans_tolower(stri_extract_words(Big)[[1]])

  2. Carlos p Marqui says:

    Another option that works fast is to use gsub to replace all unwanted punctuation in the regular expression. Then use strsplit to split the resulting strings into substrings. Unlist and eliminate any blank lines.

    wsplit=strsplit(tolower(BIG),” “)

    x = gsub(“[^a-z]+”, ” “, wsplit[[1]]) # substitute anything other than [a-z]+ with a space
    x = strsplit(x,” “) # split the resulting strings
    x<-x[which(x!="")] #eliminates blank lines

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s