Exploring Hotel Review Data from Trip Advisor with R

I wanted to use R to explore hotel review data. I chose to explore reviews for 3 hotels from Trip Advisor. First, I had to scrape the review data. I have described how I scraped the data here. I used the extracted review data and did the following exploratory analysis:
* Check ratings over time
* Check the frequent words in the top quotes for each review grouped by star rating
* Check if I can find any themes in reviews with simple k-means clustering

I have described the exploratory analysis of data here.

I think my analysis was probably a bit simplistic. Right now I didn’t find anything non-obvious from this exploratory analysis. But it was still a fun exercise. In the future, I will explore how topic model packges work with this data.

 

 

Advertisements
This entry was posted in R. Bookmark the permalink.

26 Responses to Exploring Hotel Review Data from Trip Advisor with R

  1. Lucas says:

    Nice, you may be interested in this article, which is in this months American Economic Review:

    Promotional Reviews: An Empirical Investigation of Online Review Manipulation
    American Economic Review Vol. 104, Issue 8 — August 2014

    It is restricted access, but the working paper version is here:
    http://www.nber.org/papers/w18340

  2. Khaled says:

    Hello, I did enjoy your post very much, and try it by myself.However , I failed to complete the example , Starting from line 121 for the for loop statement. It gave me the following error ,
    Error in data.frame(id = id, topquote = topquote, ratingdt = ratingdt, :
    arguments imply differing number of rows: 0, 10, 18
    In addition: Warning message:
    All formats failed to parse. No formats found.
    > dfrating=do.call(rbind,dfrating.l)
    > head(dfrating)
    [,1]
    [1,] NA
    [2,] NA
    [3,] NA
    [4,] NA
    [5,] NA
    [6,] NA
    Any help?!
    Thank you

    • notesofdabbler says:

      Thanks for your interest in the post. On looking into it I found that the html tags they use in Trip Advisor have changed a bit and my code was not written in a flexible to automatically handle it. I was getting information based on class = ‘quote’. But now some entries have class = ‘quote isNew’ and others have class = ‘quote ‘ (with an extra space) which breaks the code. I have updated the code in the function getOnePage (lines 30 and 32) to get information based on the new tags. Now the code runs.

      But I would suggest to use Hadley Wickham’s rvest package for web scraping since makes it much easier to scrape data. There is also an example of scraping Trip Advisor data with the package. I will be using rvest for my future webscraping. It is very nice.

      Hope this helps.

  3. KB says:

    Dear Dabbler:
    I do appreciate your ongoing feedback and effort. It extremely helps novice R user like me. In fact, I came from R-Blogger posting and I have been a huge fan of yours: the hands on practice like yours encourage me to keep doing and learning R.
    When you get a chance in near future, I would love to learn more about scraping reviews (i.e.Expedia or Priceline. Unlike the TripAdvisor, they only allow stayed guests to write up reviews. I hoping some differently interesting insight can be achieved). Of course, I am expecting to learn Topic Model!
    Thank you so much for your kind contribution to the world!

    • notesofdabbler says:

      Thanks for your kind comments. I am also an R learner and am benefitting a lot from the community and R-Bloggers.

  4. JAS says:

    Hi Dabbler,

    Thank you for the your post. I find your posts very easy to follow and I am learning a lot from them. i agree with KB above, as an R beginner, your posts are extremely helpful. Please keep posting

    As an exercise, I thought it will be interesting to know more about the person who made the comment, like where s/he is from and how many cities s/he has visited. I have tried for few hours but still no luck. Can you show me how I get those information please ?

    Cheers

    • notesofdabbler says:

      Thanks for your interest. You suggestion on doing some analysis on folks who write reviews will be very interesting to look at. When I get some time, I will also explore how to do it. If I am able to do it, I will post it on the blog.

  5. Hector says:

    Your work is very good, thanks . I tryaing to work with your document http://www.r-bloggers.com/learning-r-parameter-fitting-for-models-involving-differential-equations/
    However I have several experiences with different initial concentrations for A, in theory the model should work for all, but I am wondering how to get the different ssqres=preddf$conc-expdf$conc and integrated into one, in order to obtain one set of parameters

    Regards

    hector

  6. Kevin says:

    Hi! I ran your code and received two errors.

    Error in data.frame(id = id, topquote = topquote, ratingdt = ratingdt, :
    arguments imply differing number of rows: 0, 10
    after for(i in 1:(length(morepg)+1))

    and
    Error in ns_fullrev[[1]] : subscript out of bounds
    In addition: Warning message:
    XML content does not seem to be XML: ‘NA’

    It ended with me getting only the first ten reviews.

    I’m not sure exactly what’s going on, I’m just learning R (and python, and coding in general…).

    • notesofdabbler says:

      Thanks for bringing this to my attention. It looks like I am extracting information from html based on html classes ‘quote isNew’ or ‘quote ‘. But there is now also a html class of type ‘quote’ (without space). I modified the scrapeTripAdvisor.R code to reflect this (lines 39 and 41). I think it runs now.

      I wrote this code before knowledge of package rvest. It has a similar example where the code is more robust and syntax much cleaner. At some point, I will refactor this code to be more robust and use rvest. If you are starting to learn web scraping in R, I will highly recommend learning rvest (which I also plan to do).

      • Kevin says:

        Hi! Thanks so much. Yeah, I’ve started to look into rvest. They have an example, but that one doesn’t function properly for me either. But it’s definitely a clean package and I’ll start to look into it more.

      • Thimios says:

        that’s an amazing tool!!! I saw the example with rvest but it scrapes only one page of the trip advisor. How can I scrape several pages from the same hotel?

      • Thimios says:

        Hello again I create a loop in order to take the data from all the pages. The problem now is that in the column of ratings i gate NA . Please advise! Thank you!

  7. Ronak says:

    Hi – I liked your post and really appreciate your work. I tried it by myself but successful only in scrapping data of ‘jwmarriott’ but not ‘hamptoninn’ and ‘conrad’. I am running the exact code, only changing the hotel name(‘hamptoninn’) in line:100. It ended with me getting only the first ten reviews and rest its “#N/A”. Can you please suggest?

    • Ronak says:

      Hi – I understand your code was not written in a flexible to automatically handle ‘hamptoninn’ and ‘conrad’. But i was trying to figure out the changes in the code so that i can extract data for these 2 as well. Trust me i am struggling very hard. Can you please suggest me the changes as i need extract the data for other hotels as well? Thank you in advance!

    • notesofdabbler says:

      Hi, I am sorry you are having trouble running the code. But when I try to run the code, it seems to work. I just changed “jwmarriott” to “hamptoninn” in line 100 and it seems to run for me without errors. I am running RStudio 0.99.465 in windows 8 and the sessionInfo is below. I don’t know if some difference in versions might be causing issues for you.

      R version 3.1.1 (2014-07-10)
      Platform: x86_64-w64-mingw32/x64 (64-bit)

      locale:
      [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United States.1252 LC_MONETARY=English_United States.1252
      [4] LC_NUMERIC=C LC_TIME=English_United States.1252

      attached base packages:
      [1] stats graphics grDevices utils datasets methods base

      other attached packages:
      [1] ggplot2_1.0.0 lubridate_1.3.3 XML_3.98-1.1 RCurl_1.95-4.3 bitops_1.0-6

      loaded via a namespace (and not attached):
      [1] colorspace_1.2-4 digest_0.6.4 grid_3.1.1 gtable_0.1.2 MASS_7.3-33 memoise_0.2.1 munsell_0.4.2
      [8] plyr_1.8.1 proto_0.3-10 Rcpp_0.11.6 reshape2_1.4 scales_0.3.0 stringr_0.6.2 tools_3.1.1

  8. Elaine Crean says:

    Hey – I really hope I’m not spamming you, I thought I left a comment the other day…maybe you have to approve it…or I dunno :S

    Anyway, just in case. I wanted to just say how grateful I am of your code, I’m a part-time student doing my masters in Digital Marketing. Using your code I was able to scrape TripAdvisor data for a company we are working with and I’m getting great insights!

    I have a small issue and was wondering if you could help me? For some reason at page 25 only blanks are collected….I’m not sure if you have an idea what this might be…maybe the sequence or the dataframe ( I got some warnings ) – any help at all would be SO appreciated!! Thanks 🙂

    • notesofdabbler says:

      thanks for your interest in using the code. There were a few bugs I had fixed and the most recent version of the code is here. Let me know if using this fixes the issue you had. Also, if you check out rvest package, it has a much cleaner code for extracting info from websites and also has an example for scraping from Trip Advisor.

      • Elaine Crean says:

        Thanks a million! I can see some code in the latest version that I don’t have so I’ll try that out 🙂 thanks again!

  9. Elaine Crean says:

    Hi again! I’m still getting this error ;

    Error in data.frame(id = id, topquote = topquote, ratingdt = ratingdt, :
    arguments imply differing number of rows: 10, 13

    Do you know what I’m doing wrong??

    • notesofdabbler says:

      Are you getting this error by running the code without any modifications or is this with a different urlmainlist? I have seen error like this when one of the extracted list has lesser number of elements.

    • notesofdabbler says:

      I checked my code with the url you provided and that version is in this location. I couldn’t reproduce your error. But this seems to be working. I had to do a couple of things in the code:
      1. In line 93, I changed the sequence to go until 650 (that’s the number in the url of the last page of reviews)
      2. In line 139, for individual reviews, I was doing a gsub on “hotel_review”. But in this case I had to do it on “Attraction_Review” since that’s in the url

      Also only ~560 reviews are extracted out of 650. The ones that are not extracted seem to be not english and I didn’t take that into account in the code. I have also listed the session info in the code file (sometimes different versions may change certain things). Hope this helps.

      • Elaine Crean says:

        Hey 🙂
        I had made them changes earlier too but still getting error. Think it must be the versions….I reverted to the R version you are using…loaded GGplot2 in your version… still not working 😦 I’m not sure if I can get older versions of XML, Rcurl or lubridate…will keep working on it! Thanks!

      • notesofdabbler says:

        Sorry that the code is still not working for you. I don’t know the reason. Anyway, when I ran the code, I had also created a data set with extracted reviews. You might find it useful for the time being. I will likely refactor this code to use rvest sometime in the future.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s