Add additional modeling steps and plots

This commit is contained in:
Andrie de Vries 2016-03-30 21:57:15 +02:00
Родитель 1a40dae2ab
Коммит 1045ab324e
1 изменённых файлов: 109 добавлений и 136 удалений

Просмотреть файл

@ -1,96 +1,84 @@
## Getting Started with R
### Some Brief History
### Some Brief History
# R followed S. The S language was conceived by John Chambers, Rick Becker,
# Trevor Hastie, Allan Wilks and others at Bell Labs in the mid 1970's.
# S was made publicly available in the early 1980s. R, which is modeled
# Trevor Hastie, Allan Wilks and others at Bell Labs in the mid 1970s.
# S was made publically available in the early 1980's. R, which is modeled
# closely on S, was developed by Robert Gentleman and Ross Ihaka in the early
# 1990's while they were both faculty members at the University of Auckland.
# R was established as an open source project (www.r-project.org) in 1995.
# Since 1997 the R project has been managed by the R Core Group.
# When AT&T spun off Bell Labs in 1996, S was no longer freely available.
# When AT&T spun of Bell Labs in 1996, S was no longer freely available.
# S-PLUS is a commercial implementation of the S language developed by the
# Insightful corporation which is now sold by TIBCO software Inc.
# Insightful corporation which is now sold by TIBCO software Inc.
# The R Core Group: http://www.r-project.org/contributors.html
# Download R: http://cran.r-project.org/
# The R Core Group: http://www.r-project.org/contributors.html
# Download R: http://cran.r-project.org/
### How R is oganized
### How R is organized
# R is an interpreted functional language with objects. The core of
# R language contains the the data manipulation and statistical functions.
# Most of R's capabilities are delivered as user contributed packages that
# may be downloaded from CRAN. R ships with the "base and recommended"
# packages:
# http://cran.r-project.org/doc/FAQ/R-FAQ.html#Which-add_002don-packages-exist-for-R_003f
# may be downloaded from CRAN.R ships with the "base and recommended"
# packages:
# http://cran.r-project.org/doc/FAQ/R-FAQ.html#Which-add_002don-packages-exist-for-R_003f
### R RESOURCES
### R RESOURCES
# What is R? the movie: http://www.youtube.com/watch?v=TR2bHSJ_eck
# Search for R topics on the web: http://www.rseek.org
# search or R packages: http://www.rdocumentation.org
# A list of R Resources: http://www.revolutionanalytics.com/what-is-open-source-r/r-resources.php
# Quick R: http://www.statmethods.net/
# R Reference Card: http://cran.r-project.org/doc/contrib/Short-refcard.pdf
# An online book: http://www.cookbook-r.com/
# Hadley Wickham's book, Advanced R: http://adv-r.had.co.nz
# CRAN Task Views: http://cran.r-project.org/web/views/
# Some help with packages: http://crantastic.org/
# The Bioconductor project for genomics: http://www.bioconductor.org/
# What is R? the movie: http://www.youtube.com/watch?v=TR2bHSJ_eck
# Search for R topics on the web: http://www.rseek.org
# search or R packages: http://www.rdocumentation.org
# A list of R Resources: http://www.revolutionanalytics.com/what-is-open-source-r/r-resources.php
# Quick R: http://www.statmethods.net/
# R Reference Card: http://cran.r-project.org/doc/contrib/Short-refcard.pdf
# An online book: http://www.cookbook-r.com/
# Hadley Wickham's book, Advanced R: http://adv-r.had.co.nz
# CRAN Task Views: http://cran.r-project.org/web/views/
# Some help with packages: http://crantastic.org/
# The BIOCONDUCTOR PROJECT FOR GENOMICS: http://www.bioconductor.org/
### R Blogs
### R Blogs
# Revolutions blog: http://blog.revolutionanalytics.com/
# RBloggers: http://www.r-bloggers.com
# Revolutions blog: http://blog.revolutionanalytics.com/
# RBloggers: http://www.r-bloggers.com
### Getting Help
### Getting Help
# If you are looking for help with technical questions about the language
# please consult the community site (http://www.r-project.org) for frequently
# asked questions. Ask for help on one of the several R mailing lists
# http://www.r-project.org/mail.html or
# Stack Overflow http://stackoverflow.com/questions/tagged/r
# Stack Overflow http://stackoverflow.com/questions/tagged/r
### Looking at Packages
### Packages used in this set of examples
# Package | Use
# ---------- | ----------
# ggplot2 | Plots
### Looking at Packages
# You can extend the functionality of R by installing and loading packages.
# A package is simply a set of functions, and sometimes data
# Package authors can distribute their work on CRAN,
# https://cran.r-project.org/,
# in addition to other repositories (e.g. BioConductor) and GitHub.
# Package authors can distribute their work on CRAN, https://cran.r-project.org/,
# in addition to other repositors (e.g. BioConductor) and github
# For a list of contributed packages on CRAN, see https://cran.r-project.org/
# There are several ways to run a script in Visual Studio:
# 1) Line by line: with the cursor at the line, press CTRL-Enter
# 2) Multile lines: select the lines you want to run and press CTRL-Enter
# 3) Entire file: press CTRL-A to select all lines and press CTRL-Enter
# Simple calculation.
2 + 3
# Print a message.
print('Hello, World!')
# To get help on a function, use help(function_name) or ?function_name.
?help
?print
# Save all available installed packages on your machine
# to variable "packages".
# The variable's values can be viewed in Visual Studio's Variable Explorer
# by clicking on the magnifying button in the Value column.
packages <- installed.packages()
# List all available installed packages on your machine.
installed.packages()
# List all "attached" or loaded packages.
search()
search()
# You "attach" a package to make it's functions available
# You "attach" a package to make it's functions available,
# using the library() function.
# For example, the "foreign" package comes with R and contains
# functions to import data from other systems.
@ -100,153 +88,138 @@ library(foreign)
library(help = foreign)
# To install a new package, use install.packages()
# Install the ggplot2 package for its plotting capability.
suppressWarnings(
if (!require("ggplot2"))
install.packages("ggplot2"))
# Install the ggplot2 package for it's plotting capability.
if (!require("ggplot2")){
install.packages("ggplot2")
}
# Then load the package.
library("ggplot2")
# Notice that package:ggplot2 is now added to the search list.
search()
# Notice that package:ggplot2 is now added to the search list.
### A Simple Regression Example
### A Simple Regression Example
# Look at the data sets that come with the package.
data(package = "ggplot2")
# Note that the results in RTVS may pop up, or pop under, in a new window.
data(package = "ggplot2")
# ggplot2 contains a dataset called diamonds.
# Make this dataset available using the data() function.
# ggplot2 contains a dataset called diamonds. Make this dataset available using the data() function.
data(diamonds, package = "ggplot2")
# The following command returns all objects in the "global environment".
# Look for "diamonds" in the results.
# These objects also show up in Visual Studio's Variable Explorer.
# Create a listing of all objects in the "global environment". Look for "diamonds" in the results.
ls()
# The str command displays the internal structure of the diamonds dataframe.
# You can also view the internal structure
# in Visual Studio's Variable Explorer.
# Now investigate the structure of diamonds, a data frame with 53,940 observations
str(diamonds)
# Print the first few rows.
# Complete data can be viewed in Visual Studio's Variable Explorer
# by clicking on the magnifying button in the Value column.
head(diamonds)
# Print the last 6 lines.
# Print the last 6 lines.
tail(diamonds)
# Find out what kind of object it is.
# The class info also shows up in Visual Studio's Variable Explorer.
class(diamonds)
# Look at the dimension of the data frame.
# The dim info also shows up in Visual Studio's Variable Explorer.
dim(diamonds)
### Vectorized Code
# This next bit of code shows off a very powerful feature of the R language:
# how many functions are "vectorized". The function sapply() takes
# the function class() that we just used on the data frame and applies it
# to all of the columns of the data frame.
# The class info also shows up in Visual Studio's Variable Explorer.
sapply(diamonds, class) # Find out what kind of animals the variables are
### Plots in R
### Plots in R
# Create a random sample of the diamonds data.
diamondSample <- diamonds[sample(nrow(diamonds), 5000),]
dim(diamondSample)
# R has three systems for static graphics:
# base graphics, lattice and ggplot2.
# This example shows ggplot2 in action.
# R has three systems for static graphics: base graphics, lattice and ggplot2.
# This example uses ggplot2
# Set the font size so that it will be clearly legible.
theme_set(theme_gray(base_size = 18))
# Make a scatterplot.
# In this sample you use ggplot2.
ggplot(diamondSample, aes(x = carat, y = price)) +
geom_point(colour = "blue")
geom_point(colour = "blue")
# Add a log scale.
ggplot(diamondSample, aes(x = carat, y = price)) +
geom_point(colour = "blue") +
scale_x_log10()
geom_point(colour = "blue") +
scale_x_log10()
# Add a log scale for both scales.
# The relationship between price and carat looks
# to be linear on a log-log scale.
# Note that "Everything is linear if plotted log-log with a fat magic marker."
# -- http://www.daclarke.org/Humour/Engineering.html"
ggplot(diamondSample, aes(x = carat, y = price)) +
geom_point(colour = "blue") +
scale_x_log10() +
scale_y_log10()
geom_point(colour = "blue") +
scale_x_log10() +
scale_y_log10()
### Linear Regression in R
### Linear Regression in R
# Now, build a simple regression model, examine the results of the model and plot the points and the regression line.
# Now, let's build a simple regression model, examine the results of
# the model and plot the points and the regression line.
# Build the model. log of price explained by log of carat. This illustrates how linear regression works. Later we fit a model that includes the remaining variables
# Build the model to predict price using carat.
model <- lm(log(price) ~ log(carat), data = diamondSample)
#model <- lm(price ~ carat, data = diamondSample)
model <- lm(log(price) ~ log(carat) , data = diamondSample)
# Look at the results.
# Look at the results.
summary(model)
# R-squared = 0.9334, i.e. model explains 93.3% of variance
# Extract model coefficients.
coef(model)
coef(model)[1]
exp(coef(model)[1])
exp(coef(model)[1]) # exponentiate the log of price, to convert to original units
# Show the model in a plot.
ggplot(diamondSample, aes(x = carat, y = price)) +
geom_point(colour = "blue") +
geom_smooth(method = "lm", colour = "red", size = 2) +
scale_x_log10() +
scale_y_log10()
geom_point(colour = "blue") +
geom_smooth(method = "lm", colour = "red", size = 2) +
scale_x_log10() +
scale_y_log10()
### Regression Diagnostics
### Regression Diagnostics
# It is easy to get regression diagnostic plots. The same plot function
# that plots points either with a formula or with the coordinates also has
# a "method" for dealing with a model object.
# It is easy to get regression diagnostic plots. The same plot function that plots points either with a formula or with the coordinates also has a "method" for dealing with a model object.
# Set up for multiple plots on the same figure.
par(mfrow = c(2, 2))
# Look at some model diagnostics.
# check to see Q-Q plot to see linearity which means residuals are normally distributed
par(mfrow = c(2, 2)) # Set up for multiple plots on the same figure.
plot(model, col = "blue")
par(mfrow = c(1, 1)) # Rest plot layout to single plot on a 1x1 grid
### The Model Object
### The Model Object
# Finally, let's look at the model object. R packs everything that goes with
# the model, the fornula, and results into the object. You can pick out what
# you need by indexing into the model object.
# Finally, let's look at the model object. R packs everything that goes with the model, e.g. the formula and results into the object. You can pick out what you need by indexing into the model object.
str(model)
model$coefficients
coef(model)
model$coefficients # note this is the same as coef(model)
# Now fit a new model including more columns
model <- lm(log(price) ~ log(carat) + ., data = diamondSample) # Model log of price against all columns
summary(model)
# R-squared = 0.9824, i.e. model explains 98.2% of variance, i.e. a better model than previously
### Make prediction for the first several data points.
first <- 221
last <- 231
pred_data <- diamonds[first:last,]
pred <- predict(model, pred_data)
# The weight of each diamond in carats:
diamonds$carat[first:last]
# The actual price:
diamonds$price[first:last]
# The predicted price:
exp(pred)
# Create data frame of actual and predicted price
predicted_values <- data.frame(
actual = diamonds$price,
predicted = exp(predict(model, diamonds)) # anti-log of predictions
)
# Inspect predictions
head(predicted_values)
# Create plot of actuals vs predictions
ggplot(predicted_values, aes(x = actual, y = predicted)) +
geom_point(colour = "blue", alpha = 0.01) +
geom_smooth(colour = "red") +
coord_equal(ylim = c(0, 20000)) + # force equal scale
ggtitle("Linear model of diamonds data")