Add additional modeling steps and plots

This commit is contained in:
Andrie de Vries 2016-03-30 21:57:15 +02:00
Родитель 1a40dae2ab
Коммит 1045ab324e
1 изменённых файлов: 109 добавлений и 136 удалений

Просмотреть файл

@ -1,14 +1,15 @@
## Getting Started with R ## Getting Started with R
### Some Brief History ### Some Brief History
# R followed S. The S language was conceived by John Chambers, Rick Becker, # R followed S. The S language was conceived by John Chambers, Rick Becker,
# Trevor Hastie, Allan Wilks and others at Bell Labs in the mid 1970's. # Trevor Hastie, Allan Wilks and others at Bell Labs in the mid 1970s.
# S was made publicly available in the early 1980s. R, which is modeled # S was made publically available in the early 1980's. R, which is modeled
# closely on S, was developed by Robert Gentleman and Ross Ihaka in the early # closely on S, was developed by Robert Gentleman and Ross Ihaka in the early
# 1990's while they were both faculty members at the University of Auckland. # 1990's while they were both faculty members at the University of Auckland.
# R was established as an open source project (www.r-project.org) in 1995. # R was established as an open source project (www.r-project.org) in 1995.
# Since 1997 the R project has been managed by the R Core Group. # Since 1997 the R project has been managed by the R Core Group.
# When AT&T spun off Bell Labs in 1996, S was no longer freely available. # When AT&T spun of Bell Labs in 1996, S was no longer freely available.
# S-PLUS is a commercial implementation of the S language developed by the # S-PLUS is a commercial implementation of the S language developed by the
# Insightful corporation which is now sold by TIBCO software Inc. # Insightful corporation which is now sold by TIBCO software Inc.
@ -16,7 +17,7 @@
# Download R: http://cran.r-project.org/ # Download R: http://cran.r-project.org/
### How R is oganized ### How R is organized
# R is an interpreted functional language with objects. The core of # R is an interpreted functional language with objects. The core of
# R language contains the the data manipulation and statistical functions. # R language contains the the data manipulation and statistical functions.
@ -38,7 +39,7 @@
# Hadley Wickham's book, Advanced R: http://adv-r.had.co.nz # Hadley Wickham's book, Advanced R: http://adv-r.had.co.nz
# CRAN Task Views: http://cran.r-project.org/web/views/ # CRAN Task Views: http://cran.r-project.org/web/views/
# Some help with packages: http://crantastic.org/ # Some help with packages: http://crantastic.org/
# The Bioconductor project for genomics: http://www.bioconductor.org/ # The BIOCONDUCTOR PROJECT FOR GENOMICS: http://www.bioconductor.org/
### R Blogs ### R Blogs
@ -56,41 +57,28 @@
# Stack Overflow http://stackoverflow.com/questions/tagged/r # Stack Overflow http://stackoverflow.com/questions/tagged/r
### Packages used in this set of examples
# Package | Use
# ---------- | ----------
# ggplot2 | Plots
### Looking at Packages ### Looking at Packages
# You can extend the functionality of R by installing and loading packages. # You can extend the functionality of R by installing and loading packages.
# A package is simply a set of functions, and sometimes data # A package is simply a set of functions, and sometimes data
# Package authors can distribute their work on CRAN, # Package authors can distribute their work on CRAN, https://cran.r-project.org/,
# https://cran.r-project.org/, # in addition to other repositors (e.g. BioConductor) and github
# in addition to other repositories (e.g. BioConductor) and GitHub.
# For a list of contributed packages on CRAN, see https://cran.r-project.org/ # For a list of contributed packages on CRAN, see https://cran.r-project.org/
# List all available installed packages on your machine.
# There are several ways to run a script in Visual Studio: installed.packages()
# 1) Line by line: with the cursor at the line, press CTRL-Enter
# 2) Multile lines: select the lines you want to run and press CTRL-Enter
# 3) Entire file: press CTRL-A to select all lines and press CTRL-Enter
# Simple calculation.
2 + 3
# Print a message.
print('Hello, World!')
# To get help on a function, use help(function_name) or ?function_name.
?help
?print
# Save all available installed packages on your machine
# to variable "packages".
# The variable's values can be viewed in Visual Studio's Variable Explorer
# by clicking on the magnifying button in the Value column.
packages <- installed.packages()
# List all "attached" or loaded packages. # List all "attached" or loaded packages.
search() search()
# You "attach" a package to make it's functions available # You "attach" a package to make it's functions available,
# using the library() function. # using the library() function.
# For example, the "foreign" package comes with R and contains # For example, the "foreign" package comes with R and contains
# functions to import data from other systems. # functions to import data from other systems.
@ -100,78 +88,58 @@ library(foreign)
library(help = foreign) library(help = foreign)
# To install a new package, use install.packages() # To install a new package, use install.packages()
# Install the ggplot2 package for its plotting capability. # Install the ggplot2 package for it's plotting capability.
suppressWarnings( if (!require("ggplot2")){
if (!require("ggplot2")) install.packages("ggplot2")
install.packages("ggplot2")) }
# Then load the package. # Then load the package.
library("ggplot2") library("ggplot2")
# Notice that package:ggplot2 is now added to the search list.
search() search()
# Notice that package:ggplot2 is now added to the search list.
### A Simple Regression Example ### A Simple Regression Example
# Look at the data sets that come with the package. # Look at the data sets that come with the package.
# Note that the results in RTVS may pop up, or pop under, in a new window.
data(package = "ggplot2") data(package = "ggplot2")
# Note that the results in RTVS may pop up, or pop under, in a new window.
# ggplot2 contains a dataset called diamonds. # ggplot2 contains a dataset called diamonds. Make this dataset available using the data() function.
# Make this dataset available using the data() function.
data(diamonds, package = "ggplot2") data(diamonds, package = "ggplot2")
# The following command returns all objects in the "global environment". # Create a listing of all objects in the "global environment". Look for "diamonds" in the results.
# Look for "diamonds" in the results.
# These objects also show up in Visual Studio's Variable Explorer.
ls() ls()
# The str command displays the internal structure of the diamonds dataframe. # Now investigate the structure of diamonds, a data frame with 53,940 observations
# You can also view the internal structure
# in Visual Studio's Variable Explorer.
str(diamonds) str(diamonds)
# Print the first few rows. # Print the first few rows.
# Complete data can be viewed in Visual Studio's Variable Explorer
# by clicking on the magnifying button in the Value column.
head(diamonds) head(diamonds)
# Print the last 6 lines. # Print the last 6 lines.
tail(diamonds) tail(diamonds)
# Find out what kind of object it is. # Find out what kind of object it is.
# The class info also shows up in Visual Studio's Variable Explorer.
class(diamonds) class(diamonds)
# Look at the dimension of the data frame. # Look at the dimension of the data frame.
# The dim info also shows up in Visual Studio's Variable Explorer.
dim(diamonds) dim(diamonds)
### Vectorized Code
# This next bit of code shows off a very powerful feature of the R language:
# how many functions are "vectorized". The function sapply() takes
# the function class() that we just used on the data frame and applies it
# to all of the columns of the data frame.
# The class info also shows up in Visual Studio's Variable Explorer.
sapply(diamonds, class) # Find out what kind of animals the variables are
### Plots in R ### Plots in R
# Create a random sample of the diamonds data. # Create a random sample of the diamonds data.
diamondSample <- diamonds[sample(nrow(diamonds), 5000),] diamondSample <- diamonds[sample(nrow(diamonds), 5000),]
dim(diamondSample) dim(diamondSample)
# R has three systems for static graphics: # R has three systems for static graphics: base graphics, lattice and ggplot2.
# base graphics, lattice and ggplot2. # This example uses ggplot2
# This example shows ggplot2 in action.
# Set the font size so that it will be clearly legible. # Set the font size so that it will be clearly legible.
theme_set(theme_gray(base_size = 18)) theme_set(theme_gray(base_size = 18))
# Make a scatterplot. # In this sample you use ggplot2.
ggplot(diamondSample, aes(x = carat, y = price)) + ggplot(diamondSample, aes(x = carat, y = price)) +
geom_point(colour = "blue") geom_point(colour = "blue")
@ -181,32 +149,27 @@ ggplot(diamondSample, aes(x = carat, y = price)) +
scale_x_log10() scale_x_log10()
# Add a log scale for both scales. # Add a log scale for both scales.
# The relationship between price and carat looks
# to be linear on a log-log scale.
# Note that "Everything is linear if plotted log-log with a fat magic marker."
# -- http://www.daclarke.org/Humour/Engineering.html"
ggplot(diamondSample, aes(x = carat, y = price)) + ggplot(diamondSample, aes(x = carat, y = price)) +
geom_point(colour = "blue") + geom_point(colour = "blue") +
scale_x_log10() + scale_x_log10() +
scale_y_log10() scale_y_log10()
### Linear Regression in R ### Linear Regression in R
# Now, let's build a simple regression model, examine the results of # Now, build a simple regression model, examine the results of the model and plot the points and the regression line.
# the model and plot the points and the regression line.
# Build the model. log of price explained by log of carat. This illustrates how linear regression works. Later we fit a model that includes the remaining variables
# Build the model to predict price using carat.
model <- lm(log(price) ~ log(carat) , data = diamondSample) model <- lm(log(price) ~ log(carat) , data = diamondSample)
#model <- lm(price ~ carat, data = diamondSample)
# Look at the results. # Look at the results.
summary(model) summary(model)
# R-squared = 0.9334, i.e. model explains 93.3% of variance
# Extract model coefficients. # Extract model coefficients.
coef(model) coef(model)
coef(model)[1] coef(model)[1]
exp(coef(model)[1]) exp(coef(model)[1]) # exponentiate the log of price, to convert to original units
# Show the model in a plot. # Show the model in a plot.
ggplot(diamondSample, aes(x = carat, y = price)) + ggplot(diamondSample, aes(x = carat, y = price)) +
@ -218,35 +181,45 @@ ggplot(diamondSample, aes(x = carat, y = price)) +
### Regression Diagnostics ### Regression Diagnostics
# It is easy to get regression diagnostic plots. The same plot function # It is easy to get regression diagnostic plots. The same plot function that plots points either with a formula or with the coordinates also has a "method" for dealing with a model object.
# that plots points either with a formula or with the coordinates also has
# a "method" for dealing with a model object.
# Set up for multiple plots on the same figure.
par(mfrow = c(2, 2))
# Look at some model diagnostics. # Look at some model diagnostics.
# check to see Q-Q plot to see linearity which means residuals are normally distributed
par(mfrow = c(2, 2)) # Set up for multiple plots on the same figure.
plot(model, col = "blue") plot(model, col = "blue")
par(mfrow = c(1, 1)) # Rest plot layout to single plot on a 1x1 grid
### The Model Object ### The Model Object
# Finally, let's look at the model object. R packs everything that goes with # Finally, let's look at the model object. R packs everything that goes with the model, e.g. the formula and results into the object. You can pick out what you need by indexing into the model object.
# the model, the fornula, and results into the object. You can pick out what
# you need by indexing into the model object.
str(model) str(model)
model$coefficients model$coefficients # note this is the same as coef(model)
coef(model)
# Now fit a new model including more columns
model <- lm(log(price) ~ log(carat) + ., data = diamondSample) # Model log of price against all columns
summary(model)
# R-squared = 0.9824, i.e. model explains 98.2% of variance, i.e. a better model than previously
### Make prediction for the first several data points.
first <- 221 # Create data frame of actual and predicted price
last <- 231
pred_data <- diamonds[first:last,] predicted_values <- data.frame(
pred <- predict(model, pred_data) actual = diamonds$price,
# The weight of each diamond in carats: predicted = exp(predict(model, diamonds)) # anti-log of predictions
diamonds$carat[first:last] )
# The actual price:
diamonds$price[first:last] # Inspect predictions
# The predicted price: head(predicted_values)
exp(pred)
# Create plot of actuals vs predictions
ggplot(predicted_values, aes(x = actual, y = predicted)) +
geom_point(colour = "blue", alpha = 0.01) +
geom_smooth(colour = "red") +
coord_equal(ylim = c(0, 20000)) + # force equal scale
ggtitle("Linear model of diamonds data")