зеркало из https://github.com/microsoft/RTVS-docs.git
Add additional modeling steps and plots
This commit is contained in:
Родитель
1a40dae2ab
Коммит
1045ab324e
|
@ -1,96 +1,84 @@
|
||||||
## Getting Started with R
|
## Getting Started with R
|
||||||
|
|
||||||
### Some Brief History
|
### Some Brief History
|
||||||
|
|
||||||
# R followed S. The S language was conceived by John Chambers, Rick Becker,
|
# R followed S. The S language was conceived by John Chambers, Rick Becker,
|
||||||
# Trevor Hastie, Allan Wilks and others at Bell Labs in the mid 1970's.
|
# Trevor Hastie, Allan Wilks and others at Bell Labs in the mid 1970s.
|
||||||
# S was made publicly available in the early 1980’s. R, which is modeled
|
# S was made publically available in the early 1980's. R, which is modeled
|
||||||
# closely on S, was developed by Robert Gentleman and Ross Ihaka in the early
|
# closely on S, was developed by Robert Gentleman and Ross Ihaka in the early
|
||||||
# 1990's while they were both faculty members at the University of Auckland.
|
# 1990's while they were both faculty members at the University of Auckland.
|
||||||
# R was established as an open source project (www.r-project.org) in 1995.
|
# R was established as an open source project (www.r-project.org) in 1995.
|
||||||
# Since 1997 the R project has been managed by the R Core Group.
|
# Since 1997 the R project has been managed by the R Core Group.
|
||||||
# When AT&T spun off Bell Labs in 1996, S was no longer freely available.
|
# When AT&T spun of Bell Labs in 1996, S was no longer freely available.
|
||||||
# S-PLUS is a commercial implementation of the S language developed by the
|
# S-PLUS is a commercial implementation of the S language developed by the
|
||||||
# Insightful corporation which is now sold by TIBCO software Inc.
|
# Insightful corporation which is now sold by TIBCO software Inc.
|
||||||
|
|
||||||
# The R Core Group: http://www.r-project.org/contributors.html
|
# The R Core Group: http://www.r-project.org/contributors.html
|
||||||
# Download R: http://cran.r-project.org/
|
# Download R: http://cran.r-project.org/
|
||||||
|
|
||||||
|
|
||||||
### How R is oganized
|
### How R is organized
|
||||||
|
|
||||||
# R is an interpreted functional language with objects. The core of
|
# R is an interpreted functional language with objects. The core of
|
||||||
# R language contains the the data manipulation and statistical functions.
|
# R language contains the the data manipulation and statistical functions.
|
||||||
# Most of R's capabilities are delivered as user contributed packages that
|
# Most of R's capabilities are delivered as user contributed packages that
|
||||||
# may be downloaded from CRAN. R ships with the "base and recommended"
|
# may be downloaded from CRAN.R ships with the "base and recommended"
|
||||||
# packages:
|
# packages:
|
||||||
# http://cran.r-project.org/doc/FAQ/R-FAQ.html#Which-add_002don-packages-exist-for-R_003f
|
# http://cran.r-project.org/doc/FAQ/R-FAQ.html#Which-add_002don-packages-exist-for-R_003f
|
||||||
|
|
||||||
|
|
||||||
### R RESOURCES
|
### R RESOURCES
|
||||||
|
|
||||||
# What is R? the movie: http://www.youtube.com/watch?v=TR2bHSJ_eck
|
# What is R? the movie: http://www.youtube.com/watch?v=TR2bHSJ_eck
|
||||||
# Search for R topics on the web: http://www.rseek.org
|
# Search for R topics on the web: http://www.rseek.org
|
||||||
# search or R packages: http://www.rdocumentation.org
|
# search or R packages: http://www.rdocumentation.org
|
||||||
# A list of R Resources: http://www.revolutionanalytics.com/what-is-open-source-r/r-resources.php
|
# A list of R Resources: http://www.revolutionanalytics.com/what-is-open-source-r/r-resources.php
|
||||||
# Quick R: http://www.statmethods.net/
|
# Quick R: http://www.statmethods.net/
|
||||||
# R Reference Card: http://cran.r-project.org/doc/contrib/Short-refcard.pdf
|
# R Reference Card: http://cran.r-project.org/doc/contrib/Short-refcard.pdf
|
||||||
# An online book: http://www.cookbook-r.com/
|
# An online book: http://www.cookbook-r.com/
|
||||||
# Hadley Wickham's book, Advanced R: http://adv-r.had.co.nz
|
# Hadley Wickham's book, Advanced R: http://adv-r.had.co.nz
|
||||||
# CRAN Task Views: http://cran.r-project.org/web/views/
|
# CRAN Task Views: http://cran.r-project.org/web/views/
|
||||||
# Some help with packages: http://crantastic.org/
|
# Some help with packages: http://crantastic.org/
|
||||||
# The Bioconductor project for genomics: http://www.bioconductor.org/
|
# The BIOCONDUCTOR PROJECT FOR GENOMICS: http://www.bioconductor.org/
|
||||||
|
|
||||||
|
|
||||||
### R Blogs
|
### R Blogs
|
||||||
|
|
||||||
# Revolutions blog: http://blog.revolutionanalytics.com/
|
# Revolutions blog: http://blog.revolutionanalytics.com/
|
||||||
# RBloggers: http://www.r-bloggers.com
|
# RBloggers: http://www.r-bloggers.com
|
||||||
|
|
||||||
|
|
||||||
### Getting Help
|
### Getting Help
|
||||||
|
|
||||||
# If you are looking for help with technical questions about the language
|
# If you are looking for help with technical questions about the language
|
||||||
# please consult the community site (http://www.r-project.org) for frequently
|
# please consult the community site (http://www.r-project.org) for frequently
|
||||||
# asked questions. Ask for help on one of the several R mailing lists
|
# asked questions. Ask for help on one of the several R mailing lists
|
||||||
# http://www.r-project.org/mail.html or
|
# http://www.r-project.org/mail.html or
|
||||||
# Stack Overflow http://stackoverflow.com/questions/tagged/r
|
# Stack Overflow http://stackoverflow.com/questions/tagged/r
|
||||||
|
|
||||||
|
|
||||||
### Looking at Packages
|
### Packages used in this set of examples
|
||||||
|
|
||||||
|
# Package | Use
|
||||||
|
# ---------- | ----------
|
||||||
|
# ggplot2 | Plots
|
||||||
|
|
||||||
|
|
||||||
|
### Looking at Packages
|
||||||
|
|
||||||
# You can extend the functionality of R by installing and loading packages.
|
# You can extend the functionality of R by installing and loading packages.
|
||||||
# A package is simply a set of functions, and sometimes data
|
# A package is simply a set of functions, and sometimes data
|
||||||
# Package authors can distribute their work on CRAN,
|
# Package authors can distribute their work on CRAN, https://cran.r-project.org/,
|
||||||
# https://cran.r-project.org/,
|
# in addition to other repositors (e.g. BioConductor) and github
|
||||||
# in addition to other repositories (e.g. BioConductor) and GitHub.
|
|
||||||
# For a list of contributed packages on CRAN, see https://cran.r-project.org/
|
# For a list of contributed packages on CRAN, see https://cran.r-project.org/
|
||||||
|
|
||||||
|
# List all available installed packages on your machine.
|
||||||
# There are several ways to run a script in Visual Studio:
|
installed.packages()
|
||||||
# 1) Line by line: with the cursor at the line, press CTRL-Enter
|
|
||||||
# 2) Multile lines: select the lines you want to run and press CTRL-Enter
|
|
||||||
# 3) Entire file: press CTRL-A to select all lines and press CTRL-Enter
|
|
||||||
|
|
||||||
# Simple calculation.
|
|
||||||
2 + 3
|
|
||||||
|
|
||||||
# Print a message.
|
|
||||||
print('Hello, World!')
|
|
||||||
|
|
||||||
# To get help on a function, use help(function_name) or ?function_name.
|
|
||||||
?help
|
|
||||||
?print
|
|
||||||
|
|
||||||
# Save all available installed packages on your machine
|
|
||||||
# to variable "packages".
|
|
||||||
# The variable's values can be viewed in Visual Studio's Variable Explorer
|
|
||||||
# by clicking on the magnifying button in the Value column.
|
|
||||||
packages <- installed.packages()
|
|
||||||
|
|
||||||
# List all "attached" or loaded packages.
|
# List all "attached" or loaded packages.
|
||||||
search()
|
search()
|
||||||
|
|
||||||
# You "attach" a package to make it's functions available
|
# You "attach" a package to make it's functions available,
|
||||||
# using the library() function.
|
# using the library() function.
|
||||||
# For example, the "foreign" package comes with R and contains
|
# For example, the "foreign" package comes with R and contains
|
||||||
# functions to import data from other systems.
|
# functions to import data from other systems.
|
||||||
|
@ -100,153 +88,138 @@ library(foreign)
|
||||||
library(help = foreign)
|
library(help = foreign)
|
||||||
|
|
||||||
# To install a new package, use install.packages()
|
# To install a new package, use install.packages()
|
||||||
# Install the ggplot2 package for its plotting capability.
|
# Install the ggplot2 package for it's plotting capability.
|
||||||
suppressWarnings(
|
if (!require("ggplot2")){
|
||||||
if (!require("ggplot2"))
|
install.packages("ggplot2")
|
||||||
install.packages("ggplot2"))
|
}
|
||||||
|
|
||||||
# Then load the package.
|
# Then load the package.
|
||||||
library("ggplot2")
|
library("ggplot2")
|
||||||
|
|
||||||
# Notice that package:ggplot2 is now added to the search list.
|
|
||||||
search()
|
search()
|
||||||
|
# Notice that package:ggplot2 is now added to the search list.
|
||||||
|
|
||||||
|
|
||||||
### A Simple Regression Example
|
### A Simple Regression Example
|
||||||
|
|
||||||
# Look at the data sets that come with the package.
|
# Look at the data sets that come with the package.
|
||||||
|
data(package = "ggplot2")
|
||||||
# Note that the results in RTVS may pop up, or pop under, in a new window.
|
# Note that the results in RTVS may pop up, or pop under, in a new window.
|
||||||
data(package = "ggplot2")
|
|
||||||
|
|
||||||
# ggplot2 contains a dataset called diamonds.
|
# ggplot2 contains a dataset called diamonds. Make this dataset available using the data() function.
|
||||||
# Make this dataset available using the data() function.
|
|
||||||
data(diamonds, package = "ggplot2")
|
data(diamonds, package = "ggplot2")
|
||||||
|
|
||||||
# The following command returns all objects in the "global environment".
|
# Create a listing of all objects in the "global environment". Look for "diamonds" in the results.
|
||||||
# Look for "diamonds" in the results.
|
|
||||||
# These objects also show up in Visual Studio's Variable Explorer.
|
|
||||||
ls()
|
ls()
|
||||||
|
|
||||||
# The str command displays the internal structure of the diamonds dataframe.
|
# Now investigate the structure of diamonds, a data frame with 53,940 observations
|
||||||
# You can also view the internal structure
|
|
||||||
# in Visual Studio's Variable Explorer.
|
|
||||||
str(diamonds)
|
str(diamonds)
|
||||||
|
|
||||||
# Print the first few rows.
|
# Print the first few rows.
|
||||||
# Complete data can be viewed in Visual Studio's Variable Explorer
|
|
||||||
# by clicking on the magnifying button in the Value column.
|
|
||||||
head(diamonds)
|
head(diamonds)
|
||||||
|
|
||||||
# Print the last 6 lines.
|
# Print the last 6 lines.
|
||||||
tail(diamonds)
|
tail(diamonds)
|
||||||
|
|
||||||
# Find out what kind of object it is.
|
# Find out what kind of object it is.
|
||||||
# The class info also shows up in Visual Studio's Variable Explorer.
|
|
||||||
class(diamonds)
|
class(diamonds)
|
||||||
|
|
||||||
# Look at the dimension of the data frame.
|
# Look at the dimension of the data frame.
|
||||||
# The dim info also shows up in Visual Studio's Variable Explorer.
|
|
||||||
dim(diamonds)
|
dim(diamonds)
|
||||||
|
|
||||||
|
|
||||||
### Vectorized Code
|
### Plots in R
|
||||||
|
|
||||||
# This next bit of code shows off a very powerful feature of the R language:
|
|
||||||
# how many functions are "vectorized". The function sapply() takes
|
|
||||||
# the function class() that we just used on the data frame and applies it
|
|
||||||
# to all of the columns of the data frame.
|
|
||||||
# The class info also shows up in Visual Studio's Variable Explorer.
|
|
||||||
sapply(diamonds, class) # Find out what kind of animals the variables are
|
|
||||||
|
|
||||||
|
|
||||||
### Plots in R
|
|
||||||
|
|
||||||
# Create a random sample of the diamonds data.
|
# Create a random sample of the diamonds data.
|
||||||
diamondSample <- diamonds[sample(nrow(diamonds), 5000),]
|
diamondSample <- diamonds[sample(nrow(diamonds), 5000),]
|
||||||
dim(diamondSample)
|
dim(diamondSample)
|
||||||
|
|
||||||
# R has three systems for static graphics:
|
# R has three systems for static graphics: base graphics, lattice and ggplot2.
|
||||||
# base graphics, lattice and ggplot2.
|
# This example uses ggplot2
|
||||||
# This example shows ggplot2 in action.
|
|
||||||
|
|
||||||
# Set the font size so that it will be clearly legible.
|
# Set the font size so that it will be clearly legible.
|
||||||
theme_set(theme_gray(base_size = 18))
|
theme_set(theme_gray(base_size = 18))
|
||||||
|
|
||||||
# Make a scatterplot.
|
# In this sample you use ggplot2.
|
||||||
ggplot(diamondSample, aes(x = carat, y = price)) +
|
ggplot(diamondSample, aes(x = carat, y = price)) +
|
||||||
geom_point(colour = "blue")
|
geom_point(colour = "blue")
|
||||||
|
|
||||||
# Add a log scale.
|
# Add a log scale.
|
||||||
ggplot(diamondSample, aes(x = carat, y = price)) +
|
ggplot(diamondSample, aes(x = carat, y = price)) +
|
||||||
geom_point(colour = "blue") +
|
geom_point(colour = "blue") +
|
||||||
scale_x_log10()
|
scale_x_log10()
|
||||||
|
|
||||||
# Add a log scale for both scales.
|
# Add a log scale for both scales.
|
||||||
# The relationship between price and carat looks
|
|
||||||
# to be linear on a log-log scale.
|
|
||||||
# Note that "Everything is linear if plotted log-log with a fat magic marker."
|
|
||||||
# -- http://www.daclarke.org/Humour/Engineering.html"
|
|
||||||
ggplot(diamondSample, aes(x = carat, y = price)) +
|
ggplot(diamondSample, aes(x = carat, y = price)) +
|
||||||
geom_point(colour = "blue") +
|
geom_point(colour = "blue") +
|
||||||
scale_x_log10() +
|
scale_x_log10() +
|
||||||
scale_y_log10()
|
scale_y_log10()
|
||||||
|
|
||||||
|
### Linear Regression in R
|
||||||
|
|
||||||
### Linear Regression in R
|
# Now, build a simple regression model, examine the results of the model and plot the points and the regression line.
|
||||||
|
|
||||||
# Now, let's build a simple regression model, examine the results of
|
# Build the model. log of price explained by log of carat. This illustrates how linear regression works. Later we fit a model that includes the remaining variables
|
||||||
# the model and plot the points and the regression line.
|
|
||||||
|
|
||||||
# Build the model to predict price using carat.
|
model <- lm(log(price) ~ log(carat) , data = diamondSample)
|
||||||
model <- lm(log(price) ~ log(carat), data = diamondSample)
|
|
||||||
#model <- lm(price ~ carat, data = diamondSample)
|
|
||||||
|
|
||||||
# Look at the results.
|
# Look at the results.
|
||||||
summary(model)
|
summary(model)
|
||||||
|
# R-squared = 0.9334, i.e. model explains 93.3% of variance
|
||||||
|
|
||||||
# Extract model coefficients.
|
# Extract model coefficients.
|
||||||
coef(model)
|
coef(model)
|
||||||
coef(model)[1]
|
coef(model)[1]
|
||||||
exp(coef(model)[1])
|
exp(coef(model)[1]) # exponentiate the log of price, to convert to original units
|
||||||
|
|
||||||
# Show the model in a plot.
|
# Show the model in a plot.
|
||||||
ggplot(diamondSample, aes(x = carat, y = price)) +
|
ggplot(diamondSample, aes(x = carat, y = price)) +
|
||||||
geom_point(colour = "blue") +
|
geom_point(colour = "blue") +
|
||||||
geom_smooth(method = "lm", colour = "red", size = 2) +
|
geom_smooth(method = "lm", colour = "red", size = 2) +
|
||||||
scale_x_log10() +
|
scale_x_log10() +
|
||||||
scale_y_log10()
|
scale_y_log10()
|
||||||
|
|
||||||
|
|
||||||
### Regression Diagnostics
|
### Regression Diagnostics
|
||||||
|
|
||||||
# It is easy to get regression diagnostic plots. The same plot function
|
# It is easy to get regression diagnostic plots. The same plot function that plots points either with a formula or with the coordinates also has a "method" for dealing with a model object.
|
||||||
# that plots points either with a formula or with the coordinates also has
|
|
||||||
# a "method" for dealing with a model object.
|
|
||||||
|
|
||||||
# Set up for multiple plots on the same figure.
|
|
||||||
par(mfrow = c(2, 2))
|
|
||||||
|
|
||||||
# Look at some model diagnostics.
|
# Look at some model diagnostics.
|
||||||
|
# check to see Q-Q plot to see linearity which means residuals are normally distributed
|
||||||
|
|
||||||
|
par(mfrow = c(2, 2)) # Set up for multiple plots on the same figure.
|
||||||
plot(model, col = "blue")
|
plot(model, col = "blue")
|
||||||
|
par(mfrow = c(1, 1)) # Rest plot layout to single plot on a 1x1 grid
|
||||||
|
|
||||||
|
|
||||||
### The Model Object
|
### The Model Object
|
||||||
|
|
||||||
# Finally, let's look at the model object. R packs everything that goes with
|
# Finally, let's look at the model object. R packs everything that goes with the model, e.g. the formula and results into the object. You can pick out what you need by indexing into the model object.
|
||||||
# the model, the fornula, and results into the object. You can pick out what
|
|
||||||
# you need by indexing into the model object.
|
|
||||||
str(model)
|
str(model)
|
||||||
model$coefficients
|
model$coefficients # note this is the same as coef(model)
|
||||||
coef(model)
|
|
||||||
|
# Now fit a new model including more columns
|
||||||
|
model <- lm(log(price) ~ log(carat) + ., data = diamondSample) # Model log of price against all columns
|
||||||
|
|
||||||
|
summary(model)
|
||||||
|
# R-squared = 0.9824, i.e. model explains 98.2% of variance, i.e. a better model than previously
|
||||||
|
|
||||||
|
|
||||||
### Make prediction for the first several data points.
|
|
||||||
first <- 221
|
# Create data frame of actual and predicted price
|
||||||
last <- 231
|
|
||||||
pred_data <- diamonds[first:last,]
|
predicted_values <- data.frame(
|
||||||
pred <- predict(model, pred_data)
|
actual = diamonds$price,
|
||||||
# The weight of each diamond in carats:
|
predicted = exp(predict(model, diamonds)) # anti-log of predictions
|
||||||
diamonds$carat[first:last]
|
)
|
||||||
# The actual price:
|
|
||||||
diamonds$price[first:last]
|
# Inspect predictions
|
||||||
# The predicted price:
|
head(predicted_values)
|
||||||
exp(pred)
|
|
||||||
|
# Create plot of actuals vs predictions
|
||||||
|
ggplot(predicted_values, aes(x = actual, y = predicted)) +
|
||||||
|
geom_point(colour = "blue", alpha = 0.01) +
|
||||||
|
geom_smooth(colour = "red") +
|
||||||
|
coord_equal(ylim = c(0, 20000)) + # force equal scale
|
||||||
|
ggtitle("Linear model of diamonds data")
|
||||||
|
|
||||||
|
|
Загрузка…
Ссылка в новой задаче