Merge pull request #4 from microsoft/ContentMove

Having Readable Markdown AND Azure Notebook Versions
This commit is contained in:
Shana Matthews 2019-10-15 13:33:07 -07:00 коммит произвёл GitHub
Родитель c110572f04 7203c5b7ce
Коммит 3000e1a5ba
Не найден ключ, соответствующий данной подписи
Идентификатор ключа GPG: 4AEE18F83AFDEB23
37 изменённых файлов: 10025 добавлений и 0 удалений

Двоичные данные
.DS_Store поставляемый

Двоичный файл не отображается.

Двоичные данные
Data Science 1_ Introduction to Python for Data Science/.DS_Store поставляемый

Двоичный файл не отображается.

Просмотреть файл

@ -0,0 +1,858 @@
# Introduction to Python
## Comments
```python
# this is the first comment
spam = 1 # and this is the second comment
# ... and now a third!
text = "# This is not a comment because it's inside quotes."
print(text)
```
## Python basics
### Arithmetic and numeric types
> **Learning goal:** By the end of this subsection, you should be comfortable with using numeric types in Python arithmetic.
#### Python numeric operators
```python
2 + 3
```
**Share**: What is the answer? Why?
```python
30 - 4 * 5
```
**Share**: What is the answer? Why?
```python
7 / 5
```
```python
3 * 3.5
```
```python
7.0 / 5
```
**Floor Division**
```python
7 // 5
```
**Remainder (modulo)**
```python
7 % 5
```
**Exponents**
```python
5 ** 2
```
```python
2 ** 5
```
**Share**: What is the answer? Why?
```python
-5 ** 2
```
```python
(-5) ** 2
```
```python
(30 - 4) * 5
```
### Variables
**Share**: What is the answer? Why?
```python
length = 15
width = 3 * 5
length * width
```
**Variables don't need types**
```python
length = 15
length
```
```python
length = 15.0
length
```
```python
length = 'fifteen'
length
```
**Share**: What will happen? Why?
```python
n
```
**Previous Output**
```python
tax = 11.3 / 100
price = 19.95
price * tax
```
```python
price + _
```
```python
round(_, 2)
```
**Multiple Variable Assignment**
```python
a, b, c, = 3.2, 1, 6
a, b, c
```
### Expressions
**Share**: What is the answer? Why?
```python
2 < 5
```
*(Run after learners have shared the above)*
**Python Comparison Operators**:
![all of the comparison operators](https://notebooks.azure.com/sguthals/projects/data-science-1-instructor/raw/Images%2FScreen%20Shot%202019-09-10%20at%207.15.49%20AM.png)
**Complex Expressions**
```python
a, b, c = 1, 2, 3
a < b < c
```
**Built-In Functions**
```python
min(3, 2.4, 5)
```
```python
max(3, 2.4, 5)
```
**Compound Expressions**
```python
1 < 2 and 2 < 3
```
### Exercise:
**Think, Pair, Share**
1. Quietly think about what would happen if you flipped one of the `&lt;` to a `&gt;`.
2. Share with the person next to you what you think will happen.
3. Try it out in the code cell below.
4. Share anything you thought was surprising.
```python
# Now flip around one of the simple expressions and see if the output matches your expectations:
```
**Or and Not**
**Share**: What is the answer? Why?
```python
1 < 2 or 1 > 2
```
```python
not (2 < 3)
```
### Exercise:
**Think, Pair, Share**
1. Quietly think about what would the results would be. *Tip: Use paper!*
2. Share with the person next to you what you think will happen.
3. Try it out in the code cell below.
4. Share anything you thought was surprising.
5. Instructor Demo
```python
# Play around with compound expressions.
# Set i to different values to see what results this complex compound expression returns:
i = 7
(i == 2) or not (i % 2 != 0 and 1 < i < 5)
```
&gt; **Takeaway:** Arithmetic operations on numeric data form the foundation of data science work in Python. Even sophisticated numeric operations are predicated on these basics, so mastering them is essential to doing data science.
## Strings
&gt; **Learning goal:** By the end of this subsection, you should be comfortable working with strings at a basic level in Python.
```python
'spam eggs' # Single quotes.
```
```python
'doesn\'t' # Use \' to escape the single quote...
```
```python
"doesn't" # ...or use double quotes instead.
```
```python
'"Isn\'t," she said.'
```
```python
print('"Isn\'t," she said.')
```
**Pause**
Notice the difference between the previous two code cells when they are run.
```python
print('C:\some\name') # Here \n means newline!
```
```python
print(r'C:\some\name') # Note the r before the quote.
```
### String literals
**Think, Pair, Share**
```python
3 * 'un' + 'ium'
```
### Concatenating strings
```python
'Py' 'thon'
```
```python
prefix = 'Py'
prefix + 'thon'
```
### String indexes
**Think, Pair, Share**
```python
word = 'Python'
word[0]
```
**Share**
```python
word[5]
```
**Share**
```python
word[-1]
```
```python
word[-2]
```
```python
word[-6]
```
### Slicing strings
**Think, Pair, Share**
```python
word[0:2]
```
**Share**
```python
word[2:5]
```
```python
word[:2]
```
```python
word[4:]
```
```python
word[-2:]
```
**Share**
```python
word[:2] + word[2:]
```
```python
word[:4] + word[4:]
```
**TIP**
+---+---+---+---+---+---+
| P | y | t | h | o | n |
+---+---+---+---+---+---+
0 1 2 3 4 5 6
-6 -5 -4 -3 -2 -1
**Share**
```python
word[42] # The word only has 6 characters.
```
**Share**
```python
word[4:42]
```
```python
word[42:]
```
**Strings are Immutable**
```python
word[0] = 'J'
```
```python
word[2:] = 'py'
```
```python
'J' + word[1:]
```
```python
word[:2] + 'Py'
```
**Built-In Function: len**
```python
s = 'supercalifragilisticexpialidocious'
len(s)
```
**Built-In Function: str**
```python
str(2)
```
```python
str(2.5)
```
## Other data types
&gt; **Learning goal:** By the end of this subsection, you should have a basic understanding of the remaining fundamental data types in Python and an idea of how and when to use them.
### Lists
```python
squares = [1, 4, 9, 16, 25]
squares
```
**Indexing and Slicing is the Same as Strings**
```python
squares[0]
```
```python
squares[-1]
```
```python
squares[-3:]
```
```python
squares[:]
```
**Think, Pair, Share**
```python
squares + [36, 49, 64, 81, 100]
```
**Lists are Mutable**
```python
cubes = [1, 8, 27, 65, 125]
4 ** 3
```
**Think, Pair, Share**
```python
# Replace the wrong value.
cubes
```
**Replace Many Values**
```python
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
letters
```
**Share**
```python
letters[2:5] = ['C', 'D', 'E']
letters
```
**Share**
```python
letters[2:5] = []
letters
```
```python
letters[:] = []
letters
```
**Built-In Functions: len**
```python
letters = ['a', 'b', 'c', 'd']
len(letters)
```
**Nesting**
**Think, Pair, Share**
```python
a = ['a', 'b', 'c']
n = [1, 2, 3]
x = [a, n]
x
```
**Share**
```python
x[0]
```
**Share**
```python
x[0][0]
```
### Exercise:
```python
# Nested lists come up a lot in programming, so it pays to practice.
# Which indices would you include after x to get c?
# How about to get 3?
```
### List object methods
**Share**
```python
beatles = ['John', 'Paul']
beatles.append('George')
beatles
```
**Share**
```python
beatles2 = ['John', 'Paul', 'George']
beatles2.append(['Stuart', 'Pete'])
beatles2
```
**Share**
```python
beatles.extend(['Stuart', 'Pete'])
beatles
```
**Share**
```python
beatles.index('George')
```
```python
beatles.count('John')
```
```python
beatles.remove('Stuart')
beatles
```
```python
beatles.pop()
```
```python
beatles.insert(1, 'Ringo')
beatles
```
```python
beatles.reverse()
beatles
```
```python
beatles.sort()
beatles
```
### Exercise:
```python
# What happens if you run beatles.extend(beatles)?
# How about beatles.append(beatles)?
```
### Tuples
```python
t = (1, 2, 3)
t
```
**Tuples are Immutable**
```python
t[1] = 2.0
```
```python
t[1]
```
```python
t[:2]
```
**Lists &lt;-&gt; Tuples**
```python
l = ['baked', 'beans', 'spam']
l = tuple(l)
l
```
```python
l = list(l)
l
```
### Membership testing
**Share**
```python
tup = ('a', 'b', 'c')
'b' in tup
```
```python
lis = ['a', 'b', 'c']
'a' not in lis
```
### Exercise:
```python
# What happens if you run lis in lis?
# Is that the behavior you expected?
# If not, think back to the nested lists weve already encountered.
```
### Dictionaries
```python
capitals = {'France': ('Paris', 2140526)}
```
```python
capitals['Nigeria'] = ('Lagos', 6048430)
capitals
```
### Exercise:
```python
# Now try adding another country (or something else) to the capitals dictionary
```
**Interacting with Dictionaries**
```python
capitals['France']
```
```python
capitals['Nigeria'] = ('Abuja', 1235880)
capitals
```
```python
len(capitals)
```
```python
capitals.popitem()
```
```python
capitals
```
&gt; **Takeaway:** Regardless of how complex and voluminous the data you will work with, these basic data structures will repeatedly be your means for handling and manipulating it. Comfort with these basic data structures is essential to being able to understand and use Python code written by others.
### List comprehensions
&gt; **Learning goal:** By the end of this subsection, you should understand how to economically and computationally create lists.
```python
for x in range(1,11):
print(x)
```
```python
numbers = [x for x in range(1,11)] # Remember to create a range 1 more than the number you actually want.
numbers
numbers = [x for x in range(1,11)]
numbers = [x for x in [1,2,3,4,5,6,7,8,9,10]]
numbers = [1,2,3,4,5,6,7,8,9,10]
```
```python
for x in range(1,11):
print(x*x)
```
```python
squares = [x*x for x in range(1,11)]
squares
squares = [x*x for x in range(1,11)]
squares = [x*x for x in [1,2,3,4,5,6,7,8,9,10]]
squares = [1*1,2*2,3*3,4*4,5,6,7,8,9,10]
squares = [1,2,9...]
```
**Demo**
```python
odd_squares = [x*x for x in range(1,11) if x % 2 != 0]
odd_squares
```
### Exercise:
```python
# Now use a list comprehension to generate a list of odd cubes
# from 1 to 2,197
```
&gt; **Takeaway:** List comprehensions are a popular tool in Python because they enable the rapid, programmatic generation of lists. The economy and ease of use therefore make them an essential tool for you (in addition to a necessary topic to understand as you try to understand Python code written by others).
### Importing modules
&gt; **Learning goal:** By the end of this subsection, you should be comfortable importing modules in Python.
```python
factorial(5)
```
```python
import math
math.factorial(5)
```
```python
from math import factorial
factorial(5)
```
&gt; **Takeaway:** There are several Python modules that you will regularly use in conducting data science in Python, so understanding how to import them will be essential (especially in this training).

Просмотреть файл

@ -0,0 +1,731 @@
# Introduction to NumPy
**Library Alias**
```python
import numpy as np
```
## Built-In Help
### Exercise
```python
# Place your cursor after the period and press <TAB>:
np.
```
### Exercise
```python
# Replace 'add' below with a few different NumPy function names and look over the documentation:
np.add?
```
## NumPy arrays: a specialized data structure for analysis
&gt; **Learning goal:** By the end of this subsection, you should have a basic understanding of what NumPy arrays are and how they differ from the other Python data structures you have studied thus far.
### Lists in Python
```python
myList = list(range(10))
myList
```
**List Comprehension with Types**
```python
[type(item) for item in myList]
```
**Share**
```python
myList2 = [True, "2", 3.0, 4]
[type(item) for item in myList2]
```
### Fixed-type arrays in Python
#### Creating NumPy arrays method 1: using Python lists
```python
# Create an integer array:
np.array([1, 4, 2, 5, 3])
```
**Think, Pair, Share**
```python
np.array([3.14, 4, 2, 3])
```
### Exercise
```python
# What happens if you construct an array using a list that contains a combination of integers, floats, and strings?
```
**Explicit Typing**
```python
np.array([1, 2, 3, 4], dtype='float32')
```
### Exercise
```python
# Try this using a different dtype.
# Remember that you can always refer to the documentation with the command np.array.
```
**Multi-Dimensional Array**
**Think, Pair, Share**
```python
# nested lists result in multi-dimensional arrays
np.array([range(i, i + 3) for i in [2, 4, 6]])
```
#### Creating NumPy arrays method 2: building from scratch
```python
np.zeros(10, dtype=int)
```
```python
np.ones((3, 5), dtype=float)
```
```python
np.full((3, 5), 3.14)
```
```python
np.arange(0, 20, 2)
```
```python
np.linspace(0, 1, 5)
```
```python
np.random.random((3, 3))
```
```python
np.random.normal(0, 1, (3, 3))
```
```python
np.random.randint(0, 10, (3, 3))
```
```python
np.eye(3)
```
```python
np.empty(3)
```
&gt; **Takeaway:** NumPy arrays are a data structure similar to Python lists that provide high performance when storing and working on large amounts of homogeneous data—precisely the kind of data that you will encounter frequently in doing data science. NumPy arrays support many data types beyond those discussed in this course. With all of that said, however, dont worry about memorizing all of the NumPy dtypes. **Its often just necessary to care about the general kind of data youre dealing with: floating point, integer, Boolean, string, or general Python object.**
## Working with NumPy arrays: the basics
&gt; **Learning goal:** By the end of this subsection, you should be comfortable working with NumPy arrays in basic ways.
**Similar to Lists:**
- **Arrays attributes**: Assessing the size, shape, and data types of arrays
- **Indexing arrays**: Getting and setting the value of individual array elements
- **Slicing arrays**: Getting and setting smaller subarrays within a larger array
- **Reshaping arrays**: Changing the shape of a given array
- **Joining and splitting arrays**: Combining multiple arrays into one and splitting one array into multiple arrays
### Array attributes
```python
import numpy as np
np.random.seed(0) # seed for reproducibility
a1 = np.random.randint(10, size=6) # One-dimensional array
a2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
a3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array
```
**Array Types**
```python
print("dtype:", a3.dtype)
```
### Exercise:
```python
# Change the values in this code snippet to look at the attributes for a1, a2, and a3:
print("a3 ndim: ", a3.ndim)
print("a3 shape:", a3.shape)
print("a3 size: ", a3.size)
```
### Exercise:
```python
# Explore the dtype for the other arrays.
# What dtypes do you predict them to have?
print("dtype:", a3.dtype)
```
### Indexing arrays
**Quick Review**
```python
a1
```
```python
a1[0]
```
```python
a1[4]
```
```python
a1[-1]
```
```python
a1[-2]
```
**Multi-Dimensional Arrays**
```python
a2
```
```python
a2[0, 0]
```
```python
a2[2, 0]
```
```python
a2[2, -1]
```
```python
a2[0, 0] = 12
a2
```
```python
a1[0] = 3.14159
a1
```
### Exercise:
```python
# What happens if you try to insert a string into a1?
# Hint: try both a string like '3' and one like 'three'
```
### Slicing arrays
#### One-dimensional slices
```python
a = np.arange(10)
a
```
```python
a[:5]
```
```python
a[5:]
```
```python
a[4:7]
```
**Slicing With Index**
```python
a[::2]
```
```python
a[1::2]
```
```python
a[::-1]
```
```python
a[5::-2]
```
#### Multidimensional slices
```python
a2
```
```python
a2[:2, :3]
```
```python
a2[:3, ::2]
```
```python
a2[::-1, ::-1]
```
#### Accessing array rows and columns
```python
print(a2[:, 0])
```
```python
print(a2[0, :])
```
```python
print(a2[0])
```
#### Slices are no-copy views
```python
print(a2)
```
```python
a2_sub = a2[:2, :2]
print(a2_sub)
```
```python
a2_sub[0, 0] = 99
print(a2_sub)
```
```python
print(a2)
```
#### Copying arrays
```python
a2_sub_copy = a2[:2, :2].copy()
print(a2_sub_copy)
```
```python
a2_sub_copy[0, 0] = 42
print(a2_sub_copy)
```
```python
print(a2)
```
### Joining and splitting arrays
#### Joining arrays
```python
a = np.array([1, 2, 3])
b = np.array([3, 2, 1])
np.concatenate([a, b])
```
```python
c = [99, 99, 99]
print(np.concatenate([a, b, c]))
```
```python
grid = np.array([[1, 2, 3],
[4, 5, 6]])
```
```python
np.concatenate([grid, grid])
```
#### Splitting arrays
**Think, Pair, Share**
```python
a = [1, 2, 3, 99, 99, 3, 2, 1]
a1, a2, a3 = np.split(a, [3, 5])
print(a1, a2, a3)
```
&gt; **Takeaway:** Manipulating datasets is a fundamental part of preparing data for analysis. The skills you learned and practiced here will form building blocks for the most sophisticated data-manipulation you will learn in later sections in this course.
## Sorting arrays
```python
a = np.array([2, 1, 4, 3, 5])
np.sort(a)
```
```python
print(a)
```
```python
a.sort()
print(a)
```
### Sorting along rows or columns
```python
rand = np.random.RandomState(42)
table = rand.randint(0, 10, (4, 6))
print(table)
```
```python
np.sort(table, axis=0)
```
```python
np.sort(table, axis=1)
```
### NumPy Functions vs Python Built-In Functions
| Operator | Equivalent ufunc | Description |
|:--------------|:--------------------|:--------------------------------------|
|``+`` |``np.add`` |Addition (e.g., ``1 + 1 = 2``) |
|``-`` |``np.subtract`` |Subtraction (e.g., ``3 - 2 = 1``) |
|``-`` |``np.negative`` |Unary negation (e.g., ``-2``) |
|``*`` |``np.multiply`` |Multiplication (e.g., ``2 * 3 = 6``) |
|``/`` |``np.divide`` |Division (e.g., ``3 / 2 = 1.5``) |
|``//`` |``np.floor_divide`` |Floor division (e.g., ``3 // 2 = 1``) |
|``**`` |``np.power`` |Exponentiation (e.g., ``2 ** 3 = 8``) |
|``%`` |``np.mod`` |Modulus/remainder (e.g., ``9 % 4 = 1``)|
#### Exponents and logarithms
```python
a = [1, 2, 3]
print("a =", a)
print("e^a =", np.exp(a))
print("2^a =", np.exp2(a))
print("3^a =", np.power(3, a))
```
```python
a = [1, 2, 4, 10]
print("a =", a)
print("ln(a) =", np.log(a))
print("log2(a) =", np.log2(a))
print("log10(a) =", np.log10(a))
```
```python
a = [0, 0.001, 0.01, 0.1]
print("exp(a) - 1 =", np.expm1(a))
print("log(1 + a) =", np.log1p(a))
```
#### Specialized Functions
```python
from scipy import special
```
```python
# Gamma functions (generalized factorials) and related functions
a = [1, 5, 10]
print("gamma(a) =", special.gamma(a))
print("ln|gamma(a)| =", special.gammaln(a))
print("beta(a, 2) =", special.beta(a, 2))
```
&gt; **Takeaway:** Universal functions in NumPy provide you with computational functions that are faster than regular Python functions, particularly when working on large datasets that are common in data science. This speed is important because it can make you more efficient as a data scientist and it makes a broader range of inquiries into your data tractable in terms of time and computational resources.
## Aggregations
&gt; **Learning goal:** By the end of this subsection, you should be comfortable aggregating data in NumPy.
### Summing the values of an array
```python
myList = np.random.random(100)
np.sum(myList)
```
**NumPy vs Python Functions**
```python
large_array = np.random.rand(1000000)
%timeit sum(large_array)
%timeit np.sum(large_array)
```
### Minimum and maximum
```python
np.min(large_array), np.max(large_array)
```
```python
print(large_array.min(), large_array.max(), large_array.sum())
```
## Computation on arrays with broadcasting
&gt; **Learning goal:** By the end of this subsection, you should have a basic understanding of how broadcasting works in NumPy (and why NumPy uses it).
```python
first_array = np.array([3, 6, 8, 1])
second_array = np.array([4, 5, 7, 2])
first_array + second_array
```
```python
first_array + 5
```
```python
one_dim_array = np.ones((1))
one_dim_array
```
```python
two_dim_array = np.ones((2, 2))
two_dim_array
```
```python
one_dim_array + two_dim_array
```
**Think, Pair, Share**
```python
horizontal_array = np.arange(3)
vertical_array = np.arange(3)[:, np.newaxis]
print(horizontal_array)
print(vertical_array)
```
```python
horizontal_array + vertical_array
```
## Comparisons, masks, and Boolean logic in NumPy
&gt; **Learning goal:** By the end of this subsection, you should be comfortable with and understand how to use Boolean masking in NumPy in order to answer basic questions about your data.
### Example: Counting Rainy Days
Let's see masking in practice by examining the monthly rainfall statistics for Seattle. The data is in a CSV file from data.gov. To load the data, we will use pandas, which we will formally introduce in Section 4.
```python
import numpy as np
import pandas as pd
# Use pandas to extract rainfall as a NumPy array
rainfall_2003 = pd.read_csv('Data/Observed_Monthly_Rain_Gauge_Accumulations_-_Oct_2002_to_May_2017.csv')['RG01'][ 2:14].values
rainfall_2003
```
```python
%matplotlib inline
import matplotlib.pyplot as plt
```
```python
plt.bar(np.arange(1, len(rainfall_2003) + 1), rainfall_2003)
```
### Boolean operators
```python
np.sum((rainfall_2003 > 0.5) & (rainfall_2003 < 1))
```
```python
rainfall_2003 > (0.5 & rainfall_2003) < 1
```
```python
np.sum(~((rainfall_2003 <= 0.5) | (rainfall_2003 >= 1)))
```
```python
print("Number of months without rain:", np.sum(rainfall_2003 == 0))
print("Number of months with rain: ", np.sum(rainfall_2003 != 0))
print("Months with more than 1 inch: ", np.sum(rainfall_2003 > 1))
print("Rainy months with < 1 inch: ", np.sum((rainfall_2003 > 0) &
(rainfall_2003 < 1)))
```
## Boolean arrays as masks
```python
rand = np.random.RandomState(0)
two_dim_array = rand.randint(10, size=(3, 4))
two_dim_array
```
```python
two_dim_array < 5
```
**Masking**
```python
two_dim_array[two_dim_array < 5]
```
```python
# Construct a mask of all rainy months
rainy = (rainfall_2003 > 0)
# Construct a mask of all summer months (June through September)
months = np.arange(1, 13)
summer = (months > 5) & (months < 10)
print("Median precip in rainy months in 2003 (inches): ",
np.median(rainfall_2003[rainy]))
print("Median precip in summer months in 2003 (inches): ",
np.median(rainfall_2003[summer]))
print("Maximum precip in summer months in 2003 (inches): ",
np.max(rainfall_2003[summer]))
print("Median precip in non-summer rainy months (inches):",
np.median(rainfall_2003[rainy & ~summer]))
```
&gt; **Takeaway:** By combining Boolean operations, masking operations, and aggregates, you can quickly answer questions similar to those we posed about the Seattle rainfall data about any dataset. Operations like these will form the basis for the data exploration and preparation for analysis that will by our primary concerns in Sections 4 and 5.

Просмотреть файл

@ -0,0 +1,492 @@
# Introduction to Pandas
```python
import pandas as pd
```
```python
import numpy as np
```
## Fundamental panda data structures
### `Series` objects in pandas
```python
series_example = pd.Series([-0.5, 0.75, 1.0, -2])
series_example
```
```python
series_example.values
```
```python
series_example.index
```
```python
series_example[1]
```
```python
series_example[1:3]
```
### Explicit Indices
```python
series_example2 = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
series_example2
```
```python
series_example2['b']
```
### Exercise:
```python
# Do explicit Series indices work *exactly* the way you might expect?
# Try slicing series_example2 using its explicit index and find out.
```
### Series vs Dictionary
**Think, Pair, Share**
```python
population_dict = {'France': 65429495,
'Germany': 82408706,
'Russia': 143910127,
'Japan': 126922333}
population_dict
```
```python
population = pd.Series(population_dict)
population
```
### Interacting with Series
```python
population['Russia']
```
### Exercise
```python
# Try slicing on the population Series on your own.
# Would slicing be possible if Series keys were not ordered?
population['Germany':'Russia']
```
```python
# Try running population['Albania'] = 2937590 (or another country of your choice)
# What order do the keys appear in when you run population? Is it what you expected?
```
```python
population
```
```python
pop2
```
```python
pop2 = pd.Series({'Spain': 46432074, 'France': 102321, 'Albania': 50532})
population + pop2
```
### `DataFrame` object in pandas
```python
area_dict = {'Albania': 28748,
'France': 643801,
'Germany': 357386,
'Japan': 377972,
'Russia': 17125200}
area = pd.Series(area_dict)
area
```
```python
countries = pd.DataFrame({'Population': population, 'Area': area})
countries
```
```python
countries['Capital'] = ['Tirana', 'Paris', 'Berlin', 'Tokyo', 'Moscow']
countries
```
```python
countries = countries[['Capital', 'Area', 'Population']]
countries
```
```python
countries['Population Density'] = countries['Population'] / countries['Area']
countries
```
```python
countries['Area']
```
### Exercise
```python
# Now try accessing row data with a command like countries['Japan']
```
**Think, Pair, Share**
```python
countries.loc['Japan']
```
```python
countries.loc['Japan']['Area']
```
### Exercise
```python
# Can you think of a way to return the area of Japan without using .iloc?
# Hint: Try putting the column index first.
# Can you slice along these indices as well?
```
### DataSeries Creation
```python
countries['Debt-to-GDP Ratio'] = np.nan
countries
```
```python
debt = pd.Series([0.19, 2.36], index=['Russia', 'Japan'])
countries['Debt-to-GDP Ratio'] = debt
countries
```
```python
del countries['Capital']
countries
```
```python
countries.T
```
```python
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
```
## Manipulating data in pandas
### Index objects in pandas
```python
series_example = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
ind = series_example.index
ind
```
```python
ind[1]
```
```python
ind[::2]
```
**Share**
```python
ind[1] = 0
```
### Set Properties
```python
ind_odd = pd.Index([1, 3, 5, 7, 9])
ind_prime = pd.Index([2, 3, 5, 7, 11])
```
**Think, Pair, Share**
In the code cell below, try out the intersection (`ind_odd &amp; ind_prime`), union (`ind_odd | ind_prime`), and the symmetric difference (`ind_odd ^ ind_prime`) of `ind_odd` and `ind_prime`.
```python
```
### Data Selection in Series
```python
series_example2 = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
series_example2
```
```python
series_example2['b']
```
```python
'a' in series_example2
```
```python
series_example2.keys()
```
```python
list(series_example2.items())
```
```python
series_example2['e'] = 1.25
series_example2
```
### Indexers: `loc` and `iloc`
**Think, Pair, Share**
```python
series_example2.loc['a']
```
```python
series_example2.loc['a':'c']
```
**Share**
```python
series_example2.iloc[0]
```
```python
series_example2.iloc[0:2]
```
### Data Selection in DataFrames
```python
area = pd.Series({'Albania': 28748,
'France': 643801,
'Germany': 357386,
'Japan': 377972,
'Russia': 17125200})
population = pd.Series ({'Albania': 2937590,
'France': 65429495,
'Germany': 82408706,
'Russia': 143910127,
'Japan': 126922333})
countries = pd.DataFrame({'Area': area, 'Population': population})
countries
```
```python
countries['Area']
```
```python
countries['Population Density'] = countries['Population'] / countries['Area']
countries
```
### DataFrame as two-dimensional array
```python
countries.values
```
```python
countries.T
```
```python
countries.iloc[:3, :2]
```
```python
countries.loc[:'Germany', :'Population']
```
### Exercise
```python
# Can you think of how to combine masking and fancy indexing in one line?
# Your masking could be somthing like countries['Population Density'] > 200
# Your fancy indexing could be something like ['Population', 'Population Density']
# Be sure to put the the masking and fancy indexing inside the square brackets: countries.loc[]
```
# Operating on Data in Pandas
**Think, Pair, Share** For each of these Sections.
## Index alignment with Series
For our first example, suppose we are combining two different data sources and find only the top five countries by *area* and the top five countries by *population*:
```python
area = pd.Series({'Russia': 17075400, 'Canada': 9984670,
'USA': 9826675, 'China': 9598094,
'Brazil': 8514877}, name='area')
population = pd.Series({'China': 1409517397, 'India': 1339180127,
'USA': 324459463, 'Indonesia': 322179605,
'Brazil': 207652865}, name='population')
```
```python
# Now divide these to compute the population density
pop_density = area/population
pop_density
```
```python
series1 = pd.Series([2, 4, 6], index=[0, 1, 2])
series2 = pd.Series([3, 5, 7], index=[1, 2, 3])
series1 + series2
```
```python
series1.add(series2, fill_value=0)
```
Much better!
## Index alignment with DataFrames
```python
rng = np.random.RandomState(42)
df1 = pd.DataFrame(rng.randint(0, 20, (2, 2)),
columns=list('AB'))
df1
```
```python
df2 = pd.DataFrame(rng.randint(0, 10, (3, 3)),
columns=list('BAC'))
df2
```
```python
# Add df1 and df2. Is the output what you expected?
df1 + df2
```
```python
fill = df1.stack().mean()
df1.add(df2, fill_value=fill)
```
## Operations between DataFrames and Series
Index and column alignment gets maintained in operations between a `DataFrame` and a `Series` as well. To see this, consider a common operation in data science, wherein we find the difference of a `DataFrame` and one of its rows. Because pandas inherits ufuncs from NumPy, pandas will compute the difference row-wise by default:
```python
df3 = pd.DataFrame(rng.randint(10, size=(3, 4)), columns=list('WXYZ'))
df3
```
```python
df3 - df3.iloc[0]
```
```python
df3.subtract(df3['X'], axis=0)
```
```python
halfrow = df3.iloc[0, ::2]
halfrow
```
```python
df3 - halfrow
```

Просмотреть файл

@ -0,0 +1,622 @@
# Manipulating and Cleaning Data
## Exploring `DataFrame` information
&gt; **Learning goal:** By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.
```python
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
```
### `DataFrame.info`
**Dataset Alert**: Iris Data about Flowers
```python
iris_df.info()
```
### `DataFrame.head`
```python
iris_df.head()
```
### Exercise:
By default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out how to get it to show more?
```python
# Hint: Consult the documentation by using iris_df.head?
```
### `DataFrame.tail`
```python
iris_df.tail()
```
&gt; **Takeaway:** Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with.
## Dealing with missing data
&gt; **Learning goal:** By the end of this subsection, you should know how to replace or remove null values from DataFrames.
**None vs NaN**
### `None`: non-float missing data
```python
import numpy as np
example1 = np.array([2, None, 6, 8])
example1
```
**Think, Pair, Share**
```python
example1.sum()
```
**Key takeaway**: Addition (and other operations) between integers and `None` values is undefined, which can limit what you can do with datasets that contain them.
### `NaN`: missing float values
```python
np.nan + 1
```
```python
np.nan * 0
```
**Think, Pair, Share**
```python
example2 = np.array([2, np.nan, 6, 8])
example2.sum(), example2.min(), example2.max()
```
### Exercise:
```python
# What happens if you add np.nan and None together?
```
### `NaN` and `None`: null values in pandas
```python
int_series = pd.Series([1, 2, 3], dtype=int)
int_series
```
### Exercise:
```python
# Now set an element of int_series equal to None.
# How does that element show up in the Series?
# What is the dtype of the Series?
```
### Detecting null values
`isnull()` and `notnull()`
```python
example3 = pd.Series([0, np.nan, '', None])
```
```python
example3.isnull()
```
### Exercise:
```python
# Try running example3[example3.notnull()].
# Before you do so, what do you expect to see?
```
**Key takeaway**: Both the `isnull()` and `notnull()` methods produce similar results when you use them in `DataFrame`s: they show the results and the index of those results, which will help you enormously as you wrestle with your data.
### Dropping null values
```python
example3 = example3.dropna()
example3
```
```python
example4 = pd.DataFrame([[1, np.nan, 7],
[2, 5, 8],
[np.nan, 6, 9]])
example4
```
**Think, Pair, Share**
```python
example4.dropna()
```
**Drop from Columns**
```python
example4.dropna(axis='1')
```
`how='all'` will drop only rows or columns that contain all null values.
**Tip**: run `example4.dropna?`
```python
example4[3] = np.nan
example4
```
### Exercise:
```python
# How might you go about dropping just column 3?
# Hint: remember that you will need to supply both the axis parameter and the how parameter.
```
The `thresh` parameter gives you finer-grained control: you set the number of *non-null* values that a row or column needs to have in order to be kept.
**Think, Pair, Share**
```python
example4.dropna(axis='rows', thresh=3)
```
### Filling null values
```python
example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
example5
```
```python
example5.fillna(0)
```
### Exercise:
```python
# What happens if you try to fill null values with a string, like ''?
```
**Forward-fill**
```python
example5.fillna(method='ffill')
```
**Back-fill**
```python
example5.fillna(method='bfill')
```
**Specify Axis**
```python
example4
```
```python
example4.fillna(method='ffill', axis=1)
```
### Exercise:
```python
# What output does example4.fillna(method='bfill', axis=1) produce?
# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?
# Can you think of a longer code snippet to write that can fill all of the null values in example4?
```
**Fill with Logical Data**
```python
example4.fillna(example4.mean())
```
&gt; **Takeaway:** There are multiple ways to deal with missing values in your datasets. The specific strategy you use (removing them, replacing them, or even how you replace them) should be dictated by the particulars of that data. You will develop a better sense of how to deal with missing values the more you handle and interact with datasets.
## Removing duplicate data
&gt; **Learning goal:** By the end of this subsection, you should be comfortable identifying and removing duplicate values from DataFrames.
### Identifying duplicates: `duplicated`
```python
example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],
'numbers': [1, 2, 1, 3, 3]})
example6
```
```python
example6.duplicated()
```
### Dropping duplicates: `drop_duplicates`
```python
example6.drop_duplicates()
```
```python
example6.drop_duplicates(['letters'])
```
&gt; **Takeaway:** Removing duplicate data is an essential part of almost every data-science project. Duplicate data can change the results of your analyses and give you spurious results!
## Combining datasets: merge and join
&gt; **Learning goal:** By the end of this subsection, you should have a general knowledge of the various ways to combine `DataFrame`s.
### Categories of joins
`merge` carries out several types of joins: *one-to-one*, *many-to-one*, and *many-to-many*.
#### One-to-one joins
Consider combining two `DataFrame`s that contain different information on the same employees in a company:
```python
df1 = pd.DataFrame({'employee': ['Gary', 'Stu', 'Mary', 'Sue'],
'group': ['Accounting', 'Marketing', 'Marketing', 'HR']})
df1
```
```python
df2 = pd.DataFrame({'employee': ['Mary', 'Stu', 'Gary', 'Sue'],
'hire_date': [2008, 2012, 2017, 2018]})
df2
```
Combine this information into a single `DataFrame` using the `merge` function:
```python
df3 = pd.merge(df1, df2)
df3
```
#### Many-to-one joins
```python
df4 = pd.DataFrame({'group': ['Accounting', 'Marketing', 'HR'],
'supervisor': ['Carlos', 'Giada', 'Stephanie']})
df4
```
```python
pd.merge(df3, df4)
```
**Specify Key**
```python
pd.merge(df3, df4, on='group')
```
#### Many-to-many joins
```python
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
'Marketing', 'Marketing', 'HR', 'HR'],
'core_skills': ['math', 'spreadsheets', 'writing', 'communication',
'spreadsheets', 'organization']})
df5
```
```python
pd.merge(df1, df5, on='group')
```
#### `left_on` and `right_on` keywords
```python
df6 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
'salary': [70000, 80000, 120000, 90000]})
df6
```
```python
pd.merge(df1, df6, left_on="employee", right_on="name")
```
### Exercise:
```python
# Using the documentation, can you figure out how to use .drop() to get rid of the 'name' column?
# Hint: You will need to supply two parameters to .drop()
```
#### `left_index` and `right_index` keywords
```python
df1a = df1.set_index('employee')
df1a
```
```python
df2a = df2.set_index('employee')
df2a
```
```python
pd.merge(df1a, df2a, left_index=True, right_index=True)
```
### Exercise:
```python
# What happens if you specify only left_index or right_index?
```
**`join` for `DataFrame`s**
```python
df1a.join(df2a)
```
**Mix and Match**: `left_index`/`right_index` with `right_on`/`left_on`
```python
pd.merge(df1a, df6, left_index=True, right_on='name')
```
#### Set arithmetic for joins
```python
df5 = pd.DataFrame({'group': ['Engineering', 'Marketing', 'Sales'],
'core_skills': ['math', 'writing', 'communication']})
df5
```
```python
pd.merge(df1, df5, on='group')
```
**`intersection` for merge**
```python
pd.merge(df1, df5, on='group', how='inner')
```
### Exercise:
```python
# The keyword for perfoming an outer join is how='outer'. How would you perform it?
# What do you expect the output of an outer join of df1 and df5 to be?
```
**Share**
```python
pd.merge(df1, df5, how='left')
```
### Exercise:
```python
# Now run the right merge between df1 and df5.
# What do you expect to see?
```
#### `suffixes` keyword: dealing with conflicting column names
```python
df7 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
'rank': [1, 2, 3, 4]})
df7
```
```python
df8 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
'rank': [3, 1, 4, 2]})
df8
```
```python
pd.merge(df7, df8, on='name')
```
**Using `_` to merge same column names**
```python
pd.merge(df7, df8, on='name', suffixes=['_left', '_right'])
```
### Concatenation in NumPy
**One-dimensional arrays**
```python
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
```
**Two-dimensional arrays**
```python
x = [[1, 2],
[3, 4]]
np.concatenate([x, x], axis=1)
```
### Concatenation in pandas
**Series**
```python
ser1 = pd.Series(['a', 'b', 'c'], index=[1, 2, 3])
ser2 = pd.Series(['d', 'e', 'f'], index=[4, 5, 6])
pd.concat([ser1, ser2])
```
**DataFrames**
```python
df9 = pd.DataFrame({'A': ['a', 'c'],
'B': ['b', 'd']})
df9
```
```python
pd.concat([df9, df9])
```
**Re-indexing**
```python
pd.concat([df9, df9], ignore_index=True)
```
**Changing Axis**
```python
pd.concat([df9, df9], axis=1)
```
&gt; Note that while pandas will display this without error, you will get an error message if you try to assign this result as a new `DataFrame`. Column names in `DataFrame`s must be unique.
### Concatenation with joins
```python
df10 = pd.DataFrame({'A': ['a', 'd'],
'B': ['b', 'e'],
'C': ['c', 'f']})
df10
```
```python
df11 = pd.DataFrame({'B': ['u', 'x'],
'C': ['v', 'y'],
'D': ['w', 'z']})
df11
```
```python
pd.concat([df10, df11])
```
```python
pd.concat([df10, df11], join='inner')
```
```python
pd.concat([df10, df11], join_axes=[df10.columns])
```
#### `append()`
```python
df9.append(df9)
```
**Important point**: Unlike the `append()` and `extend()` methods of Python lists, the `append()` method in pandas does not modify the original object. It instead creates a new object with the combined data.
&gt; **Takeaway:** A large part of the value you can provide as a data scientist comes from connecting multiple, often disparate datasets to find new insights. Learning how to join and merge data is thus an essential part of your skill set.

Просмотреть файл

@ -0,0 +1,252 @@
# Project
&gt; **Learning goal:** By the end of this Capstone, you should be familiar with some of the ways to visually explore the data stored in `DataFrame`s.
Often when probing a new data set, it is invaluable to get high-level information about what the dataset holds. Earlier in this section we discussed using methods such as `DataFrame.info`, `DataFrame.head`, and `DataFrame.tail` to examine some aspects of a `DataFrame`. While these methods are critical, they are on their own often insufficient to get enough information to know how to approach a new dataset. This is where exploratory statistics and visualizations for datasets come in.
To see what we mean in terms of gaining exploratory insight (both visually and numerically), let's dig into one of the the datasets that come with the scikit-learn library, the Boston Housing Dataset (though you will load it from a CSV file):
```python
import pandas as pd
df = pd.read_csv('Data/housing_dataset.csv')
df.head()
```
This dataset contains information collected from the U.S Census Bureau concerning housing in the area of Boston, Massachusetts and was first published in 1978. The dataset has 14 columns:
- **CRIM**: Per-capita crime rate by town
- **ZN**: Proportion of residential land zoned for lots over 25,000 square feet
- **INDUS**: Proportion of non-retail business acres per town
- **CHAS**: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- **NOX**: Nitric oxides concentration (parts per 10 million)
- **RM**: Average number of rooms per dwelling
- **AGE**: Proportion of owner-occupied units built prior to 1940
- **DIS**: Weighted distances to five Boston employment centres
- **RAD**: Index of accessibility to radial highways
- **TAX**: Full-value property-tax rate per \$10,000
- **PTRATIO**: Pupil-teacher ratio by town
- **LSTAT**: Percent of lower-status portion of the population
- **MEDV**: Median value of owner-occupied homes in \$1,000s
One of the first methods we can use to better understand this dataset is `DataFrame.shape`:
```python
df.shape
```
The dataset has 506 rows and 13 columns.
To get a better idea of the contents of each column we can use `DataFrame.describe`, which returns the maximum value, minimums value, mean, and standard deviation of numeric values in each columns, in addition to the quartiles for each column:
```python
df.describe()
```
Because dataset can have so many columns in them, it can often be useful to transpose the results of `DataFrame.describe` to better use them.
Note that you can also examine specific descriptive statistics for columns without having to invoke `DataFrame.describe`:
```python
df['MEDV'].mean()
```
```python
df['MEDV'].max()
```
```python
df['AGE'].median()
```
```python
# Now find the maximum value in df['AGE'].
```
Other information that you will often want to see is the relationship between different columns. You do this with the `DataFrame.groupby` method. For example, you could examine the average MEDV (median value of owner-occupied homes) for each value of AGE (proportion of owner-occupied units built prior to 1940):
```python
df.groupby(['AGE'])['MEDV'].mean()
```
```python
# Now try to find the median value for AGE for each value of MEDV.
```
You can also apply a lambda function to each element of a `DataFrame` column by using the `apply` method. For example, say you wanted to create a new column that flagged a row if more than 50 percent of owner-occupied homes were build before 1940:
```python
df['AGE_50'] = df['AGE'].apply(lambda x: x>50)
```
Once applied, you also see how many values returned true and how many false by using the `value_counts` method:
```python
df['AGE_50'].value_counts()
```
You can also examine figures from the groupby statement you created earlier:
```python
df.groupby(['AGE_50'])['MEDV'].mean()
```
You can also group by more than one variable, such AGE_50 (the one you just created), CHAS (whether a town is on the Charles River), and RAD (an index measuring access to the Boston-area radial highways), and then evaluate each group for the average median home price in that group:
```python
groupby_twovar=df.groupby(['AGE_50','RAD','CHAS'])['MEDV'].mean()
```
You can then see what values are in this stacked group of variables:
```python
groupby_twovar
```
Let's take a moment to analyze these results in a little depth. The first row reports that communities with less the half of houses built before 1940, with a highway-access index of 1, and that are not situated on the Charles River have a mean house price of \$24,667 (1970s dollars); the next row shows that for communities similar to the first row except for being located on the Charles River have a mean house price of \$50,000.
One insight that pops out from continuing down this is that, all else being equal, being located next to the Charles River can significantly increase the value of newer housing stock. The story is more ambiguous for communities dominated by older houses: proximity to the Charles significantly increases home prices in one community (and that one presumably farther away from the city); for all others, being situated on the river either provided a modest increase in value or actually decreased mean home prices.
While groupings like this can be a great way to begin to interrogate your data, you might not care for the 'tall' format it comes in. In that case, you can unstack the data into a "wide" format:
```python
groupby_twovar.unstack()
```
```python
# How could you use groupby to get a sense of the proportion
# of residential land zoned for lots over 25,000 sq.ft.,
# the proportion of non-retail business acres per town,
# and the distance of towns from employment centers in Boston?
```
It is also often valuable to know how many unique values a column has in it with the `nunique` method:
```python
df['CHAS'].nunique()
```
Complementary to that, you will also likely want to know what those unique values are, which is where the `unique` method helps:
```python
df['CHAS'].unique()
```
You can use the `value_counts` method to see how many of each unique value there are in a column:
```python
df['CHAS'].value_counts()
```
Or you can easily plot a bar graph to visually see the breakdown:
```python
%matplotlib inline
df['CHAS'].value_counts().plot(kind='bar')
```
Note that the IPython magic command `%matplotlib inline` enables you to view the chart inline.
Let's pull back to the dataset as a whole for a moment. Two major things that you will look for in almost any dataset are trends and relationships. A typical relationship between variables to explore is the Pearson correlation, or the extent to which two variables are linearly related. The `corr` method will show this in table format for all of the columns in a `DataFrame`:
```python
df.corr(method='pearson')
```
Suppose you just wanted to look at the correlations between all of the columns and just one variable? Let's examine just the correlation between all other variables and the percentage of owner-occupied houses build before 1940 (AGE). We will do this by accessing the column by index number:
```python
corr = df.corr(method='pearson')
corr_with_homevalue = corr.iloc[-1]
corr_with_homevalue[corr_with_homevalue.argsort()[::-1]]
```
With the correlations arranged in descending order, it's easy to start to see some patterns. Correlating AGE with a variable we created from AGE is a trivial correlation. However, it is interesting to note that the percentage of older housing stock in communities strongly correlates with air pollution (NOX) and the proportion of non-retail business acres per town (INDUS); at least in 1978 metro Boston, older towns are more industrial.
Graphically, we can see the correlations using a heatmap from the Seaborn library:
```python
import seaborn as sns
sns.heatmap(df.corr(),cmap=sns.cubehelix_palette(20, light=0.95, dark=0.15))
```
Histograms are another valuable tool for investigating your data. For example, what is the overall distribution of prices of owner-occupied houses in the Boston area?
```python
import matplotlib.pyplot as plt
plt.hist(df['MEDV'])
```
The default bin size for the matplotlib histogram (essentially big of buckets of percentages that you include in each histogram bar in this case) is pretty large and might mask smaller details. To get a finer-grained view of the AGE column, you can manually increase the number of bins in the histogram:
```python
plt.hist(df['MEDV'],bins=50)
```
Seaborn has a somewhat more attractive version of the standard matplotlib histogram: the distribution plot. This is a combination histogram and kernel density estimate (KDE) plot (essentially a smoothed histogram):
```python
sns.distplot(df['MEDV'])
```
Another commonly used plot is the Seaborn jointplot, which combines histograms for two columns along with a scatterplot:
```python
sns.jointplot(df['RM'], df['MEDV'], kind='scatter')
```
Unfortunately, many of the dots print over each other. You can help address this by adding some alpha blending, a figure that sets the transparency for the dots so that concentrations of them drawing over one another will be apparent:
```python
sns.jointplot(df['RM'], df['MEDV'], kind='scatter', alpha=0.3)
```
Another way to see patterns in your data is with a two-dimensional KDE plot. Darker colors here represent a higher concentration of data points:
```python
sns.kdeplot(df['RM'], df['MEDV'], shade=True)
```
Note that while the KDE plot is very good at showing concentrations of data points, finer structures like linear relationships (such as the clear relationship between the number of rooms in homes and the house price) are lost in the KDE plot.
Finally, the pairplot in Seaborn allows you to see scatterplots and histograms for several columns in one table. Here we have played with some of the keywords to produce a more sophisticated and easier to read pairplot that incorporates both alpha blending and linear regression lines for the scatterplots.
```python
sns.pairplot(df[['RM', 'AGE', 'LSTAT', 'DIS', 'MEDV']], kind="reg", plot_kws={'line_kws':{'color':'red'}, 'scatter_kws': {'alpha': 0.1}})
```
Visualization is the start of the really cool, fun part of data science. So play around with these visualization tools and see what you can learn from the data!
&gt; **Takeaway:** An old joke goes: “What does a data scientist seen when they look at a dataset? A bunch of numbers.” There is more than a little truth in that joke. Visualization is often the key to finding patterns and correlations in your data. While visualization cannot often deliver precise results, it can point you in the right direction to ask better questions and efficiently find value in the data.

Двоичные данные
Data Science 1_ Introduction to Python for Data Science/Azure Notebook Files/.DS_Store поставляемый Normal file

Двоичный файл не отображается.

Двоичные данные
Data Science 1_ Introduction to Python for Data Science/Reference Material/.DS_Store поставляемый Normal file

Двоичный файл не отображается.

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -0,0 +1,746 @@
# Introduction to Pandas
Having explored NumPy, it is time to get to know the other workhorse of data science in Python: pandas. The pandas library in Python really does a lot to make working with data--and importing, cleaning, and organizing it--so much easier that it is hard to imagine doing data science in Python without it.
But it was not always this way. Wes McKinney developed the library out of necessity in 2008 while at AQR Capital Management in order to have a better tool for dealing with data analysis. The library has since taken off as an open-source software project that has become a mature and integral part of the data science ecosystem. (In fact, some examples in this section will be drawn from McKinney's book, *Python for Data Analysis*.)
The name 'pandas' actually has nothing to do with Chinese bears but rather comes from the term *panel data*, a form of multi-dimensional data involving measurements over time that comes out the econometrics and statistics community. Ironically, while panel data is a usable data structure in pandas, it is not generally used today and we will not examine it in this course. Instead, we will focus on the two most widely used data structures in pandas: `Series` and `DataFrame`s.
## Reminders about importing and documentation
Just as you imported NumPy undwither the alias ``np``, we will import Pandas under the alias ``pd``:
```python
import pandas as pd
```
As with the NumPy convention, `pd` is an important and widely used convention in the data science world; we will use it here and we advise you to use it in your own coding.
As we progress through Section 5, don't forget that IPython provides tab-completion feature and function documentation with the ``?`` character. If you don't understand anything about a function you see in this section, take a moment and read the documentation; it can help a great deal. As a reminder, to display the built-in pandas documentation, use this code:
```ipython
In [4]: pd?
```
Because it can be useful to lean about `Series` and `DataFrame`s in pandas a extension of `ndarray`s in NumPy, go ahead also import NumPy; you will want it for some of the examples later on:
```python
import numpy as np
```
Now, on to pandas!
## Fundamental panda data structures
Both `Series` and `DataFrame`s are a lot like they `ndarray`s you encountered in the last section. They provide clean, efficent data storage and handling at the scales necessary for data science. What both of them provide that `ndarray`s lack, however, are essential data-science features like flexibility when dealing with missing data and the ability to label data. These capabilities (along with others) help make `Series` and `DataFrame`s essential to the "data munging" that make up so much of data science.
### `Series` objects in pandas
A pandas `Series` is a lot like an `ndarray` in NumPy: a one-dimensional array of indexed data.
You can create a simple Series from an array of data like this:
```python
series_example = pd.Series([-0.5, 0.75, 1.0, -2])
series_example
```
Similar to an `ndarray`, a `Series` upcasts entries to be of the same type of data (that `-2` integer in the original array became a `-2.00` float in the `Series`).
What is different from an `ndarray` is that the ``Series`` automatically wraps both a sequence of values and a sequence of indices. These are two separate objects within the `Seriers` object that can access with the ``values`` and ``index`` attributes.
Try accessing the ``values`` first; they are just a familiar NumPy array:
```python
series_example.values
```
The ``index`` is also an array-like object:
```python
series_example.index
```
Just as with `ndarra`s, you can access specific data elements in a `Series` via the familiar Python square-bracket index notation and slicing:
```python
series_example[1]
```
```python
series_example[1:3]
```
Despite a lot of similarities, pandas `Series` have an important distinction from NumPy `ndarrays`: whereas `ndarrays` have *implicitly defined* integer indices (as do Python lists), pandas `Series` have *explicitly defined* indices. The best part is that you can set the index:
```python
series_example2 = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
series_example2
```
These explicit indices work exactly the way you would expect them to:
```python
series_example2['b']
```
### Exercise:
```python
# Do explicit Series indices work *exactly* the way you might expect?
# Try slicing series_example2 using its explicit index and find out.
```
With explicit indices in the mix, a `Series` is basically a fixed-length, ordered dictionary in that it maps arbitrary typed index values to arbitrary typed data values. But like `ndarray`s these data are all of the same type, which is important. Just as the type-specific compiled code behind `ndarray` makes them more efficient than a Python lists for certain operations, the type information of pandas ``Series`` makes them much more efficient than Python dictionaries for certain operations.
But the connection between `Series` and dictionaries is nevertheless very real: you can construct a ``Series`` object directly from a Python dictionary:
```python
population_dict = {'France': 65429495,
'Germany': 82408706,
'Russia': 143910127,
'Japan': 126922333}
population = pd.Series(population_dict)
population
```
Did you see what happened there? The order of the keys `Russia` and `Japan` in the switched places between the order in which they were entered in `population_dict` and how they ended up in the `population` `Series` object. While Python dictionary keys have no order, `Series` keys are ordered.
So, at one level, you can interact with `Series` as you would with dictionaries:
```python
population['Russia']
```
But you can also do powerful array-like operations with `Series` like slicing:
```python
# Try slicing on the population Series on your own.
# Would slicing be possible if Series keys were not ordered?
```
You can also add elements to a `Series` the way that you would to an `ndarray`. Try it in the code cell below:
```python
# Try running population['Albania'] = 2937590 (or another country of your choice)
# What order do the keys appear in when you run population? Is it what you expected?
```
Anoter useful `Series` feature (and definitely a difference from dictionaries) is that `Series` automatically aligns differently indexed data in arithmetic operations:
```python
pop2 = pd.Series({'Spain': 46432074, 'France': 102321, 'Albania': 50532})
population + pop2
```
Notice that in the case of Germany, Japan, Russia, and Spain (and Albania, depending on what you did in the previous exercise), the addition operation produced `NaN` (not a number) values. pandas does not treat missing values as `0`, but as `NaN` (and it can be helpful to think of arithmetic operations involving `NaN` as essentially `NaN`$ + x=$ `NaN`).
### `DataFrame` object in pandas
The other crucial data structure in pandas to get to know for data science is the `DataFrame`.
Like the ``Series`` object, ``DataFrame``s can be thought of either as generalizations of `ndarray`s (or as specializations of Python dictionaries).
Just as a ``Series`` is like a one-dimensional array with flexible indices, a ``DataFrame`` is like a two-dimensional array with both flexible row indices and flexible column names. Essentially, a `DataFrame` represents a rectangular table of data and contains an ordered collection of labeled columns, each of which can be a different value type (`string`, `int`, `float`, etc.).
The DataFrame has both a row and column index; in this way you can think of it as a dictionary of `Series`, all of which share the same index.
Let's take a look at how this works in practice. We will start by creating a `Series` called `area`:
```python
area_dict = {'Albania': 28748,
'France': 643801,
'Germany': 357386,
'Japan': 377972,
'Russia': 17125200}
area = pd.Series(area_dict)
area
```
Now you can combine this with the `population` `Series` you created earlier by using a dictionary to construct a single two-dimensional table containing data from both `Series`:
```python
countries = pd.DataFrame({'Population': population, 'Area': area})
countries
```
As with `Series`, note that `DataFrame`s also automatically order indices (in this case, the column indices `Area` and `Population`).
So far we have combined dictionaries together to compose a `DataFrame` (which has given our `DataFrame` a row-centric feel), but you can also create `DataFrame`s in a column-wise fashion. Consider adding a `Capital` column using our reliable old array-analog, a list:
```python
countries['Capital'] = ['Tirana', 'Paris', 'Berlin', 'Tokyo', 'Moscow']
countries
```
As with `Series`, even though initial indices are ordered in `DataFrame`s, subsequent additions to a `DataFrame` stay in the ordered added. However, you can explicitly change the order of `DataFrame` column indices this way:
```python
countries = countries[['Capital', 'Area', 'Population']]
countries
```
Commonly in a data science context, it is necessary to generate new columns of data from existing data sets. Because `DataFrame` columns behave like `Series`, you can do this is by performing operations on them as you would with `Series`:
```python
countries['Population Density'] = countries['Population'] / countries['Area']
countries
```
Note: don't worry if IPython gives you a warning over this. The warning is IPython trying to be a little too helpful. The new column you created is an actual part of the `DataFrame` and not a copy of a slice.
We have stated before that `DataFrame`s are like dictionaries, and it's true. You can retrieve the contents of a column just as you would the value for a specific key in an ordinary dictionary:
```python
countries['Area']
```
What about using the row indices?
```python
# Now try accessing row data with a command like countries['Japan']
```
This returns an error: `DataFrame`s are dictionaries of `Series`, which are the columns. `DataFrame` rows often have heterogeneous data types, so different methods are necessary to access row data. For that, we use the `.loc` method:
```python
countries.loc['Japan']
```
Note that what `.loc` returns is an indexed object in its own right and you can access elements within it using familiar index syntax:
```python
countries.loc['Japan']['Area']
```
```python
# Can you think of a way to return the area of Japan without using .iloc?
# Hint: Try putting the column index first.
# Can you slice along these indices as well?
```
Sometimes it is helpful in data science projects to add a column to a `DataFrame` without assigning values to it:
```python
countries['Debt-to-GDP Ratio'] = np.nan
countries
```
Again, you can disregard the warning (if it triggers) about adding the column this way.
You can also add columns to a `DataFrame` that do not have the same number of rows as the `DataFrame`:
```python
debt = pd.Series([0.19, 2.36], index=['Russia', 'Japan'])
countries['Debt-to-GDP Ratio'] = debt
countries
```
You can use the `del` command to delete a column from a `DataFrame`:
```python
del countries['Capital']
countries
```
In addition to their dictionary-like behavior, `DataFrames` also behave like two-dimensional arrays. For example, it can be useful at times when working with a `DataFrame` to transpose it:
```python
countries.T
```
Again, note that `DataFrame` columns are `Series` and thus the data types must consistent, hence the upcasting to floating-point numbers. **If there had been strings in this `DataFrame`, everything would have been upcast to strings.** Use caution when transposing `DataFrame`s.
#### From a two-dimensional NumPy array
Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.
If omitted, an integer index will be used for each:
```python
pd.DataFrame(np.random.rand(3, 2),
columns=['foo', 'bar'],
index=['a', 'b', 'c'])
```
## Manipulating data in pandas
A huge part of data science is manipulating data in order to analyze it. (One rule of thumb is that 80% of any data science project will be concerned with cleaning and organizing the data for the project.) So it makes sense to lear the tools that pandas provides for handling data in `Series` and especially `DataFrame`s. Because both of those data structures are ordered, let's first start by taking a closer look at what gives them their structure: the `Index`.
### Index objects in pandas
Both ``Series`` and ``DataFrame``s in pandas have explicit indices that enable you to reference and modify data in them. These indices are actually objects themselves. The ``Index`` object can be thought of as both an immutable array or as fixed-size set.
It's worth the time to get to know the properties of the `Index` object. Let's return to an example from earlier in the section to examine these properties.
```python
series_example = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
ind = series_example.index
ind
```
The ``Index`` works a lot like an array. we have already seen how to use standard Python indexing notation to retrieve values or slices:
```python
ind[1]
```
```python
ind[::2]
```
But ``Index`` objects are immutable; you cannot be modified via the normal means:
```python
ind[1] = 0
```
This immutability is a good thing: it makes it safer to share indices between multiple ``Series`` or ``DataFrame``s without the potential for problems arising from inadvertent index modification.
In addition to being array-like, a Index also behaves like a fixed-size set, including following many of the conventions used by Python's built-in ``set`` data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way. Let's play around with this to see it in action.
```python
ind_odd = pd.Index([1, 3, 5, 7, 9])
ind_prime = pd.Index([2, 3, 5, 7, 11])
```
In the code cell below, try out the intersection (`ind_odd &amp; ind_prime`), union (`ind_odd | ind_prime`), and the symmetric difference (`ind_odd ^ ind_prime`) of `ind_odd` and `ind_prime`.
```python
```
These operations may also be accessed via object methods, for example ``ind_odd.intersection(ind_prime)``. Below is a table listing some useful `Index` methods and properties.
| **Method** | **Description** |
|:---------------|:------------------------------------------------------------------------------------------|
| [`append`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html) | Concatenate with additional `Index` objects, producing a new `Index` |
| [`diff`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html) | Compute set difference as an Index |
| [`drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) | Compute new `Index` by deleting passed values |
| [`insert`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html) | Compute new `Index` by inserting element at index `i` |
| [`is_monotonic`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.is_monotonic.html) | Returns `True` if each element is greater than or equal to the previous element |
| [`is_unique`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.is_unique.html) | Returns `True` if the Index has no duplicate values |
| [`isin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html) | Compute boolean array indicating whether each value is contained in the passed collection |
| [`unique`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html) | Compute the array of unique values in order of appearance |
### Data Selection in Series
As a refresher, a ``Series`` object acts in many ways like both a one-dimensional `ndarray` and a standard Python dictionary.
Like a dictionary, the ``Series`` object provides a mapping from a collection of arbitrary keys to a collection of arbitrary values. Back to an old example:
```python
series_example2 = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
series_example2
```
```python
series_example2['b']
```
You can also examine the keys/indices and values using dictionary-like Python tools:
```python
'a' in series_example2
```
```python
series_example2.keys()
```
```python
list(series_example2.items())
```
As with dictionaries, you can extend a dictionary by assigning to a new key, you can extend a ``Series`` by assigning to a new index value:
```python
series_example2['e'] = 1.25
series_example2
```
#### Series as one-dimensional array
Because ``Series`` also provide array-style functionality, you can use the NumPy techniques we looked at in Section 3 like slices, masking, and fancy indexing:
```python
# Slicing using the explicit index
series_example2['a':'c']
```
```python
# Slicing using the implicit integer index
series_example2[0:2]
```
```python
# Masking
series_example2[(series_example2 > -1) & (series_example2 < 0.8)]
```
```python
# Fancy indexing
series_example2[['a', 'e']]
```
One note to avoid confusion. When slicing with an explicit index (i.e., ``series_example2['a':'c']``), the final index is **included** in the slice; when slicing with an implicit index (i.e., ``series_example2[0:2]``), the final index is **excluded** from the slice.
#### Indexers: `loc` and `iloc`
A great thing about pandas is that you can use a lot different things for your explicit indices. A potentially confusing thing about pandas is that you can use a lot different things for your explicit indices, including integers. To avoid confusion between integer indices that you might supply and those implicit integer indices that pandas generates, pandas provides special *indexer* attributes that explicitly expose certain indexing schemes.
(A technical note: These are not functional methods; they are attributes that expose a particular slicing interface to the data in the ``Series``.)
The ``loc`` attribute allows indexing and slicing that always references the explicit index:
```python
series_example2.loc['a']
```
```python
series_example2.loc['a':'c']
```
The ``iloc`` attribute enables indexing and slicing using the implicit, Python-style index:
```python
series_example2.iloc[0]
```
```python
series_example2.iloc[0:2]
```
A guiding principle of the Python language is the idea that "explicit is better than implicit." Professional code will generally use explicit indexing with ``loc`` and ``iloc`` and you should as well in order to make your code clean and readable.
### Data selection in DataFrames
``DataFrame``s also exhibit dual behavior, acting both like a two-dimensional `ndarray` and like a dictionary of ``Series`` sharing the same index.
#### DataFrame as dictionary of Series
Let's return to our earlier example of countries' areas and populations in order to examine `DataFrame`s as a dictionary of `Series`.
```python
area = pd.Series({'Albania': 28748,
'France': 643801,
'Germany': 357386,
'Japan': 377972,
'Russia': 17125200})
population = pd.Series ({'Albania': 2937590,
'France': 65429495,
'Germany': 82408706,
'Russia': 143910127,
'Japan': 126922333})
countries = pd.DataFrame({'Area': area, 'Population': population})
countries
```
You can access the individual ``Series`` that make up the columns of a ``DataFrame`` via dictionary-style indexing of the column name:
```python
countries['Area']
```
An you can use dictionary-style syntax can also be used to modify `DataFrame`s, such as by adding a new column:
```python
countries['Population Density'] = countries['Population'] / countries['Area']
countries
```
#### DataFrame as two-dimensional array
You can also think of ``DataFrame``s as two-dimensional arrays. You can examine the raw data in the `DataFrame`/data array using the ``values`` attribute:
```python
countries.values
```
Viewed thsi way it makes sense that we can transpose the rows and columns of a `DataFrame` the same way we would an array:
```python
countries.T
```
`DataFrame`s also uses the ``loc`` and ``iloc`` indexers. With ``iloc``, you can index the underlying array as if it were an `ndarray` but with the ``DataFrame`` index and column labels maintained in the result:
```python
countries.iloc[:3, :2]
```
``loc`` also permits array-like slicing but using the explicit index and column names:
```python
countries.loc[:'Germany', :'Population']
```
You can also use array-like techniques such as masking and fancing indexing with `loc`.
```python
# Can you think of how to combine masking and fancy indexing in one line?
# Your masking could be somthing like countries['Population Density'] > 200
# Your fancy indexing could be something like ['Population', 'Population Density']
# Be sure to put the the masking and fancy indexing inside the square brackets: countries.loc[]
```
#### Indexing conventions
In practice in the world of data science (and pandas more generally), *indexing* refers to columns while *slicing* refers to rows:
```python
countries['France':'Japan']
```
Such slices can also refer to rows by number rather than by index:
```python
countries[1:3]
```
Similarly, direct masking operations are also interpreted row-wise rather than column-wise:
```python
countries[countries['Population Density'] > 200]
```
These two conventions are syntactically similar to those on a NumPy array, and while these may not precisely fit the mold of the Pandas conventions, they are nevertheless quite useful in practice.
# Operating on Data in Pandas
As you begin to work in data science, operating on data is imperative. It is the very heart of data science. Another aspect of pandas that makes it a compelling tool for many data scientists is pandas' capability to perform efficient element-wise operations on data. pandas builds on ufuncs from NumPy to supply theses capabilities and then extends them to provide additional power for data manipulation:
- For unary operations (such as negation and trigonometric functions), ufuncs in pandas **preserve index and column labels** in the output.
- For binary operations (such as addition and multiplication), pandas automatically **aligns indices** when passing objects to ufuncs.
These critical features of ufuncs in pandas mean that data retains its context when operated on and, more importantly still, drastically helps reduce errors when you combine data from multiple sources.
## Index Preservation
pandas is explicitly designed to work with NumPy. As a results, all NumPy ufuncs will work on Pandas ``Series`` and ``DataFrame`` objects.
We can see this more clearly if we create a simple ``Series`` and ``DataFrame`` of random numbers on which to operate.
```python
rng = np.random.RandomState(42)
ser_example = pd.Series(rng.randint(0, 10, 4))
ser_example
```
Did you notice the NumPy function we used with the variable `rng`? By specifying a seed for the random-number generator, you get the same result each time. This can be useful trick when you need to produce psuedo-random output that also needs to be replicatable by others. (Go ahead and re-run the code cell above a couple of times to convince yourself that it produces the same output each time.)
```python
df_example = pd.DataFrame(rng.randint(0, 10, (3, 4)),
columns=['A', 'B', 'C', 'D'])
df_example
```
Let's apply a ufunc to our example `Series`:
```python
np.exp(ser_example)
```
The same thing happens with a slightly more complex operation on our example `DataFrame`:
```python
np.cos(df_example * np.pi / 4)
```
Note that you can use all of the ufuncs we discussed in Section 3 the same way.
## Index alignment
As mentioned above, when you perform a binary operation on two ``Series`` or ``DataFrame`` objects, pandas will align indices in the process of performing the operation. This is essential when working with incomplete data (and data is usually incomplete), but it is helpful to see this in action to better understand it.
### Index alignment with Series
For our first example, suppose we are combining two different data sources and find only the top five countries by *area* and the top five countries by *population*:
```python
area = pd.Series({'Russia': 17075400, 'Canada': 9984670,
'USA': 9826675, 'China': 9598094,
'Brazil': 8514877}, name='area')
population = pd.Series({'China': 1409517397, 'India': 1339180127,
'USA': 324459463, 'Indonesia': 322179605,
'Brazil': 207652865}, name='population')
```
```python
# Now divide these to compute the population density
```
Your resulting array contains the **union** of indices of the two input arrays: seven countries in total. All of the countries in the array without an entry (because they lacked either area data or population data) are marked with the now familiar ``NaN``, or "Not a Number," designation.
Index matching works the same way built-in Python arithmetic expressions and missing values are filled in with `NaN`s. You can see this clearly by adding two `Series` that are slightly misaligned in their indices:
```python
series1 = pd.Series([2, 4, 6], index=[0, 1, 2])
series2 = pd.Series([3, 5, 7], index=[1, 2, 3])
series1 + series2
```
`NaN` values are not always convenient to work with; `NaN` combined with any other values results in `NaN`, which can be a pain, particulalry if you are combining multiple data sources with missing values. To help with this, pandas allows you to specify a default value to use for missing values in the operation. For example, calling `series1.add(series2)` is equivalent to calling `series1 + series2`, but you can supply the fill value:
```python
series1.add(series2, fill_value=0)
```
Much better!
### Index alignment with DataFrames
The same kind of alignment takes place in both dimension (columns and indices) when you perform operations on ``DataFrame``s.
```python
df1 = pd.DataFrame(rng.randint(0, 20, (2, 2)),
columns=list('AB'))
df1
```
```python
df2 = pd.DataFrame(rng.randint(0, 10, (3, 3)),
columns=list('BAC'))
df2
```
```python
# Add df1 and df2. Is the output what you expected?
```
Even though we passed the columns in a different order in `df2` than in `df1`, the indices were aligned correctly sorted in the resulting union of columns.
You can also use fill values for missing values with `Data Frame`s. In this example, let's fill the missing values with the mean of all values in `df1` (computed by first stacking the rows of `df1`):
```python
fill = df1.stack().mean()
df1.add(df2, fill_value=fill)
```
This table lists Python operators and their equivalent pandas object methods:
| Python Operator | Pandas Method(s) |
|-----------------|---------------------------------------|
| ``+`` | ``add()`` |
| ``-`` | ``sub()``, ``subtract()`` |
| ``*`` | ``mul()``, ``multiply()`` |
| ``/`` | ``truediv()``, ``div()``, ``divide()``|
| ``//`` | ``floordiv()`` |
| ``%`` | ``mod()`` |
| ``**`` | ``pow()`` |
## Operations between DataFrames and Series
Index and column alignment gets maintained in operations between a `DataFrame` and a `Series` as well. To see this, consider a common operation in data science, wherein we find the difference of a `DataFrame` and one of its rows. Because pandas inherits ufuncs from NumPy, pandas will compute the difference row-wise by default:
```python
df3 = pd.DataFrame(rng.randint(10, size=(3, 4)), columns=list('WXYZ'))
df3
```
```python
df3 - df3.iloc[0]
```
But what if you need to operate column-wise? You can do this by using object methodsand specifying the ``axis`` keyword.
```python
df3.subtract(df3['X'], axis=0)
```
And when you do operations between `DataFrame`s and `Series` operations, you still get automatic index alignment:
```python
halfrow = df3.iloc[0, ::2]
halfrow
```
Note that the output from that operation was transposed. That was so that we can subtract it from the `DataFrame`:
```python
df3 - halfrow
```
Remember, pandas preserves and aligns indices and columns so preserve data context. This will be of huge help to you in our next section when we look at data cleaning and preparation.

Просмотреть файл

@ -0,0 +1,978 @@
# Manipulating and Cleaning Data
This section marks a subtle change. Up until now, we have been introducing ideas and techniques in order to prepare you with a toolbox of techniques to deal with real-world situations. We are now going to start using some of those tools while also giving you some ideas about how and when to use them in your own work with data.
Real-world data is messy. You will likely need to combine several data sources to get the data you actually want. The data from those sources will be incomplete. And it will likely not be formatted in exactly the way you want in order to perform your analysis. It's for these reasons that most data scientists will tell you that about 80 percent of any project is spent just getting the data into a form ready for analysis.
## Exploring `DataFrame` information
&gt; **Learning goal:** By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.
Once you have loaded your data into pandas, it will more likely than not be in a `DataFrame`. However, if the data set in your `DataFrame` has 60,000 rows and 400 columns, how do you even begin to get a sense of what you're working with? Fortunately, pandas provides some conventient tools to quickly look at overall information about a `DataFrame` in addition to the first few and last few rows.
In order to explore this functionality, we will import the Python scikit-learn library and use an iconic dataset that every data scientist has seen hundreds of times: British biologist Ronald Fisher's *Iris* data set used in his 1936 paper "The use of multiple measurements in taxonomic problems":
```python
import pandas as pd
from sklearn.datasets import load_iris
iris = load_iris()
iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
```
### `DataFrame.info`
Let's take a look at this dataset to see what we have:
```python
iris_df.info()
```
From this, we know that the *Iris* dataset has 150 entries in four columns. All of the data is stored as 64-bit floating-point numbers.
### `DataFrame.head`
Next, let's see what the first few rows of our `DataFrame` look like:
```python
iris_df.head()
```
### Exercise:
By default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out how to get it to show more?
```python
# Hint: Consult the documentation by using iris_df.head?
```
### `DataFrame.tail`
The flipside of `DataFrame.head` is `DataFrame.tail`, which returns the last five rows of a `DataFrame`:
```python
iris_df.tail()
```
In practice, it is useful to be able to easily examine the first few rows or the last few rows of a `DataFrame`, particularly when you are looking for outliers in ordered datasets.
&gt; **Takeaway:** Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with.
## Dealing with missing data
&gt; **Learning goal:** By the end of this subsection, you should know how to replace or remove null values from DataFrames.
Most of the time the datasets you want to use (of have to use) have missing values in them. How missing data is handled carries with it subtle tradeoffs that can affect your final analysis and real-world outcomes.
Pandas handles missing values in two ways. The first you've seen before in previous sections: `NaN`, or Not a Number. This is a actually a special value that is part of the IEEE floating-point specification and it is only used to indicate missing floating-point values.
For missing values apart from floats, pandas uses the Python `None` object. While it might seem confusing that you will encounter two different kinds of values that say essentially the same thing, there are sound programmatic reasons for this design choice and, in practice, going this route enables pandas to deliver a good compromise for the vast majority of cases. Notwithstanding this, both `None` and `NaN` carry restrictions that you need to be mindful of with regards to how they can be used.
### `None`: non-float missing data
Because `None` comes from Python, it cannot be used in NumPy and pandas arrays that are not of data type `'object'`. Remember, NumPy arrays (and the data structures in pandas) can contain only one type of data. This is what gives them their tremendous power for large-scale data and computational work, but it also limits their flexibility. Such arrays have to upcast to the “lowest common denominator,” the data type that will encompass everything in the array. When `None` is in the array, it means you are working with Python objects.
To see this in action, consider the following example array (note the `dtype` for it):
```python
import numpy as np
example1 = np.array([2, None, 6, 8])
example1
```
The reality of upcast data types carries two side effects with it. First, operations will be carried out at the level of interpreted Python code rather than compiled NumPy code. Essentially, this means that any operations involving `Series` or `DataFrames` with `None` in them will be slower. While you would probably not notice this performance hit, for large datasets it might become an issue.
The second side effect stems from the first. Because `None` essentially drags `Series` or `DataFrame`s back into the world of vanilla Python, using NumPy/pandas aggregations like `sum()` or `min()` on arrays that contain a ``None`` value will generally produce an error:
```python
example1.sum()
```
**Key takeaway**: Addition (and other operations) between integers and `None` values is undefined, which can limit what you can do with datasets that contain them.
### `NaN`: missing float values
In contrast to `None`, NumPy (and therefore pandas) supports `NaN` for its fast, vectorized operations and ufuncs. The bad news is that any arithmetic performed on `NaN` always results in `NaN`. For example:
```python
np.nan + 1
```
```python
np.nan * 0
```
The good news: aggregations run on arrays with `NaN` in them don't pop errors. The bad news: the results are not uniformly useful:
```python
example2 = np.array([2, np.nan, 6, 8])
example2.sum(), example2.min(), example2.max()
```
### Exercise:
```python
# What happens if you add np.nan and None together?
```
Remember: `NaN` is just for missing floating-point values; there is no `NaN` equivalent for integers, strings, or Booleans.
### `NaN` and `None`: null values in pandas
Even though `NaN` and `None` can behave somewhat differently, pandas is nevertheless built to handle them interchangeably. To see what we mean, consider a `Series` of integers:
```python
int_series = pd.Series([1, 2, 3], dtype=int)
int_series
```
### Exercise:
```python
# Now set an element of int_series equal to None.
# How does that element show up in the Series?
# What is the dtype of the Series?
```
In the process of upcasting data types to establish data homogeneity in `Seires` and `DataFrame`s, pandas will willingly switch missing values between `None` and `NaN`. Because of this design feature, it can be helpful to think of `None` and `NaN` as two different flavors of "null" in pandas. Indeed, some of the core methods you will use to deal with missing values in pandas reflect this idea in their names:
- `isnull()`: Generates a Boolean mask indicating missing values
- `notnull()`: Opposite of `isnull()`
- `dropna()`: Returns a filtered version of the data
- `fillna()`: Returns a copy of the data with missing values filled or imputed
These are important methods to master and get comfortable with, so let's go over them each in some depth.
### Detecting null values
Both `isnull()` and `notnull()` are your primary methods for detecting null data. Both return Boolean masks over your data.
```python
example3 = pd.Series([0, np.nan, '', None])
```
```python
example3.isnull()
```
Look closely at the output. Does any of it surprise you? While `0` is an arithmetic null, it's nevertheless a perfectly good integer and pandas treats it as such. `''` is a little more subtle. While we used it in Section 1 to represent an empty string value, it is nevertheless a string object and not a representation of null as far as pandas is concerned.
Now, let's turn this around and use these methods in a manner more like you will use them in practice. You can use Boolean masks directly as a ``Series`` or ``DataFrame`` index, which can be useful when trying to work with isolated missing (or present) values.
### Exercise:
```python
# Try running example3[example3.notnull()].
# Before you do so, what do you expect to see?
```
**Key takeaway**: Both the `isnull()` and `notnull()` methods produce similar results when you use them in `DataFrame`s: they show the results and the index of those results, which will help you enormously as you wrestle with your data.
### Dropping null values
Beyond identifying missing values, pandas provides a convenient means to remove null values from `Series` and `DataFrame`s. (Particularly on large data sets, it is often more advisable to simply remove missing [NA] values from your analysis than deal with them in other ways.) To see this in action, let's return to `example3`:
```python
example3 = example3.dropna()
example3
```
Note that this should look like your output from `example3[example3.notnull()]`. The difference here is that, rather than just indexing on the masked values, `dropna` has removed those missing values from the `Series` `example3`.
Because `DataFrame`s have two dimensions, they afford more options for dropping data.
```python
example4 = pd.DataFrame([[1, np.nan, 7],
[2, 5, 8],
[np.nan, 6, 9]])
example4
```
(Did you notice that pandas upcast two of the columns to floats to accommodate the `NaN`s?)
You cannot drop a single value from a `DataFrame`, so you have to drop full rows or columns. Depending on what you are doing, you might want to do one or the other, and so pandas gives you options for both. Because in data science, columns generally represent variables and rows represent observations, you are more likely to drop rows of data; the default setting for `dropna()` is to drop all rows that contain any null values:
```python
example4.dropna()
```
If necessary, you can drop NA values from columns. Use `axis=1` to do so:
```python
example4.dropna(axis='columns')
```
Notice that this can drop a lot of data that you might want to keep, particularly in smaller datasets. What if you just want to drop rows or columns that contain several or even just all null values? You specify those setting in `dropna` with the `how` and `thresh` parameters.
By default, `how='any'` (if you would like to check for yourself or see what other parameters the method has, run `example4.dropna?` in a code cell). You could alternatively specify `how='all'` so as to drop only rows or columns that contain all null values. Let's expand our example `DataFrame` to see this in action.
```python
example4[3] = np.nan
example4
```
### Exercise:
```python
# How might you go about dropping just column 3?
# Hint: remember that you will need to supply both the axis parameter and the how parameter.
```
The `thresh` parameter gives you finer-grained control: you set the number of *non-null* values that a row or column needs to have in order to be kept:
```python
example4.dropna(axis='rows', thresh=3)
```
Here, the first and last row have been dropped, because they contain only two non-null values.
### Filling null values
Depending on your dataset, it can sometimes make more sense to fill null values with valid ones rather than drop them. You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the `Series` or `DataFrame` with the missing values replaced with one of your choosing. Let's create another example `Series` to see how this works in practice.
```python
example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
example5
```
You can fill all of the null entries with a single value, such as `0`:
```python
example5.fillna(0)
```
### Exercise:
```python
# What happens if you try to fill null values with a string, like ''?
```
You can **forward-fill** null values, which is to use the last valid value to fill a null:
```python
example5.fillna(method='ffill')
```
You can also **back-fill** to propagate the next valid value backward to fill a null:
```python
example5.fillna(method='bfill')
```
As you might guess, this works the same with `DataFrame`s, but you can also specify an `axis` along which to fill null values:
```python
example4
```
```python
example4.fillna(method='ffill', axis=1)
```
Notice that when a previous value is not available for forward-filling, the null value remains.
### Exercise:
```python
# What output does example4.fillna(method='bfill', axis=1) produce?
# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?
# Can you think of a longer code snippet to write that can fill all of the null values in example4?
```
You can be creative about how you use `fillna`. For example, let's look at `example4` again, but this time let's fill the missing values with the average of all of the values in the `DataFrame`:
```python
example4.fillna(example4.mean())
```
Notice that column 3 is still valueless: the default direction is to fill values row-wise.
&gt; **Takeaway:** There are multiple ways to deal with missing values in your datasets. The specific strategy you use (removing them, replacing them, or even how you replace them) should be dictated by the particulars of that data. You will develop a better sense of how to deal with missing values the more you handle and interact with datasets.
## Removing duplicate data
&gt; **Learning goal:** By the end of this subsection, you should be comfortable identifying and removing duplicate values from DataFrames.
In addition to missing data, you will often encounter duplicated data in real-world datasets. Fortunately, pandas provides an easy means of detecting and removing duplicate entries.
### Identifying duplicates: `duplicated`
You can easily spot duplicate values using the `duplicated` method in pandas, which returns a Boolean mask indicating whether an entry in a `DataFrame` is a duplicate of an ealier one. Let's create another example `DataFrame` to see this in action.
```python
example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],
'numbers': [1, 2, 1, 3, 3]})
example6
```
```python
example6.duplicated()
```
### Dropping duplicates: `drop_duplicates`
`drop_duplicates` simply returns a copy of the data for which all of the `duplicated` values are `False`:
```python
example6.drop_duplicates()
```
Both `duplicated` and `drop_duplicates` default to consider all columnsm but you can specify that they examine only a subset of columns in your `DataFrame`:
```python
example6.drop_duplicates(['letters'])
```
&gt; **Takeaway:** Removing duplicate data is an essential part of almost every data-science project. Duplicate data can change the results of your analyses and give you spurious results!
## Combining datasets: merge and join
&gt; **Learning goal:** By the end of this subsection, you should have a general knowledge of the various ways to combine `DataFrame`s.
Your most interesting analyses will often come from data melded together from more than one source. Because of this, pandas provides several methods of merging and joining datasets to make this necessary job easier:
- **`pandas.merge`** connects rows in `DataFrame`s based on one or more keys.
- **`pandas.concat`** concatenates or “stacks” together objects along an axis.
- The **`combine_first`** instance method enables you to splice together overlapping data to fill in missing values in one object with values from another.
Let's examine merging data first, because it will be the most familiar to course attendees who are already familiar with SQL or other relational databases.
### Categories of joins
`merge` carries out several types of joins: *one-to-one*, *many-to-one*, and *many-to-many*. You use the same basic function call to implement all of them and we will examine all three (because you will need all three as some point in your data delving depending on the data). We will start with one-to-one joins because they are generally the simplest example.
#### One-to-one joins
Consider combining two `DataFrame`s that contain different information on the same employees in a company:
```python
df1 = pd.DataFrame({'employee': ['Gary', 'Stu', 'Mary', 'Sue'],
'group': ['Accounting', 'Marketing', 'Marketing', 'HR']})
df1
```
```python
df2 = pd.DataFrame({'employee': ['Mary', 'Stu', 'Gary', 'Sue'],
'hire_date': [2008, 2012, 2017, 2018]})
df2
```
Combine this information into a single `DataFrame` using the `merge` function:
```python
df3 = pd.merge(df1, df2)
df3
```
Pandas joined on the `employee` column because it was the only column common to both `df1` and `df2`. (Note also that the original indices of `df1` and `df2` were discarded by `merge`; this is generally the case with merges unless you conduct them by index, which we will dicuss later on.)
#### Many-to-one joins
A many-to-one join is like a one-to-one join except that one of the two key columns contains duplicate entries. The `DataFrame` resulting from such a join will preserve those duplicate entries as appropriate:
```python
df4 = pd.DataFrame({'group': ['Accounting', 'Marketing', 'HR'],
'supervisor': ['Carlos', 'Giada', 'Stephanie']})
df4
```
```python
pd.merge(df3, df4)
```
The resulting `DataFrame` has an additional column for `supervisor`; that column has an extra occurence of 'Giada' that did not occur in `df4` because more than one employee in the merged `DataFrame` works in the 'Marketing' group.
Note that we didnt specify which column to join on. When you don't specify that information, `merge` uses the overlapping column names as the keys. However, that can be ambiguous; several columns might meet that condition. For that reason, it is a good practice to explicitly specify on which key to join. You can do this with the `on` parameter:
```python
pd.merge(df3, df4, on='group')
```
#### Many-to-many joins
What happens if the key columns in both of the `DataFrame`s you are joining contain duplicates? That gives you a many-to-many join:
```python
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
'Marketing', 'Marketing', 'HR', 'HR'],
'core_skills': ['math', 'spreadsheets', 'writing', 'communication',
'spreadsheets', 'organization']})
df5
```
```python
pd.merge(df1, df5, on='group')
```
Again, in order to avoid ambiguity as to which column to join on, it is a good idea to explicitly tell `merge` which one to use with the `on` parameter.
#### `left_on` and `right_on` keywords
What if you need to merge two datasets with no shared column names? For example, what if you are using a dataset in which the employee name is labeled as 'name' rather than 'employee'? In such cases, you will need to use the `left_on` and `right_on` keywords in order to specify the column names on which to join:
```python
df6 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
'salary': [70000, 80000, 120000, 90000]})
df6
```
```python
pd.merge(df1, df6, left_on="employee", right_on="name")
```
### Exercise:
```python
# Using the documentation, can you figure out how to use .drop() to get rid of the 'name' column?
# Hint: You will need to supply two parameters to .drop()
```
#### `left_index` and `right_index` keywords
Sometimes it can be more advantageous to merge on an index rather than on a column. The `left_index` and `right_index` keywords make it possible to join by index. Let's revisit some of our earlier example `DataFrame`s to see what this looks like in action.
```python
df1a = df1.set_index('employee')
df1a
```
```python
df2a = df2.set_index('employee')
df2a
```
To merge on the index, specify the `left_index` and `right_index` parameters in `merge`:
```python
pd.merge(df1a, df2a, left_index=True, right_index=True)
```
### Exercise:
```python
# What happens if you specify only left_index or right_index?
```
You can also use the `join` method for `DataFrame`s, which produces the same effect but merges on indices by default:
```python
df1a.join(df2a)
```
You can also mix and match `left_index`/`right_index` with `right_on`/`left_on`:
```python
pd.merge(df1a, df6, left_index=True, right_on='name')
```
#### Set arithmetic for joins
Let's return to many-to-many joins for a moment. A consideration that is unique to them is the *arithmetic* of the join, specifically the set arithmetic we use for the join. To illustrate what we mean by this, let's restructure an old example `DataFrame`:
```python
df5 = pd.DataFrame({'group': ['Engineering', 'Marketing', 'Sales'],
'core_skills': ['math', 'writing', 'communication']})
df5
```
```python
pd.merge(df1, df5, on='group')
```
Notice that after we have restructured `df5` and then re-run the merge with `df1`, we have only two entries in the result. This is because we merged on `group` and 'Marketing' was the only entry that appeared in the `group` column of both `DataFrame`s.
In effect, what we have gotten is the *intersection* of both `DataFrame`s. This is know as the inner join in the database world and it is the default setting for `merge` although we can certainly specify it:
```python
pd.merge(df1, df5, on='group', how='inner')
```
The complement of the inner join is the outer join, which returns the *union* of the two `DataFrame`s.
### Exercise:
```python
# The keyword for perfoming an outer join is how='outer'. How would you perform it?
# What do you expect the output of an outer join of df1 and df5 to be?
```
Notice in your resulting `DataFrame` that not every row in `df1` and `df5` had a value that corresponds to the union of the key values (the 'group' column). Pandas fills in these missing values with `NaN`s.
Inner and outer joins are not your only options. A *left join* returns all of the rows in the first (left-side) `DataFrame` supplied to `merge` along with rows from the other `DataFrame` that match up with the left-side key values (and `NaNs` rows with respective values):
```python
pd.merge(df1, df5, how='left')
```
### Exercise:
```python
# Now run the right merge between df1 and df5.
# What do you expect to see?
```
#### `suffixes` keyword: dealing with conflicting column names
Because you can join datasets, you will eventually join two with conflicting column names. Let's look at another example to see what we mean:
```python
df7 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
'rank': [1, 2, 3, 4]})
df7
```
```python
df8 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
'rank': [3, 1, 4, 2]})
df8
```
```python
pd.merge(df7, df8, on='name')
```
Each column name in a `DataFrame` must be unique, so in cases where two joined `DataFrame`s share column names (aside from the column serving as the key), the `merge` function automatically appends the suffix `_x` or `_y` to the conflicting column names in order to make them unique. In cases where it is best to control your column names, you can specify a custom suffix for `merge` to append through the `suffixes` keyword:
```python
pd.merge(df7, df8, on='name', suffixes=['_left', '_right'])
```
Note that these suffixes work if there are multiple conflicting columns.
### Concatenation in NumPy
Concatenation in pandas is built off of the concatenation functionality for NumPy arrays. Here is what NumPy concatenation looks like:
- For one-dimensional arrays:
```python
x = [1, 2, 3]
y = [4, 5, 6]
z = [7, 8, 9]
np.concatenate([x, y, z])
```
- For two-dimensional arrays:
```python
x = [[1, 2],
[3, 4]]
np.concatenate([x, x], axis=1)
```
Notice that the `axis=1` parameter makes the concatenation occur along columns rather than rows. Concatenation in pandas looks similar to this.
### Concatenation in pandas
Pandas has a function, `pd.concat()` that can be used for a simple concatenation of `Series` or `DataFrame` objects in similar manner to `np.concatenate()` with ndarrays.
```python
ser1 = pd.Series(['a', 'b', 'c'], index=[1, 2, 3])
ser2 = pd.Series(['d', 'e', 'f'], index=[4, 5, 6])
pd.concat([ser1, ser2])
```
It also concatenates higher-dimensional objects, such as ``DataFrame``s:
```python
df9 = pd.DataFrame({'A': ['a', 'c'],
'B': ['b', 'd']})
df9
```
```python
pd.concat([df9, df9])
```
Notice that `pd.concat` has preserved the indexing even though that means that it has been duplicated. You can have the results re-indexed (and avoid potential confusion down the road) like so:
```python
pd.concat([df9, df9], ignore_index=True)
```
By default, `pd.concat` concatenates row-wise within the `DataFrame` (that is, `axis=0` by default). You can specify the axis along which to concatenate:
```python
pd.concat([df9, df9], axis=1)
```
Note that while pandas will display this without error, you will get an error message if you try to assign this result as a new `DataFrame`. Column names in `DataFrame`s must be unique.
### Concatenation with joins
Just as you did with merge above, you can use inner and outer joins when concatenating `DataFrame`s with different sets of column names.
```python
df10 = pd.DataFrame({'A': ['a', 'd'],
'B': ['b', 'e'],
'C': ['c', 'f']})
df10
```
```python
df11 = pd.DataFrame({'B': ['u', 'x'],
'C': ['v', 'y'],
'D': ['w', 'z']})
df11
```
```python
pd.concat([df10, df11])
```
As we saw earlier, the default join for this is an outer join and entries for which no data is available are filled with `NaN` values. You can also do an inner join:
```python
pd.concat([df10, df11], join='inner')
```
Another option is to directly specify the index of the remaininig colums using the `join_axes` argument, which takes a list of index objects. Here, we will specify that the returned columns should be the same as those of the first input (`df10`):
```python
pd.concat([df10, df11], join_axes=[df10.columns])
```
#### `append()`
Because direct array concatenation is so common, ``Series`` and ``DataFrame`` objects have an ``append`` method that can accomplish the same thing in fewer keystrokes. For example, rather than calling ``pd.concat([df9, df9])``, you can simply call ``df9.append(df9)``:
```python
df9.append(df9)
```
**Important point**: Unlike the `append()` and `extend()` methods of Python lists, the `append()` method in pandas does not modify the original object. It instead creates a new object with the combined data.
&gt; **Takeaway:** A large part of the value you can provide as a data scientist comes from connecting multiple, often disparate datasets to find new insights. Learning how to join and merge data is thus an essential part of your skill set.
## Exploratory statistics and visualization
&gt; **Learning goal:** By the end of this subsection, you should be familiar with some of the ways to visually explore the data stored in `DataFrame`s.
Often when probing a new data set, it is invaluable to get high-level information about what the dataset holds. Earlier in this section we discussed using methods such as `DataFrame.info`, `DataFrame.head`, and `DataFrame.tail` to examine some aspects of a `DataFrame`. While these methods are critical, they are on their own often insufficient to get enough information to know how to approach a new dataset. This is where exploratory statistics and visualizations for datasets come in.
To see what we mean in terms of gaining exploratory insight (both visually and numerically), let's dig into one of the the datasets that come with the scikit-learn library, the Boston Housing Dataset (though you will load it from a CSV file):
```python
df = pd.read_csv('Data/housing_dataset.csv')
df.head()
```
This dataset contains information collected from the U.S Census Bureau concerning housing in the area of Boston, Massachusetts and was first published in 1978. The dataset has 14 columns:
- **CRIM**: Per-capita crime rate by town
- **ZN**: Proportion of residential land zoned for lots over 25,000 square feet
- **INDUS**: Proportion of non-retail business acres per town
- **CHAS**: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
- **NOX**: Nitric oxides concentration (parts per 10 million)
- **RM**: Average number of rooms per dwelling
- **AGE**: Proportion of owner-occupied units built prior to 1940
- **DIS**: Weighted distances to five Boston employment centres
- **RAD**: Index of accessibility to radial highways
- **TAX**: Full-value property-tax rate per \$10,000
- **PTRATIO**: Pupil-teacher ratio by town
- **LSTAT**: Percent of lower-status portion of the population
- **MEDV**: Median value of owner-occupied homes in \$1,000s
One of the first methods we can use to better understand this dataset is `DataFrame.shape`:
```python
df.shape
```
The dataset has 506 rows and 13 columns.
To get a better idea of the contents of each column we can use `DataFrame.describe`, which returns the maximum value, minimums value, mean, and standard deviation of numeric values in each columns, in addition to the quartiles for each column:
```python
df.describe()
```
Because dataset can have so many columns in them, it can often be useful to transpose the results of `DataFrame.describe` to better use them:
Note that you can also examine specific descriptive statistics for columns without having to invoke `DataFrame.describe`:
```python
df['MEDV'].mean()
```
```python
df['MEDV'].max()
```
```python
df['AGE'].median()
```
### Exercise:
```python
# Now find the maximum value in df['AGE'].
```
Other information that you will often want to see is the relationship between different columns. You do this with the `DataFrame.groupby` method. For example, you could examine the average MEDV (median value of owner-occupied homes) for each value of AGE (proportion of owner-occupied units built prior to 1940):
```python
df.groupby(['AGE'])['MEDV'].mean()
```
### Exercise:
```python
# Now try to find the median value for AGE for each value of MEDV.
```
You can also apply a lambda function to each element of a `DataFrame` column by using the `apply` method. For example, say you wanted to create a new column that flagged a row if more than 50 percent of owner-occupied homes were build before 1940:
```python
df['AGE_50'] = df['AGE'].apply(lambda x: x>50)
```
Once applied, you also see how many values returned true and how many false by using the `value_counts` method:
```python
df['AGE_50'].value_counts()
```
You can also examine figures from the groupby statement you created earlier:
```python
df.groupby(['AGE_50'])['MEDV'].mean()
```
You can also group by more than one variable, such AGE_50 (the one you just created), CHAS (whether a town is on the Charles River), and RAD (an index measuring access to the Boston-area radial highways), and then evaluate each group for the average median home price in that group:
```python
groupby_twovar=df.groupby(['AGE_50','RAD','CHAS'])['MEDV'].mean()
```
You can then see what values are in this stacked group of variables:
```python
groupby_twovar
```
Let's take a moment to analyze these results in a little depth. The first row reports that communities with less the half of houses built before 1940, with a highway-access index of 1, and that are not situated on the Charles River have a mean house price of \$24,667 (1970s dollars); the next row shows that for communities similar to the first row except for being located on the Charles River have a mean house price of \$50,000.
One insight that pops out from continuing down this is that, all else being equal, being located next to the Charles River can significantly increase the value of newer housing stock. The story is more ambiguous for communities dominated by older houses: proximity to the Charles significantly increases home prices in one community (and that one presumably farther away from the city); for all others, being situated on the river either provided a modest increase in value or actually decreased mean home prices.
While groupings like this can be a great way to begin to interrogate your data, you might not care for the 'tall' format it comes in. In that case, you can unstack the data into a "wide" format:
```python
groupby_twovar.unstack()
```
### Exercise:
```python
# How could you use groupby to get a sense of the proportion
# of residential land zoned for lots over 25,000 sq.ft.,
# the proportion of non-retail business acres per town,
# and the distance of towns from employment centers in Boston?
```
It is also often valuable to know how many unique values a column has in it with the `nunique` method:
```python
df['CHAS'].nunique()
```
Complementary to that, you will also likely want to know what those unique values are, which is where the `unique` method helps:
```python
df['CHAS'].unique()
```
You can use the `value_counts` method to see how many of each unique value there are in a column:
```python
df['CHAS'].value_counts()
```
Or you can easily plot a bar graph to visually see the breakdown:
```python
%matplotlib inline
df['CHAS'].value_counts().plot(kind='bar')
```
Note that the IPython magic command `%matplotlib inline` enables you to view the chart inline.
Let's pull back to the dataset as a whole for a moment. Two major things that you will look for in almost any dataset are trends and relationships. A typical relationship between variables to explore is the Pearson correlation, or the extent to which two variables are linearly related. The `corr` method will show this in table format for all of the columns in a `DataFrame`:
```python
df.corr(method='pearson')
```
Suppose you just wanted to look at the correlations between all of the columns and just one variable? Let's examine just the correlation between all other variables and the percentage of owner-occupied houses build before 1940 (AGE). We will do this by accessing the column by index number:
```python
corr = df.corr(method='pearson')
corr_with_homevalue = corr.iloc[-1]
corr_with_homevalue[corr_with_homevalue.argsort()[::-1]]
```
With the correlations arranged in descending order, it's easy to start to see some patterns. Correlating AGE with a variable we created from AGE is a trivial correlation. However, it is interesting to note that the percentage of older housing stock in communities strongly correlates with air pollution (NOX) and the proportion of non-retail business acres per town (INDUS); at least in 1978 metro Boston, older towns are more industrial.
Graphically, we can see the correlations using a heatmap from the Seaborn library:
```python
import seaborn as sns
sns.heatmap(df.corr(),cmap=sns.cubehelix_palette(20, light=0.95, dark=0.15))
```
Histograms are another valuable tool for investigating your data. For example, what is the overall distribution of prices of owner-occupied houses in the Boston area?
```python
import matplotlib.pyplot as plt
plt.hist(df['MEDV'])
```
The default bin size for the matplotlib histogram (essentially big of buckets of percentages that you include in each histogram bar in this case) is pretty large and might mask smaller details. To get a finer-grained view of the AGE column, you can manually increase the number of bins in the histogram:
```python
plt.hist(df['MEDV'],bins=50)
```
Seaborn has a somewhat more attractive version of the standard matplotlib histogram: the distribution plot. This is a combination histogram and kernel density estimate (KDE) plot (essentially a smoothed histogram):
```python
sns.distplot(df['MEDV'])
```
Another commonly used plot is the Seaborn jointplot, which combines histograms for two columns along with a scatterplot:
```python
sns.jointplot(df['RM'], df['MEDV'], kind='scatter')
```
Unfortunately, many of the dots print over each other. You can help address this by adding some alpha blending, a figure that sets the transparency for the dots so that concentrations of them drawing over one another will be apparent:
```python
sns.jointplot(df['RM'], df['MEDV'], kind='scatter', alpha=0.3)
```
Another way to see patterns in your data is with a two-dimensional KDE plot. Darker colors here represent a higher concentration of data points:
```python
sns.kdeplot(df['RM'], df['MEDV'], shade=True)
```
Note that while the KDE plot is very good at showing concentrations of data points, finer structures like linear relationships (such as the clear relationship between the number of rooms in homes and the house price) are lost in the KDE plot.
Finally, the pairplot in Seaborn allows you to see scatterplots and histograms for several columns in one table. Here we have played with some of the keywords to produce a more sophisticated and easier to read pairplot that incorporates both alpha blending and linear regression lines for the scatterplots.
```python
sns.pairplot(df[['RM', 'AGE', 'LSTAT', 'DIS', 'MEDV']], kind="reg", plot_kws={'line_kws':{'color':'red'}, 'scatter_kws': {'alpha': 0.1}})
```
Visualization is the start of the really cool, fun part of data science. So play around with these visualization tools and see what you can learn from the data!
&gt; **Takeaway:** An old joke goes: “What does a data scientist seen when they look at a dataset? A bunch of numbers.” There is more than a little truth in that joke. Visualization is often the key to finding patterns and correlations in your data. While visualization cannot often deliver precise results, it can point you in the right direction to ask better questions and efficiently find value in the data.

Просмотреть файл

@ -0,0 +1,366 @@
# Machine Learning in Python
The content for this notebook was copied from [The Deep Learning Machine Learning in Python lab](https://github.com/Microsoft/computerscience/tree/master/Labs/Deep%20Learning/200%20-%20Machine%20Learning%20in%20Python).
This demo shows prediction of flight delays between airport pairs based on the day of the month using a random forest.
The demo concludes by visualizing the probability of on-time arrival between JFK and Atlanta Hartsfield-Jackson oves consecutive days.
In this exercise, you will import a dataset from Azure blob storage and load it into the notebook. Jupyter notebooks are highly interactive, and since they can include executable code, they provide the perfect platform for manipulating data and building predictive models from it.
## Ingest
cURL is a familiar tool to transfer data to or from servers using familiar protocols such as http, https, ftp, ftps, etc.
In the code cell below cURL is used to download Flight Data from a public blob storage to the working directory.
```python
!curl https://topcs.blob.core.windows.net/public/FlightData.csv -o flightdata.csv
```
Pandas will be used here to create a data frame in which the data will be manipulated and massaged for enhanced analysis.
Import the data and create a pandas DataFrame from it, and display the first five rows.
```python
import pandas as pd
df = pd.read_csv('flightdata.csv')
df.head()
```
The DataFrame that you created contains on-time arrival information for a major U.S. airline. It has more than 11,000 rows and 26 columns. (The output says "5 rows" because DataFrame's head function only returns the first five rows.) Each row represents one flight and contains information such as the origin, the destination, the scheduled departure time, and whether the flight arrived on time or late. You will learn more about the data, including its content and structure, in the next lab.
## Process
In the real world, few datasets can be used as is to train machine-learning models. It is not uncommon for data scientists to spend 80% or more of their time on a project cleaning, preparing, and shaping the data — a process sometimes referred to as data wrangling. Typical actions include removing duplicate rows, removing rows or columns with missing values or algorithmically replacing the missing values, normalizing data, and selecting feature columns. A machine-learning model is only as good as the data it is trained with. Preparing the data is arguably the most crucial step in the machine-learning process.
Before you can prepare a dataset, you need to understand its content and structure. In the previous steps, you imported a dataset containing on-time arrival information for a major U.S. airline. That data included 26 columns and thousands of rows, with each row representing one flight and containing information such as the flight's origin, destination, and scheduled departure time. You also loaded the data into the Jupyter notebook and used a simple Python script to create a pandas DataFrame from it.
To get a count of rows, run the following code:
```python
df.shape
```
Now take a moment to examine the 26 columns in the dataset. They contain important information such as the date that the flight took place (YEAR, MONTH, and DAY_OF_MONTH), the origin and destination (ORIGIN and DEST), the scheduled departure and arrival times (CRS_DEP_TIME and CRS_ARR_TIME), the difference between the scheduled arrival time and the actual arrival time in minutes (ARR_DELAY), and whether the flight was late by 15 minutes or more (ARR_DEL15).
Here is a complete list of the columns in the dataset. Times are expressed in 24-hour military time. For example, 1130 equals 11:30 a.m. and 1500 equals 3:00 p.m.
One of the first things data scientists typically look for in a dataset is missing values. There's an easy way to check for missing values in pandas. To demonstrate, execute the following code:
```python
df.isnull().values.any()
```
The next step is to find out where the missing values are. To do so, execute the following code:
```python
df.isnull().sum()
```
Curiously, the 26th column ("Unnamed: 25") contains 11,231 missing values, which equals the number of rows in the dataset. This column was mistakenly created because the CSV file that you imported contains a comma at the end of each line. To eliminate that column, execute the following code:
```python
df = df.drop('Unnamed: 25', axis=1)
df.isnull().sum()
```
The DataFrame still contains a lot of missing values, but some of them are irrelevant because the columns containing them are not germane to the model that you are building. The goal of that model is to predict whether a flight you are considering booking is likely to arrive on time. If you know that the flight is likely to be late, you might choose to book another flight.
The next step, therefore, is to filter the dataset to eliminate columns that aren't relevant to a predictive model. For example, the aircraft's tail number probably has little bearing on whether a flight will arrive on time, and at the time you book a ticket, you have no way of knowing whether a flight will be cancelled, diverted, or delayed. By contrast, the scheduled departure time could have a lot to do with on-time arrivals. Because of the hub-and-spoke system used by most airlines, morning flights tend to be on time more often than afternoon or evening flights. And at some major airports, traffic stacks up during the day, increasing the likelihood that later flights will be delayed.
Pandas provides an easy way to filter out columns you don't want. Execute the following code:
```python
df = df[["MONTH", "DAY_OF_MONTH", "DAY_OF_WEEK", "ORIGIN", "DEST", "CRS_DEP_TIME", "ARR_DEL15"]]
df.isnull().sum()
```
The only column that now contains missing values is the ARR_DEL15 column, which uses 0s to identify flights that arrived on time and 1s for flights that didn't. Use the following code to show the first five rows with missing values:
```python
df[df.isnull().values.any(axis=1)].head()
```
The reason these rows are missing ARR_DEL15 values is that they all correspond to flights that were canceled or diverted. You could call dropna on the DataFrame to remove these rows. But since a flight that is canceled or diverted to another airport could be considered "late," let's use the fillna method to replace the missing values with 1s.
Use the following code to replace missing values in the ARR_DEL15 column with 1s and display rows 177 through 184:
```python
df = df.fillna({'ARR_DEL15': 1})
df.iloc[177:185]
```
Use the following code to display the first five rows of the DataFrame:
```python
df.head()
```
The CRS_DEP_TIME column of the dataset you are using represents scheduled departure times. The granularity of the numbers in this column — it contains more than 500 unique values — could have a negative impact on accuracy in a machine-learning model. This can be resolved using a technique called binning or quantization. What if you divided each number in this column by 100 and rounded down to the nearest integer? 1030 would become 10, 1925 would become 19, and so on, and you would be left with a maximum of 24 discrete values in this column. Intuitively, it makes sense, because it probably doesn't matter much whether a flight leaves at 10:30 a.m. or 10:40 a.m. It matters a great deal whether it leaves at 10:30 a.m. or 5:30 p.m.
In addition, the dataset's ORIGIN and DEST columns contain airport codes that represent categorical machine-learning values. These columns need to be converted into discrete columns containing indicator variables, sometimes known as "dummy" variables. In other words, the ORIGIN column, which contains five airport codes, needs to be converted into five columns, one per airport, with each column containing 1s and 0s indicating whether a flight originated at the airport that the column represents. The DEST column needs to be handled in a similar manner.
In this portion of the exercise, you will "bin" the departure times in the CRS_DEP_TIME column and use pandas' get_dummies method to create indicator columns from the ORIGIN and DEST columns.
Use the following code to bin the departure times:
```python
import math
for index, row in df.iterrows():
df.loc[index, 'CRS_DEP_TIME'] = math.floor(row['CRS_DEP_TIME'] / 100)
df.head()
```
Now use the following statements to generate indicator columns from the ORIGIN and DEST columns, while dropping the ORIGIN and DEST columns themselves:
```python
df = pd.get_dummies(df, columns=['ORIGIN', 'DEST'])
df.head()
```
## Predict
Machine learning, which facilitates predictive analytics using large volumes of data by employing algorithms that iteratively learn from that data, is one of the fastest growing areas of data science.
One of the most popular tools for building machine-learning models is Scikit-learn, a free and open-source toolkit for Python programmers. It has built-in support for popular regression, classification, and clustering algorithms and works with other Python libraries such as NumPy and SciPy. With Sckit-learn, a simple method call can replace hundreds of lines of hand-written code. Sckit-learn allows you to focus on building, training, tuning, and testing machine-learning models without getting bogged down coding algorithms.
In this lab, the third of four in a series, you will use Sckit-learn to build a machine-learning model utilizing on-time arrival data for a major U.S. airline. The goal is to create a model that might be useful in the real world for predicting whether a flight is likely to arrive on time. It is precisely the kind of problem that machine learning is commonly used to solve. And it's a great way to increase your machine-learning chops while getting acquainted with Scikit-learn.
The first statement imports Sckit-learn's train_test_split helper function. The second line uses the function to split the DataFrame into a training set containing 80% of the original data, and a test set containing the remaining 20%. The random_state parameter seeds the random-number generator used to do the splitting, while the first and second parameters are DataFrames containing the feature columns and the label column.
```python
from sklearn.model_selection import train_test_split
train_x, test_x, train_y, test_y = train_test_split(df.drop('ARR_DEL15', axis=1), df['ARR_DEL15'], test_size=0.2, random_state=42)
```
train_test_split returns four DataFrames. Use the following command to display the number of rows and columns in the DataFrame containing the feature columns used for training:
```python
train_x.shape
```
Now use this command to display the number of rows and columns in the DataFrame containing the feature columns used for testing:
```python
test_x.shape
```
You will train a classification model, which seeks to resolve a set of inputs into one of a set of known outputs.
Sckit-learn includes a variety of classes for implementing common machine-learning models. One of them is RandomForestClassifier, which fits multiple decision trees to the data and uses averaging to boost the overall accuracy and limit overfitting.
Execute the following code to create a RandomForestClassifier object and train it by calling the fit method.
```python
from sklearn.ensemble import RandomForestClassifier
model = RandomForestClassifier(random_state=13)
model.fit(train_x, train_y)
```
The output shows the parameters used in the classifier, including n_estimators, which specifies the number of trees in each decision-tree forest, and max_depth, which specifies the maximum depth of the decision trees. The values shown are the defaults, but you can override any of them when creating the RandomForestClassifier object.
Now call the predict method to test the model using the values in test_x, followed by the score method to determine the mean accuracy of the model:
```python
predicted = model.predict(test_x)
model.score(test_x, test_y)
```
There are several ways to measure the accuracy of a classification model. One of the best overall measures for a binary classification model is Area Under Receiver Operating Characteristic Curve (sometimes referred to as "ROC AUC"), which essentially quantifies how often the model will make a correct prediction regardless of the outcome. In this exercise, you will compute an ROC AUC score for the model you built in the previous exercise and learn about some of the reasons why that score is lower than the mean accuracy output by the score method. You will also learn about other ways to gauge the accuracy of the model.
Before you compute the ROC AUC, you must generate prediction probabilities for the test set. These probabilities are estimates for each of the classes, or answers, the model can predict. For example, [0.88199435, 0.11800565] means that there's an 89% chance that a flight will arrive on time (ARR_DEL15 = 0) and a 12% chance that it won't (ARR_DEL15 = 1). The sum of the two probabilities adds up to 100%.
Run the following code to generate a set of prediction probabilities from the test data:
```python
from sklearn.metrics import roc_auc_score
probabilities = model.predict_proba(test_x)
```
Now use the following statement to generate an ROC AUC score from the probabilities using Sckit-learn's roc_auc_score method:
```python
roc_auc_score(test_y, probabilities[:, 1])
```
Why is the AUC score lower than the mean accuracy computed in the previous exercise?
The output from the score method reflects how many of the items in the test set the model predicted correctly. This score is skewed by the fact that the dataset the model was trained and tested with contains many more rows representing on-time arrivals than rows representing late arrivals. Because of this imbalance in the data, you are more likely to be correct if you predict that a flight will be on time than if you predict that a flight will be late.
ROC AUC takes this into account and provides a more accurate indication of how likely it is that a prediction of on-time or late will be correct.
You can learn more about the model's behavior by generating a confusion matrix, also known as an error matrix. The confusion matrix quantifies the number of times each answer was classified correctly or incorrectly. Specifically, it quantifies the number of false positives, false negatives, true positives, and true negatives. This is important, because if a binary classification model trained to recognize cats and dogs is tested with a dataset that is 95% dogs, it could score 95% simply by guessing "dog" every time. But if it failed to identify cats at all, it would be of little value.
Use the following code to produce a confusion matrix for your model:
```python
from sklearn.metrics import confusion_matrix
confusion_matrix(test_y, predicted)
```
The first row in the output represents flights that were on time. The first column in that row shows how many flights were correctly predicted to be on time, while the second column reveals how many flights were predicted as delayed but were not. From this, the model appears to be very adept at predicting that a flight will be on time.
But look at the second row, which represents flights that were delayed. The first column shows how many delayed flights were incorrectly predicted to be on time. The second column shows how many flights were correctly predicted to be delayed. Clearly, the model isn't nearly as adept at predicting that a flight will be delayed as it is at predicting that a flight will arrive on time. What you want in a confusion matrix is big numbers in the upper-left and lower-right corners, and small numbers (preferably zeros) in the upper-right and lower-left corners.
Other measures of accuracy for a classification model include precision and recall. Suppose the model was presented with three on-time arrivals and three delayed arrivals, and that it correctly predicted two of the on-time arrivals, but incorrectly predicted that two of the delayed arrivals would be on time. In this case, the precision would be 50% (two of the four flights it classified as being on time actually were on time), while its recall would be 67% (it correctly identified two of the three on-time arrivals). You can learn more about precision and recall from https://en.wikipedia.org/wiki/Precision_and_recall
Sckit-learn contains a handy method named precision_score for computing precision. To quantify the precision of your model, execute the following statements:
```python
from sklearn.metrics import precision_score
train_predictions = model.predict(train_x)
precision_score(train_y, train_predictions)
```
Sckit-learn also contains a method named recall_score for computing recall. To measure you model's recall, execute the following statements:
```python
from sklearn.metrics import recall_score
recall_score(train_y, train_predictions)
```
## Visualize
Now that you that have trained a machine-learning model to perform predictive analytics, it's time to put it to work. In this lab, the final one in the series, you will write a function that uses the machine-learning model you built in the previous lab to predict whether a flight will arrive on time or late. And you will use Matplotlib, the popular plotting and charting library for Python, to visualize the results.
The first statement is one of several magic commands supported by the Python kernel that you selected when you created the notebook. It enables Jupyter to render Matplotlib output in a notebook without making repeated calls to show. And it must appear before any references to Matplotlib itself. The final statement configures Seaborn to enhance the output from Matplotlib.
Execute the following code. Ignore any warning messages that are displayed related to font caching:
```python
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
```
The first statement is one of several magic commands supported by the Python kernel that you selected when you created the notebook. It enables Jupyter to render Matplotlib output in a notebook without making repeated calls to show. And it must appear before any references to Matplotlib itself. The final statement configures Seaborn to enhance the output from Matplotlib.
To see Matplotlib at work, execute the following code in a new cell to plot the ROC curve for the machine-learning model you built in the previous lab:
```python
from sklearn.metrics import roc_curve
fpr, tpr, _ = roc_curve(test_y, probabilities[:, 1])
plt.plot(fpr, tpr)
plt.plot([0, 1], [0, 1], color='grey', lw=1, linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
```
The dotted line in the middle of the graph represents a 50-50 chance of obtaining a correct answer. The blue curve represents the accuracy of your model. More importantly, the fact that this chart appears at all demonstrates that you can use Matplotlib in a Jupyter notebook.
The reason you built a machine-learning model is to predict whether a flight will arrive on time or late. In this exercise, you will write a Python function that calls the machine-learning model you built in the previous lab to compute the likelihood that a flight will be on time. Then you will use the function to analyze several flights.
This function takes as input a date and time, an origin airport code, and a destination airport code, and returns a value between 0.0 and 1.0 indicating the probability that the flight will arrive at its destination on time. It uses the machine-learning model you built in the previous lab to compute the probability. And to call the model, it passes a DataFrame containing the input values to predict_proba. The structure of the DataFrame exactly matches the structure of the DataFrame depicted in previous steps.
```python
def predict_delay(departure_date_time, origin, destination):
from datetime import datetime
try:
departure_date_time_parsed = datetime.strptime(departure_date_time, '%d/%m/%Y %H:%M:%S')
except ValueError as e:
return 'Error parsing date/time - {}'.format(e)
month = departure_date_time_parsed.month
day = departure_date_time_parsed.day
day_of_week = departure_date_time_parsed.isoweekday()
hour = departure_date_time_parsed.hour
origin = origin.upper()
destination = destination.upper()
input = [{'MONTH': month,
'DAY': day,
'DAY_OF_WEEK': day_of_week,
'CRS_DEP_TIME': hour,
'ORIGIN_ATL': 1 if origin == 'ATL' else 0,
'ORIGIN_DTW': 1 if origin == 'DTW' else 0,
'ORIGIN_JFK': 1 if origin == 'JFK' else 0,
'ORIGIN_MSP': 1 if origin == 'MSP' else 0,
'ORIGIN_SEA': 1 if origin == 'SEA' else 0,
'DEST_ATL': 1 if destination == 'ATL' else 0,
'DEST_DTW': 1 if destination == 'DTW' else 0,
'DEST_JFK': 1 if destination == 'JFK' else 0,
'DEST_MSP': 1 if destination == 'MSP' else 0,
'DEST_SEA': 1 if destination == 'SEA' else 0 }]
return model.predict_proba(pd.DataFrame(input))[0][0]
```
Use the code below to compute the probability that a flight from New York to Atlanta on the evening of October 1 will arrive on time. Note that the year you enter is irrelevant because it isn't used by the model.
```python
predict_delay('1/10/2018 21:45:00', 'JFK', 'ATL')
```
Modify the code to compute the probability that the same flight a day later will arrive on time:
```python
predict_delay('2/10/2018 21:45:00', 'JFK', 'ATL')
```
How likely is this flight to arrive on time? If your travel plans were flexible, would you consider postponing your trip for one day?
Now modify the code to compute the probability that a morning flight the same day from Atlanta to Seattle will arrive on time:
```python
predict_delay('2/10/2018 10:00:00', 'ATL', 'SEA')
```
In this exercise, you will combine the predict_delay function you created in the previous exercise with Matplotlib to produce side-by-side comparisons of the same flight on consecutive days and flights with the same origin and destination at different times throughout the day.
```python
import numpy as np
labels = ('Oct 1', 'Oct 2', 'Oct 3', 'Oct 4', 'Oct 5', 'Oct 6', 'Oct 7')
values = (predict_delay('1/10/2018 21:45:00', 'JFK', 'ATL'),
predict_delay('2/10/2018 21:45:00', 'JFK', 'ATL'),
predict_delay('3/10/2018 21:45:00', 'JFK', 'ATL'),
predict_delay('4/10/2018 21:45:00', 'JFK', 'ATL'),
predict_delay('5/10/2018 21:45:00', 'JFK', 'ATL'),
predict_delay('6/10/2018 21:45:00', 'JFK', 'ATL'),
predict_delay('7/10/2018 21:45:00', 'JFK', 'ATL'))
alabels = np.arange(len(labels))
plt.bar(alabels, values, align='center', alpha=0.5)
plt.xticks(alabels, labels)
plt.ylabel('Probability of On-Time Arrival')
plt.ylim((0.0, 1.0))
```
Referenced from https://github.com/Microsoft/computerscience/tree/master/Labs/Deep%20Learning/200%20-%20Machine%20Learning%20in%20Python, 12/17/2018

Двоичные данные
Data Science 2_Beginners Data Science for Python Developers/.DS_Store поставляемый Normal file

Двоичный файл не отображается.

Просмотреть файл

@ -0,0 +1,477 @@
# Section 1: Introduction to machine learning models
## A quick aside: types of ML
As you get deeper into data science, it might seem like there are a bewildering array of ML algorithms out there. However many you encounter, it can be handy to remember that most ML algorithms fall into three broad categories:
- **Predictive algorithms**: These analyze current and historical facts to make predictions about unknown events, such as the future or customers choices.
- **Classification algorithms**: These teach a program from a body of data, and the program then uses that learning to classify new observations.
- **Time-series forecasting algorithms**: While it can argued that these algorithms are a part of predictive algorithms, their techniques are specialized enough that they in many ways functions like a separate category. Time-series forecasting is beyond the scope of this course, but we have more than enough work with focusing here on prediction and classification.
## Prediction: linear regression
&gt; **Learning goal:** By the end of this subsection, you should be comfortable fitting linear regression models, and you should have some familiarity with interpreting their output.
### Data exploration
**Import Libraries**
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
```
**Dataset Alert**: Boston Housing Dataset
```python
df = pd.read_csv('Data/Housing_Dataset_Sample.csv')
df.head()
```
### Exercise:
```python
# Do you remember the DataFrame method for looking at overall information
# about a DataFrame, such as number of columns and rows? Try it here.
```
```python
df.describe().T
```
**Price Column**
```python
sns.distplot(df['Price'])
```
**House Prices vs Average Area Income**
```python
sns.jointplot(df['Avg. Area Income'],df['Price'])
```
**All Columns**
```python
sns.pairplot(df)
```
**Some observations**
1. Blob Data
2. Distortions might be a result of data (e.g. no one has 0.3 rooms)
### Fitting the model
**Can We Predict Housing Prices?**
```python
X = df.iloc[:,:5] # First 5 Columns
y = df['Price'] # Price Column
```
**Train, Test, Split**
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=54)
```
**Fit to Linear Regression Model**
```python
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
```
```python
reg.fit(X_train,y_train)
```
### Evaluating the model
**Predict**
```python
predictions = reg.predict(X_test)
```
```python
predictions
```
```python
print(reg.intercept_,reg.coef_)
```
**Score**
```python
#Explained variation. A high R2 close to 1 indicates better prediction with less error.
from sklearn.metrics import r2_score
r2_score(y_test,predictions)
```
**Visualize Errors**
```python
sns.distplot([y_test-predictions])
```
**Visualize Predictions**
```python
# Plot outputs
plt.scatter(y_test,predictions, color='blue')
```
### Exercise:
Can you think of a way to refine this visualization to make it clearer, particularly if you were explaining the results to someone?
```python
# Hint: Remember to try the plt.scatter parameter alpha=.
# It takes values between 0 and 1.
```
&gt; **Takeaway:** In this subsection, you performed prediction using linear regression by exploring your data, then fitting your model, and finally evaluating your models performance.
## Classification: logistic regression
&gt; **Learning goal:** By the end of this subsection, you should know how logistic regression differs from linear regression, be comfortable fitting logistic regression models, and have some familiarity with interpreting their output.
**Dataset Alert**: Fates of RMS Titanic Passengers
The dataset has 12 variables:
- **PassengerId**
- **Survived:** 0 = No, 1 = Yes
- **Pclass:** Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
- **Sex**
- **Age**
- **Sibsp:** Number of siblings or spouses aboard the *Titanic*
- **Parch:** Number of parents or children aboard the *Titanic*
- **Ticket:** Passenger ticket number
- **Fare:** Passenger fare
- **Cabin:** Cabin number
- **Embarked:** Port of embarkation; C = Cherbourg, Q = Queenstown, S = Southampton
```python
df = pd.read_csv('Data/train_data_titanic.csv')
df.head()
```
```python
df.info()
```
### Remove extraneous variables
```python
df.drop(['Name','Ticket'],axis=1,inplace=True)
```
### Check for multicollinearity
**Question**: Do any correlations between **Survived** and **Fare** jump out?
```python
sns.pairplot(df[['Survived','Fare']], dropna=True)
```
### Exercise:
```python
# Try running sns.pairplot twice more on some other combinations of columns
# and see if any patterns emerge.
```
We can also use `groupby` to look for patterns. Consider the mean values for the various variables when we group by **Survived**:
```python
df.groupby('Survived').mean()
```
```python
df.head()
```
```python
df['SibSp'].value_counts()
```
```python
df['Parch'].value_counts()
```
```python
df['Sex'].value_counts()
```
### Handle missing values
```python
# missing
df.isnull().sum()>(len(df)/2)
```
The history saving thread hit an unexpected error (OperationalError('no such table: history',)).History will not be written to the database.
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-1-a63f128a4173> in <module>()
1 # missing
----> 2 df.isnull().sum()>(len(df)/2)
NameError: name 'df' is not defined
```python
df.drop('Cabin',axis=1,inplace=True)
```
```python
df.info()
```
```python
df['Age'].isnull().value_counts()
```
### Corelation Exploration
```python
df.groupby('Sex')['Age'].median().plot(kind='bar')
```
```python
df['Age'] = df.groupby('Sex')['Age'].apply(lambda x: x.fillna(x.median()))
```
```python
df.isnull().sum()
```
```python
df['Embarked'].value_counts()
```
```python
df['Embarked'].fillna(df['Embarked'].value_counts().idxmax(), inplace=True)
df['Embarked'].value_counts()
```
```python
df = pd.get_dummies(data=df, columns=['Sex', 'Embarked'],drop_first=True)
df.head()
```
**Correlation Matrix**
```python
df.corr()
```
**Define X and Y**
```python
X = df.drop(['Survived','Pclass'],axis=1)
y = df['Survived']
```
```python
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=67)
```
### Exercise:
We now need to split the training and test data, which you will so as an exercise:
```python
from sklearn.model_selection import train_test_split
# Look up in the portion above on linear regression and use train_test_split here.
# Set test_size = 0.3 and random_state = 67 to get the same results as below when
# you run through the rest of the code example below.
```
**Use Logistic Regression Model**
```python
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
```
```python
lr.fit(X_train,y_train)
```
```python
predictions = lr.predict(X_test)
```
### Evaluate the model
#### Classification report
```python
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
```
The classification reports the proportions of both survivors and non-survivors with four scores:
- **Precision:** The number of true positives divided by the sum of true positives and false positives; closer to 1 is better.
- **Recall:** The true-positive rate, the number of true positives divided by the sum of the true positives and the false negatives.
- **F1 score:** The harmonic mean (the average for rates) of precision and recall.
- **Support:** The number of true instances for each label.
```python
print(classification_report(y_test,predictions))
```
#### Confusion matrix
```python
print(confusion_matrix(y_test,predictions))
```
```python
pd.DataFrame(confusion_matrix(y_test, predictions), columns=['True Survived', 'True Not Survived'], index=['Predicted Survived', 'Predicted Not Survived'])
```
#### Accuracy score
```python
print(accuracy_score(y_test,predictions))
```
&gt; **Takeaway:** In this subsection, you performed classification using logistic regression by removing extraneous variables, checking for multicollinearity, handling missing values, and fitting and evaluating your model.
## Classification: decision trees
&gt; **Learning goal:** By the end of this subsection, you should be comfortable fitting decision-tree models and have some understanding of what they output.
```python
from sklearn import tree
tr = tree.DecisionTreeClassifier()
```
### Exercise:
```python
# Using the same split data as with the logistic regression,
# can you fit the decision tree model?
# Hint: Refer to code snippet for fitting the logistic regression above.
```
**Note**: Using the same Titanic Data
```python
tr.fit(X_train, y_train)
```
```python
tr_predictions = tr.predict(X_test)
```
```python
pd.DataFrame(confusion_matrix(y_test, tr_predictions),
columns=['True Survived', 'True Not Survived'],
index=['Predicted Survived', 'Predicted Not Survived'])
```
```python
print(accuracy_score(y_test,tr_predictions))
```
**Visualize tree**
```python
import graphviz
dot_file = tree.export_graphviz(tr, out_file=None,
feature_names=X.columns,
class_names='Survived',
filled=True,rounded=True)
graph = graphviz.Source(dot_file)
graph
```
&gt; **Takeaway:** In this subsection, you performed classification on previously cleaned data by fitting and evaluating a decision tree.

Просмотреть файл

@ -0,0 +1,224 @@
# Section 2: Cloud-based machine learning
&gt; <font>**Note:**</font> The `azureml` package presently works only with Python 2. If your notebook is not currently running Python 2, change it in the menu at the top of the notebook by clicking **Kernel &gt; Change kernel &gt; Python 2**.
## Create and connect to an Azure ML Studio workspace
The `azureml` package is installed by default with Azure Notebooks, so we don't have to worry about that. It uses an Azure ML Studio workspace ID and authorization token to connect your notebook to the workspace; you will obtain the ID and token by following these steps:
1. Open [Azure ML Studio](https://studio.azureml.net) in a new browser tab and sign in with a Microsoft account. Azure ML Studio is free and does not require an Azure subscription. Once signed in with your Microsoft account (the same credentials youve used for Azure Notebooks), you're in your “workspace.”
2. On the left pane, click **Settings**.
![Settings button](https://github.com/Microsoft/AzureNotebooks/blob/master/Samples/images/azure-ml-studio-settings.png?raw=true)<br><br>
3. On the **Name** tab, the **Workspace ID** field contains your workspace ID. Copy that ID into the `workspace_id` value in the code cell in Step 5 of the notebook below.
![Location of workspace ID](https://github.com/Microsoft/AzureNotebooks/blob/master/Samples/images/azure-ml-studio-workspace-id.png?raw=true)<br><br>
4. Click the **Authorization Tokens** tab, and then copy either token into the `authorization_token` value in the code cell in Step 5 of the notebook.
![Location of authorization token](https://github.com/Microsoft/AzureNotebooks/blob/master/Samples/images/azure-ml-studio-tokens.png?raw=true)<br><br>
5. 5. Run the code cell below; if it runs without error, you're ready to continue.
```python
from azureml import Workspace
# Replace the values with those from your own Azure ML Studio instance; see Prerequisites
# The workspace_id is a string of hexadecimal characters; the token is a long string of random characters.
workspace_id = 'your_workspace_id'
authorization_token = 'your_auth_token'
ws = Workspace(workspace_id, authorization_token)
```
## Explore forest fire data
Lets look at a meteorological dataset collected by Cortez and Morais for 2007 to study the burned area of forest fires in the northeast region of Portugal.
&gt; P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data.
In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence,
Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December,
Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9.
The dataset contains the following features:
- **`X`**: x-axis spatial coordinate within the Montesinho park map: 1 to 9
- **`Y`**: y-axis spatial coordinate within the Montesinho park map: 2 to 9
- **`month`**: month of the year: "1" to "12" jan-dec
- **`day`**: day of the week: "1" to "7" sun-sat
- **`FFMC`**: FFMC index from the FWI system: 18.7 to 96.20
- **`DMC`**: DMC index from the FWI system: 1.1 to 291.3
- **`DC`**: DC index from the FWI system: 7.9 to 860.6
- **`ISI`**: ISI index from the FWI system: 0.0 to 56.10
- **`temp`**: temperature in Celsius degrees: 2.2 to 33.30
- **`RH`**: relative humidity in %: 15.0 to 100
- **`wind`**: wind speed in km/h: 0.40 to 9.40
- **`rain`**: outside rain in mm/m2 : 0.0 to 6.4
- **`area`**: the burned area of the forest (in ha): 0.00 to 1090.84
Let's load the dataset and visualize the area that was burned in relation to the temperature in that region.
```python
import pandas as pd
df = pd.DataFrame(pd.read_csv('Data/forestfires.csv'))
%matplotlib inline
from ggplot import *
ggplot(aes(x='temp', y='area'), data=df) + geom_line() + geom_point()
```
Intuitively, the hotter the weather, the more hectares burned in forest fires.
## Transfer your data to Azure ML Studio
```python
from azureml import DataTypeIds
dataset = ws.datasets.add_from_dataframe(
dataframe=df,
data_type_id=DataTypeIds.GenericCSV,
name='Forest Fire Data',
description='Paulo Cortez and Aníbal Morais (Univ. Minho) @ 2007'
)
```
After running the code above, you can see the dataset listed in the **Datasets** section of the Azure Machine Learning Studio workspace. (**Note**: You might need to switch between browser tabs and refresh the page in order to see the dataset.)
![image.png](attachment:image.png)<br>
**View Azure ML Studio Data in Notebooks**
```python
print('\n'.join([i.name for i in ws.datasets if not i.is_example])) # only list user-created datasets
```
**Interact with Azure ML Studio Data in Notebooks**
```python
# Read some more of the metadata
ds = ws.datasets['Forest Fire Data']
print(ds.name)
print(ds.description)
print(ds.family_id)
print(ds.data_type_id)
print(ds.created_date)
print(ds.size)
# Read the contents
df2 = ds.to_dataframe()
df2.head()
```
## Create your model
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df[['wind','rain','month','RH']],
df['temp'],
test_size=0.25,
random_state=42
)
```
```python
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)
y_test_predictions = regressor.predict(X_test)
print('R^2 for true vs. predicted test set forest temperature: {:0.2f}'.format(r2_score(y_test, y_test_predictions)))
```
### Exercise:
```python
# Play around with this algorithm.
# Can you get better results changing the variables you select for the training and test data?
# What if you look at different variables for the response?
```
## Deploy your model as a web service
**Access your Model Anywhere**
```python
from azureml import services
@services.publish(workspace_id, authorization_token)
@services.types(wind=float, rain=float, month=int, RH=float)
@services.returns(float)
# The name of your web service is set to this function's name
def forest_fire_predictor(wind, rain, month, RH):
return regressor.predict([wind, rain, month, RH])
# Hold onto information about your web service so
# you can call it within the notebook later
service_url = forest_fire_predictor.service.url
api_key = forest_fire_predictor.service.api_key
help_url = forest_fire_predictor.service.help_url
service_id = forest_fire_predictor.service.service_id
```
## Consuming the web service
```python
forest_fire_predictor.service(5.4, 0.2, 9, 22.1)
```
```python
import urllib2
import json
data = {"Inputs": {
"input1": {
"ColumnNames": [ "wind", "rain", "month", "RH"],
"Values": [["5.4", "0.2", "9", "22.1"]]
}
}, # Specified feature values
"GlobalParameters": {}
}
body = json.dumps(data)
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}
req = urllib2.Request(service_url, body, headers)
try:
response = urllib2.urlopen(req)
result = json.loads(response.read()) # load JSON-formatted string response as dictionary
print(result['Results']['output1']['value']['Values'][0][0]) # Get the returned prediction
except urllib2.HTTPError, error:
print("The request failed with status code: " + str(error.code))
print(error.info())
print(json.loads(error.read()))
```
### Exercise:
Try this same process of training and hosting a model through Azure ML Studio with the Pima Indians Diabetes dataset (in CSV format in your Data folder). The dataset has nine columns; use any of the eight features you see fit to try and predict the ninth column, Outcome (1 = diabetes, 0 = no diabetes).
```python
```
&gt; **Takeaway**: In this part, you explored fitting a model and deploying it as a web service. You did this by using now-familiar tools in an Azure Notebook to build a model relating variables surrounding forest fires and then posting that as a function in Azure ML Studio. From there, you saw how you and others can access the pre-fitted models to make predictions on new data from anywhere on the web.

Просмотреть файл

@ -0,0 +1,25 @@
# Capstone Project
In this Capstone Project you will be engaging with the NOAA Significant Volcanic Eruption database, which can be found at:
`Data/noaa_volerup.csv`
In this Notebook.
## Tasks
Using what you know about Python, NumPy, Pandas, and Machine Learning you sould:
- Idenfity requirements for success
- Identify possible risks in the data if this were a real-world scenario
- Prepare the data
- Select features (variables)
- Split the data between training and testing
- Choose algorithms
## Options
If you would prefer to find your own dataset, that is OK, however limit your searching to about 15 minutes. Microsoft has several [Public Datasets](https://docs.microsoft.com/en-us/azure/sql-database/sql-database-public-data-sets) if you want to start there.
You are also encouraged to explore any aspects of the data. Be explicit about your inquiry and success in predicting affects on our world.
```python
```

Просмотреть файл

@ -0,0 +1,576 @@
# Section 1: Introduction to machine learning models
You have now made it to the section on machine learning (ML). ML and the branch of computer science in which it resides, artificial intelligence (AI), are so central to data science that ML/AI and data science are synonymous in the minds of many people. However, the preceding sections have hopefully demonstrated that there are a lot of other facets to the discipline of data science apart from the prediction and classification tasks that supply so much value to the world. (Remember, at least 80 percent of the effort in most data-science projects will be composed of cleaning and manipulating the data to prepare it for analysis.)
That said, ML is fun! In this section, and the next one on data science in the cloud, you will get to play around with some of the “magic” of data science and start to put into practice the tools you have spent the last five sections learning. Let's get started!
## A quick aside: types of ML
As you get deeper into data science, it might seem like there are a bewildering array of ML algorithms out there. However many you encounter, it can be handy to remember that most ML algorithms fall into three broad categories:
- **Predictive algorithms**: These analyze current and historical facts to make predictions about unknown events, such as the future or customers choices.
- **Classification algorithms**: These teach a program from a body of data, and the program then uses that learning to classify new observations.
- **Time-series forecasting algorithms**: While it can argued that these algorithms are a part of predictive algorithms, their techniques are specialized enough that they in many ways functions like a separate category. Time-series forecasting is beyond the scope of this course, but we have more than enough work with focusing here on prediction and classification.
## Prediction: linear regression
&gt; **Learning goal:** By the end of this subsection, you should be comfortable fitting linear regression models, and you should have some familiarity with interpreting their output.
Arguably the simplest form of machine learning is to draw a line connecting two points and make predictions about where that trend might lead.
But what if you have more than two points—and those points don't line up neatly? What if you have points in more than two dimensions? This is where linear regression comes in.
Formally, linear regression is used to predict a quantitative *response* (the values on a Y axis) that is dependent on one or more *predictors* (values on one or more axes that are orthogonal to Y, commonly just thought of collectively as X). The working assumption is that the relationship between predictors and response is more or less linear. The goal of linear regression is to fit a straight line in the best possible way to minimize the deviation between our observed responses in the dataset and the responses predicted by our line, the linear approximation. (The most common means of assessing this error is called the **least squares method**; it consists of minimizing the number you get when you square the difference between your predicted value and the actual value and add up all of those squared differences for your entire dataset.)
<img src="../Images/linear_regression.png" style="padding-right: 10px;">
Statistically, we can represent this relationship between response and predictors as:
$Y = B_0 + B_1X + E$
Remember high school geometry? $B_0$ is the intercept of our line and $B_1$ is its slope. We commonly refer to $B_0$ and $B_1$ as coefficients and to $E$ as the *error term*, which represents the margin of error in the model.
Let's try this in practice with actual data. (Note: no graph paper will be harmed in the course of these predictions.)
### Data exploration
We'll begin by importing our usual libraries and using our %matplotlib inline magic command:
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
```
And now for our data. In this case, well use a newer housing dataset than the Boston Housing Dataset we used in the last section (with this one storing data on individual houses across the United States).
```python
df = pd.read_csv('../Data/Housing_Dataset_Sample.csv')
df.head()
```
### Exercise:
```python
# Do you remember the DataFrame method for looking at overall information
# about a DataFrame, such as number of columns and rows? Try it here.
```
Let's also use the `describe` method to look at some of the vital statistics about the columns. Note that in cases like this, in which some of the column names are long, it can be helpful to view the transposition of the summary, like so:
```python
df.describe().T
```
Let's look at the data in the **Price** column. (You can disregard the deprecation warning if it appears.)
```python
sns.distplot(df['Price'])
```
As we would hope with this much data, our prices form a nice bell-shaped, normally distributed curve.
Now, let's look at a simple relationship like that between house prices and the average income in a geographic area:
```python
sns.jointplot(df['Avg. Area Income'],df['Price'])
```
As we would expect, there is an intuitive, linear relationship between them. Also good: the pairplot shows that the data in both columns is normally distributed, so we don't have to worry about somehow transforming the data for meaningful analysis.
Let's take a quick look at all of the columns:
```python
sns.pairplot(df)
```
Some observations:
1. Not all of the combinations of columns provide strong linear relationships; some just look like blobs. That's nothing to worry about for our analysis.
2. See the visualizations that look like lanes rather than organic groups? That is the result of the average number of bedrooms in houses being measured in discrete values rather than continuous ones (as no one has 0.3 bedrooms in their house). The number of bathrooms is also the one column whose data is not really normally distributed, though some of this might be distortion caused by the default bin size of the pairplot histogram functionality.
It is now time to make a prediction.
### Fitting the model
Let's make a prediction. Let's feed everything into a linear model (average area income, average area house age, average area number of rooms, average area number of bedrooms, and area population) and see how well knowing those factors can help us predict the price of a home.
To do this, we will make our first five columns the X (our predictors) and the **Price** column the Y (our response):
```python
X = df.iloc[:,:5]
y = df['Price']
```
Now, we could use all of our data to create our model. However, all that would get us is a model that is good at predicting itself. Not only would that leave us with no objective way to measure how good the model is, it would also likely lead to a model that was less accurate when used on new data. Such a model is termed *overfitted*.
To avoid this, data scientists divide their datasets for ML into *training* data (the data used to fit the model) and *test* data (data used to evaluate how accurate the model is). Fortunately, scikit-learn provides a function that enables us to easily divide up our data between training and test sets: `train_test_split`. In this case, we will use 70 percent of our data for training and reserve 30 percent of it for testing. (Note that you will also supply a fourth parameter to the function: `random_state`; `train_test_split` randomly divides up our data between test and training, so this number provides an explicit seed for the random-number generator so that you will get the same result each time you run this code snippet.)
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=54)
```
All that is left now is to import our linear regression algorithm and fit our model based on our training data:
```python
from sklearn.linear_model import LinearRegression
reg = LinearRegression()
```
```python
reg.fit(X_train,y_train)
```
### Evaluating the model
Now, a moment of truth: let's see how our model does making predictions based on the test data:
```python
predictions = reg.predict(X_test)
```
```python
predictions
```
Our predictions are just an array of numbers: these are the house prices predicted by our model. One for every row in our test dataset.
Remember how we mentioned that linear models have the mathematical form of $Y = B_0 + B_1*X + E$? Lets look at the actual equation:
```python
print(reg.intercept_,reg.coef_)
```
In algebraic terms, here is our model:
$Y=-2,646,401+0.21587X_1+0.00002X_2+0.00001X_3+0.00279X_4+0.00002X_5$
where:
- $Y=$ Price
- $X_1=$ Average area income
- $X_2=$ Average area house age
- $X_3=$ Average area number of rooms
- $X_4=$ Average area number of bedrooms
- $X_5=$ Area population
So, just how good is our model? There are many ways to measure the accuracy of ML models. Linear models have a good one: the $R^2$ score (also knows as the coefficient of determination). A high $R^2$, close to 1, indicates better prediction with less error.
```python
#Explained variation. A high R2 close to 1 indicates better prediction with less error.
from sklearn.metrics import r2_score
r2_score(y_test,predictions)
```
The $R^2$ score also indicates how much explanatory power a linear model has. In the case of our model, the five predictors we used in the model explain a little more than 92 percent of the price of a house in this dataset.
We can also plot our errors to get a visual sense of how wrong our predictions were:
```python
#plot errors
sns.distplot([y_test-predictions])
```
Do you notice the numbers on the left axis? Whereas a histogram shows the number of things that fall into discrete numeric buckets, a kernel density estimation (KDE, and the histogram that accompanies it in the Seaborn displot) normalizes those numbers to show what proportion of results lands in each bucket. Essentially, these are all decimal numbers less than 1.0 because the area under the KDE has to add up to 1.
Maybe more gratifying, we can plot the predictions from our model:
```python
# Plot outputs
plt.scatter(y_test,predictions, color='blue')
```
The linear nature of our predicted prices is clear enough, but there are so many of them that it is hard to tell where dots are concentrated. Can you think of a way to refine this visualization to make it clearer, particularly if you were explaining the results to someone?
### Exercise:
```python
# Hint: Remember to try the plt.scatter parameter alpha=.
# It takes values between 0 and 1.
```
&gt; **Takeaway:** In this subsection, you performed prediction using linear regression by exploring your data, then fitting your model, and finally evaluating your models performance.
## Classification: logistic regression
&gt; **Learning goal:** By the end of this subsection, you should know how logistic regression differs from linear regression, be comfortable fitting logistic regression models, and have some familiarity with interpreting their output.
We'll now pivot to discussing classification. If our simple analogy of predictive analytics was drawing a line through points and extrapolating from that, then classification can be described in its simplest form as drawing lines around groups of points.
While linear regression is used to predict quantitative responses, *logistic* regression is used for classification problems. Formally, logistic regression predicts the categorical response (Y) based on predictors (Xs). Logistic regression goes by several names, and it is also known in the scholarly literature as logit regression, maximum-entropy classification (MaxEnt), and the log-linear classifier. In this algorithm, the probabilities describing the possible outcomes of a single trial are modeled using a sigmoid (S-curve) function. Sigmoid functions take any value and transform it to be between 0 and 1, which can be used as a probability for a class to be predicted, with the goal of predictors mapping to 1 when something belongs in the class and 0 when they do not.
<img src="../Images/logistic_regression.png?" style="padding-right: 10px;">
To show this in action, let's do something a little different and try a historical dataset: the fates of the passengers of the RMS Titanic, which is a popular dataset for classification problems in machine learning. In this case, the class we want to predict is whether a passenger survived the doomed liner's sinking.
The dataset has 12 variables:
- **PassengerId**
- **Survived:** 0 = No, 1 = Yes
- **Pclass:** Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
- **Sex**
- **Age**
- **Sibsp:** Number of siblings or spouses aboard the *Titanic*
- **Parch:** Number of parents or children aboard the *Titanic*
- **Ticket:** Passenger ticket number
- **Fare:** Passenger fare
- **Cabin:** Cabin number
- **Embarked:** Port of embarkation; C = Cherbourg, Q = Queenstown, S = Southampton
```python
df = pd.read_csv('../Data/train_data_titanic.csv')
df.head()
```
```python
df.info()
```
One reason that the Titanic data set is a popular classification set is that it provides opportunities to prepare data for analysis. To prepare this dataset for analysis, we need to perform a number of tasks:
- Remove extraneous variables
- Check for multicollinearity
- Handle missing values
We will touch on each of these steps in turn.
### Remove extraneous variables
The name of individual passengers and their ticket numbers will clearly do nothing to help our model, so we can drop those columns to simplify matters.
```python
df.drop(['Name','Ticket'],axis=1,inplace=True)
```
There are additional variables that will not add classifying power to our model, but to find them we will need to look for correlation between variables.
### Check for multicollinearity
If one or more of our predictors can themselves be predicted from other predictors, it can produce a state of *multicollinearity* in our model. Multicollinearity is a challenge because it can skew the results of regression models (both linear and logistic) and reduce the predictive or classifying power of a model.
To help combat this problem, we can start to look for some initial patterns. For example, do any correlations between **Survived** and **Fare** jump out?
```python
sns.pairplot(df[['Survived','Fare']], dropna=True)
```
### Exercise:
```python
# Try running sns.pairplot twice more on some other combinations of columns
# and see if any patterns emerge.
```
We can also use `groupby` to look for patterns. Consider the mean values for the various variables when we group by **Survived**:
```python
df.groupby('Survived').mean()
```
Survivors appear to be slightly younger on average with higher-cost fare.
```python
df.head()
```
Value counts can also help us get a sense of the data before us, such as numbers for siblings and spouses on the *Titanic*, in addition to the sex split of passengers:
```python
df['SibSp'].value_counts()
```
```python
df['Parch'].value_counts()
```
```python
df['Sex'].value_counts()
```
### Handle missing values
We now need to address missing values. First, lets look to see which columns have more than half of their values missing:
```python
#missing
df.isnull().sum()>(len(df)/2)
```
Let's break down the code in the call above just a bit. `df.isnull().sum()` tells pandas to take the sum of all of the missing values for each column. `len(df)/2` is just another way of expressing half the number of rows in the `DataFrame`. Taken together with the `&gt;`, this line of code is looking for any columns with more than half of its entries missing, and there is one: **Cabin**.
We could try to do something about those missing values. However, if any pattern does emerge in the data that involves **Cabin**, it will be highly cross-correlated with both **Pclass** and **Fare** (as higher-fare, better-class accommodations were grouped together on the *Titanic*). Given that too much cross-correlation can be detrimental to a model, it is probably just better for us to drop **Cabin** from our `DataFrame`:
```python
df.drop('Cabin',axis=1,inplace=True)
```
Let's now run `info` to see if there are columns with just a few null values.
```python
df.info()
```
One note on the data: given that 1,503 died in the *Titanic* tragedy (and that we know that some survived), this data set clearly does not include every passenger on the ship (and none of the crew). Also remember that **Survived** is a variable that includes both those who survived and those who perished.
Back to missing values. **Age** is missing several values, as is **Embarked**. Let's see how many values are missing from **Age**:
```python
df['Age'].isnull().value_counts()
```
As we saw above, **Age** isn't really correlated with **Fare**, so it is a variable that we want to eventually use in our model. That means that we need to do something with those missing values. But we before we decide on a strategy, we should check to see if our median age is the same for both sexes.
```python
df.groupby('Sex')['Age'].median().plot(kind='bar')
```
The median ages are different for men and women sailing on the *Titanic*, which means that we should handle the missing values accordingly. A sound strategy is to replace the missing ages for passengers with the median age *for the passengers' sexes*.
```python
df['Age'] = df.groupby('Sex')['Age'].apply(lambda x: x.fillna(x.median()))
```
Any other missing values?
```python
df.isnull().sum()
```
We are missing two values for **Embarked**. Check to see how that variable breaks down:
```python
df['Embarked'].value_counts()
```
The vast majority of passengers embarked on the *Titanic* from Southampton, so we will just fill in those two missing values with the most statistically likely value (the median result): Southampton.
```python
df['Embarked'].fillna(df['Embarked'].value_counts().idxmax(), inplace=True)
df['Embarked'].value_counts()
```
```python
df = pd.get_dummies(data=df, columns=['Sex', 'Embarked'],drop_first=True)
df.head()
```
Let's do a final look at the correlation matrix to see if there is anything else we should remove.
```python
df.corr()
```
**Pclass** and **Fare** have some amount of correlation, we can probably get rid of one of them. In addition, we need to remove **Survived** from our X `DataFrame` because it will be our response `DataFrame`, Y:
```python
X = df.drop(['Survived','Pclass'],axis=1)
y = df['Survived']
```
```python
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=67)
```
### Exercise:
We now need to split the training and test data, which you will so as an exercise:
```python
from sklearn.model_selection import train_test_split
# Look up in the portion above on linear regression and use train_test_split here.
# Set test_size = 0.3 and random_state = 67 to get the same results as below when
# you run through the rest of the code example below.
```
Now you will import and fit the logistic regression model:
```python
from sklearn.linear_model import LogisticRegression
lr = LogisticRegression()
```
```python
lr.fit(X_train,y_train)
```
```python
predictions = lr.predict(X_test)
```
### Evaluate the model
In contrast to linear regression, logistic regression does not produce an $R^2$ score by which we can assess the accuracy of our model. In order to evaluate that, we will use a classification report, a confusion matrix, and the accuracy score.
#### Classification report
```python
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
```
The classification reports the proportions of both survivors and non-survivors with four scores:
- **Precision:** The number of true positives divided by the sum of true positives and false positives; closer to 1 is better.
- **Recall:** The true-positive rate, the number of true positives divided by the sum of the true positives and the false negatives.
- **F1 score:** The harmonic mean (the average for rates) of precision and recall.
- **Support:** The number of true instances for each label.
Why so many ways of measuring accuracy for a model? Well, success means different things in different contexts. Imagine that we had a model to diagnose infectious disease. In such a case we might want to tune our model to maximize recall (and thus minimize our false-negative rate): even high precision might miss a lot of infected people. On the other hand, a weather-forecasting model might be interested in maximizing precision because the cost of false negatives is so low. For other uses, striking a balance between precision and recall by maximizing the F1 score might be the best choice. Run the classification report:
```python
print(classification_report(y_test,predictions))
```
#### Confusion matrix
The confusion matrix is another way to present this same information, this time with raw scores. The columns show the true condition, positive on the left, negative on the right. The rows show predicted conditions, positive on the top, negative on the bottom. So, the matrix below shows that our model correctly predicted 146 survivors (true positives) and incorrectly predicted another 16 (false positives). On the other hand, our model correctly predicted 30 non-survivors (true negatives) and incorrectly predicted 76 more (false negatives).
```python
print(confusion_matrix(y_test,predictions))
```
Let's dress up the confusion matrix a bit to make it a little easier to read:
```python
pd.DataFrame(confusion_matrix(y_test, predictions), columns=['True Survived', 'True Not Survived'], index=['Predicted Survived', 'Predicted Not Survived'])
```
#### Accuracy score
Finally, our accuracy score tells us the fraction of correctly classified samples; in this case (146 + 76) / (146 + 76 + 30 + 16).
```python
print(accuracy_score(y_test,predictions))
```
Not bad for an off-the-shelf model with no tuning!
&gt; **Takeaway:** In this subsection, you performed classification using logistic regression by removing extraneous variables, checking for multicollinearity, handling missing values, and fitting and evaluating your model.
## Classification: decision trees
&gt; **Learning goal:** By the end of this subsection, you should be comfortable fitting decision-tree models and have some understanding of what they output.
If logistic regression uses observations about variables to swing a metaphorical needle between 0 and 1, classification based on decision trees programmatically builds a Yes/No decision to classify items.
<img src="../Images/decision_tree.png" style="padding-right: 10px;">
Let's look at this in practice with the same *Titanic* dataset we used with logistic regression.
```python
from sklearn import tree
```
```python
tr = tree.DecisionTreeClassifier()
```
### Exercise:
```python
# Using the same split data as with the logistic regression,
# can you fit the decision tree model?
# Hint: Refer to code snippet for fitting the logistic regression above.
```
```python
tr.fit(X_train, y_train)
```
Once fitted, we get our predicitions just like we did in the logistic regression example above:
```python
tr_predictions = tr.predict(X_test)
```
```python
pd.DataFrame(confusion_matrix(y_test, tr_predictions),
columns=['True Survived', 'True Not Survived'],
index=['Predicted Survived', 'Predicted Not Survived'])
```
```python
print(accuracy_score(y_test,tr_predictions))
```
One of the great attractions of decision trees is that the models are readable by humans. Let's visualize to see it in action. (Note, the generated graphic can be quite large, so scroll to the right if the generated graphic just looks blank at first.)
```python
import graphviz
dot_file = tree.export_graphviz(tr, out_file=None,
feature_names=X.columns,
class_names='Survived',
filled=True,rounded=True)
graph = graphviz.Source(dot_file)
graph
```
There are, of course, myriad other ML models that we could explore. However, you now know some of the most commonly encountered ones, which is great preparation to understand what automated, cloud-based ML and AI services are doing and how to intelligently apply them to data-science problems, the subject of the next section.
&gt; **Takeaway:** In this subsection, you performed classification on previously cleaned data by fitting and evaluating a decision tree.

Просмотреть файл

@ -0,0 +1,245 @@
# Section 2: Cloud-based machine learning
Thus far, we have looked at building and fitting ML models “locally.” True, the notebooks have been located in the cloud themselves, but the models with all of their predictive and classification power are stuck in those notebooks. To use these models, you would have to load data into your notebooks and get the results there.
In practice, we want those models accessible from a number of locations. And while the management of production ML models has a lifecycle all its own, one part of that is making models accessible from the web. One way to do so is to develop them using third-party cloud tools, such as [Microsoft Azure ML Studio](https://studio.azureml.net) (not to be confused with Microsoft Azure Machine Learning sService, which provides end-to-end lifecycle management for ML models).
Alternatively, we can develop and deploy a function that can be accessed by other programs over the web—a web service—that runs within Azure ML Studio, and we can do so entirely from a Python notebook. In this section, we will use the [`azureml`](https://github.com/Azure/Azure-MachineLearning-ClientLibrary-Python) package to deploy an Azure ML web service directly from within a Python notebook (or other Python environment).
&gt; <font>**Note:**</font> The `azureml` package presently works only with Python 2. If your notebook is not currently running Python 2, change it in the menu at the top of the notebook by clicking **Kernel &gt; Change kernel &gt; Python 2**.
## Create and connect to an Azure ML Studio workspace
The `azureml` package is installed by default with Azure Notebooks, so we don't have to worry about that. It uses an Azure ML Studio workspace ID and authorization token to connect your notebook to the workspace; you will obtain the ID and token by following these steps:
1. Open [Azure ML Studio](https://studio.azureml.net) in a new browser tab and sign in with a Microsoft account. Azure ML Studio is free and does not require an Azure subscription. Once signed in with your Microsoft account (the same credentials youve used for Azure Notebooks), you're in your “workspace.”
2. On the left pane, click **Settings**.
![Settings button](https://github.com/Microsoft/AzureNotebooks/blob/master/Samples/images/azure-ml-studio-settings.png?raw=true)<br><br>
3. On the **Name** tab, the **Workspace ID** field contains your workspace ID. Copy that ID into the `workspace_id` value in the code cell in Step 5 of the notebook below.
![Location of workspace ID](https://github.com/Microsoft/AzureNotebooks/blob/master/Samples/images/azure-ml-studio-workspace-id.png?raw=true)<br><br>
4. Click the **Authorization Tokens** tab, and then copy either token into the `authorization_token` value in the code cell in Step 5 of the notebook.
![Location of authorization token](https://github.com/Microsoft/AzureNotebooks/blob/master/Samples/images/azure-ml-studio-tokens.png?raw=true)<br><br>
5. 5. Run the code cell below; if it runs without error, you're ready to continue.
```python
from azureml import Workspace
# Replace the values with those from your own Azure ML Studio instance; see Prerequisites
# The workspace_id is a string of hexadecimal characters; the token is a long string of random characters.
workspace_id = 'your_workspace_id'
authorization_token = 'your_auth_token'
ws = Workspace(workspace_id, authorization_token)
```
## Explore forest fire data
Lets look at a meteorological dataset collected by Cortez and Morais for 2007 to study the burned area of forest fires in the northeast region of Portugal.
&gt; P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data.
In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence,
Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December,
Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9.
The dataset contains the following features:
- **`X`**: x-axis spatial coordinate within the Montesinho park map: 1 to 9
- **`Y`**: y-axis spatial coordinate within the Montesinho park map: 2 to 9
- **`month`**: month of the year: "1" to "12" jan-dec
- **`day`**: day of the week: "1" to "7" sun-sat
- **`FFMC`**: FFMC index from the FWI system: 18.7 to 96.20
- **`DMC`**: DMC index from the FWI system: 1.1 to 291.3
- **`DC`**: DC index from the FWI system: 7.9 to 860.6
- **`ISI`**: ISI index from the FWI system: 0.0 to 56.10
- **`temp`**: temperature in Celsius degrees: 2.2 to 33.30
- **`RH`**: relative humidity in %: 15.0 to 100
- **`wind`**: wind speed in km/h: 0.40 to 9.40
- **`rain`**: outside rain in mm/m2 : 0.0 to 6.4
- **`area`**: the burned area of the forest (in ha): 0.00 to 1090.84
Let's load the dataset and visualize the area that was burned in relation to the temperature in that region.
```python
import pandas as pd
df = pd.DataFrame(pd.read_csv('../Data/forestfires.csv'))
%matplotlib inline
from ggplot import *
ggplot(aes(x='temp', y='area'), data=df) + geom_line() + geom_point()
```
Intuitively, the hotter the weather, the more hectares burned in forest fires.
## Transfer your data to Azure ML Studio
We have our data, but how do we get it into Azure ML Studio in order to use it there? That is where the `azureml` package comes in. It enables us to load data and models into Azure ML Studio from an Azure Notebook (or any Python environment).
The first code cell of this notebook is what establishes the connection with *your* Azure ML Studio account.
Now that you have your notebook talking to Azure ML Studio, you can export your data to it:
```python
from azureml import DataTypeIds
dataset = ws.datasets.add_from_dataframe(
dataframe=df,
data_type_id=DataTypeIds.GenericCSV,
name='Forest Fire Data',
description='Paulo Cortez and Aníbal Morais (Univ. Minho) @ 2007'
)
```
After running the code above, you can see the dataset listed in the **Datasets** section of the Azure Machine Learning Studio workspace. (**Note**: You might need to switch between browser tabs and refresh the page in order to see the dataset.)
![image.png](attachment:image.png)<br>
It is also straightforward to list the datasets available in the workspace and transfer datasets from the workspace to the notebook:
```python
print('\n'.join([i.name for i in ws.datasets if not i.is_example])) # only list user-created datasets
```
You can also interact with and examine the dataset in Azure ML Studio directly from your notebook:
```python
# Read some more of the metadata
ds = ws.datasets['Forest Fire Data']
print(ds.name)
print(ds.description)
print(ds.family_id)
print(ds.data_type_id)
print(ds.created_date)
print(ds.size)
# Read the contents
df2 = ds.to_dataframe()
df2.head()
```
## Create your model
We're now back into familiar territory: prepping data for the model and fitting the model. To keep it interesting, we'll use the scikit-learn `train_test_split()` function with a slight change of parameters to select 75 percent of the data points for training and 25 percent for validation (testing).
```python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
df[['wind','rain','month','RH']],
df['temp'],
test_size=0.25,
random_state=42
)
```
Did you see what we did there? Rather than select all of the variables for the model, we were more selective and just chose windspeed, rainfall, month, and relative humidity in order to predict temperature.
Fit scikit-learn's `DecisionTreeRegressor` model using the training data. This algorithm is a combination of the linear regression and decision tree classification that you worked with in Section 6.
```python
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score
regressor = DecisionTreeRegressor(random_state=42)
regressor.fit(X_train, y_train)
y_test_predictions = regressor.predict(X_test)
print('R^2 for true vs. predicted test set forest temperature: {:0.2f}'.format(r2_score(y_test, y_test_predictions)))
```
```python
# Play around with this algorithm.
# Can you get better results changing the variables you select for the training and test data?
# What if you look at different variables for the response?
```
## Deploy your model as a web service
This is the important part. Once deployed as a web service, your model can be accessed from anywhere. This means that rather than refit a model every time you need a new prediction for a business or humanitarian use case, you can send the data to the pre-fitted model and get back a prediction.
First, deploy the model as a predictive web service. To do so, create a wrapper function that takes input data as an argument and calls `predict()` with your trained model and this input data, returning the results.
```python
from azureml import services
@services.publish(workspace_id, authorization_token)
@services.types(wind=float, rain=float, month=int, RH=float)
@services.returns(float)
# The name of your web service is set to this function's name
def forest_fire_predictor(wind, rain, month, RH):
return regressor.predict([wind, rain, month, RH])
# Hold onto information about your web service so
# you can call it within the notebook later
service_url = forest_fire_predictor.service.url
api_key = forest_fire_predictor.service.api_key
help_url = forest_fire_predictor.service.help_url
service_id = forest_fire_predictor.service.service_id
```
You can also go to the **Web Services** section of your Azure ML Studio workspace to see the predictive web service running there.
## Consuming the web service
Next, consume the web service. To see if this works, try it here from the notebook session in which the web service was created. Just call the predictor directly:
```python
forest_fire_predictor.service(5.4, 0.2, 9, 22.1)
```
At any later time, you can use the stored API key and service URL to call the service. In the example below, data can be packaged in JavaScript Object Notation (JSON) format and sent to the web service.
```python
import urllib2
import json
data = {"Inputs": {
"input1": {
"ColumnNames": [ "wind", "rain", "month", "RH"],
"Values": [["5.4", "0.2", "9", "22.1"]]
}
}, # Specified feature values
"GlobalParameters": {}
}
body = json.dumps(data)
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}
req = urllib2.Request(service_url, body, headers)
try:
response = urllib2.urlopen(req)
result = json.loads(response.read()) # load JSON-formatted string response as dictionary
print(result['Results']['output1']['value']['Values'][0][0]) # Get the returned prediction
except urllib2.HTTPError, error:
print("The request failed with status code: " + str(error.code))
print(error.info())
print(json.loads(error.read()))
```
### Exercise:
Try this same process of training and hosting a model through Azure ML Studio with the Pima Indians Diabetes dataset (in CSV format in your data folder). The dataset has nine columns; use any of the eight features you see fit to try and predict the ninth column, Outcome (1 = diabetes, 0 = no diabetes).
&gt; **Takeaway**: In this part, you explored fitting a model and deploying it as a web service. You did this by using now-familiar tools in an Azure Notebook to build a model relating variables surrounding forest fires and then posting that as a function in Azure ML Studio. From there, you saw how you and others can access the pre-fitted models to make predictions on new data from anywhere on the web.
You have now created your own ML web service. Let's now see how you can also interact with existing ML web services for even more sophisticated applications.

Просмотреть файл

@ -0,0 +1,449 @@
# Section 3: Azure Cognitive Services
Just as you created a web service that could consume data and return predictions, so there are many AI software-as-a-service (SaaS) offerings on the web that will return predictions or classifications based on data you supply to them. One family of these is Microsoft Azure Cognitive Services.
The advantage of using cloud-based services is that they provide cutting-edge models that you can access without having to train them. This can help accelerate both your exploration and use of ML.
Azure provides Cognitive Services APIs that can be consumed using Python to conduct image recognition, speech recognition, and text recognition, just to name a few. For the purposes of this notebook, we're going to look at using the Computer Vision API and the Text Analytics API.
First, well start by obtaining a Cognitive Services API key. Note that you can get a free key for seven days, and then you'll be required to pay.
To learn more about pricing for Cognitive Services, see https://azure.microsoft.com/en-us/pricing/details/cognitive-services/
Browse to **Try Azure Cognitive Services** at https://azure.microsoft.com/en-us/try/cognitive-services/
1. Select **Vision API**.
2. Select **Computer Vision**.
3. Click **Get API key**.
4. If prompted for credentials, select **Free 7-day trial**.
Complete the above steps to also retrieve a Text Analytics API key from the Language APIs category. (You can also do this by scrolling down on the page with your API keys and clicking **Add** under the appropriate service.)
Once you have your API keys in hand, you're ready to start.
&gt; **Learning goal:** By the end of this part, you should have a basic comfort with accessing cloud-based cognitive services by API from a Python environment.
## Azure Cognitive Services Computer Vision
Computer vision is a hot topic in academic AI research and in business, medical, government, and environmental applications. We will explore it here by seeing firsthand how computers can tag and identify images.
The first step in using the Cognitive Services Computer Vision API is to create a client object using the ComputerVisionClient class.
Replace **ACCOUNT_ENDPOINT** with the account endpoint provided from the free trial. Replace **ACCOUNT_KEY** with the account key provided from the free trial.
```python
!pip install azure-cognitiveservices-vision-computervision
```
```python
from azure.cognitiveservices.vision.computervision import ComputerVisionClient
from azure.cognitiveservices.vision.computervision.models import VisualFeatureTypes
from msrest.authentication import CognitiveServicesCredentials
# Get endpoint and key from environment variables
endpoint = 'ACCOUNT_ENDPOINT'
# Example: endpoint = 'https://westcentralus.api.cognitive.microsoft.com'
key = 'ACCOUNT_KEY'
# Example key = '1234567890abcdefghijklmnopqrstuv
# Set credentials
credentials = CognitiveServicesCredentials(key)
# Create client
client = ComputerVisionClient(endpoint, credentials)
```
Now that we have a client object to work with, let's see what we can do.
Using analyze_image, we can see the properties of the image with VisualFeatureTypes.tags.
```python
url = 'https://cdn.pixabay.com/photo/2014/05/02/23/54/times-square-336508_960_720.jpg'
image_analysis = client.analyze_image(url,visual_features=[VisualFeatureTypes.tags])
for tag in image_analysis.tags:
print(tag)
```
### Exercise:
```python
# How can you use the code above to also see the description using VisualFeatureTypes property?
```
Now let's look at the subject domain of the image. An example of a domain is celebrity.
As of now, the analyze_image_by_domain method only supports celebrities and landmarks domain-specific models.
```python
# This will list the available subject domains
models = client.list_models()
for x in models.models_property:
print(x)
```
Let's analyze an image by domain:
```python
# Type of prediction
domain = "landmarks"
# Public-domain image of Seattle
url = "https://images.pexels.com/photos/37350/space-needle-seattle-washington-cityscape.jpg"
# English-language response
language = "en"
analysis = client.analyze_image_by_domain(domain, url, language)
for landmark in analysis.result["landmarks"]:
print(landmark["name"])
print(landmark["confidence"])
```
### Exercise:
```python
# How can you use the code above to predict an image of a celebrity?
# Using this image, https://images.pexels.com/photos/270968/pexels-photo-270968.jpeg?
# Remember that the domains were printed out earlier.
```
Let's see how we can get a text description of an image using the describe_image method. Use max_descriptions to retrieve how many descriptions of the image the API service can find.
```python
domain = "landmarks"
url = "https://images.pexels.com/photos/726484/pexels-photo-726484.jpeg"
language = "en"
max_descriptions = 3
analysis = client.describe_image(url, max_descriptions, language)
for caption in analysis.captions:
print(caption.text)
print(caption.confidence)
```
### Exercise:
```python
# What other descriptions can be found with other images?
# What happens if you change the count of descriptions to output?
```
Let's say that the images contain text. How do we retrieve that information? There are two methods that need to be used for this type of call. Batch_read_file and get_read_operation_result. TextOperationStatusCodes is used to ensure that the batch_read_file call is completed before the text is read from the image.
```python
# import models
from azure.cognitiveservices.vision.computervision.models import TextRecognitionMode
from azure.cognitiveservices.vision.computervision.models import TextOperationStatusCodes
import time
url = "https://images.pexels.com/photos/6375/quote-chalk-think-words.jpg"
mode = TextRecognitionMode.handwritten
raw = True
custom_headers = None
numberOfCharsInOperationId = 36
# Async SDK call
rawHttpResponse = client.batch_read_file(url, mode, custom_headers, raw)
# Get ID from returned headers
operationLocation = rawHttpResponse.headers["Operation-Location"]
idLocation = len(operationLocation) - numberOfCharsInOperationId
operationId = operationLocation[idLocation:]
# SDK call
while True:
result = client.get_read_operation_result(operationId)
if result.status not in ['NotStarted', 'Running']:
break
time.sleep(1)
# Get data
if result.status == TextOperationStatusCodes.succeeded:
for textResult in result.recognition_results:
for line in textResult.lines:
print(line.text)
print(line.bounding_box)
```
### Exercise:
```python
# What other images with words can be analyzed?
```
You can find addition Cognitive Services demonstrations at the following URLs:
- https://aidemos.microsoft.com/
- https://github.com/microsoft/computerscience/blob/master/Events%20and%20Hacks/Student%20Hacks/hackmit/cogservices_demos/
- https://azure.microsoft.com/en-us/services/cognitive-services/directory/
Images come in varying sizes, and there might be cases where you want to create a thumbnail of the image. For this, we need to install the Pillow library, which you can learn about at https://python-pillow.org/. Pillow is the PIL fork, or Python Imaging Library, which allows for image processing.
```python
# Install Pillow
!pip install Pillow
```
Now that the Pillow library is installed, we will import the Image module and create a thumbnail from a provided image. (Once generated, you can find the thumbnail image in your project folder on Azure Notebooks.)
```python
# Pillow package
from PIL import Image
# IO package to create local image
import io
width = 50
height = 50
url = "https://images.pexels.com/photos/37350/space-needle-seattle-washington-cityscape.jpg"
thumbnail = client.generate_thumbnail(width, height, url)
for x in thumbnail:
image = Image.open(io.BytesIO(x))
image.save('thumbnail.jpg')
```
&gt; **Takeaway:** In this subsection, you explored how to access computer-vision cognitive services by API. Specifically, you used tools to analyze and describe images that you submitted to these services.
## Azure Cognitive Services Text Analytics
Another area where cloud-based AI shines is text analytics. Like computer vision, identifying and pulling meaning from natural human languages is really the intersection of a lot of specialized disciplines, so using cloud services for it provides an economical means of tapping a lot of cognitive horsepower.
To prepare to use the Cognitive Services Text Analytics API, the requests library must be imported, along with the ability to print out JSON formats.
```python
import requests
# pprint is pretty print (formats the JSON)
from pprint import pprint
from IPython.display import HTML
```
Replace 'ACCOUNT_KEY' with the API key that was created during the creation of the seven-day free trial account.
```python
subscription_key = 'ACCOUNT_KEY'
assert subscription_key
# If using a Free Trial account, this URL does not need to be udpated.
# If using a paid account, verify that it matches the region where the
# Text Analytics Service was setup.
text_analytics_base_url = "https://westcentralus.api.cognitive.microsoft.com/text/analytics/v2.1/"
```
### Text Analytics API
Now it's time to start processing some text languages.
To verify the URL endpoint for text_analytics_base_url, run the following:
```python
language_api_url = text_analytics_base_url + "languages"
print(language_api_url)
```
The API requires that the payload be formatted in the form of documents containing `id` and `text` attributes:
```python
documents = { 'documents': [
{ 'id': '1', 'text': 'This is a document written in English.' },
{ 'id': '2', 'text': 'Este es un documento escrito en Español.' },
{ 'id': '3', 'text': '这是一个用中文写的文件' },
{ 'id': '4', 'text': 'Ez egy magyar nyelvű dokumentum.' },
{ 'id': '5', 'text': 'Dette er et dokument skrevet på dansk.' },
{ 'id': '6', 'text': 'これは日本語で書かれた文書です。' }
]}
```
The next lines of code call the API service using the requests library to determine the languages that were passed in from the documents:
```python
headers = {"Ocp-Apim-Subscription-Key": subscription_key}
response = requests.post(language_api_url, headers=headers, json=documents)
languages = response.json()
pprint(languages)
```
The next line of code outputs the documents in a table format with the language information for each document:
```python
table = []
for document in languages["documents"]:
text = next(filter(lambda d: d["id"] == document["id"], documents["documents"]))["text"]
langs = ", ".join(["{0}({1})".format(lang["name"], lang["score"]) for lang in document["detectedLanguages"]])
table.append("<tr><td>{0}</td><td>{1}</td>".format(text, langs))
HTML("<table><tr><th>Text</th><th>Detected languages(scores)</th></tr>{0}</table>".format("\n".join(table)))
```
The service did a pretty good job of identifying the languages. It did confidently identify the Danish phrase as being Norwegian, but in fairness, even linguists argue as to whether Danish and Norwegian constitute distinct languages or are dialects of the same language. (**Note:** Danes and Norwegians have no doubts on the subject.)
### Exercise:
```python
# Create another document set of text and use the text analytics API to detect the language for the text.
```
### Sentiment Analysis API
Now that we know how to use the Text Analytics API to detect the language, let's use it for sentiment analysis. Basically, the computers at the other end of the API connection will judge the sentiments of written phrases (anywhere on the spectrum of positive to negative) based solely on the context clues provided by the text.
```python
# Verify the API URl source for the Sentiment Analysis API
sentiment_api_url = text_analytics_base_url + "sentiment"
print(sentiment_api_url)
```
As above, the Sentiment Analysis API requires the language to be passed in as documents with `id` and `text` attributes.
```python
documents = {'documents' : [
{'id': '1', 'language': 'en', 'text': 'I had a wonderful experience! The rooms were wonderful and the staff was helpful.'},
{'id': '2', 'language': 'en', 'text': 'I had a terrible time at the hotel. The staff was rude and the food was awful.'},
{'id': '3', 'language': 'es', 'text': 'Los caminos que llevan hasta Monte Rainier son espectaculares y hermosos.'},
{'id': '4', 'language': 'es', 'text': 'La carretera estaba atascada. Había mucho tráfico el día de ayer.'}
]}
```
Let's analyze the text using the Sentiment Analysis API to output a sentiment analysis score:
```python
headers = {"Ocp-Apim-Subscription-Key": subscription_key}
response = requests.post(sentiment_api_url, headers=headers, json=documents)
sentiments = response.json()
pprint(sentiments)
```
### Exercise:
```python
# Create another document set with varying degree of sentiment and use the Sentiment Analysis API to detect what
# the sentiment is
```
### Key Phrases API
We've detected the language type using the Text Analytics API and the sentiment using the Sentiment Analysis API. What if we want to detect key phrases in the text? We can use the Key Phrase API.
```python
# As with the other services, setup the Key Phrases API with the following parameters
key_phrase_api_url = text_analytics_base_url + "keyPhrases"
print(key_phrase_api_url)
```
Create the documents needed to pass to the Key Phrases API with the `id` and `text` attributes.
```python
documents = {'documents' : [
{'id': '1', 'language': 'en', 'text': 'I had a wonderful experience! The rooms were wonderful and the staff was helpful.'},
{'id': '2', 'language': 'en', 'text': 'I had a terrible time at the hotel. The staff was rude and the food was awful.'},
{'id': '3', 'language': 'es', 'text': 'Los caminos que llevan hasta Monte Rainier son espectaculares y hermosos.'},
{'id': '4', 'language': 'es', 'text': 'La carretera estaba atascada. Había mucho tráfico el día de ayer.'}
]}
```
Now, call the Key Phrases API with the formatted documents to retrieve the key phrases.
```python
headers = {'Ocp-Apim-Subscription-Key': subscription_key}
response = requests.post(key_phrase_api_url, headers=headers, json=documents)
key_phrases = response.json()
pprint(key_phrases)
```
We can make this easier to read by outputing the documents in an HTML table format.
```python
table = []
for document in key_phrases["documents"]:
text = next(filter(lambda d: d["id"] == document["id"], documents["documents"]))["text"]
phrases = ",".join(document["keyPhrases"])
table.append("<tr><td>{0}</td><td>{1}</td>".format(text, phrases))
HTML("<table><tr><th>Text</th><th>Key phrases</th></tr>{0}</table>".format("\n".join(table)))
```
Now call the Key Phrases API with the formatted documents to retrive the key phrases.
### Exercise:
```python
# What other key phrases can you come up with for analysis?
```
### Entities API
The final API we will use in the Text Analytics API service is the Entities API. This will retrieve attributes for documents provided to the API service.
```python
# Configure the Entities URI
entity_linking_api_url = text_analytics_base_url + "entities"
print(entity_linking_api_url)
```
The next step is creating a document with id and text attributes to pass on to the Entities API.
```python
documents = {'documents' : [
{'id': '1', 'text': 'Microsoft is an It company.'}
]}
```
Finally, call the service using the rest call below to retrieve the data listed in the text attribute.
```python
headers = {"Ocp-Apim-Subscription-Key": subscription_key}
response = requests.post(entity_linking_api_url, headers=headers, json=documents)
entities = response.json()
entities
```
### Exercise:
```python
# What other entities can be retrieved with the API?
# Create a document setup and use the Text Analytics, Sentiment Analysis,
# Key Phrase, and Entities API services to retrieve the data.
```
&gt; **Takeaway:** In this subsection, you explored text analytics in the cloud. Specifically, you used a variety of different APIs to extract different information from text: language, sentiment, key phrases, and entities.
That's it the instructional portion of this course. In these eight sections, you've now seen the range of tools that go into preparing data for analysis and performing ML and AI analysis on data. In the next, concluding section, you will bring these skills together in a final project.