зеркало из https://github.com/microsoft/Reactors.git
Merge pull request #4 from microsoft/ContentMove
Having Readable Markdown AND Azure Notebook Versions
This commit is contained in:
Коммит
3000e1a5ba
Двоичный файл не отображается.
Двоичный файл не отображается.
|
@ -0,0 +1,858 @@
|
|||
|
||||
# Introduction to Python
|
||||
|
||||
## Comments
|
||||
|
||||
|
||||
```python
|
||||
# this is the first comment
|
||||
spam = 1 # and this is the second comment
|
||||
# ... and now a third!
|
||||
text = "# This is not a comment because it's inside quotes."
|
||||
print(text)
|
||||
```
|
||||
|
||||
## Python basics
|
||||
|
||||
### Arithmetic and numeric types
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should be comfortable with using numeric types in Python arithmetic.
|
||||
|
||||
#### Python numeric operators
|
||||
|
||||
|
||||
```python
|
||||
2 + 3
|
||||
```
|
||||
|
||||
**Share**: What is the answer? Why?
|
||||
|
||||
|
||||
```python
|
||||
30 - 4 * 5
|
||||
```
|
||||
|
||||
**Share**: What is the answer? Why?
|
||||
|
||||
|
||||
```python
|
||||
7 / 5
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
3 * 3.5
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
7.0 / 5
|
||||
```
|
||||
|
||||
**Floor Division**
|
||||
|
||||
|
||||
```python
|
||||
7 // 5
|
||||
```
|
||||
|
||||
**Remainder (modulo)**
|
||||
|
||||
|
||||
```python
|
||||
7 % 5
|
||||
```
|
||||
|
||||
**Exponents**
|
||||
|
||||
|
||||
```python
|
||||
5 ** 2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
2 ** 5
|
||||
```
|
||||
|
||||
**Share**: What is the answer? Why?
|
||||
|
||||
|
||||
```python
|
||||
-5 ** 2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
(-5) ** 2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
(30 - 4) * 5
|
||||
```
|
||||
|
||||
### Variables
|
||||
|
||||
**Share**: What is the answer? Why?
|
||||
|
||||
|
||||
```python
|
||||
length = 15
|
||||
width = 3 * 5
|
||||
length * width
|
||||
```
|
||||
|
||||
**Variables don't need types**
|
||||
|
||||
|
||||
```python
|
||||
length = 15
|
||||
length
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
length = 15.0
|
||||
length
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
length = 'fifteen'
|
||||
length
|
||||
```
|
||||
|
||||
**Share**: What will happen? Why?
|
||||
|
||||
|
||||
```python
|
||||
n
|
||||
```
|
||||
|
||||
**Previous Output**
|
||||
|
||||
|
||||
```python
|
||||
tax = 11.3 / 100
|
||||
price = 19.95
|
||||
price * tax
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
price + _
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
round(_, 2)
|
||||
```
|
||||
|
||||
**Multiple Variable Assignment**
|
||||
|
||||
|
||||
```python
|
||||
a, b, c, = 3.2, 1, 6
|
||||
a, b, c
|
||||
```
|
||||
|
||||
### Expressions
|
||||
|
||||
**Share**: What is the answer? Why?
|
||||
|
||||
|
||||
```python
|
||||
2 < 5
|
||||
```
|
||||
|
||||
*(Run after learners have shared the above)*
|
||||
**Python Comparison Operators**:
|
||||
![all of the comparison operators](https://notebooks.azure.com/sguthals/projects/data-science-1-instructor/raw/Images%2FScreen%20Shot%202019-09-10%20at%207.15.49%20AM.png)
|
||||
|
||||
**Complex Expressions**
|
||||
|
||||
|
||||
```python
|
||||
a, b, c = 1, 2, 3
|
||||
a < b < c
|
||||
```
|
||||
|
||||
**Built-In Functions**
|
||||
|
||||
|
||||
```python
|
||||
min(3, 2.4, 5)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
max(3, 2.4, 5)
|
||||
```
|
||||
|
||||
**Compound Expressions**
|
||||
|
||||
|
||||
```python
|
||||
1 < 2 and 2 < 3
|
||||
```
|
||||
|
||||
|
||||
|
||||
### Exercise:
|
||||
|
||||
**Think, Pair, Share**
|
||||
1. Quietly think about what would happen if you flipped one of the `<` to a `>`.
|
||||
2. Share with the person next to you what you think will happen.
|
||||
3. Try it out in the code cell below.
|
||||
4. Share anything you thought was surprising.
|
||||
|
||||
|
||||
```python
|
||||
# Now flip around one of the simple expressions and see if the output matches your expectations:
|
||||
|
||||
```
|
||||
|
||||
**Or and Not**
|
||||
**Share**: What is the answer? Why?
|
||||
|
||||
|
||||
```python
|
||||
1 < 2 or 1 > 2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
not (2 < 3)
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
**Think, Pair, Share**
|
||||
1. Quietly think about what would the results would be. *Tip: Use paper!*
|
||||
2. Share with the person next to you what you think will happen.
|
||||
3. Try it out in the code cell below.
|
||||
4. Share anything you thought was surprising.
|
||||
5. Instructor Demo
|
||||
|
||||
|
||||
```python
|
||||
# Play around with compound expressions.
|
||||
# Set i to different values to see what results this complex compound expression returns:
|
||||
i = 7
|
||||
(i == 2) or not (i % 2 != 0 and 1 < i < 5)
|
||||
```
|
||||
|
||||
> **Takeaway:** Arithmetic operations on numeric data form the foundation of data science work in Python. Even sophisticated numeric operations are predicated on these basics, so mastering them is essential to doing data science.
|
||||
|
||||
## Strings
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should be comfortable working with strings at a basic level in Python.
|
||||
|
||||
|
||||
```python
|
||||
'spam eggs' # Single quotes.
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
'doesn\'t' # Use \' to escape the single quote...
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
"doesn't" # ...or use double quotes instead.
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
'"Isn\'t," she said.'
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
print('"Isn\'t," she said.')
|
||||
```
|
||||
|
||||
**Pause**
|
||||
Notice the difference between the previous two code cells when they are run.
|
||||
|
||||
|
||||
```python
|
||||
print('C:\some\name') # Here \n means newline!
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
print(r'C:\some\name') # Note the r before the quote.
|
||||
```
|
||||
|
||||
### String literals
|
||||
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
3 * 'un' + 'ium'
|
||||
```
|
||||
|
||||
### Concatenating strings
|
||||
|
||||
|
||||
```python
|
||||
'Py' 'thon'
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
prefix = 'Py'
|
||||
prefix + 'thon'
|
||||
```
|
||||
|
||||
### String indexes
|
||||
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
word = 'Python'
|
||||
word[0]
|
||||
```
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
word[5]
|
||||
```
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
word[-1]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
word[-2]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
word[-6]
|
||||
```
|
||||
|
||||
### Slicing strings
|
||||
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
word[0:2]
|
||||
```
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
word[2:5]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
word[:2]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
word[4:]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
word[-2:]
|
||||
```
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
word[:2] + word[2:]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
word[:4] + word[4:]
|
||||
```
|
||||
|
||||
**TIP**
|
||||
+---+---+---+---+---+---+
|
||||
| P | y | t | h | o | n |
|
||||
+---+---+---+---+---+---+
|
||||
0 1 2 3 4 5 6
|
||||
-6 -5 -4 -3 -2 -1
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
word[42] # The word only has 6 characters.
|
||||
```
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
word[4:42]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
word[42:]
|
||||
```
|
||||
|
||||
**Strings are Immutable**
|
||||
|
||||
|
||||
```python
|
||||
word[0] = 'J'
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
word[2:] = 'py'
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
'J' + word[1:]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
word[:2] + 'Py'
|
||||
```
|
||||
|
||||
**Built-In Function: len**
|
||||
|
||||
|
||||
```python
|
||||
s = 'supercalifragilisticexpialidocious'
|
||||
len(s)
|
||||
```
|
||||
|
||||
**Built-In Function: str**
|
||||
|
||||
|
||||
```python
|
||||
str(2)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
str(2.5)
|
||||
```
|
||||
|
||||
## Other data types
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should have a basic understanding of the remaining fundamental data types in Python and an idea of how and when to use them.
|
||||
|
||||
### Lists
|
||||
|
||||
|
||||
```python
|
||||
squares = [1, 4, 9, 16, 25]
|
||||
squares
|
||||
```
|
||||
|
||||
**Indexing and Slicing is the Same as Strings**
|
||||
|
||||
|
||||
```python
|
||||
squares[0]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
squares[-1]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
squares[-3:]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
squares[:]
|
||||
```
|
||||
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
squares + [36, 49, 64, 81, 100]
|
||||
```
|
||||
|
||||
**Lists are Mutable**
|
||||
|
||||
|
||||
```python
|
||||
cubes = [1, 8, 27, 65, 125]
|
||||
4 ** 3
|
||||
```
|
||||
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
# Replace the wrong value.
|
||||
cubes
|
||||
```
|
||||
|
||||
**Replace Many Values**
|
||||
|
||||
|
||||
```python
|
||||
letters = ['a', 'b', 'c', 'd', 'e', 'f', 'g']
|
||||
letters
|
||||
```
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
letters[2:5] = ['C', 'D', 'E']
|
||||
letters
|
||||
```
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
letters[2:5] = []
|
||||
letters
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
letters[:] = []
|
||||
letters
|
||||
```
|
||||
|
||||
**Built-In Functions: len**
|
||||
|
||||
|
||||
```python
|
||||
letters = ['a', 'b', 'c', 'd']
|
||||
len(letters)
|
||||
```
|
||||
|
||||
**Nesting**
|
||||
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
a = ['a', 'b', 'c']
|
||||
n = [1, 2, 3]
|
||||
x = [a, n]
|
||||
x
|
||||
```
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
x[0]
|
||||
```
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
x[0][0]
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Nested lists come up a lot in programming, so it pays to practice.
|
||||
# Which indices would you include after x to get ‘c’?
|
||||
# How about to get 3?
|
||||
|
||||
```
|
||||
|
||||
### List object methods
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
beatles = ['John', 'Paul']
|
||||
beatles.append('George')
|
||||
beatles
|
||||
```
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
beatles2 = ['John', 'Paul', 'George']
|
||||
beatles2.append(['Stuart', 'Pete'])
|
||||
beatles2
|
||||
```
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
beatles.extend(['Stuart', 'Pete'])
|
||||
beatles
|
||||
```
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
beatles.index('George')
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
beatles.count('John')
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
beatles.remove('Stuart')
|
||||
beatles
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
beatles.pop()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
beatles.insert(1, 'Ringo')
|
||||
beatles
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
beatles.reverse()
|
||||
beatles
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
beatles.sort()
|
||||
beatles
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# What happens if you run beatles.extend(beatles)?
|
||||
# How about beatles.append(beatles)?
|
||||
|
||||
```
|
||||
|
||||
### Tuples
|
||||
|
||||
|
||||
```python
|
||||
t = (1, 2, 3)
|
||||
t
|
||||
```
|
||||
|
||||
**Tuples are Immutable**
|
||||
|
||||
|
||||
```python
|
||||
t[1] = 2.0
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
t[1]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
t[:2]
|
||||
```
|
||||
|
||||
**Lists <-> Tuples**
|
||||
|
||||
|
||||
```python
|
||||
l = ['baked', 'beans', 'spam']
|
||||
l = tuple(l)
|
||||
l
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
l = list(l)
|
||||
l
|
||||
```
|
||||
|
||||
### Membership testing
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
tup = ('a', 'b', 'c')
|
||||
'b' in tup
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
lis = ['a', 'b', 'c']
|
||||
'a' not in lis
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# What happens if you run lis in lis?
|
||||
# Is that the behavior you expected?
|
||||
# If not, think back to the nested lists we’ve already encountered.
|
||||
|
||||
```
|
||||
|
||||
### Dictionaries
|
||||
|
||||
|
||||
```python
|
||||
capitals = {'France': ('Paris', 2140526)}
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
capitals['Nigeria'] = ('Lagos', 6048430)
|
||||
capitals
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Now try adding another country (or something else) to the capitals dictionary
|
||||
```
|
||||
|
||||
**Interacting with Dictionaries**
|
||||
|
||||
|
||||
```python
|
||||
capitals['France']
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
capitals['Nigeria'] = ('Abuja', 1235880)
|
||||
capitals
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
len(capitals)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
capitals.popitem()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
capitals
|
||||
```
|
||||
|
||||
> **Takeaway:** Regardless of how complex and voluminous the data you will work with, these basic data structures will repeatedly be your means for handling and manipulating it. Comfort with these basic data structures is essential to being able to understand and use Python code written by others.
|
||||
|
||||
### List comprehensions
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should understand how to economically and computationally create lists.
|
||||
|
||||
|
||||
```python
|
||||
for x in range(1,11):
|
||||
print(x)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
numbers = [x for x in range(1,11)] # Remember to create a range 1 more than the number you actually want.
|
||||
numbers
|
||||
|
||||
numbers = [x for x in range(1,11)]
|
||||
numbers = [x for x in [1,2,3,4,5,6,7,8,9,10]]
|
||||
numbers = [1,2,3,4,5,6,7,8,9,10]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
for x in range(1,11):
|
||||
print(x*x)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
squares = [x*x for x in range(1,11)]
|
||||
squares
|
||||
|
||||
squares = [x*x for x in range(1,11)]
|
||||
squares = [x*x for x in [1,2,3,4,5,6,7,8,9,10]]
|
||||
squares = [1*1,2*2,3*3,4*4,5,6,7,8,9,10]
|
||||
squares = [1,2,9...]
|
||||
```
|
||||
|
||||
**Demo**
|
||||
|
||||
|
||||
```python
|
||||
odd_squares = [x*x for x in range(1,11) if x % 2 != 0]
|
||||
odd_squares
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Now use a list comprehension to generate a list of odd cubes
|
||||
# from 1 to 2,197
|
||||
|
||||
```
|
||||
|
||||
> **Takeaway:** List comprehensions are a popular tool in Python because they enable the rapid, programmatic generation of lists. The economy and ease of use therefore make them an essential tool for you (in addition to a necessary topic to understand as you try to understand Python code written by others).
|
||||
|
||||
### Importing modules
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should be comfortable importing modules in Python.
|
||||
|
||||
|
||||
```python
|
||||
factorial(5)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
import math
|
||||
math.factorial(5)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
from math import factorial
|
||||
factorial(5)
|
||||
```
|
||||
|
||||
|
||||
> **Takeaway:** There are several Python modules that you will regularly use in conducting data science in Python, so understanding how to import them will be essential (especially in this training).
|
|
@ -0,0 +1,731 @@
|
|||
|
||||
# Introduction to NumPy
|
||||
|
||||
**Library Alias**
|
||||
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
```
|
||||
|
||||
## Built-In Help
|
||||
|
||||
|
||||
### Exercise
|
||||
|
||||
|
||||
```python
|
||||
# Place your cursor after the period and press <TAB>:
|
||||
np.
|
||||
```
|
||||
|
||||
|
||||
|
||||
### Exercise
|
||||
|
||||
|
||||
```python
|
||||
# Replace 'add' below with a few different NumPy function names and look over the documentation:
|
||||
np.add?
|
||||
```
|
||||
|
||||
## NumPy arrays: a specialized data structure for analysis
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should have a basic understanding of what NumPy arrays are and how they differ from the other Python data structures you have studied thus far.
|
||||
|
||||
### Lists in Python
|
||||
|
||||
|
||||
|
||||
```python
|
||||
myList = list(range(10))
|
||||
myList
|
||||
```
|
||||
|
||||
**List Comprehension with Types**
|
||||
|
||||
|
||||
```python
|
||||
[type(item) for item in myList]
|
||||
```
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
myList2 = [True, "2", 3.0, 4]
|
||||
[type(item) for item in myList2]
|
||||
```
|
||||
|
||||
### Fixed-type arrays in Python
|
||||
|
||||
#### Creating NumPy arrays method 1: using Python lists
|
||||
|
||||
|
||||
```python
|
||||
# Create an integer array:
|
||||
np.array([1, 4, 2, 5, 3])
|
||||
```
|
||||
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
np.array([3.14, 4, 2, 3])
|
||||
```
|
||||
|
||||
### Exercise
|
||||
|
||||
|
||||
```python
|
||||
# What happens if you construct an array using a list that contains a combination of integers, floats, and strings?
|
||||
|
||||
```
|
||||
|
||||
**Explicit Typing**
|
||||
|
||||
|
||||
```python
|
||||
np.array([1, 2, 3, 4], dtype='float32')
|
||||
```
|
||||
|
||||
### Exercise
|
||||
|
||||
|
||||
```python
|
||||
# Try this using a different dtype.
|
||||
# Remember that you can always refer to the documentation with the command np.array.
|
||||
|
||||
```
|
||||
|
||||
**Multi-Dimensional Array**
|
||||
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
# nested lists result in multi-dimensional arrays
|
||||
np.array([range(i, i + 3) for i in [2, 4, 6]])
|
||||
```
|
||||
|
||||
#### Creating NumPy arrays method 2: building from scratch
|
||||
|
||||
|
||||
```python
|
||||
np.zeros(10, dtype=int)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
np.ones((3, 5), dtype=float)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
np.full((3, 5), 3.14)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
np.arange(0, 20, 2)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
np.linspace(0, 1, 5)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
np.random.random((3, 3))
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
np.random.normal(0, 1, (3, 3))
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
np.random.randint(0, 10, (3, 3))
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
np.eye(3)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
np.empty(3)
|
||||
```
|
||||
|
||||
> **Takeaway:** NumPy arrays are a data structure similar to Python lists that provide high performance when storing and working on large amounts of homogeneous data—precisely the kind of data that you will encounter frequently in doing data science. NumPy arrays support many data types beyond those discussed in this course. With all of that said, however, don’t worry about memorizing all of the NumPy dtypes. **It’s often just necessary to care about the general kind of data you’re dealing with: floating point, integer, Boolean, string, or general Python object.**
|
||||
|
||||
## Working with NumPy arrays: the basics
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should be comfortable working with NumPy arrays in basic ways.
|
||||
|
||||
**Similar to Lists:**
|
||||
- **Arrays attributes**: Assessing the size, shape, and data types of arrays
|
||||
- **Indexing arrays**: Getting and setting the value of individual array elements
|
||||
- **Slicing arrays**: Getting and setting smaller subarrays within a larger array
|
||||
- **Reshaping arrays**: Changing the shape of a given array
|
||||
- **Joining and splitting arrays**: Combining multiple arrays into one and splitting one array into multiple arrays
|
||||
|
||||
### Array attributes
|
||||
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
np.random.seed(0) # seed for reproducibility
|
||||
|
||||
a1 = np.random.randint(10, size=6) # One-dimensional array
|
||||
a2 = np.random.randint(10, size=(3, 4)) # Two-dimensional array
|
||||
a3 = np.random.randint(10, size=(3, 4, 5)) # Three-dimensional array
|
||||
```
|
||||
|
||||
**Array Types**
|
||||
|
||||
|
||||
```python
|
||||
print("dtype:", a3.dtype)
|
||||
```
|
||||
|
||||
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Change the values in this code snippet to look at the attributes for a1, a2, and a3:
|
||||
print("a3 ndim: ", a3.ndim)
|
||||
print("a3 shape:", a3.shape)
|
||||
print("a3 size: ", a3.size)
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Explore the dtype for the other arrays.
|
||||
# What dtypes do you predict them to have?
|
||||
print("dtype:", a3.dtype)
|
||||
```
|
||||
|
||||
### Indexing arrays
|
||||
|
||||
**Quick Review**
|
||||
|
||||
|
||||
```python
|
||||
a1
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a1[0]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a1[4]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a1[-1]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a1[-2]
|
||||
```
|
||||
|
||||
**Multi-Dimensional Arrays**
|
||||
|
||||
|
||||
```python
|
||||
a2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a2[0, 0]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a2[2, 0]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a2[2, -1]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a2[0, 0] = 12
|
||||
a2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a1[0] = 3.14159
|
||||
a1
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# What happens if you try to insert a string into a1?
|
||||
# Hint: try both a string like '3' and one like 'three'
|
||||
|
||||
```
|
||||
|
||||
### Slicing arrays
|
||||
|
||||
#### One-dimensional slices
|
||||
|
||||
|
||||
```python
|
||||
a = np.arange(10)
|
||||
a
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a[:5]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a[5:]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a[4:7]
|
||||
```
|
||||
|
||||
**Slicing With Index**
|
||||
|
||||
|
||||
```python
|
||||
a[::2]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a[1::2]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a[::-1]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a[5::-2]
|
||||
```
|
||||
|
||||
#### Multidimensional slices
|
||||
|
||||
|
||||
```python
|
||||
a2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a2[:2, :3]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a2[:3, ::2]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a2[::-1, ::-1]
|
||||
```
|
||||
|
||||
#### Accessing array rows and columns
|
||||
|
||||
|
||||
```python
|
||||
print(a2[:, 0])
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
print(a2[0, :])
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
print(a2[0])
|
||||
```
|
||||
|
||||
#### Slices are no-copy views
|
||||
|
||||
|
||||
```python
|
||||
print(a2)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a2_sub = a2[:2, :2]
|
||||
print(a2_sub)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a2_sub[0, 0] = 99
|
||||
print(a2_sub)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
print(a2)
|
||||
```
|
||||
|
||||
#### Copying arrays
|
||||
|
||||
|
||||
|
||||
```python
|
||||
a2_sub_copy = a2[:2, :2].copy()
|
||||
print(a2_sub_copy)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a2_sub_copy[0, 0] = 42
|
||||
print(a2_sub_copy)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
print(a2)
|
||||
```
|
||||
|
||||
### Joining and splitting arrays
|
||||
|
||||
#### Joining arrays
|
||||
|
||||
|
||||
```python
|
||||
a = np.array([1, 2, 3])
|
||||
b = np.array([3, 2, 1])
|
||||
np.concatenate([a, b])
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
c = [99, 99, 99]
|
||||
print(np.concatenate([a, b, c]))
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
grid = np.array([[1, 2, 3],
|
||||
[4, 5, 6]])
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
np.concatenate([grid, grid])
|
||||
```
|
||||
|
||||
#### Splitting arrays
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
a = [1, 2, 3, 99, 99, 3, 2, 1]
|
||||
a1, a2, a3 = np.split(a, [3, 5])
|
||||
print(a1, a2, a3)
|
||||
```
|
||||
|
||||
> **Takeaway:** Manipulating datasets is a fundamental part of preparing data for analysis. The skills you learned and practiced here will form building blocks for the most sophisticated data-manipulation you will learn in later sections in this course.
|
||||
|
||||
## Sorting arrays
|
||||
|
||||
|
||||
```python
|
||||
a = np.array([2, 1, 4, 3, 5])
|
||||
np.sort(a)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
print(a)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a.sort()
|
||||
print(a)
|
||||
```
|
||||
|
||||
### Sorting along rows or columns
|
||||
|
||||
|
||||
```python
|
||||
rand = np.random.RandomState(42)
|
||||
table = rand.randint(0, 10, (4, 6))
|
||||
print(table)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
np.sort(table, axis=0)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
np.sort(table, axis=1)
|
||||
```
|
||||
|
||||
### NumPy Functions vs Python Built-In Functions
|
||||
|
||||
| Operator | Equivalent ufunc | Description |
|
||||
|:--------------|:--------------------|:--------------------------------------|
|
||||
|``+`` |``np.add`` |Addition (e.g., ``1 + 1 = 2``) |
|
||||
|``-`` |``np.subtract`` |Subtraction (e.g., ``3 - 2 = 1``) |
|
||||
|``-`` |``np.negative`` |Unary negation (e.g., ``-2``) |
|
||||
|``*`` |``np.multiply`` |Multiplication (e.g., ``2 * 3 = 6``) |
|
||||
|``/`` |``np.divide`` |Division (e.g., ``3 / 2 = 1.5``) |
|
||||
|``//`` |``np.floor_divide`` |Floor division (e.g., ``3 // 2 = 1``) |
|
||||
|``**`` |``np.power`` |Exponentiation (e.g., ``2 ** 3 = 8``) |
|
||||
|``%`` |``np.mod`` |Modulus/remainder (e.g., ``9 % 4 = 1``)|
|
||||
|
||||
#### Exponents and logarithms
|
||||
|
||||
|
||||
```python
|
||||
a = [1, 2, 3]
|
||||
print("a =", a)
|
||||
print("e^a =", np.exp(a))
|
||||
print("2^a =", np.exp2(a))
|
||||
print("3^a =", np.power(3, a))
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a = [1, 2, 4, 10]
|
||||
print("a =", a)
|
||||
print("ln(a) =", np.log(a))
|
||||
print("log2(a) =", np.log2(a))
|
||||
print("log10(a) =", np.log10(a))
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
a = [0, 0.001, 0.01, 0.1]
|
||||
print("exp(a) - 1 =", np.expm1(a))
|
||||
print("log(1 + a) =", np.log1p(a))
|
||||
```
|
||||
|
||||
#### Specialized Functions
|
||||
|
||||
|
||||
```python
|
||||
from scipy import special
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Gamma functions (generalized factorials) and related functions
|
||||
a = [1, 5, 10]
|
||||
print("gamma(a) =", special.gamma(a))
|
||||
print("ln|gamma(a)| =", special.gammaln(a))
|
||||
print("beta(a, 2) =", special.beta(a, 2))
|
||||
```
|
||||
|
||||
> **Takeaway:** Universal functions in NumPy provide you with computational functions that are faster than regular Python functions, particularly when working on large datasets that are common in data science. This speed is important because it can make you more efficient as a data scientist and it makes a broader range of inquiries into your data tractable in terms of time and computational resources.
|
||||
|
||||
## Aggregations
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should be comfortable aggregating data in NumPy.
|
||||
|
||||
### Summing the values of an array
|
||||
|
||||
|
||||
```python
|
||||
myList = np.random.random(100)
|
||||
np.sum(myList)
|
||||
```
|
||||
|
||||
**NumPy vs Python Functions**
|
||||
|
||||
|
||||
```python
|
||||
large_array = np.random.rand(1000000)
|
||||
%timeit sum(large_array)
|
||||
%timeit np.sum(large_array)
|
||||
```
|
||||
|
||||
### Minimum and maximum
|
||||
|
||||
|
||||
```python
|
||||
np.min(large_array), np.max(large_array)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
print(large_array.min(), large_array.max(), large_array.sum())
|
||||
```
|
||||
|
||||
## Computation on arrays with broadcasting
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should have a basic understanding of how broadcasting works in NumPy (and why NumPy uses it).
|
||||
|
||||
|
||||
```python
|
||||
first_array = np.array([3, 6, 8, 1])
|
||||
second_array = np.array([4, 5, 7, 2])
|
||||
first_array + second_array
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
first_array + 5
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
one_dim_array = np.ones((1))
|
||||
one_dim_array
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
two_dim_array = np.ones((2, 2))
|
||||
two_dim_array
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
one_dim_array + two_dim_array
|
||||
```
|
||||
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
horizontal_array = np.arange(3)
|
||||
vertical_array = np.arange(3)[:, np.newaxis]
|
||||
|
||||
print(horizontal_array)
|
||||
print(vertical_array)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
horizontal_array + vertical_array
|
||||
```
|
||||
|
||||
## Comparisons, masks, and Boolean logic in NumPy
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should be comfortable with and understand how to use Boolean masking in NumPy in order to answer basic questions about your data.
|
||||
|
||||
### Example: Counting Rainy Days
|
||||
|
||||
Let's see masking in practice by examining the monthly rainfall statistics for Seattle. The data is in a CSV file from data.gov. To load the data, we will use pandas, which we will formally introduce in Section 4.
|
||||
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
import pandas as pd
|
||||
|
||||
# Use pandas to extract rainfall as a NumPy array
|
||||
rainfall_2003 = pd.read_csv('Data/Observed_Monthly_Rain_Gauge_Accumulations_-_Oct_2002_to_May_2017.csv')['RG01'][ 2:14].values
|
||||
rainfall_2003
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
%matplotlib inline
|
||||
import matplotlib.pyplot as plt
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
plt.bar(np.arange(1, len(rainfall_2003) + 1), rainfall_2003)
|
||||
```
|
||||
|
||||
### Boolean operators
|
||||
|
||||
|
||||
```python
|
||||
np.sum((rainfall_2003 > 0.5) & (rainfall_2003 < 1))
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
rainfall_2003 > (0.5 & rainfall_2003) < 1
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
np.sum(~((rainfall_2003 <= 0.5) | (rainfall_2003 >= 1)))
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
print("Number of months without rain:", np.sum(rainfall_2003 == 0))
|
||||
print("Number of months with rain: ", np.sum(rainfall_2003 != 0))
|
||||
print("Months with more than 1 inch: ", np.sum(rainfall_2003 > 1))
|
||||
print("Rainy months with < 1 inch: ", np.sum((rainfall_2003 > 0) &
|
||||
(rainfall_2003 < 1)))
|
||||
```
|
||||
|
||||
## Boolean arrays as masks
|
||||
|
||||
|
||||
```python
|
||||
rand = np.random.RandomState(0)
|
||||
two_dim_array = rand.randint(10, size=(3, 4))
|
||||
two_dim_array
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
two_dim_array < 5
|
||||
```
|
||||
|
||||
**Masking**
|
||||
|
||||
|
||||
```python
|
||||
two_dim_array[two_dim_array < 5]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Construct a mask of all rainy months
|
||||
rainy = (rainfall_2003 > 0)
|
||||
|
||||
# Construct a mask of all summer months (June through September)
|
||||
months = np.arange(1, 13)
|
||||
summer = (months > 5) & (months < 10)
|
||||
|
||||
print("Median precip in rainy months in 2003 (inches): ",
|
||||
np.median(rainfall_2003[rainy]))
|
||||
print("Median precip in summer months in 2003 (inches): ",
|
||||
np.median(rainfall_2003[summer]))
|
||||
print("Maximum precip in summer months in 2003 (inches): ",
|
||||
np.max(rainfall_2003[summer]))
|
||||
print("Median precip in non-summer rainy months (inches):",
|
||||
np.median(rainfall_2003[rainy & ~summer]))
|
||||
```
|
||||
|
||||
> **Takeaway:** By combining Boolean operations, masking operations, and aggregates, you can quickly answer questions similar to those we posed about the Seattle rainfall data about any dataset. Operations like these will form the basis for the data exploration and preparation for analysis that will by our primary concerns in Sections 4 and 5.
|
|
@ -0,0 +1,492 @@
|
|||
|
||||
# Introduction to Pandas
|
||||
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
```
|
||||
|
||||
## Fundamental panda data structures
|
||||
|
||||
### `Series` objects in pandas
|
||||
|
||||
|
||||
```python
|
||||
series_example = pd.Series([-0.5, 0.75, 1.0, -2])
|
||||
series_example
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series_example.values
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series_example.index
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series_example[1]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series_example[1:3]
|
||||
```
|
||||
|
||||
### Explicit Indices
|
||||
|
||||
|
||||
```python
|
||||
series_example2 = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
|
||||
series_example2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series_example2['b']
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Do explicit Series indices work *exactly* the way you might expect?
|
||||
# Try slicing series_example2 using its explicit index and find out.
|
||||
|
||||
```
|
||||
|
||||
### Series vs Dictionary
|
||||
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
population_dict = {'France': 65429495,
|
||||
'Germany': 82408706,
|
||||
'Russia': 143910127,
|
||||
'Japan': 126922333}
|
||||
population_dict
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
population = pd.Series(population_dict)
|
||||
population
|
||||
```
|
||||
|
||||
### Interacting with Series
|
||||
|
||||
|
||||
```python
|
||||
population['Russia']
|
||||
```
|
||||
|
||||
### Exercise
|
||||
|
||||
|
||||
```python
|
||||
# Try slicing on the population Series on your own.
|
||||
# Would slicing be possible if Series keys were not ordered?
|
||||
population['Germany':'Russia']
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Try running population['Albania'] = 2937590 (or another country of your choice)
|
||||
# What order do the keys appear in when you run population? Is it what you expected?
|
||||
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
population
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pop2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pop2 = pd.Series({'Spain': 46432074, 'France': 102321, 'Albania': 50532})
|
||||
population + pop2
|
||||
```
|
||||
|
||||
### `DataFrame` object in pandas
|
||||
|
||||
|
||||
```python
|
||||
area_dict = {'Albania': 28748,
|
||||
'France': 643801,
|
||||
'Germany': 357386,
|
||||
'Japan': 377972,
|
||||
'Russia': 17125200}
|
||||
area = pd.Series(area_dict)
|
||||
area
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
countries = pd.DataFrame({'Population': population, 'Area': area})
|
||||
countries
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
countries['Capital'] = ['Tirana', 'Paris', 'Berlin', 'Tokyo', 'Moscow']
|
||||
countries
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
countries = countries[['Capital', 'Area', 'Population']]
|
||||
countries
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
countries['Population Density'] = countries['Population'] / countries['Area']
|
||||
countries
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
countries['Area']
|
||||
```
|
||||
|
||||
### Exercise
|
||||
|
||||
|
||||
```python
|
||||
# Now try accessing row data with a command like countries['Japan']
|
||||
|
||||
```
|
||||
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
countries.loc['Japan']
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
countries.loc['Japan']['Area']
|
||||
```
|
||||
|
||||
### Exercise
|
||||
|
||||
|
||||
```python
|
||||
# Can you think of a way to return the area of Japan without using .iloc?
|
||||
# Hint: Try putting the column index first.
|
||||
# Can you slice along these indices as well?
|
||||
|
||||
```
|
||||
|
||||
### DataSeries Creation
|
||||
|
||||
|
||||
```python
|
||||
countries['Debt-to-GDP Ratio'] = np.nan
|
||||
countries
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
debt = pd.Series([0.19, 2.36], index=['Russia', 'Japan'])
|
||||
countries['Debt-to-GDP Ratio'] = debt
|
||||
countries
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
del countries['Capital']
|
||||
countries
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
countries.T
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.DataFrame(np.random.rand(3, 2),
|
||||
columns=['foo', 'bar'],
|
||||
index=['a', 'b', 'c'])
|
||||
```
|
||||
|
||||
## Manipulating data in pandas
|
||||
|
||||
### Index objects in pandas
|
||||
|
||||
|
||||
```python
|
||||
series_example = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
|
||||
ind = series_example.index
|
||||
ind
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
ind[1]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
ind[::2]
|
||||
```
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
ind[1] = 0
|
||||
```
|
||||
|
||||
### Set Properties
|
||||
|
||||
|
||||
```python
|
||||
ind_odd = pd.Index([1, 3, 5, 7, 9])
|
||||
ind_prime = pd.Index([2, 3, 5, 7, 11])
|
||||
```
|
||||
|
||||
**Think, Pair, Share**
|
||||
In the code cell below, try out the intersection (`ind_odd & ind_prime`), union (`ind_odd | ind_prime`), and the symmetric difference (`ind_odd ^ ind_prime`) of `ind_odd` and `ind_prime`.
|
||||
|
||||
|
||||
```python
|
||||
|
||||
```
|
||||
|
||||
### Data Selection in Series
|
||||
|
||||
|
||||
```python
|
||||
series_example2 = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
|
||||
series_example2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series_example2['b']
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
'a' in series_example2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series_example2.keys()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
list(series_example2.items())
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series_example2['e'] = 1.25
|
||||
series_example2
|
||||
```
|
||||
|
||||
### Indexers: `loc` and `iloc`
|
||||
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
series_example2.loc['a']
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series_example2.loc['a':'c']
|
||||
```
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
series_example2.iloc[0]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series_example2.iloc[0:2]
|
||||
```
|
||||
|
||||
### Data Selection in DataFrames
|
||||
|
||||
|
||||
```python
|
||||
area = pd.Series({'Albania': 28748,
|
||||
'France': 643801,
|
||||
'Germany': 357386,
|
||||
'Japan': 377972,
|
||||
'Russia': 17125200})
|
||||
population = pd.Series ({'Albania': 2937590,
|
||||
'France': 65429495,
|
||||
'Germany': 82408706,
|
||||
'Russia': 143910127,
|
||||
'Japan': 126922333})
|
||||
countries = pd.DataFrame({'Area': area, 'Population': population})
|
||||
countries
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
countries['Area']
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
countries['Population Density'] = countries['Population'] / countries['Area']
|
||||
countries
|
||||
```
|
||||
|
||||
### DataFrame as two-dimensional array
|
||||
|
||||
|
||||
```python
|
||||
countries.values
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
countries.T
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
countries.iloc[:3, :2]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
countries.loc[:'Germany', :'Population']
|
||||
```
|
||||
|
||||
### Exercise
|
||||
|
||||
|
||||
```python
|
||||
# Can you think of how to combine masking and fancy indexing in one line?
|
||||
# Your masking could be somthing like countries['Population Density'] > 200
|
||||
# Your fancy indexing could be something like ['Population', 'Population Density']
|
||||
# Be sure to put the the masking and fancy indexing inside the square brackets: countries.loc[]
|
||||
|
||||
```
|
||||
|
||||
# Operating on Data in Pandas
|
||||
|
||||
**Think, Pair, Share** For each of these Sections.
|
||||
|
||||
## Index alignment with Series
|
||||
|
||||
For our first example, suppose we are combining two different data sources and find only the top five countries by *area* and the top five countries by *population*:
|
||||
|
||||
|
||||
```python
|
||||
area = pd.Series({'Russia': 17075400, 'Canada': 9984670,
|
||||
'USA': 9826675, 'China': 9598094,
|
||||
'Brazil': 8514877}, name='area')
|
||||
population = pd.Series({'China': 1409517397, 'India': 1339180127,
|
||||
'USA': 324459463, 'Indonesia': 322179605,
|
||||
'Brazil': 207652865}, name='population')
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Now divide these to compute the population density
|
||||
pop_density = area/population
|
||||
pop_density
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series1 = pd.Series([2, 4, 6], index=[0, 1, 2])
|
||||
series2 = pd.Series([3, 5, 7], index=[1, 2, 3])
|
||||
series1 + series2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series1.add(series2, fill_value=0)
|
||||
```
|
||||
|
||||
Much better!
|
||||
|
||||
## Index alignment with DataFrames
|
||||
|
||||
|
||||
```python
|
||||
rng = np.random.RandomState(42)
|
||||
df1 = pd.DataFrame(rng.randint(0, 20, (2, 2)),
|
||||
columns=list('AB'))
|
||||
df1
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df2 = pd.DataFrame(rng.randint(0, 10, (3, 3)),
|
||||
columns=list('BAC'))
|
||||
df2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Add df1 and df2. Is the output what you expected?
|
||||
df1 + df2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
fill = df1.stack().mean()
|
||||
df1.add(df2, fill_value=fill)
|
||||
```
|
||||
|
||||
## Operations between DataFrames and Series
|
||||
|
||||
Index and column alignment gets maintained in operations between a `DataFrame` and a `Series` as well. To see this, consider a common operation in data science, wherein we find the difference of a `DataFrame` and one of its rows. Because pandas inherits ufuncs from NumPy, pandas will compute the difference row-wise by default:
|
||||
|
||||
|
||||
```python
|
||||
df3 = pd.DataFrame(rng.randint(10, size=(3, 4)), columns=list('WXYZ'))
|
||||
df3
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df3 - df3.iloc[0]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df3.subtract(df3['X'], axis=0)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
halfrow = df3.iloc[0, ::2]
|
||||
halfrow
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df3 - halfrow
|
||||
```
|
|
@ -0,0 +1,622 @@
|
|||
|
||||
# Manipulating and Cleaning Data
|
||||
|
||||
|
||||
## Exploring `DataFrame` information
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.
|
||||
|
||||
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
from sklearn.datasets import load_iris
|
||||
|
||||
iris = load_iris()
|
||||
iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
|
||||
```
|
||||
|
||||
### `DataFrame.info`
|
||||
**Dataset Alert**: Iris Data about Flowers
|
||||
|
||||
|
||||
```python
|
||||
iris_df.info()
|
||||
```
|
||||
|
||||
### `DataFrame.head`
|
||||
|
||||
|
||||
```python
|
||||
iris_df.head()
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
By default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out how to get it to show more?
|
||||
|
||||
|
||||
```python
|
||||
# Hint: Consult the documentation by using iris_df.head?
|
||||
|
||||
```
|
||||
|
||||
### `DataFrame.tail`
|
||||
|
||||
|
||||
```python
|
||||
iris_df.tail()
|
||||
```
|
||||
|
||||
|
||||
|
||||
> **Takeaway:** Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with.
|
||||
|
||||
## Dealing with missing data
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should know how to replace or remove null values from DataFrames.
|
||||
|
||||
**None vs NaN**
|
||||
|
||||
### `None`: non-float missing data
|
||||
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
example1 = np.array([2, None, 6, 8])
|
||||
example1
|
||||
```
|
||||
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
example1.sum()
|
||||
```
|
||||
|
||||
**Key takeaway**: Addition (and other operations) between integers and `None` values is undefined, which can limit what you can do with datasets that contain them.
|
||||
|
||||
### `NaN`: missing float values
|
||||
|
||||
|
||||
|
||||
```python
|
||||
np.nan + 1
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
np.nan * 0
|
||||
```
|
||||
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
example2 = np.array([2, np.nan, 6, 8])
|
||||
example2.sum(), example2.min(), example2.max()
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# What happens if you add np.nan and None together?
|
||||
|
||||
```
|
||||
|
||||
### `NaN` and `None`: null values in pandas
|
||||
|
||||
|
||||
```python
|
||||
int_series = pd.Series([1, 2, 3], dtype=int)
|
||||
int_series
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Now set an element of int_series equal to None.
|
||||
# How does that element show up in the Series?
|
||||
# What is the dtype of the Series?
|
||||
|
||||
```
|
||||
|
||||
### Detecting null values
|
||||
`isnull()` and `notnull()`
|
||||
|
||||
|
||||
```python
|
||||
example3 = pd.Series([0, np.nan, '', None])
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
example3.isnull()
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Try running example3[example3.notnull()].
|
||||
# Before you do so, what do you expect to see?
|
||||
|
||||
```
|
||||
|
||||
**Key takeaway**: Both the `isnull()` and `notnull()` methods produce similar results when you use them in `DataFrame`s: they show the results and the index of those results, which will help you enormously as you wrestle with your data.
|
||||
|
||||
### Dropping null values
|
||||
|
||||
|
||||
```python
|
||||
example3 = example3.dropna()
|
||||
example3
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
example4 = pd.DataFrame([[1, np.nan, 7],
|
||||
[2, 5, 8],
|
||||
[np.nan, 6, 9]])
|
||||
example4
|
||||
```
|
||||
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
example4.dropna()
|
||||
```
|
||||
|
||||
**Drop from Columns**
|
||||
|
||||
|
||||
```python
|
||||
example4.dropna(axis='1')
|
||||
```
|
||||
|
||||
`how='all'` will drop only rows or columns that contain all null values.
|
||||
|
||||
**Tip**: run `example4.dropna?`
|
||||
|
||||
|
||||
```python
|
||||
example4[3] = np.nan
|
||||
example4
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# How might you go about dropping just column 3?
|
||||
# Hint: remember that you will need to supply both the axis parameter and the how parameter.
|
||||
|
||||
```
|
||||
|
||||
The `thresh` parameter gives you finer-grained control: you set the number of *non-null* values that a row or column needs to have in order to be kept.
|
||||
|
||||
**Think, Pair, Share**
|
||||
|
||||
|
||||
```python
|
||||
example4.dropna(axis='rows', thresh=3)
|
||||
```
|
||||
|
||||
### Filling null values
|
||||
|
||||
|
||||
```python
|
||||
example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
|
||||
example5
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
example5.fillna(0)
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# What happens if you try to fill null values with a string, like ''?
|
||||
|
||||
```
|
||||
|
||||
**Forward-fill**
|
||||
|
||||
|
||||
```python
|
||||
example5.fillna(method='ffill')
|
||||
```
|
||||
|
||||
**Back-fill**
|
||||
|
||||
|
||||
```python
|
||||
example5.fillna(method='bfill')
|
||||
```
|
||||
|
||||
**Specify Axis**
|
||||
|
||||
|
||||
```python
|
||||
example4
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
example4.fillna(method='ffill', axis=1)
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# What output does example4.fillna(method='bfill', axis=1) produce?
|
||||
# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?
|
||||
# Can you think of a longer code snippet to write that can fill all of the null values in example4?
|
||||
|
||||
```
|
||||
|
||||
**Fill with Logical Data**
|
||||
|
||||
|
||||
```python
|
||||
example4.fillna(example4.mean())
|
||||
```
|
||||
|
||||
|
||||
|
||||
> **Takeaway:** There are multiple ways to deal with missing values in your datasets. The specific strategy you use (removing them, replacing them, or even how you replace them) should be dictated by the particulars of that data. You will develop a better sense of how to deal with missing values the more you handle and interact with datasets.
|
||||
|
||||
## Removing duplicate data
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should be comfortable identifying and removing duplicate values from DataFrames.
|
||||
|
||||
|
||||
### Identifying duplicates: `duplicated`
|
||||
|
||||
|
||||
```python
|
||||
example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],
|
||||
'numbers': [1, 2, 1, 3, 3]})
|
||||
example6
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
example6.duplicated()
|
||||
```
|
||||
|
||||
### Dropping duplicates: `drop_duplicates`
|
||||
|
||||
|
||||
```python
|
||||
example6.drop_duplicates()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
example6.drop_duplicates(['letters'])
|
||||
```
|
||||
|
||||
> **Takeaway:** Removing duplicate data is an essential part of almost every data-science project. Duplicate data can change the results of your analyses and give you spurious results!
|
||||
|
||||
## Combining datasets: merge and join
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should have a general knowledge of the various ways to combine `DataFrame`s.
|
||||
|
||||
### Categories of joins
|
||||
|
||||
`merge` carries out several types of joins: *one-to-one*, *many-to-one*, and *many-to-many*.
|
||||
|
||||
#### One-to-one joins
|
||||
|
||||
Consider combining two `DataFrame`s that contain different information on the same employees in a company:
|
||||
|
||||
|
||||
```python
|
||||
df1 = pd.DataFrame({'employee': ['Gary', 'Stu', 'Mary', 'Sue'],
|
||||
'group': ['Accounting', 'Marketing', 'Marketing', 'HR']})
|
||||
df1
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df2 = pd.DataFrame({'employee': ['Mary', 'Stu', 'Gary', 'Sue'],
|
||||
'hire_date': [2008, 2012, 2017, 2018]})
|
||||
df2
|
||||
```
|
||||
|
||||
Combine this information into a single `DataFrame` using the `merge` function:
|
||||
|
||||
|
||||
```python
|
||||
df3 = pd.merge(df1, df2)
|
||||
df3
|
||||
```
|
||||
|
||||
#### Many-to-one joins
|
||||
|
||||
|
||||
```python
|
||||
df4 = pd.DataFrame({'group': ['Accounting', 'Marketing', 'HR'],
|
||||
'supervisor': ['Carlos', 'Giada', 'Stephanie']})
|
||||
df4
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df3, df4)
|
||||
```
|
||||
|
||||
**Specify Key**
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df3, df4, on='group')
|
||||
```
|
||||
|
||||
#### Many-to-many joins
|
||||
|
||||
|
||||
```python
|
||||
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
|
||||
'Marketing', 'Marketing', 'HR', 'HR'],
|
||||
'core_skills': ['math', 'spreadsheets', 'writing', 'communication',
|
||||
'spreadsheets', 'organization']})
|
||||
df5
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df1, df5, on='group')
|
||||
```
|
||||
|
||||
#### `left_on` and `right_on` keywords
|
||||
|
||||
|
||||
```python
|
||||
df6 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
|
||||
'salary': [70000, 80000, 120000, 90000]})
|
||||
df6
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df1, df6, left_on="employee", right_on="name")
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Using the documentation, can you figure out how to use .drop() to get rid of the 'name' column?
|
||||
# Hint: You will need to supply two parameters to .drop()
|
||||
|
||||
```
|
||||
|
||||
#### `left_index` and `right_index` keywords
|
||||
|
||||
|
||||
```python
|
||||
df1a = df1.set_index('employee')
|
||||
df1a
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df2a = df2.set_index('employee')
|
||||
df2a
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df1a, df2a, left_index=True, right_index=True)
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# What happens if you specify only left_index or right_index?
|
||||
|
||||
```
|
||||
|
||||
**`join` for `DataFrame`s**
|
||||
|
||||
|
||||
```python
|
||||
df1a.join(df2a)
|
||||
```
|
||||
|
||||
**Mix and Match**: `left_index`/`right_index` with `right_on`/`left_on`
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df1a, df6, left_index=True, right_on='name')
|
||||
```
|
||||
|
||||
#### Set arithmetic for joins
|
||||
|
||||
|
||||
```python
|
||||
df5 = pd.DataFrame({'group': ['Engineering', 'Marketing', 'Sales'],
|
||||
'core_skills': ['math', 'writing', 'communication']})
|
||||
df5
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df1, df5, on='group')
|
||||
```
|
||||
|
||||
**`intersection` for merge**
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df1, df5, on='group', how='inner')
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# The keyword for perfoming an outer join is how='outer'. How would you perform it?
|
||||
# What do you expect the output of an outer join of df1 and df5 to be?
|
||||
|
||||
```
|
||||
|
||||
**Share**
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df1, df5, how='left')
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Now run the right merge between df1 and df5.
|
||||
# What do you expect to see?
|
||||
|
||||
```
|
||||
|
||||
#### `suffixes` keyword: dealing with conflicting column names
|
||||
|
||||
|
||||
```python
|
||||
df7 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
|
||||
'rank': [1, 2, 3, 4]})
|
||||
df7
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df8 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
|
||||
'rank': [3, 1, 4, 2]})
|
||||
df8
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df7, df8, on='name')
|
||||
```
|
||||
|
||||
**Using `_` to merge same column names**
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df7, df8, on='name', suffixes=['_left', '_right'])
|
||||
```
|
||||
|
||||
### Concatenation in NumPy
|
||||
**One-dimensional arrays**
|
||||
|
||||
|
||||
```python
|
||||
x = [1, 2, 3]
|
||||
y = [4, 5, 6]
|
||||
z = [7, 8, 9]
|
||||
np.concatenate([x, y, z])
|
||||
```
|
||||
|
||||
**Two-dimensional arrays**
|
||||
|
||||
|
||||
```python
|
||||
x = [[1, 2],
|
||||
[3, 4]]
|
||||
np.concatenate([x, x], axis=1)
|
||||
```
|
||||
|
||||
### Concatenation in pandas
|
||||
|
||||
**Series**
|
||||
|
||||
|
||||
```python
|
||||
ser1 = pd.Series(['a', 'b', 'c'], index=[1, 2, 3])
|
||||
ser2 = pd.Series(['d', 'e', 'f'], index=[4, 5, 6])
|
||||
pd.concat([ser1, ser2])
|
||||
```
|
||||
|
||||
**DataFrames**
|
||||
|
||||
|
||||
```python
|
||||
df9 = pd.DataFrame({'A': ['a', 'c'],
|
||||
'B': ['b', 'd']})
|
||||
df9
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.concat([df9, df9])
|
||||
```
|
||||
|
||||
**Re-indexing**
|
||||
|
||||
|
||||
```python
|
||||
pd.concat([df9, df9], ignore_index=True)
|
||||
```
|
||||
|
||||
**Changing Axis**
|
||||
|
||||
|
||||
```python
|
||||
pd.concat([df9, df9], axis=1)
|
||||
```
|
||||
|
||||
> Note that while pandas will display this without error, you will get an error message if you try to assign this result as a new `DataFrame`. Column names in `DataFrame`s must be unique.
|
||||
|
||||
### Concatenation with joins
|
||||
|
||||
|
||||
```python
|
||||
df10 = pd.DataFrame({'A': ['a', 'd'],
|
||||
'B': ['b', 'e'],
|
||||
'C': ['c', 'f']})
|
||||
df10
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df11 = pd.DataFrame({'B': ['u', 'x'],
|
||||
'C': ['v', 'y'],
|
||||
'D': ['w', 'z']})
|
||||
df11
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.concat([df10, df11])
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.concat([df10, df11], join='inner')
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.concat([df10, df11], join_axes=[df10.columns])
|
||||
```
|
||||
|
||||
#### `append()`
|
||||
|
||||
|
||||
```python
|
||||
df9.append(df9)
|
||||
```
|
||||
|
||||
**Important point**: Unlike the `append()` and `extend()` methods of Python lists, the `append()` method in pandas does not modify the original object. It instead creates a new object with the combined data.
|
||||
|
||||
> **Takeaway:** A large part of the value you can provide as a data scientist comes from connecting multiple, often disparate datasets to find new insights. Learning how to join and merge data is thus an essential part of your skill set.
|
|
@ -0,0 +1,252 @@
|
|||
|
||||
# Project
|
||||
|
||||
> **Learning goal:** By the end of this Capstone, you should be familiar with some of the ways to visually explore the data stored in `DataFrame`s.
|
||||
|
||||
Often when probing a new data set, it is invaluable to get high-level information about what the dataset holds. Earlier in this section we discussed using methods such as `DataFrame.info`, `DataFrame.head`, and `DataFrame.tail` to examine some aspects of a `DataFrame`. While these methods are critical, they are on their own often insufficient to get enough information to know how to approach a new dataset. This is where exploratory statistics and visualizations for datasets come in.
|
||||
|
||||
To see what we mean in terms of gaining exploratory insight (both visually and numerically), let's dig into one of the the datasets that come with the scikit-learn library, the Boston Housing Dataset (though you will load it from a CSV file):
|
||||
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
df = pd.read_csv('Data/housing_dataset.csv')
|
||||
df.head()
|
||||
```
|
||||
|
||||
This dataset contains information collected from the U.S Census Bureau concerning housing in the area of Boston, Massachusetts and was first published in 1978. The dataset has 14 columns:
|
||||
- **CRIM**: Per-capita crime rate by town
|
||||
- **ZN**: Proportion of residential land zoned for lots over 25,000 square feet
|
||||
- **INDUS**: Proportion of non-retail business acres per town
|
||||
- **CHAS**: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
|
||||
- **NOX**: Nitric oxides concentration (parts per 10 million)
|
||||
- **RM**: Average number of rooms per dwelling
|
||||
- **AGE**: Proportion of owner-occupied units built prior to 1940
|
||||
- **DIS**: Weighted distances to five Boston employment centres
|
||||
- **RAD**: Index of accessibility to radial highways
|
||||
- **TAX**: Full-value property-tax rate per \$10,000
|
||||
- **PTRATIO**: Pupil-teacher ratio by town
|
||||
- **LSTAT**: Percent of lower-status portion of the population
|
||||
- **MEDV**: Median value of owner-occupied homes in \$1,000s
|
||||
|
||||
One of the first methods we can use to better understand this dataset is `DataFrame.shape`:
|
||||
|
||||
|
||||
```python
|
||||
df.shape
|
||||
```
|
||||
|
||||
The dataset has 506 rows and 13 columns.
|
||||
|
||||
To get a better idea of the contents of each column we can use `DataFrame.describe`, which returns the maximum value, minimums value, mean, and standard deviation of numeric values in each columns, in addition to the quartiles for each column:
|
||||
|
||||
|
||||
```python
|
||||
df.describe()
|
||||
```
|
||||
|
||||
Because dataset can have so many columns in them, it can often be useful to transpose the results of `DataFrame.describe` to better use them.
|
||||
Note that you can also examine specific descriptive statistics for columns without having to invoke `DataFrame.describe`:
|
||||
|
||||
|
||||
```python
|
||||
df['MEDV'].mean()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df['MEDV'].max()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df['AGE'].median()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Now find the maximum value in df['AGE'].
|
||||
|
||||
```
|
||||
|
||||
Other information that you will often want to see is the relationship between different columns. You do this with the `DataFrame.groupby` method. For example, you could examine the average MEDV (median value of owner-occupied homes) for each value of AGE (proportion of owner-occupied units built prior to 1940):
|
||||
|
||||
|
||||
```python
|
||||
df.groupby(['AGE'])['MEDV'].mean()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Now try to find the median value for AGE for each value of MEDV.
|
||||
|
||||
```
|
||||
|
||||
You can also apply a lambda function to each element of a `DataFrame` column by using the `apply` method. For example, say you wanted to create a new column that flagged a row if more than 50 percent of owner-occupied homes were build before 1940:
|
||||
|
||||
|
||||
```python
|
||||
df['AGE_50'] = df['AGE'].apply(lambda x: x>50)
|
||||
```
|
||||
|
||||
Once applied, you also see how many values returned true and how many false by using the `value_counts` method:
|
||||
|
||||
|
||||
```python
|
||||
df['AGE_50'].value_counts()
|
||||
```
|
||||
|
||||
You can also examine figures from the groupby statement you created earlier:
|
||||
|
||||
|
||||
```python
|
||||
df.groupby(['AGE_50'])['MEDV'].mean()
|
||||
```
|
||||
|
||||
You can also group by more than one variable, such AGE_50 (the one you just created), CHAS (whether a town is on the Charles River), and RAD (an index measuring access to the Boston-area radial highways), and then evaluate each group for the average median home price in that group:
|
||||
|
||||
|
||||
```python
|
||||
groupby_twovar=df.groupby(['AGE_50','RAD','CHAS'])['MEDV'].mean()
|
||||
```
|
||||
|
||||
You can then see what values are in this stacked group of variables:
|
||||
|
||||
|
||||
```python
|
||||
groupby_twovar
|
||||
```
|
||||
|
||||
Let's take a moment to analyze these results in a little depth. The first row reports that communities with less the half of houses built before 1940, with a highway-access index of 1, and that are not situated on the Charles River have a mean house price of \$24,667 (1970s dollars); the next row shows that for communities similar to the first row except for being located on the Charles River have a mean house price of \$50,000.
|
||||
|
||||
One insight that pops out from continuing down this is that, all else being equal, being located next to the Charles River can significantly increase the value of newer housing stock. The story is more ambiguous for communities dominated by older houses: proximity to the Charles significantly increases home prices in one community (and that one presumably farther away from the city); for all others, being situated on the river either provided a modest increase in value or actually decreased mean home prices.
|
||||
|
||||
While groupings like this can be a great way to begin to interrogate your data, you might not care for the 'tall' format it comes in. In that case, you can unstack the data into a "wide" format:
|
||||
|
||||
|
||||
```python
|
||||
groupby_twovar.unstack()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# How could you use groupby to get a sense of the proportion
|
||||
# of residential land zoned for lots over 25,000 sq.ft.,
|
||||
# the proportion of non-retail business acres per town,
|
||||
# and the distance of towns from employment centers in Boston?
|
||||
|
||||
```
|
||||
|
||||
It is also often valuable to know how many unique values a column has in it with the `nunique` method:
|
||||
|
||||
|
||||
```python
|
||||
df['CHAS'].nunique()
|
||||
```
|
||||
|
||||
Complementary to that, you will also likely want to know what those unique values are, which is where the `unique` method helps:
|
||||
|
||||
|
||||
```python
|
||||
df['CHAS'].unique()
|
||||
```
|
||||
|
||||
You can use the `value_counts` method to see how many of each unique value there are in a column:
|
||||
|
||||
|
||||
```python
|
||||
df['CHAS'].value_counts()
|
||||
```
|
||||
|
||||
Or you can easily plot a bar graph to visually see the breakdown:
|
||||
|
||||
|
||||
```python
|
||||
%matplotlib inline
|
||||
df['CHAS'].value_counts().plot(kind='bar')
|
||||
```
|
||||
|
||||
Note that the IPython magic command `%matplotlib inline` enables you to view the chart inline.
|
||||
|
||||
Let's pull back to the dataset as a whole for a moment. Two major things that you will look for in almost any dataset are trends and relationships. A typical relationship between variables to explore is the Pearson correlation, or the extent to which two variables are linearly related. The `corr` method will show this in table format for all of the columns in a `DataFrame`:
|
||||
|
||||
|
||||
```python
|
||||
df.corr(method='pearson')
|
||||
```
|
||||
|
||||
Suppose you just wanted to look at the correlations between all of the columns and just one variable? Let's examine just the correlation between all other variables and the percentage of owner-occupied houses build before 1940 (AGE). We will do this by accessing the column by index number:
|
||||
|
||||
|
||||
```python
|
||||
corr = df.corr(method='pearson')
|
||||
corr_with_homevalue = corr.iloc[-1]
|
||||
corr_with_homevalue[corr_with_homevalue.argsort()[::-1]]
|
||||
```
|
||||
|
||||
With the correlations arranged in descending order, it's easy to start to see some patterns. Correlating AGE with a variable we created from AGE is a trivial correlation. However, it is interesting to note that the percentage of older housing stock in communities strongly correlates with air pollution (NOX) and the proportion of non-retail business acres per town (INDUS); at least in 1978 metro Boston, older towns are more industrial.
|
||||
|
||||
Graphically, we can see the correlations using a heatmap from the Seaborn library:
|
||||
|
||||
|
||||
```python
|
||||
import seaborn as sns
|
||||
sns.heatmap(df.corr(),cmap=sns.cubehelix_palette(20, light=0.95, dark=0.15))
|
||||
```
|
||||
|
||||
Histograms are another valuable tool for investigating your data. For example, what is the overall distribution of prices of owner-occupied houses in the Boston area?
|
||||
|
||||
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
plt.hist(df['MEDV'])
|
||||
```
|
||||
|
||||
The default bin size for the matplotlib histogram (essentially big of buckets of percentages that you include in each histogram bar in this case) is pretty large and might mask smaller details. To get a finer-grained view of the AGE column, you can manually increase the number of bins in the histogram:
|
||||
|
||||
|
||||
```python
|
||||
plt.hist(df['MEDV'],bins=50)
|
||||
```
|
||||
|
||||
Seaborn has a somewhat more attractive version of the standard matplotlib histogram: the distribution plot. This is a combination histogram and kernel density estimate (KDE) plot (essentially a smoothed histogram):
|
||||
|
||||
|
||||
```python
|
||||
sns.distplot(df['MEDV'])
|
||||
```
|
||||
|
||||
Another commonly used plot is the Seaborn jointplot, which combines histograms for two columns along with a scatterplot:
|
||||
|
||||
|
||||
```python
|
||||
sns.jointplot(df['RM'], df['MEDV'], kind='scatter')
|
||||
```
|
||||
|
||||
Unfortunately, many of the dots print over each other. You can help address this by adding some alpha blending, a figure that sets the transparency for the dots so that concentrations of them drawing over one another will be apparent:
|
||||
|
||||
|
||||
```python
|
||||
sns.jointplot(df['RM'], df['MEDV'], kind='scatter', alpha=0.3)
|
||||
```
|
||||
|
||||
Another way to see patterns in your data is with a two-dimensional KDE plot. Darker colors here represent a higher concentration of data points:
|
||||
|
||||
|
||||
```python
|
||||
sns.kdeplot(df['RM'], df['MEDV'], shade=True)
|
||||
```
|
||||
|
||||
Note that while the KDE plot is very good at showing concentrations of data points, finer structures like linear relationships (such as the clear relationship between the number of rooms in homes and the house price) are lost in the KDE plot.
|
||||
|
||||
Finally, the pairplot in Seaborn allows you to see scatterplots and histograms for several columns in one table. Here we have played with some of the keywords to produce a more sophisticated and easier to read pairplot that incorporates both alpha blending and linear regression lines for the scatterplots.
|
||||
|
||||
|
||||
```python
|
||||
sns.pairplot(df[['RM', 'AGE', 'LSTAT', 'DIS', 'MEDV']], kind="reg", plot_kws={'line_kws':{'color':'red'}, 'scatter_kws': {'alpha': 0.1}})
|
||||
```
|
||||
|
||||
Visualization is the start of the really cool, fun part of data science. So play around with these visualization tools and see what you can learn from the data!
|
||||
|
||||
> **Takeaway:** An old joke goes: “What does a data scientist seen when they look at a dataset? A bunch of numbers.” There is more than a little truth in that joke. Visualization is often the key to finding patterns and correlations in your data. While visualization cannot often deliver precise results, it can point you in the right direction to ask better questions and efficiently find value in the data.
|
Двоичные данные
Data Science 1_ Introduction to Python for Data Science/Azure Notebook Files/.DS_Store
поставляемый
Normal file
Двоичные данные
Data Science 1_ Introduction to Python for Data Science/Azure Notebook Files/.DS_Store
поставляемый
Normal file
Двоичный файл не отображается.
Двоичные данные
Data Science 1_ Introduction to Python for Data Science/Reference Material/.DS_Store
поставляемый
Normal file
Двоичные данные
Data Science 1_ Introduction to Python for Data Science/Reference Material/.DS_Store
поставляемый
Normal file
Двоичный файл не отображается.
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
Разница между файлами не показана из-за своего большого размера
Загрузить разницу
|
@ -0,0 +1,746 @@
|
|||
|
||||
# Introduction to Pandas
|
||||
|
||||
Having explored NumPy, it is time to get to know the other workhorse of data science in Python: pandas. The pandas library in Python really does a lot to make working with data--and importing, cleaning, and organizing it--so much easier that it is hard to imagine doing data science in Python without it.
|
||||
|
||||
But it was not always this way. Wes McKinney developed the library out of necessity in 2008 while at AQR Capital Management in order to have a better tool for dealing with data analysis. The library has since taken off as an open-source software project that has become a mature and integral part of the data science ecosystem. (In fact, some examples in this section will be drawn from McKinney's book, *Python for Data Analysis*.)
|
||||
|
||||
The name 'pandas' actually has nothing to do with Chinese bears but rather comes from the term *panel data*, a form of multi-dimensional data involving measurements over time that comes out the econometrics and statistics community. Ironically, while panel data is a usable data structure in pandas, it is not generally used today and we will not examine it in this course. Instead, we will focus on the two most widely used data structures in pandas: `Series` and `DataFrame`s.
|
||||
|
||||
## Reminders about importing and documentation
|
||||
|
||||
Just as you imported NumPy undwither the alias ``np``, we will import Pandas under the alias ``pd``:
|
||||
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
```
|
||||
|
||||
As with the NumPy convention, `pd` is an important and widely used convention in the data science world; we will use it here and we advise you to use it in your own coding.
|
||||
|
||||
As we progress through Section 5, don't forget that IPython provides tab-completion feature and function documentation with the ``?`` character. If you don't understand anything about a function you see in this section, take a moment and read the documentation; it can help a great deal. As a reminder, to display the built-in pandas documentation, use this code:
|
||||
|
||||
```ipython
|
||||
In [4]: pd?
|
||||
```
|
||||
|
||||
Because it can be useful to lean about `Series` and `DataFrame`s in pandas a extension of `ndarray`s in NumPy, go ahead also import NumPy; you will want it for some of the examples later on:
|
||||
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
```
|
||||
|
||||
Now, on to pandas!
|
||||
|
||||
## Fundamental panda data structures
|
||||
|
||||
Both `Series` and `DataFrame`s are a lot like they `ndarray`s you encountered in the last section. They provide clean, efficent data storage and handling at the scales necessary for data science. What both of them provide that `ndarray`s lack, however, are essential data-science features like flexibility when dealing with missing data and the ability to label data. These capabilities (along with others) help make `Series` and `DataFrame`s essential to the "data munging" that make up so much of data science.
|
||||
|
||||
### `Series` objects in pandas
|
||||
|
||||
A pandas `Series` is a lot like an `ndarray` in NumPy: a one-dimensional array of indexed data.
|
||||
You can create a simple Series from an array of data like this:
|
||||
|
||||
|
||||
```python
|
||||
series_example = pd.Series([-0.5, 0.75, 1.0, -2])
|
||||
series_example
|
||||
```
|
||||
|
||||
Similar to an `ndarray`, a `Series` upcasts entries to be of the same type of data (that `-2` integer in the original array became a `-2.00` float in the `Series`).
|
||||
|
||||
What is different from an `ndarray` is that the ``Series`` automatically wraps both a sequence of values and a sequence of indices. These are two separate objects within the `Seriers` object that can access with the ``values`` and ``index`` attributes.
|
||||
|
||||
Try accessing the ``values`` first; they are just a familiar NumPy array:
|
||||
|
||||
|
||||
```python
|
||||
series_example.values
|
||||
```
|
||||
|
||||
The ``index`` is also an array-like object:
|
||||
|
||||
|
||||
```python
|
||||
series_example.index
|
||||
```
|
||||
|
||||
Just as with `ndarra`s, you can access specific data elements in a `Series` via the familiar Python square-bracket index notation and slicing:
|
||||
|
||||
|
||||
```python
|
||||
series_example[1]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series_example[1:3]
|
||||
```
|
||||
|
||||
Despite a lot of similarities, pandas `Series` have an important distinction from NumPy `ndarrays`: whereas `ndarrays` have *implicitly defined* integer indices (as do Python lists), pandas `Series` have *explicitly defined* indices. The best part is that you can set the index:
|
||||
|
||||
|
||||
```python
|
||||
series_example2 = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
|
||||
series_example2
|
||||
```
|
||||
|
||||
These explicit indices work exactly the way you would expect them to:
|
||||
|
||||
|
||||
```python
|
||||
series_example2['b']
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Do explicit Series indices work *exactly* the way you might expect?
|
||||
# Try slicing series_example2 using its explicit index and find out.
|
||||
|
||||
```
|
||||
|
||||
With explicit indices in the mix, a `Series` is basically a fixed-length, ordered dictionary in that it maps arbitrary typed index values to arbitrary typed data values. But like `ndarray`s these data are all of the same type, which is important. Just as the type-specific compiled code behind `ndarray` makes them more efficient than a Python lists for certain operations, the type information of pandas ``Series`` makes them much more efficient than Python dictionaries for certain operations.
|
||||
|
||||
But the connection between `Series` and dictionaries is nevertheless very real: you can construct a ``Series`` object directly from a Python dictionary:
|
||||
|
||||
|
||||
```python
|
||||
population_dict = {'France': 65429495,
|
||||
'Germany': 82408706,
|
||||
'Russia': 143910127,
|
||||
'Japan': 126922333}
|
||||
population = pd.Series(population_dict)
|
||||
population
|
||||
```
|
||||
|
||||
Did you see what happened there? The order of the keys `Russia` and `Japan` in the switched places between the order in which they were entered in `population_dict` and how they ended up in the `population` `Series` object. While Python dictionary keys have no order, `Series` keys are ordered.
|
||||
|
||||
So, at one level, you can interact with `Series` as you would with dictionaries:
|
||||
|
||||
|
||||
```python
|
||||
population['Russia']
|
||||
```
|
||||
|
||||
But you can also do powerful array-like operations with `Series` like slicing:
|
||||
|
||||
|
||||
```python
|
||||
# Try slicing on the population Series on your own.
|
||||
# Would slicing be possible if Series keys were not ordered?
|
||||
|
||||
```
|
||||
|
||||
You can also add elements to a `Series` the way that you would to an `ndarray`. Try it in the code cell below:
|
||||
|
||||
|
||||
```python
|
||||
# Try running population['Albania'] = 2937590 (or another country of your choice)
|
||||
# What order do the keys appear in when you run population? Is it what you expected?
|
||||
|
||||
```
|
||||
|
||||
Anoter useful `Series` feature (and definitely a difference from dictionaries) is that `Series` automatically aligns differently indexed data in arithmetic operations:
|
||||
|
||||
|
||||
```python
|
||||
pop2 = pd.Series({'Spain': 46432074, 'France': 102321, 'Albania': 50532})
|
||||
population + pop2
|
||||
```
|
||||
|
||||
Notice that in the case of Germany, Japan, Russia, and Spain (and Albania, depending on what you did in the previous exercise), the addition operation produced `NaN` (not a number) values. pandas does not treat missing values as `0`, but as `NaN` (and it can be helpful to think of arithmetic operations involving `NaN` as essentially `NaN`$ + x=$ `NaN`).
|
||||
|
||||
### `DataFrame` object in pandas
|
||||
|
||||
The other crucial data structure in pandas to get to know for data science is the `DataFrame`.
|
||||
Like the ``Series`` object, ``DataFrame``s can be thought of either as generalizations of `ndarray`s (or as specializations of Python dictionaries).
|
||||
|
||||
Just as a ``Series`` is like a one-dimensional array with flexible indices, a ``DataFrame`` is like a two-dimensional array with both flexible row indices and flexible column names. Essentially, a `DataFrame` represents a rectangular table of data and contains an ordered collection of labeled columns, each of which can be a different value type (`string`, `int`, `float`, etc.).
|
||||
The DataFrame has both a row and column index; in this way you can think of it as a dictionary of `Series`, all of which share the same index.
|
||||
|
||||
Let's take a look at how this works in practice. We will start by creating a `Series` called `area`:
|
||||
|
||||
|
||||
```python
|
||||
area_dict = {'Albania': 28748,
|
||||
'France': 643801,
|
||||
'Germany': 357386,
|
||||
'Japan': 377972,
|
||||
'Russia': 17125200}
|
||||
area = pd.Series(area_dict)
|
||||
area
|
||||
```
|
||||
|
||||
Now you can combine this with the `population` `Series` you created earlier by using a dictionary to construct a single two-dimensional table containing data from both `Series`:
|
||||
|
||||
|
||||
```python
|
||||
countries = pd.DataFrame({'Population': population, 'Area': area})
|
||||
countries
|
||||
```
|
||||
|
||||
As with `Series`, note that `DataFrame`s also automatically order indices (in this case, the column indices `Area` and `Population`).
|
||||
|
||||
So far we have combined dictionaries together to compose a `DataFrame` (which has given our `DataFrame` a row-centric feel), but you can also create `DataFrame`s in a column-wise fashion. Consider adding a `Capital` column using our reliable old array-analog, a list:
|
||||
|
||||
|
||||
```python
|
||||
countries['Capital'] = ['Tirana', 'Paris', 'Berlin', 'Tokyo', 'Moscow']
|
||||
countries
|
||||
```
|
||||
|
||||
As with `Series`, even though initial indices are ordered in `DataFrame`s, subsequent additions to a `DataFrame` stay in the ordered added. However, you can explicitly change the order of `DataFrame` column indices this way:
|
||||
|
||||
|
||||
```python
|
||||
countries = countries[['Capital', 'Area', 'Population']]
|
||||
countries
|
||||
```
|
||||
|
||||
Commonly in a data science context, it is necessary to generate new columns of data from existing data sets. Because `DataFrame` columns behave like `Series`, you can do this is by performing operations on them as you would with `Series`:
|
||||
|
||||
|
||||
```python
|
||||
countries['Population Density'] = countries['Population'] / countries['Area']
|
||||
countries
|
||||
```
|
||||
|
||||
Note: don't worry if IPython gives you a warning over this. The warning is IPython trying to be a little too helpful. The new column you created is an actual part of the `DataFrame` and not a copy of a slice.
|
||||
|
||||
We have stated before that `DataFrame`s are like dictionaries, and it's true. You can retrieve the contents of a column just as you would the value for a specific key in an ordinary dictionary:
|
||||
|
||||
|
||||
```python
|
||||
countries['Area']
|
||||
```
|
||||
|
||||
What about using the row indices?
|
||||
|
||||
|
||||
```python
|
||||
# Now try accessing row data with a command like countries['Japan']
|
||||
|
||||
```
|
||||
|
||||
This returns an error: `DataFrame`s are dictionaries of `Series`, which are the columns. `DataFrame` rows often have heterogeneous data types, so different methods are necessary to access row data. For that, we use the `.loc` method:
|
||||
|
||||
|
||||
```python
|
||||
countries.loc['Japan']
|
||||
```
|
||||
|
||||
Note that what `.loc` returns is an indexed object in its own right and you can access elements within it using familiar index syntax:
|
||||
|
||||
|
||||
```python
|
||||
countries.loc['Japan']['Area']
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Can you think of a way to return the area of Japan without using .iloc?
|
||||
# Hint: Try putting the column index first.
|
||||
# Can you slice along these indices as well?
|
||||
|
||||
```
|
||||
|
||||
Sometimes it is helpful in data science projects to add a column to a `DataFrame` without assigning values to it:
|
||||
|
||||
|
||||
```python
|
||||
countries['Debt-to-GDP Ratio'] = np.nan
|
||||
countries
|
||||
```
|
||||
|
||||
Again, you can disregard the warning (if it triggers) about adding the column this way.
|
||||
|
||||
You can also add columns to a `DataFrame` that do not have the same number of rows as the `DataFrame`:
|
||||
|
||||
|
||||
```python
|
||||
debt = pd.Series([0.19, 2.36], index=['Russia', 'Japan'])
|
||||
countries['Debt-to-GDP Ratio'] = debt
|
||||
countries
|
||||
```
|
||||
|
||||
You can use the `del` command to delete a column from a `DataFrame`:
|
||||
|
||||
|
||||
```python
|
||||
del countries['Capital']
|
||||
countries
|
||||
```
|
||||
|
||||
In addition to their dictionary-like behavior, `DataFrames` also behave like two-dimensional arrays. For example, it can be useful at times when working with a `DataFrame` to transpose it:
|
||||
|
||||
|
||||
```python
|
||||
countries.T
|
||||
```
|
||||
|
||||
Again, note that `DataFrame` columns are `Series` and thus the data types must consistent, hence the upcasting to floating-point numbers. **If there had been strings in this `DataFrame`, everything would have been upcast to strings.** Use caution when transposing `DataFrame`s.
|
||||
|
||||
#### From a two-dimensional NumPy array
|
||||
|
||||
Given a two-dimensional array of data, we can create a ``DataFrame`` with any specified column and index names.
|
||||
If omitted, an integer index will be used for each:
|
||||
|
||||
|
||||
```python
|
||||
pd.DataFrame(np.random.rand(3, 2),
|
||||
columns=['foo', 'bar'],
|
||||
index=['a', 'b', 'c'])
|
||||
```
|
||||
|
||||
## Manipulating data in pandas
|
||||
|
||||
A huge part of data science is manipulating data in order to analyze it. (One rule of thumb is that 80% of any data science project will be concerned with cleaning and organizing the data for the project.) So it makes sense to lear the tools that pandas provides for handling data in `Series` and especially `DataFrame`s. Because both of those data structures are ordered, let's first start by taking a closer look at what gives them their structure: the `Index`.
|
||||
|
||||
### Index objects in pandas
|
||||
|
||||
Both ``Series`` and ``DataFrame``s in pandas have explicit indices that enable you to reference and modify data in them. These indices are actually objects themselves. The ``Index`` object can be thought of as both an immutable array or as fixed-size set.
|
||||
|
||||
It's worth the time to get to know the properties of the `Index` object. Let's return to an example from earlier in the section to examine these properties.
|
||||
|
||||
|
||||
```python
|
||||
series_example = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
|
||||
ind = series_example.index
|
||||
ind
|
||||
```
|
||||
|
||||
The ``Index`` works a lot like an array. we have already seen how to use standard Python indexing notation to retrieve values or slices:
|
||||
|
||||
|
||||
```python
|
||||
ind[1]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
ind[::2]
|
||||
```
|
||||
|
||||
But ``Index`` objects are immutable; you cannot be modified via the normal means:
|
||||
|
||||
|
||||
```python
|
||||
ind[1] = 0
|
||||
```
|
||||
|
||||
This immutability is a good thing: it makes it safer to share indices between multiple ``Series`` or ``DataFrame``s without the potential for problems arising from inadvertent index modification.
|
||||
|
||||
In addition to being array-like, a Index also behaves like a fixed-size set, including following many of the conventions used by Python's built-in ``set`` data structure, so that unions, intersections, differences, and other combinations can be computed in a familiar way. Let's play around with this to see it in action.
|
||||
|
||||
|
||||
```python
|
||||
ind_odd = pd.Index([1, 3, 5, 7, 9])
|
||||
ind_prime = pd.Index([2, 3, 5, 7, 11])
|
||||
```
|
||||
|
||||
In the code cell below, try out the intersection (`ind_odd & ind_prime`), union (`ind_odd | ind_prime`), and the symmetric difference (`ind_odd ^ ind_prime`) of `ind_odd` and `ind_prime`.
|
||||
|
||||
|
||||
```python
|
||||
|
||||
```
|
||||
|
||||
These operations may also be accessed via object methods, for example ``ind_odd.intersection(ind_prime)``. Below is a table listing some useful `Index` methods and properties.
|
||||
|
||||
| **Method** | **Description** |
|
||||
|:---------------|:------------------------------------------------------------------------------------------|
|
||||
| [`append`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html) | Concatenate with additional `Index` objects, producing a new `Index` |
|
||||
| [`diff`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.diff.html) | Compute set difference as an Index |
|
||||
| [`drop`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) | Compute new `Index` by deleting passed values |
|
||||
| [`insert`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.insert.html) | Compute new `Index` by inserting element at index `i` |
|
||||
| [`is_monotonic`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.is_monotonic.html) | Returns `True` if each element is greater than or equal to the previous element |
|
||||
| [`is_unique`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.is_unique.html) | Returns `True` if the Index has no duplicate values |
|
||||
| [`isin`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.isin.html) | Compute boolean array indicating whether each value is contained in the passed collection |
|
||||
| [`unique`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.unique.html) | Compute the array of unique values in order of appearance |
|
||||
|
||||
### Data Selection in Series
|
||||
|
||||
As a refresher, a ``Series`` object acts in many ways like both a one-dimensional `ndarray` and a standard Python dictionary.
|
||||
|
||||
Like a dictionary, the ``Series`` object provides a mapping from a collection of arbitrary keys to a collection of arbitrary values. Back to an old example:
|
||||
|
||||
|
||||
```python
|
||||
series_example2 = pd.Series([-0.5, 0.75, 1.0, -2], index=['a', 'b', 'c', 'd'])
|
||||
series_example2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series_example2['b']
|
||||
```
|
||||
|
||||
You can also examine the keys/indices and values using dictionary-like Python tools:
|
||||
|
||||
|
||||
```python
|
||||
'a' in series_example2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series_example2.keys()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
list(series_example2.items())
|
||||
```
|
||||
|
||||
As with dictionaries, you can extend a dictionary by assigning to a new key, you can extend a ``Series`` by assigning to a new index value:
|
||||
|
||||
|
||||
```python
|
||||
series_example2['e'] = 1.25
|
||||
series_example2
|
||||
```
|
||||
|
||||
#### Series as one-dimensional array
|
||||
|
||||
Because ``Series`` also provide array-style functionality, you can use the NumPy techniques we looked at in Section 3 like slices, masking, and fancy indexing:
|
||||
|
||||
|
||||
```python
|
||||
# Slicing using the explicit index
|
||||
series_example2['a':'c']
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Slicing using the implicit integer index
|
||||
series_example2[0:2]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Masking
|
||||
series_example2[(series_example2 > -1) & (series_example2 < 0.8)]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Fancy indexing
|
||||
series_example2[['a', 'e']]
|
||||
```
|
||||
|
||||
One note to avoid confusion. When slicing with an explicit index (i.e., ``series_example2['a':'c']``), the final index is **included** in the slice; when slicing with an implicit index (i.e., ``series_example2[0:2]``), the final index is **excluded** from the slice.
|
||||
|
||||
#### Indexers: `loc` and `iloc`
|
||||
|
||||
A great thing about pandas is that you can use a lot different things for your explicit indices. A potentially confusing thing about pandas is that you can use a lot different things for your explicit indices, including integers. To avoid confusion between integer indices that you might supply and those implicit integer indices that pandas generates, pandas provides special *indexer* attributes that explicitly expose certain indexing schemes.
|
||||
|
||||
(A technical note: These are not functional methods; they are attributes that expose a particular slicing interface to the data in the ``Series``.)
|
||||
|
||||
The ``loc`` attribute allows indexing and slicing that always references the explicit index:
|
||||
|
||||
|
||||
```python
|
||||
series_example2.loc['a']
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series_example2.loc['a':'c']
|
||||
```
|
||||
|
||||
The ``iloc`` attribute enables indexing and slicing using the implicit, Python-style index:
|
||||
|
||||
|
||||
```python
|
||||
series_example2.iloc[0]
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
series_example2.iloc[0:2]
|
||||
```
|
||||
|
||||
A guiding principle of the Python language is the idea that "explicit is better than implicit." Professional code will generally use explicit indexing with ``loc`` and ``iloc`` and you should as well in order to make your code clean and readable.
|
||||
|
||||
### Data selection in DataFrames
|
||||
|
||||
``DataFrame``s also exhibit dual behavior, acting both like a two-dimensional `ndarray` and like a dictionary of ``Series`` sharing the same index.
|
||||
|
||||
#### DataFrame as dictionary of Series
|
||||
|
||||
Let's return to our earlier example of countries' areas and populations in order to examine `DataFrame`s as a dictionary of `Series`.
|
||||
|
||||
|
||||
```python
|
||||
area = pd.Series({'Albania': 28748,
|
||||
'France': 643801,
|
||||
'Germany': 357386,
|
||||
'Japan': 377972,
|
||||
'Russia': 17125200})
|
||||
population = pd.Series ({'Albania': 2937590,
|
||||
'France': 65429495,
|
||||
'Germany': 82408706,
|
||||
'Russia': 143910127,
|
||||
'Japan': 126922333})
|
||||
countries = pd.DataFrame({'Area': area, 'Population': population})
|
||||
countries
|
||||
```
|
||||
|
||||
You can access the individual ``Series`` that make up the columns of a ``DataFrame`` via dictionary-style indexing of the column name:
|
||||
|
||||
|
||||
```python
|
||||
countries['Area']
|
||||
```
|
||||
|
||||
An you can use dictionary-style syntax can also be used to modify `DataFrame`s, such as by adding a new column:
|
||||
|
||||
|
||||
```python
|
||||
countries['Population Density'] = countries['Population'] / countries['Area']
|
||||
countries
|
||||
```
|
||||
|
||||
#### DataFrame as two-dimensional array
|
||||
|
||||
You can also think of ``DataFrame``s as two-dimensional arrays. You can examine the raw data in the `DataFrame`/data array using the ``values`` attribute:
|
||||
|
||||
|
||||
```python
|
||||
countries.values
|
||||
```
|
||||
|
||||
Viewed thsi way it makes sense that we can transpose the rows and columns of a `DataFrame` the same way we would an array:
|
||||
|
||||
|
||||
```python
|
||||
countries.T
|
||||
```
|
||||
|
||||
`DataFrame`s also uses the ``loc`` and ``iloc`` indexers. With ``iloc``, you can index the underlying array as if it were an `ndarray` but with the ``DataFrame`` index and column labels maintained in the result:
|
||||
|
||||
|
||||
```python
|
||||
countries.iloc[:3, :2]
|
||||
```
|
||||
|
||||
``loc`` also permits array-like slicing but using the explicit index and column names:
|
||||
|
||||
|
||||
```python
|
||||
countries.loc[:'Germany', :'Population']
|
||||
```
|
||||
|
||||
You can also use array-like techniques such as masking and fancing indexing with `loc`.
|
||||
|
||||
|
||||
```python
|
||||
# Can you think of how to combine masking and fancy indexing in one line?
|
||||
# Your masking could be somthing like countries['Population Density'] > 200
|
||||
# Your fancy indexing could be something like ['Population', 'Population Density']
|
||||
# Be sure to put the the masking and fancy indexing inside the square brackets: countries.loc[]
|
||||
|
||||
```
|
||||
|
||||
#### Indexing conventions
|
||||
|
||||
In practice in the world of data science (and pandas more generally), *indexing* refers to columns while *slicing* refers to rows:
|
||||
|
||||
|
||||
```python
|
||||
countries['France':'Japan']
|
||||
```
|
||||
|
||||
Such slices can also refer to rows by number rather than by index:
|
||||
|
||||
|
||||
```python
|
||||
countries[1:3]
|
||||
```
|
||||
|
||||
Similarly, direct masking operations are also interpreted row-wise rather than column-wise:
|
||||
|
||||
|
||||
```python
|
||||
countries[countries['Population Density'] > 200]
|
||||
```
|
||||
|
||||
These two conventions are syntactically similar to those on a NumPy array, and while these may not precisely fit the mold of the Pandas conventions, they are nevertheless quite useful in practice.
|
||||
|
||||
# Operating on Data in Pandas
|
||||
|
||||
As you begin to work in data science, operating on data is imperative. It is the very heart of data science. Another aspect of pandas that makes it a compelling tool for many data scientists is pandas' capability to perform efficient element-wise operations on data. pandas builds on ufuncs from NumPy to supply theses capabilities and then extends them to provide additional power for data manipulation:
|
||||
- For unary operations (such as negation and trigonometric functions), ufuncs in pandas **preserve index and column labels** in the output.
|
||||
- For binary operations (such as addition and multiplication), pandas automatically **aligns indices** when passing objects to ufuncs.
|
||||
|
||||
These critical features of ufuncs in pandas mean that data retains its context when operated on and, more importantly still, drastically helps reduce errors when you combine data from multiple sources.
|
||||
|
||||
## Index Preservation
|
||||
|
||||
pandas is explicitly designed to work with NumPy. As a results, all NumPy ufuncs will work on Pandas ``Series`` and ``DataFrame`` objects.
|
||||
|
||||
We can see this more clearly if we create a simple ``Series`` and ``DataFrame`` of random numbers on which to operate.
|
||||
|
||||
|
||||
```python
|
||||
rng = np.random.RandomState(42)
|
||||
ser_example = pd.Series(rng.randint(0, 10, 4))
|
||||
ser_example
|
||||
```
|
||||
|
||||
Did you notice the NumPy function we used with the variable `rng`? By specifying a seed for the random-number generator, you get the same result each time. This can be useful trick when you need to produce psuedo-random output that also needs to be replicatable by others. (Go ahead and re-run the code cell above a couple of times to convince yourself that it produces the same output each time.)
|
||||
|
||||
|
||||
```python
|
||||
df_example = pd.DataFrame(rng.randint(0, 10, (3, 4)),
|
||||
columns=['A', 'B', 'C', 'D'])
|
||||
df_example
|
||||
```
|
||||
|
||||
Let's apply a ufunc to our example `Series`:
|
||||
|
||||
|
||||
```python
|
||||
np.exp(ser_example)
|
||||
```
|
||||
|
||||
The same thing happens with a slightly more complex operation on our example `DataFrame`:
|
||||
|
||||
|
||||
```python
|
||||
np.cos(df_example * np.pi / 4)
|
||||
```
|
||||
|
||||
Note that you can use all of the ufuncs we discussed in Section 3 the same way.
|
||||
|
||||
## Index alignment
|
||||
|
||||
As mentioned above, when you perform a binary operation on two ``Series`` or ``DataFrame`` objects, pandas will align indices in the process of performing the operation. This is essential when working with incomplete data (and data is usually incomplete), but it is helpful to see this in action to better understand it.
|
||||
|
||||
### Index alignment with Series
|
||||
|
||||
For our first example, suppose we are combining two different data sources and find only the top five countries by *area* and the top five countries by *population*:
|
||||
|
||||
|
||||
```python
|
||||
area = pd.Series({'Russia': 17075400, 'Canada': 9984670,
|
||||
'USA': 9826675, 'China': 9598094,
|
||||
'Brazil': 8514877}, name='area')
|
||||
population = pd.Series({'China': 1409517397, 'India': 1339180127,
|
||||
'USA': 324459463, 'Indonesia': 322179605,
|
||||
'Brazil': 207652865}, name='population')
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Now divide these to compute the population density
|
||||
|
||||
```
|
||||
|
||||
Your resulting array contains the **union** of indices of the two input arrays: seven countries in total. All of the countries in the array without an entry (because they lacked either area data or population data) are marked with the now familiar ``NaN``, or "Not a Number," designation.
|
||||
|
||||
Index matching works the same way built-in Python arithmetic expressions and missing values are filled in with `NaN`s. You can see this clearly by adding two `Series` that are slightly misaligned in their indices:
|
||||
|
||||
|
||||
```python
|
||||
series1 = pd.Series([2, 4, 6], index=[0, 1, 2])
|
||||
series2 = pd.Series([3, 5, 7], index=[1, 2, 3])
|
||||
series1 + series2
|
||||
```
|
||||
|
||||
`NaN` values are not always convenient to work with; `NaN` combined with any other values results in `NaN`, which can be a pain, particulalry if you are combining multiple data sources with missing values. To help with this, pandas allows you to specify a default value to use for missing values in the operation. For example, calling `series1.add(series2)` is equivalent to calling `series1 + series2`, but you can supply the fill value:
|
||||
|
||||
|
||||
```python
|
||||
series1.add(series2, fill_value=0)
|
||||
```
|
||||
|
||||
Much better!
|
||||
|
||||
### Index alignment with DataFrames
|
||||
|
||||
The same kind of alignment takes place in both dimension (columns and indices) when you perform operations on ``DataFrame``s.
|
||||
|
||||
|
||||
```python
|
||||
df1 = pd.DataFrame(rng.randint(0, 20, (2, 2)),
|
||||
columns=list('AB'))
|
||||
df1
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df2 = pd.DataFrame(rng.randint(0, 10, (3, 3)),
|
||||
columns=list('BAC'))
|
||||
df2
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Add df1 and df2. Is the output what you expected?
|
||||
```
|
||||
|
||||
Even though we passed the columns in a different order in `df2` than in `df1`, the indices were aligned correctly sorted in the resulting union of columns.
|
||||
|
||||
You can also use fill values for missing values with `Data Frame`s. In this example, let's fill the missing values with the mean of all values in `df1` (computed by first stacking the rows of `df1`):
|
||||
|
||||
|
||||
```python
|
||||
fill = df1.stack().mean()
|
||||
df1.add(df2, fill_value=fill)
|
||||
```
|
||||
|
||||
This table lists Python operators and their equivalent pandas object methods:
|
||||
|
||||
| Python Operator | Pandas Method(s) |
|
||||
|-----------------|---------------------------------------|
|
||||
| ``+`` | ``add()`` |
|
||||
| ``-`` | ``sub()``, ``subtract()`` |
|
||||
| ``*`` | ``mul()``, ``multiply()`` |
|
||||
| ``/`` | ``truediv()``, ``div()``, ``divide()``|
|
||||
| ``//`` | ``floordiv()`` |
|
||||
| ``%`` | ``mod()`` |
|
||||
| ``**`` | ``pow()`` |
|
||||
|
||||
|
||||
## Operations between DataFrames and Series
|
||||
|
||||
Index and column alignment gets maintained in operations between a `DataFrame` and a `Series` as well. To see this, consider a common operation in data science, wherein we find the difference of a `DataFrame` and one of its rows. Because pandas inherits ufuncs from NumPy, pandas will compute the difference row-wise by default:
|
||||
|
||||
|
||||
```python
|
||||
df3 = pd.DataFrame(rng.randint(10, size=(3, 4)), columns=list('WXYZ'))
|
||||
df3
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df3 - df3.iloc[0]
|
||||
```
|
||||
|
||||
But what if you need to operate column-wise? You can do this by using object methodsand specifying the ``axis`` keyword.
|
||||
|
||||
|
||||
```python
|
||||
df3.subtract(df3['X'], axis=0)
|
||||
```
|
||||
|
||||
And when you do operations between `DataFrame`s and `Series` operations, you still get automatic index alignment:
|
||||
|
||||
|
||||
```python
|
||||
halfrow = df3.iloc[0, ::2]
|
||||
halfrow
|
||||
```
|
||||
|
||||
Note that the output from that operation was transposed. That was so that we can subtract it from the `DataFrame`:
|
||||
|
||||
|
||||
```python
|
||||
df3 - halfrow
|
||||
```
|
||||
|
||||
Remember, pandas preserves and aligns indices and columns so preserve data context. This will be of huge help to you in our next section when we look at data cleaning and preparation.
|
|
@ -0,0 +1,978 @@
|
|||
|
||||
# Manipulating and Cleaning Data
|
||||
|
||||
This section marks a subtle change. Up until now, we have been introducing ideas and techniques in order to prepare you with a toolbox of techniques to deal with real-world situations. We are now going to start using some of those tools while also giving you some ideas about how and when to use them in your own work with data.
|
||||
|
||||
Real-world data is messy. You will likely need to combine several data sources to get the data you actually want. The data from those sources will be incomplete. And it will likely not be formatted in exactly the way you want in order to perform your analysis. It's for these reasons that most data scientists will tell you that about 80 percent of any project is spent just getting the data into a form ready for analysis.
|
||||
|
||||
## Exploring `DataFrame` information
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should be comfortable finding general information about the data stored in pandas DataFrames.
|
||||
|
||||
Once you have loaded your data into pandas, it will more likely than not be in a `DataFrame`. However, if the data set in your `DataFrame` has 60,000 rows and 400 columns, how do you even begin to get a sense of what you're working with? Fortunately, pandas provides some conventient tools to quickly look at overall information about a `DataFrame` in addition to the first few and last few rows.
|
||||
|
||||
In order to explore this functionality, we will import the Python scikit-learn library and use an iconic dataset that every data scientist has seen hundreds of times: British biologist Ronald Fisher's *Iris* data set used in his 1936 paper "The use of multiple measurements in taxonomic problems":
|
||||
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
from sklearn.datasets import load_iris
|
||||
|
||||
iris = load_iris()
|
||||
iris_df = pd.DataFrame(data=iris['data'], columns=iris['feature_names'])
|
||||
```
|
||||
|
||||
### `DataFrame.info`
|
||||
Let's take a look at this dataset to see what we have:
|
||||
|
||||
|
||||
```python
|
||||
iris_df.info()
|
||||
```
|
||||
|
||||
From this, we know that the *Iris* dataset has 150 entries in four columns. All of the data is stored as 64-bit floating-point numbers.
|
||||
|
||||
### `DataFrame.head`
|
||||
Next, let's see what the first few rows of our `DataFrame` look like:
|
||||
|
||||
|
||||
```python
|
||||
iris_df.head()
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
By default, `DataFrame.head` returns the first five rows of a `DataFrame`. In the code cell below, can you figure out how to get it to show more?
|
||||
|
||||
|
||||
```python
|
||||
# Hint: Consult the documentation by using iris_df.head?
|
||||
|
||||
```
|
||||
|
||||
### `DataFrame.tail`
|
||||
The flipside of `DataFrame.head` is `DataFrame.tail`, which returns the last five rows of a `DataFrame`:
|
||||
|
||||
|
||||
```python
|
||||
iris_df.tail()
|
||||
```
|
||||
|
||||
In practice, it is useful to be able to easily examine the first few rows or the last few rows of a `DataFrame`, particularly when you are looking for outliers in ordered datasets.
|
||||
|
||||
> **Takeaway:** Even just by looking at the metadata about the information in a DataFrame or the first and last few values in one, you can get an immediate idea about the size, shape, and content of the data you are dealing with.
|
||||
|
||||
## Dealing with missing data
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should know how to replace or remove null values from DataFrames.
|
||||
|
||||
Most of the time the datasets you want to use (of have to use) have missing values in them. How missing data is handled carries with it subtle tradeoffs that can affect your final analysis and real-world outcomes.
|
||||
|
||||
Pandas handles missing values in two ways. The first you've seen before in previous sections: `NaN`, or Not a Number. This is a actually a special value that is part of the IEEE floating-point specification and it is only used to indicate missing floating-point values.
|
||||
|
||||
For missing values apart from floats, pandas uses the Python `None` object. While it might seem confusing that you will encounter two different kinds of values that say essentially the same thing, there are sound programmatic reasons for this design choice and, in practice, going this route enables pandas to deliver a good compromise for the vast majority of cases. Notwithstanding this, both `None` and `NaN` carry restrictions that you need to be mindful of with regards to how they can be used.
|
||||
|
||||
### `None`: non-float missing data
|
||||
Because `None` comes from Python, it cannot be used in NumPy and pandas arrays that are not of data type `'object'`. Remember, NumPy arrays (and the data structures in pandas) can contain only one type of data. This is what gives them their tremendous power for large-scale data and computational work, but it also limits their flexibility. Such arrays have to upcast to the “lowest common denominator,” the data type that will encompass everything in the array. When `None` is in the array, it means you are working with Python objects.
|
||||
|
||||
To see this in action, consider the following example array (note the `dtype` for it):
|
||||
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
example1 = np.array([2, None, 6, 8])
|
||||
example1
|
||||
```
|
||||
|
||||
The reality of upcast data types carries two side effects with it. First, operations will be carried out at the level of interpreted Python code rather than compiled NumPy code. Essentially, this means that any operations involving `Series` or `DataFrames` with `None` in them will be slower. While you would probably not notice this performance hit, for large datasets it might become an issue.
|
||||
|
||||
The second side effect stems from the first. Because `None` essentially drags `Series` or `DataFrame`s back into the world of vanilla Python, using NumPy/pandas aggregations like `sum()` or `min()` on arrays that contain a ``None`` value will generally produce an error:
|
||||
|
||||
|
||||
```python
|
||||
example1.sum()
|
||||
```
|
||||
|
||||
**Key takeaway**: Addition (and other operations) between integers and `None` values is undefined, which can limit what you can do with datasets that contain them.
|
||||
|
||||
### `NaN`: missing float values
|
||||
|
||||
In contrast to `None`, NumPy (and therefore pandas) supports `NaN` for its fast, vectorized operations and ufuncs. The bad news is that any arithmetic performed on `NaN` always results in `NaN`. For example:
|
||||
|
||||
|
||||
```python
|
||||
np.nan + 1
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
np.nan * 0
|
||||
```
|
||||
|
||||
The good news: aggregations run on arrays with `NaN` in them don't pop errors. The bad news: the results are not uniformly useful:
|
||||
|
||||
|
||||
```python
|
||||
example2 = np.array([2, np.nan, 6, 8])
|
||||
example2.sum(), example2.min(), example2.max()
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# What happens if you add np.nan and None together?
|
||||
|
||||
```
|
||||
|
||||
Remember: `NaN` is just for missing floating-point values; there is no `NaN` equivalent for integers, strings, or Booleans.
|
||||
|
||||
### `NaN` and `None`: null values in pandas
|
||||
|
||||
Even though `NaN` and `None` can behave somewhat differently, pandas is nevertheless built to handle them interchangeably. To see what we mean, consider a `Series` of integers:
|
||||
|
||||
|
||||
```python
|
||||
int_series = pd.Series([1, 2, 3], dtype=int)
|
||||
int_series
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Now set an element of int_series equal to None.
|
||||
# How does that element show up in the Series?
|
||||
# What is the dtype of the Series?
|
||||
|
||||
```
|
||||
|
||||
In the process of upcasting data types to establish data homogeneity in `Seires` and `DataFrame`s, pandas will willingly switch missing values between `None` and `NaN`. Because of this design feature, it can be helpful to think of `None` and `NaN` as two different flavors of "null" in pandas. Indeed, some of the core methods you will use to deal with missing values in pandas reflect this idea in their names:
|
||||
|
||||
- `isnull()`: Generates a Boolean mask indicating missing values
|
||||
- `notnull()`: Opposite of `isnull()`
|
||||
- `dropna()`: Returns a filtered version of the data
|
||||
- `fillna()`: Returns a copy of the data with missing values filled or imputed
|
||||
|
||||
These are important methods to master and get comfortable with, so let's go over them each in some depth.
|
||||
|
||||
### Detecting null values
|
||||
Both `isnull()` and `notnull()` are your primary methods for detecting null data. Both return Boolean masks over your data.
|
||||
|
||||
|
||||
```python
|
||||
example3 = pd.Series([0, np.nan, '', None])
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
example3.isnull()
|
||||
```
|
||||
|
||||
Look closely at the output. Does any of it surprise you? While `0` is an arithmetic null, it's nevertheless a perfectly good integer and pandas treats it as such. `''` is a little more subtle. While we used it in Section 1 to represent an empty string value, it is nevertheless a string object and not a representation of null as far as pandas is concerned.
|
||||
|
||||
Now, let's turn this around and use these methods in a manner more like you will use them in practice. You can use Boolean masks directly as a ``Series`` or ``DataFrame`` index, which can be useful when trying to work with isolated missing (or present) values.
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Try running example3[example3.notnull()].
|
||||
# Before you do so, what do you expect to see?
|
||||
|
||||
```
|
||||
|
||||
**Key takeaway**: Both the `isnull()` and `notnull()` methods produce similar results when you use them in `DataFrame`s: they show the results and the index of those results, which will help you enormously as you wrestle with your data.
|
||||
|
||||
### Dropping null values
|
||||
|
||||
Beyond identifying missing values, pandas provides a convenient means to remove null values from `Series` and `DataFrame`s. (Particularly on large data sets, it is often more advisable to simply remove missing [NA] values from your analysis than deal with them in other ways.) To see this in action, let's return to `example3`:
|
||||
|
||||
|
||||
```python
|
||||
example3 = example3.dropna()
|
||||
example3
|
||||
```
|
||||
|
||||
Note that this should look like your output from `example3[example3.notnull()]`. The difference here is that, rather than just indexing on the masked values, `dropna` has removed those missing values from the `Series` `example3`.
|
||||
|
||||
Because `DataFrame`s have two dimensions, they afford more options for dropping data.
|
||||
|
||||
|
||||
```python
|
||||
example4 = pd.DataFrame([[1, np.nan, 7],
|
||||
[2, 5, 8],
|
||||
[np.nan, 6, 9]])
|
||||
example4
|
||||
```
|
||||
|
||||
(Did you notice that pandas upcast two of the columns to floats to accommodate the `NaN`s?)
|
||||
|
||||
You cannot drop a single value from a `DataFrame`, so you have to drop full rows or columns. Depending on what you are doing, you might want to do one or the other, and so pandas gives you options for both. Because in data science, columns generally represent variables and rows represent observations, you are more likely to drop rows of data; the default setting for `dropna()` is to drop all rows that contain any null values:
|
||||
|
||||
|
||||
```python
|
||||
example4.dropna()
|
||||
```
|
||||
|
||||
If necessary, you can drop NA values from columns. Use `axis=1` to do so:
|
||||
|
||||
|
||||
```python
|
||||
example4.dropna(axis='columns')
|
||||
```
|
||||
|
||||
Notice that this can drop a lot of data that you might want to keep, particularly in smaller datasets. What if you just want to drop rows or columns that contain several or even just all null values? You specify those setting in `dropna` with the `how` and `thresh` parameters.
|
||||
|
||||
By default, `how='any'` (if you would like to check for yourself or see what other parameters the method has, run `example4.dropna?` in a code cell). You could alternatively specify `how='all'` so as to drop only rows or columns that contain all null values. Let's expand our example `DataFrame` to see this in action.
|
||||
|
||||
|
||||
```python
|
||||
example4[3] = np.nan
|
||||
example4
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# How might you go about dropping just column 3?
|
||||
# Hint: remember that you will need to supply both the axis parameter and the how parameter.
|
||||
|
||||
```
|
||||
|
||||
The `thresh` parameter gives you finer-grained control: you set the number of *non-null* values that a row or column needs to have in order to be kept:
|
||||
|
||||
|
||||
```python
|
||||
example4.dropna(axis='rows', thresh=3)
|
||||
```
|
||||
|
||||
Here, the first and last row have been dropped, because they contain only two non-null values.
|
||||
|
||||
### Filling null values
|
||||
|
||||
Depending on your dataset, it can sometimes make more sense to fill null values with valid ones rather than drop them. You could use `isnull` to do this in place, but that can be laborious, particularly if you have a lot of values to fill. Because this is such a common task in data science, pandas provides `fillna`, which returns a copy of the `Series` or `DataFrame` with the missing values replaced with one of your choosing. Let's create another example `Series` to see how this works in practice.
|
||||
|
||||
|
||||
```python
|
||||
example5 = pd.Series([1, np.nan, 2, None, 3], index=list('abcde'))
|
||||
example5
|
||||
```
|
||||
|
||||
You can fill all of the null entries with a single value, such as `0`:
|
||||
|
||||
|
||||
```python
|
||||
example5.fillna(0)
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# What happens if you try to fill null values with a string, like ''?
|
||||
|
||||
```
|
||||
|
||||
You can **forward-fill** null values, which is to use the last valid value to fill a null:
|
||||
|
||||
|
||||
```python
|
||||
example5.fillna(method='ffill')
|
||||
```
|
||||
|
||||
You can also **back-fill** to propagate the next valid value backward to fill a null:
|
||||
|
||||
|
||||
```python
|
||||
example5.fillna(method='bfill')
|
||||
```
|
||||
|
||||
As you might guess, this works the same with `DataFrame`s, but you can also specify an `axis` along which to fill null values:
|
||||
|
||||
|
||||
```python
|
||||
example4
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
example4.fillna(method='ffill', axis=1)
|
||||
```
|
||||
|
||||
Notice that when a previous value is not available for forward-filling, the null value remains.
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# What output does example4.fillna(method='bfill', axis=1) produce?
|
||||
# What about example4.fillna(method='ffill') or example4.fillna(method='bfill')?
|
||||
# Can you think of a longer code snippet to write that can fill all of the null values in example4?
|
||||
|
||||
```
|
||||
|
||||
You can be creative about how you use `fillna`. For example, let's look at `example4` again, but this time let's fill the missing values with the average of all of the values in the `DataFrame`:
|
||||
|
||||
|
||||
```python
|
||||
example4.fillna(example4.mean())
|
||||
```
|
||||
|
||||
Notice that column 3 is still valueless: the default direction is to fill values row-wise.
|
||||
|
||||
> **Takeaway:** There are multiple ways to deal with missing values in your datasets. The specific strategy you use (removing them, replacing them, or even how you replace them) should be dictated by the particulars of that data. You will develop a better sense of how to deal with missing values the more you handle and interact with datasets.
|
||||
|
||||
## Removing duplicate data
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should be comfortable identifying and removing duplicate values from DataFrames.
|
||||
|
||||
In addition to missing data, you will often encounter duplicated data in real-world datasets. Fortunately, pandas provides an easy means of detecting and removing duplicate entries.
|
||||
|
||||
### Identifying duplicates: `duplicated`
|
||||
|
||||
You can easily spot duplicate values using the `duplicated` method in pandas, which returns a Boolean mask indicating whether an entry in a `DataFrame` is a duplicate of an ealier one. Let's create another example `DataFrame` to see this in action.
|
||||
|
||||
|
||||
```python
|
||||
example6 = pd.DataFrame({'letters': ['A','B'] * 2 + ['B'],
|
||||
'numbers': [1, 2, 1, 3, 3]})
|
||||
example6
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
example6.duplicated()
|
||||
```
|
||||
|
||||
### Dropping duplicates: `drop_duplicates`
|
||||
`drop_duplicates` simply returns a copy of the data for which all of the `duplicated` values are `False`:
|
||||
|
||||
|
||||
```python
|
||||
example6.drop_duplicates()
|
||||
```
|
||||
|
||||
Both `duplicated` and `drop_duplicates` default to consider all columnsm but you can specify that they examine only a subset of columns in your `DataFrame`:
|
||||
|
||||
|
||||
```python
|
||||
example6.drop_duplicates(['letters'])
|
||||
```
|
||||
|
||||
> **Takeaway:** Removing duplicate data is an essential part of almost every data-science project. Duplicate data can change the results of your analyses and give you spurious results!
|
||||
|
||||
## Combining datasets: merge and join
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should have a general knowledge of the various ways to combine `DataFrame`s.
|
||||
|
||||
Your most interesting analyses will often come from data melded together from more than one source. Because of this, pandas provides several methods of merging and joining datasets to make this necessary job easier:
|
||||
- **`pandas.merge`** connects rows in `DataFrame`s based on one or more keys.
|
||||
- **`pandas.concat`** concatenates or “stacks” together objects along an axis.
|
||||
- The **`combine_first`** instance method enables you to splice together overlapping data to fill in missing values in one object with values from another.
|
||||
|
||||
Let's examine merging data first, because it will be the most familiar to course attendees who are already familiar with SQL or other relational databases.
|
||||
|
||||
### Categories of joins
|
||||
|
||||
`merge` carries out several types of joins: *one-to-one*, *many-to-one*, and *many-to-many*. You use the same basic function call to implement all of them and we will examine all three (because you will need all three as some point in your data delving depending on the data). We will start with one-to-one joins because they are generally the simplest example.
|
||||
|
||||
#### One-to-one joins
|
||||
|
||||
Consider combining two `DataFrame`s that contain different information on the same employees in a company:
|
||||
|
||||
|
||||
```python
|
||||
df1 = pd.DataFrame({'employee': ['Gary', 'Stu', 'Mary', 'Sue'],
|
||||
'group': ['Accounting', 'Marketing', 'Marketing', 'HR']})
|
||||
df1
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df2 = pd.DataFrame({'employee': ['Mary', 'Stu', 'Gary', 'Sue'],
|
||||
'hire_date': [2008, 2012, 2017, 2018]})
|
||||
df2
|
||||
```
|
||||
|
||||
Combine this information into a single `DataFrame` using the `merge` function:
|
||||
|
||||
|
||||
```python
|
||||
df3 = pd.merge(df1, df2)
|
||||
df3
|
||||
```
|
||||
|
||||
Pandas joined on the `employee` column because it was the only column common to both `df1` and `df2`. (Note also that the original indices of `df1` and `df2` were discarded by `merge`; this is generally the case with merges unless you conduct them by index, which we will dicuss later on.)
|
||||
|
||||
#### Many-to-one joins
|
||||
|
||||
A many-to-one join is like a one-to-one join except that one of the two key columns contains duplicate entries. The `DataFrame` resulting from such a join will preserve those duplicate entries as appropriate:
|
||||
|
||||
|
||||
```python
|
||||
df4 = pd.DataFrame({'group': ['Accounting', 'Marketing', 'HR'],
|
||||
'supervisor': ['Carlos', 'Giada', 'Stephanie']})
|
||||
df4
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df3, df4)
|
||||
```
|
||||
|
||||
The resulting `DataFrame` has an additional column for `supervisor`; that column has an extra occurence of 'Giada' that did not occur in `df4` because more than one employee in the merged `DataFrame` works in the 'Marketing' group.
|
||||
|
||||
Note that we didn’t specify which column to join on. When you don't specify that information, `merge` uses the overlapping column names as the keys. However, that can be ambiguous; several columns might meet that condition. For that reason, it is a good practice to explicitly specify on which key to join. You can do this with the `on` parameter:
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df3, df4, on='group')
|
||||
```
|
||||
|
||||
#### Many-to-many joins
|
||||
What happens if the key columns in both of the `DataFrame`s you are joining contain duplicates? That gives you a many-to-many join:
|
||||
|
||||
|
||||
```python
|
||||
df5 = pd.DataFrame({'group': ['Accounting', 'Accounting',
|
||||
'Marketing', 'Marketing', 'HR', 'HR'],
|
||||
'core_skills': ['math', 'spreadsheets', 'writing', 'communication',
|
||||
'spreadsheets', 'organization']})
|
||||
df5
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df1, df5, on='group')
|
||||
```
|
||||
|
||||
Again, in order to avoid ambiguity as to which column to join on, it is a good idea to explicitly tell `merge` which one to use with the `on` parameter.
|
||||
|
||||
#### `left_on` and `right_on` keywords
|
||||
What if you need to merge two datasets with no shared column names? For example, what if you are using a dataset in which the employee name is labeled as 'name' rather than 'employee'? In such cases, you will need to use the `left_on` and `right_on` keywords in order to specify the column names on which to join:
|
||||
|
||||
|
||||
```python
|
||||
df6 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
|
||||
'salary': [70000, 80000, 120000, 90000]})
|
||||
df6
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df1, df6, left_on="employee", right_on="name")
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Using the documentation, can you figure out how to use .drop() to get rid of the 'name' column?
|
||||
# Hint: You will need to supply two parameters to .drop()
|
||||
|
||||
```
|
||||
|
||||
#### `left_index` and `right_index` keywords
|
||||
|
||||
Sometimes it can be more advantageous to merge on an index rather than on a column. The `left_index` and `right_index` keywords make it possible to join by index. Let's revisit some of our earlier example `DataFrame`s to see what this looks like in action.
|
||||
|
||||
|
||||
```python
|
||||
df1a = df1.set_index('employee')
|
||||
df1a
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df2a = df2.set_index('employee')
|
||||
df2a
|
||||
```
|
||||
|
||||
To merge on the index, specify the `left_index` and `right_index` parameters in `merge`:
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df1a, df2a, left_index=True, right_index=True)
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# What happens if you specify only left_index or right_index?
|
||||
|
||||
```
|
||||
|
||||
You can also use the `join` method for `DataFrame`s, which produces the same effect but merges on indices by default:
|
||||
|
||||
|
||||
```python
|
||||
df1a.join(df2a)
|
||||
```
|
||||
|
||||
You can also mix and match `left_index`/`right_index` with `right_on`/`left_on`:
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df1a, df6, left_index=True, right_on='name')
|
||||
```
|
||||
|
||||
#### Set arithmetic for joins
|
||||
|
||||
Let's return to many-to-many joins for a moment. A consideration that is unique to them is the *arithmetic* of the join, specifically the set arithmetic we use for the join. To illustrate what we mean by this, let's restructure an old example `DataFrame`:
|
||||
|
||||
|
||||
```python
|
||||
df5 = pd.DataFrame({'group': ['Engineering', 'Marketing', 'Sales'],
|
||||
'core_skills': ['math', 'writing', 'communication']})
|
||||
df5
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df1, df5, on='group')
|
||||
```
|
||||
|
||||
Notice that after we have restructured `df5` and then re-run the merge with `df1`, we have only two entries in the result. This is because we merged on `group` and 'Marketing' was the only entry that appeared in the `group` column of both `DataFrame`s.
|
||||
|
||||
In effect, what we have gotten is the *intersection* of both `DataFrame`s. This is know as the inner join in the database world and it is the default setting for `merge` although we can certainly specify it:
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df1, df5, on='group', how='inner')
|
||||
```
|
||||
|
||||
The complement of the inner join is the outer join, which returns the *union* of the two `DataFrame`s.
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# The keyword for perfoming an outer join is how='outer'. How would you perform it?
|
||||
# What do you expect the output of an outer join of df1 and df5 to be?
|
||||
|
||||
```
|
||||
|
||||
Notice in your resulting `DataFrame` that not every row in `df1` and `df5` had a value that corresponds to the union of the key values (the 'group' column). Pandas fills in these missing values with `NaN`s.
|
||||
|
||||
Inner and outer joins are not your only options. A *left join* returns all of the rows in the first (left-side) `DataFrame` supplied to `merge` along with rows from the other `DataFrame` that match up with the left-side key values (and `NaNs` rows with respective values):
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df1, df5, how='left')
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Now run the right merge between df1 and df5.
|
||||
# What do you expect to see?
|
||||
|
||||
```
|
||||
|
||||
#### `suffixes` keyword: dealing with conflicting column names
|
||||
Because you can join datasets, you will eventually join two with conflicting column names. Let's look at another example to see what we mean:
|
||||
|
||||
|
||||
```python
|
||||
df7 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
|
||||
'rank': [1, 2, 3, 4]})
|
||||
df7
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df8 = pd.DataFrame({'name': ['Gary', 'Stu', 'Mary', 'Sue'],
|
||||
'rank': [3, 1, 4, 2]})
|
||||
df8
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df7, df8, on='name')
|
||||
```
|
||||
|
||||
Each column name in a `DataFrame` must be unique, so in cases where two joined `DataFrame`s share column names (aside from the column serving as the key), the `merge` function automatically appends the suffix `_x` or `_y` to the conflicting column names in order to make them unique. In cases where it is best to control your column names, you can specify a custom suffix for `merge` to append through the `suffixes` keyword:
|
||||
|
||||
|
||||
```python
|
||||
pd.merge(df7, df8, on='name', suffixes=['_left', '_right'])
|
||||
```
|
||||
|
||||
Note that these suffixes work if there are multiple conflicting columns.
|
||||
|
||||
### Concatenation in NumPy
|
||||
Concatenation in pandas is built off of the concatenation functionality for NumPy arrays. Here is what NumPy concatenation looks like:
|
||||
- For one-dimensional arrays:
|
||||
|
||||
|
||||
```python
|
||||
x = [1, 2, 3]
|
||||
y = [4, 5, 6]
|
||||
z = [7, 8, 9]
|
||||
np.concatenate([x, y, z])
|
||||
```
|
||||
|
||||
- For two-dimensional arrays:
|
||||
|
||||
|
||||
```python
|
||||
x = [[1, 2],
|
||||
[3, 4]]
|
||||
np.concatenate([x, x], axis=1)
|
||||
```
|
||||
|
||||
Notice that the `axis=1` parameter makes the concatenation occur along columns rather than rows. Concatenation in pandas looks similar to this.
|
||||
|
||||
### Concatenation in pandas
|
||||
|
||||
Pandas has a function, `pd.concat()` that can be used for a simple concatenation of `Series` or `DataFrame` objects in similar manner to `np.concatenate()` with ndarrays.
|
||||
|
||||
|
||||
```python
|
||||
ser1 = pd.Series(['a', 'b', 'c'], index=[1, 2, 3])
|
||||
ser2 = pd.Series(['d', 'e', 'f'], index=[4, 5, 6])
|
||||
pd.concat([ser1, ser2])
|
||||
```
|
||||
|
||||
It also concatenates higher-dimensional objects, such as ``DataFrame``s:
|
||||
|
||||
|
||||
```python
|
||||
df9 = pd.DataFrame({'A': ['a', 'c'],
|
||||
'B': ['b', 'd']})
|
||||
df9
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.concat([df9, df9])
|
||||
```
|
||||
|
||||
Notice that `pd.concat` has preserved the indexing even though that means that it has been duplicated. You can have the results re-indexed (and avoid potential confusion down the road) like so:
|
||||
|
||||
|
||||
```python
|
||||
pd.concat([df9, df9], ignore_index=True)
|
||||
```
|
||||
|
||||
By default, `pd.concat` concatenates row-wise within the `DataFrame` (that is, `axis=0` by default). You can specify the axis along which to concatenate:
|
||||
|
||||
|
||||
```python
|
||||
pd.concat([df9, df9], axis=1)
|
||||
```
|
||||
|
||||
Note that while pandas will display this without error, you will get an error message if you try to assign this result as a new `DataFrame`. Column names in `DataFrame`s must be unique.
|
||||
|
||||
### Concatenation with joins
|
||||
Just as you did with merge above, you can use inner and outer joins when concatenating `DataFrame`s with different sets of column names.
|
||||
|
||||
|
||||
```python
|
||||
df10 = pd.DataFrame({'A': ['a', 'd'],
|
||||
'B': ['b', 'e'],
|
||||
'C': ['c', 'f']})
|
||||
df10
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df11 = pd.DataFrame({'B': ['u', 'x'],
|
||||
'C': ['v', 'y'],
|
||||
'D': ['w', 'z']})
|
||||
df11
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.concat([df10, df11])
|
||||
```
|
||||
|
||||
As we saw earlier, the default join for this is an outer join and entries for which no data is available are filled with `NaN` values. You can also do an inner join:
|
||||
|
||||
|
||||
```python
|
||||
pd.concat([df10, df11], join='inner')
|
||||
```
|
||||
|
||||
Another option is to directly specify the index of the remaininig colums using the `join_axes` argument, which takes a list of index objects. Here, we will specify that the returned columns should be the same as those of the first input (`df10`):
|
||||
|
||||
|
||||
```python
|
||||
pd.concat([df10, df11], join_axes=[df10.columns])
|
||||
```
|
||||
|
||||
#### `append()`
|
||||
|
||||
Because direct array concatenation is so common, ``Series`` and ``DataFrame`` objects have an ``append`` method that can accomplish the same thing in fewer keystrokes. For example, rather than calling ``pd.concat([df9, df9])``, you can simply call ``df9.append(df9)``:
|
||||
|
||||
|
||||
```python
|
||||
df9.append(df9)
|
||||
```
|
||||
|
||||
**Important point**: Unlike the `append()` and `extend()` methods of Python lists, the `append()` method in pandas does not modify the original object. It instead creates a new object with the combined data.
|
||||
|
||||
> **Takeaway:** A large part of the value you can provide as a data scientist comes from connecting multiple, often disparate datasets to find new insights. Learning how to join and merge data is thus an essential part of your skill set.
|
||||
|
||||
## Exploratory statistics and visualization
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should be familiar with some of the ways to visually explore the data stored in `DataFrame`s.
|
||||
|
||||
Often when probing a new data set, it is invaluable to get high-level information about what the dataset holds. Earlier in this section we discussed using methods such as `DataFrame.info`, `DataFrame.head`, and `DataFrame.tail` to examine some aspects of a `DataFrame`. While these methods are critical, they are on their own often insufficient to get enough information to know how to approach a new dataset. This is where exploratory statistics and visualizations for datasets come in.
|
||||
|
||||
To see what we mean in terms of gaining exploratory insight (both visually and numerically), let's dig into one of the the datasets that come with the scikit-learn library, the Boston Housing Dataset (though you will load it from a CSV file):
|
||||
|
||||
|
||||
```python
|
||||
df = pd.read_csv('Data/housing_dataset.csv')
|
||||
df.head()
|
||||
```
|
||||
|
||||
This dataset contains information collected from the U.S Census Bureau concerning housing in the area of Boston, Massachusetts and was first published in 1978. The dataset has 14 columns:
|
||||
- **CRIM**: Per-capita crime rate by town
|
||||
- **ZN**: Proportion of residential land zoned for lots over 25,000 square feet
|
||||
- **INDUS**: Proportion of non-retail business acres per town
|
||||
- **CHAS**: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
|
||||
- **NOX**: Nitric oxides concentration (parts per 10 million)
|
||||
- **RM**: Average number of rooms per dwelling
|
||||
- **AGE**: Proportion of owner-occupied units built prior to 1940
|
||||
- **DIS**: Weighted distances to five Boston employment centres
|
||||
- **RAD**: Index of accessibility to radial highways
|
||||
- **TAX**: Full-value property-tax rate per \$10,000
|
||||
- **PTRATIO**: Pupil-teacher ratio by town
|
||||
- **LSTAT**: Percent of lower-status portion of the population
|
||||
- **MEDV**: Median value of owner-occupied homes in \$1,000s
|
||||
|
||||
One of the first methods we can use to better understand this dataset is `DataFrame.shape`:
|
||||
|
||||
|
||||
```python
|
||||
df.shape
|
||||
```
|
||||
|
||||
The dataset has 506 rows and 13 columns.
|
||||
|
||||
To get a better idea of the contents of each column we can use `DataFrame.describe`, which returns the maximum value, minimums value, mean, and standard deviation of numeric values in each columns, in addition to the quartiles for each column:
|
||||
|
||||
|
||||
```python
|
||||
df.describe()
|
||||
```
|
||||
|
||||
Because dataset can have so many columns in them, it can often be useful to transpose the results of `DataFrame.describe` to better use them:
|
||||
|
||||
Note that you can also examine specific descriptive statistics for columns without having to invoke `DataFrame.describe`:
|
||||
|
||||
|
||||
```python
|
||||
df['MEDV'].mean()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df['MEDV'].max()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df['AGE'].median()
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Now find the maximum value in df['AGE'].
|
||||
|
||||
```
|
||||
|
||||
Other information that you will often want to see is the relationship between different columns. You do this with the `DataFrame.groupby` method. For example, you could examine the average MEDV (median value of owner-occupied homes) for each value of AGE (proportion of owner-occupied units built prior to 1940):
|
||||
|
||||
|
||||
```python
|
||||
df.groupby(['AGE'])['MEDV'].mean()
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Now try to find the median value for AGE for each value of MEDV.
|
||||
|
||||
```
|
||||
|
||||
You can also apply a lambda function to each element of a `DataFrame` column by using the `apply` method. For example, say you wanted to create a new column that flagged a row if more than 50 percent of owner-occupied homes were build before 1940:
|
||||
|
||||
|
||||
```python
|
||||
df['AGE_50'] = df['AGE'].apply(lambda x: x>50)
|
||||
```
|
||||
|
||||
Once applied, you also see how many values returned true and how many false by using the `value_counts` method:
|
||||
|
||||
|
||||
```python
|
||||
df['AGE_50'].value_counts()
|
||||
```
|
||||
|
||||
You can also examine figures from the groupby statement you created earlier:
|
||||
|
||||
|
||||
```python
|
||||
df.groupby(['AGE_50'])['MEDV'].mean()
|
||||
```
|
||||
|
||||
You can also group by more than one variable, such AGE_50 (the one you just created), CHAS (whether a town is on the Charles River), and RAD (an index measuring access to the Boston-area radial highways), and then evaluate each group for the average median home price in that group:
|
||||
|
||||
|
||||
```python
|
||||
groupby_twovar=df.groupby(['AGE_50','RAD','CHAS'])['MEDV'].mean()
|
||||
```
|
||||
|
||||
You can then see what values are in this stacked group of variables:
|
||||
|
||||
|
||||
```python
|
||||
groupby_twovar
|
||||
```
|
||||
|
||||
Let's take a moment to analyze these results in a little depth. The first row reports that communities with less the half of houses built before 1940, with a highway-access index of 1, and that are not situated on the Charles River have a mean house price of \$24,667 (1970s dollars); the next row shows that for communities similar to the first row except for being located on the Charles River have a mean house price of \$50,000.
|
||||
|
||||
One insight that pops out from continuing down this is that, all else being equal, being located next to the Charles River can significantly increase the value of newer housing stock. The story is more ambiguous for communities dominated by older houses: proximity to the Charles significantly increases home prices in one community (and that one presumably farther away from the city); for all others, being situated on the river either provided a modest increase in value or actually decreased mean home prices.
|
||||
|
||||
While groupings like this can be a great way to begin to interrogate your data, you might not care for the 'tall' format it comes in. In that case, you can unstack the data into a "wide" format:
|
||||
|
||||
|
||||
```python
|
||||
groupby_twovar.unstack()
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# How could you use groupby to get a sense of the proportion
|
||||
# of residential land zoned for lots over 25,000 sq.ft.,
|
||||
# the proportion of non-retail business acres per town,
|
||||
# and the distance of towns from employment centers in Boston?
|
||||
|
||||
```
|
||||
|
||||
It is also often valuable to know how many unique values a column has in it with the `nunique` method:
|
||||
|
||||
|
||||
```python
|
||||
df['CHAS'].nunique()
|
||||
```
|
||||
|
||||
Complementary to that, you will also likely want to know what those unique values are, which is where the `unique` method helps:
|
||||
|
||||
|
||||
```python
|
||||
df['CHAS'].unique()
|
||||
```
|
||||
|
||||
You can use the `value_counts` method to see how many of each unique value there are in a column:
|
||||
|
||||
|
||||
```python
|
||||
df['CHAS'].value_counts()
|
||||
```
|
||||
|
||||
Or you can easily plot a bar graph to visually see the breakdown:
|
||||
|
||||
|
||||
```python
|
||||
%matplotlib inline
|
||||
df['CHAS'].value_counts().plot(kind='bar')
|
||||
```
|
||||
|
||||
Note that the IPython magic command `%matplotlib inline` enables you to view the chart inline.
|
||||
|
||||
Let's pull back to the dataset as a whole for a moment. Two major things that you will look for in almost any dataset are trends and relationships. A typical relationship between variables to explore is the Pearson correlation, or the extent to which two variables are linearly related. The `corr` method will show this in table format for all of the columns in a `DataFrame`:
|
||||
|
||||
|
||||
```python
|
||||
df.corr(method='pearson')
|
||||
```
|
||||
|
||||
Suppose you just wanted to look at the correlations between all of the columns and just one variable? Let's examine just the correlation between all other variables and the percentage of owner-occupied houses build before 1940 (AGE). We will do this by accessing the column by index number:
|
||||
|
||||
|
||||
```python
|
||||
corr = df.corr(method='pearson')
|
||||
corr_with_homevalue = corr.iloc[-1]
|
||||
corr_with_homevalue[corr_with_homevalue.argsort()[::-1]]
|
||||
```
|
||||
|
||||
With the correlations arranged in descending order, it's easy to start to see some patterns. Correlating AGE with a variable we created from AGE is a trivial correlation. However, it is interesting to note that the percentage of older housing stock in communities strongly correlates with air pollution (NOX) and the proportion of non-retail business acres per town (INDUS); at least in 1978 metro Boston, older towns are more industrial.
|
||||
|
||||
Graphically, we can see the correlations using a heatmap from the Seaborn library:
|
||||
|
||||
|
||||
```python
|
||||
import seaborn as sns
|
||||
sns.heatmap(df.corr(),cmap=sns.cubehelix_palette(20, light=0.95, dark=0.15))
|
||||
```
|
||||
|
||||
Histograms are another valuable tool for investigating your data. For example, what is the overall distribution of prices of owner-occupied houses in the Boston area?
|
||||
|
||||
|
||||
```python
|
||||
import matplotlib.pyplot as plt
|
||||
plt.hist(df['MEDV'])
|
||||
```
|
||||
|
||||
The default bin size for the matplotlib histogram (essentially big of buckets of percentages that you include in each histogram bar in this case) is pretty large and might mask smaller details. To get a finer-grained view of the AGE column, you can manually increase the number of bins in the histogram:
|
||||
|
||||
|
||||
```python
|
||||
plt.hist(df['MEDV'],bins=50)
|
||||
```
|
||||
|
||||
Seaborn has a somewhat more attractive version of the standard matplotlib histogram: the distribution plot. This is a combination histogram and kernel density estimate (KDE) plot (essentially a smoothed histogram):
|
||||
|
||||
|
||||
```python
|
||||
sns.distplot(df['MEDV'])
|
||||
```
|
||||
|
||||
Another commonly used plot is the Seaborn jointplot, which combines histograms for two columns along with a scatterplot:
|
||||
|
||||
|
||||
```python
|
||||
sns.jointplot(df['RM'], df['MEDV'], kind='scatter')
|
||||
```
|
||||
|
||||
Unfortunately, many of the dots print over each other. You can help address this by adding some alpha blending, a figure that sets the transparency for the dots so that concentrations of them drawing over one another will be apparent:
|
||||
|
||||
|
||||
```python
|
||||
sns.jointplot(df['RM'], df['MEDV'], kind='scatter', alpha=0.3)
|
||||
```
|
||||
|
||||
Another way to see patterns in your data is with a two-dimensional KDE plot. Darker colors here represent a higher concentration of data points:
|
||||
|
||||
|
||||
```python
|
||||
sns.kdeplot(df['RM'], df['MEDV'], shade=True)
|
||||
```
|
||||
|
||||
Note that while the KDE plot is very good at showing concentrations of data points, finer structures like linear relationships (such as the clear relationship between the number of rooms in homes and the house price) are lost in the KDE plot.
|
||||
|
||||
Finally, the pairplot in Seaborn allows you to see scatterplots and histograms for several columns in one table. Here we have played with some of the keywords to produce a more sophisticated and easier to read pairplot that incorporates both alpha blending and linear regression lines for the scatterplots.
|
||||
|
||||
|
||||
```python
|
||||
sns.pairplot(df[['RM', 'AGE', 'LSTAT', 'DIS', 'MEDV']], kind="reg", plot_kws={'line_kws':{'color':'red'}, 'scatter_kws': {'alpha': 0.1}})
|
||||
```
|
||||
|
||||
Visualization is the start of the really cool, fun part of data science. So play around with these visualization tools and see what you can learn from the data!
|
||||
|
||||
> **Takeaway:** An old joke goes: “What does a data scientist seen when they look at a dataset? A bunch of numbers.” There is more than a little truth in that joke. Visualization is often the key to finding patterns and correlations in your data. While visualization cannot often deliver precise results, it can point you in the right direction to ask better questions and efficiently find value in the data.
|
|
@ -0,0 +1,366 @@
|
|||
|
||||
# Machine Learning in Python
|
||||
The content for this notebook was copied from [The Deep Learning Machine Learning in Python lab](https://github.com/Microsoft/computerscience/tree/master/Labs/Deep%20Learning/200%20-%20Machine%20Learning%20in%20Python).
|
||||
This demo shows prediction of flight delays between airport pairs based on the day of the month using a random forest.
|
||||
The demo concludes by visualizing the probability of on-time arrival between JFK and Atlanta Hartsfield-Jackson oves consecutive days.
|
||||
|
||||
In this exercise, you will import a dataset from Azure blob storage and load it into the notebook. Jupyter notebooks are highly interactive, and since they can include executable code, they provide the perfect platform for manipulating data and building predictive models from it.
|
||||
|
||||
## Ingest
|
||||
|
||||
cURL is a familiar tool to transfer data to or from servers using familiar protocols such as http, https, ftp, ftps, etc.
|
||||
In the code cell below cURL is used to download Flight Data from a public blob storage to the working directory.
|
||||
|
||||
|
||||
```python
|
||||
!curl https://topcs.blob.core.windows.net/public/FlightData.csv -o flightdata.csv
|
||||
```
|
||||
|
||||
Pandas will be used here to create a data frame in which the data will be manipulated and massaged for enhanced analysis.
|
||||
|
||||
Import the data and create a pandas DataFrame from it, and display the first five rows.
|
||||
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
|
||||
df = pd.read_csv('flightdata.csv')
|
||||
df.head()
|
||||
```
|
||||
|
||||
The DataFrame that you created contains on-time arrival information for a major U.S. airline. It has more than 11,000 rows and 26 columns. (The output says "5 rows" because DataFrame's head function only returns the first five rows.) Each row represents one flight and contains information such as the origin, the destination, the scheduled departure time, and whether the flight arrived on time or late. You will learn more about the data, including its content and structure, in the next lab.
|
||||
|
||||
## Process
|
||||
|
||||
In the real world, few datasets can be used as is to train machine-learning models. It is not uncommon for data scientists to spend 80% or more of their time on a project cleaning, preparing, and shaping the data — a process sometimes referred to as data wrangling. Typical actions include removing duplicate rows, removing rows or columns with missing values or algorithmically replacing the missing values, normalizing data, and selecting feature columns. A machine-learning model is only as good as the data it is trained with. Preparing the data is arguably the most crucial step in the machine-learning process.
|
||||
|
||||
Before you can prepare a dataset, you need to understand its content and structure. In the previous steps, you imported a dataset containing on-time arrival information for a major U.S. airline. That data included 26 columns and thousands of rows, with each row representing one flight and containing information such as the flight's origin, destination, and scheduled departure time. You also loaded the data into the Jupyter notebook and used a simple Python script to create a pandas DataFrame from it.
|
||||
|
||||
To get a count of rows, run the following code:
|
||||
|
||||
|
||||
```python
|
||||
df.shape
|
||||
```
|
||||
|
||||
Now take a moment to examine the 26 columns in the dataset. They contain important information such as the date that the flight took place (YEAR, MONTH, and DAY_OF_MONTH), the origin and destination (ORIGIN and DEST), the scheduled departure and arrival times (CRS_DEP_TIME and CRS_ARR_TIME), the difference between the scheduled arrival time and the actual arrival time in minutes (ARR_DELAY), and whether the flight was late by 15 minutes or more (ARR_DEL15).
|
||||
|
||||
Here is a complete list of the columns in the dataset. Times are expressed in 24-hour military time. For example, 1130 equals 11:30 a.m. and 1500 equals 3:00 p.m.
|
||||
|
||||
One of the first things data scientists typically look for in a dataset is missing values. There's an easy way to check for missing values in pandas. To demonstrate, execute the following code:
|
||||
|
||||
|
||||
```python
|
||||
df.isnull().values.any()
|
||||
```
|
||||
|
||||
The next step is to find out where the missing values are. To do so, execute the following code:
|
||||
|
||||
|
||||
```python
|
||||
df.isnull().sum()
|
||||
```
|
||||
|
||||
Curiously, the 26th column ("Unnamed: 25") contains 11,231 missing values, which equals the number of rows in the dataset. This column was mistakenly created because the CSV file that you imported contains a comma at the end of each line. To eliminate that column, execute the following code:
|
||||
|
||||
|
||||
```python
|
||||
df = df.drop('Unnamed: 25', axis=1)
|
||||
df.isnull().sum()
|
||||
```
|
||||
|
||||
The DataFrame still contains a lot of missing values, but some of them are irrelevant because the columns containing them are not germane to the model that you are building. The goal of that model is to predict whether a flight you are considering booking is likely to arrive on time. If you know that the flight is likely to be late, you might choose to book another flight.
|
||||
|
||||
The next step, therefore, is to filter the dataset to eliminate columns that aren't relevant to a predictive model. For example, the aircraft's tail number probably has little bearing on whether a flight will arrive on time, and at the time you book a ticket, you have no way of knowing whether a flight will be cancelled, diverted, or delayed. By contrast, the scheduled departure time could have a lot to do with on-time arrivals. Because of the hub-and-spoke system used by most airlines, morning flights tend to be on time more often than afternoon or evening flights. And at some major airports, traffic stacks up during the day, increasing the likelihood that later flights will be delayed.
|
||||
|
||||
Pandas provides an easy way to filter out columns you don't want. Execute the following code:
|
||||
|
||||
|
||||
```python
|
||||
df = df[["MONTH", "DAY_OF_MONTH", "DAY_OF_WEEK", "ORIGIN", "DEST", "CRS_DEP_TIME", "ARR_DEL15"]]
|
||||
df.isnull().sum()
|
||||
```
|
||||
|
||||
The only column that now contains missing values is the ARR_DEL15 column, which uses 0s to identify flights that arrived on time and 1s for flights that didn't. Use the following code to show the first five rows with missing values:
|
||||
|
||||
|
||||
```python
|
||||
df[df.isnull().values.any(axis=1)].head()
|
||||
```
|
||||
|
||||
The reason these rows are missing ARR_DEL15 values is that they all correspond to flights that were canceled or diverted. You could call dropna on the DataFrame to remove these rows. But since a flight that is canceled or diverted to another airport could be considered "late," let's use the fillna method to replace the missing values with 1s.
|
||||
|
||||
Use the following code to replace missing values in the ARR_DEL15 column with 1s and display rows 177 through 184:
|
||||
|
||||
|
||||
```python
|
||||
df = df.fillna({'ARR_DEL15': 1})
|
||||
df.iloc[177:185]
|
||||
```
|
||||
|
||||
Use the following code to display the first five rows of the DataFrame:
|
||||
|
||||
|
||||
```python
|
||||
df.head()
|
||||
```
|
||||
|
||||
The CRS_DEP_TIME column of the dataset you are using represents scheduled departure times. The granularity of the numbers in this column — it contains more than 500 unique values — could have a negative impact on accuracy in a machine-learning model. This can be resolved using a technique called binning or quantization. What if you divided each number in this column by 100 and rounded down to the nearest integer? 1030 would become 10, 1925 would become 19, and so on, and you would be left with a maximum of 24 discrete values in this column. Intuitively, it makes sense, because it probably doesn't matter much whether a flight leaves at 10:30 a.m. or 10:40 a.m. It matters a great deal whether it leaves at 10:30 a.m. or 5:30 p.m.
|
||||
|
||||
In addition, the dataset's ORIGIN and DEST columns contain airport codes that represent categorical machine-learning values. These columns need to be converted into discrete columns containing indicator variables, sometimes known as "dummy" variables. In other words, the ORIGIN column, which contains five airport codes, needs to be converted into five columns, one per airport, with each column containing 1s and 0s indicating whether a flight originated at the airport that the column represents. The DEST column needs to be handled in a similar manner.
|
||||
|
||||
In this portion of the exercise, you will "bin" the departure times in the CRS_DEP_TIME column and use pandas' get_dummies method to create indicator columns from the ORIGIN and DEST columns.
|
||||
|
||||
Use the following code to bin the departure times:
|
||||
|
||||
|
||||
```python
|
||||
import math
|
||||
|
||||
for index, row in df.iterrows():
|
||||
df.loc[index, 'CRS_DEP_TIME'] = math.floor(row['CRS_DEP_TIME'] / 100)
|
||||
df.head()
|
||||
```
|
||||
|
||||
Now use the following statements to generate indicator columns from the ORIGIN and DEST columns, while dropping the ORIGIN and DEST columns themselves:
|
||||
|
||||
|
||||
```python
|
||||
df = pd.get_dummies(df, columns=['ORIGIN', 'DEST'])
|
||||
df.head()
|
||||
```
|
||||
|
||||
## Predict
|
||||
|
||||
Machine learning, which facilitates predictive analytics using large volumes of data by employing algorithms that iteratively learn from that data, is one of the fastest growing areas of data science.
|
||||
|
||||
One of the most popular tools for building machine-learning models is Scikit-learn, a free and open-source toolkit for Python programmers. It has built-in support for popular regression, classification, and clustering algorithms and works with other Python libraries such as NumPy and SciPy. With Sckit-learn, a simple method call can replace hundreds of lines of hand-written code. Sckit-learn allows you to focus on building, training, tuning, and testing machine-learning models without getting bogged down coding algorithms.
|
||||
|
||||
In this lab, the third of four in a series, you will use Sckit-learn to build a machine-learning model utilizing on-time arrival data for a major U.S. airline. The goal is to create a model that might be useful in the real world for predicting whether a flight is likely to arrive on time. It is precisely the kind of problem that machine learning is commonly used to solve. And it's a great way to increase your machine-learning chops while getting acquainted with Scikit-learn.
|
||||
|
||||
The first statement imports Sckit-learn's train_test_split helper function. The second line uses the function to split the DataFrame into a training set containing 80% of the original data, and a test set containing the remaining 20%. The random_state parameter seeds the random-number generator used to do the splitting, while the first and second parameters are DataFrames containing the feature columns and the label column.
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
train_x, test_x, train_y, test_y = train_test_split(df.drop('ARR_DEL15', axis=1), df['ARR_DEL15'], test_size=0.2, random_state=42)
|
||||
```
|
||||
|
||||
train_test_split returns four DataFrames. Use the following command to display the number of rows and columns in the DataFrame containing the feature columns used for training:
|
||||
|
||||
|
||||
```python
|
||||
train_x.shape
|
||||
```
|
||||
|
||||
Now use this command to display the number of rows and columns in the DataFrame containing the feature columns used for testing:
|
||||
|
||||
|
||||
```python
|
||||
test_x.shape
|
||||
```
|
||||
|
||||
You will train a classification model, which seeks to resolve a set of inputs into one of a set of known outputs.
|
||||
|
||||
Sckit-learn includes a variety of classes for implementing common machine-learning models. One of them is RandomForestClassifier, which fits multiple decision trees to the data and uses averaging to boost the overall accuracy and limit overfitting.
|
||||
|
||||
Execute the following code to create a RandomForestClassifier object and train it by calling the fit method.
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.ensemble import RandomForestClassifier
|
||||
|
||||
model = RandomForestClassifier(random_state=13)
|
||||
model.fit(train_x, train_y)
|
||||
```
|
||||
|
||||
The output shows the parameters used in the classifier, including n_estimators, which specifies the number of trees in each decision-tree forest, and max_depth, which specifies the maximum depth of the decision trees. The values shown are the defaults, but you can override any of them when creating the RandomForestClassifier object.
|
||||
|
||||
Now call the predict method to test the model using the values in test_x, followed by the score method to determine the mean accuracy of the model:
|
||||
|
||||
|
||||
```python
|
||||
predicted = model.predict(test_x)
|
||||
model.score(test_x, test_y)
|
||||
```
|
||||
|
||||
There are several ways to measure the accuracy of a classification model. One of the best overall measures for a binary classification model is Area Under Receiver Operating Characteristic Curve (sometimes referred to as "ROC AUC"), which essentially quantifies how often the model will make a correct prediction regardless of the outcome. In this exercise, you will compute an ROC AUC score for the model you built in the previous exercise and learn about some of the reasons why that score is lower than the mean accuracy output by the score method. You will also learn about other ways to gauge the accuracy of the model.
|
||||
|
||||
Before you compute the ROC AUC, you must generate prediction probabilities for the test set. These probabilities are estimates for each of the classes, or answers, the model can predict. For example, [0.88199435, 0.11800565] means that there's an 89% chance that a flight will arrive on time (ARR_DEL15 = 0) and a 12% chance that it won't (ARR_DEL15 = 1). The sum of the two probabilities adds up to 100%.
|
||||
|
||||
Run the following code to generate a set of prediction probabilities from the test data:
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.metrics import roc_auc_score
|
||||
probabilities = model.predict_proba(test_x)
|
||||
```
|
||||
|
||||
Now use the following statement to generate an ROC AUC score from the probabilities using Sckit-learn's roc_auc_score method:
|
||||
|
||||
|
||||
```python
|
||||
roc_auc_score(test_y, probabilities[:, 1])
|
||||
```
|
||||
|
||||
Why is the AUC score lower than the mean accuracy computed in the previous exercise?
|
||||
|
||||
The output from the score method reflects how many of the items in the test set the model predicted correctly. This score is skewed by the fact that the dataset the model was trained and tested with contains many more rows representing on-time arrivals than rows representing late arrivals. Because of this imbalance in the data, you are more likely to be correct if you predict that a flight will be on time than if you predict that a flight will be late.
|
||||
|
||||
ROC AUC takes this into account and provides a more accurate indication of how likely it is that a prediction of on-time or late will be correct.
|
||||
|
||||
You can learn more about the model's behavior by generating a confusion matrix, also known as an error matrix. The confusion matrix quantifies the number of times each answer was classified correctly or incorrectly. Specifically, it quantifies the number of false positives, false negatives, true positives, and true negatives. This is important, because if a binary classification model trained to recognize cats and dogs is tested with a dataset that is 95% dogs, it could score 95% simply by guessing "dog" every time. But if it failed to identify cats at all, it would be of little value.
|
||||
|
||||
Use the following code to produce a confusion matrix for your model:
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.metrics import confusion_matrix
|
||||
confusion_matrix(test_y, predicted)
|
||||
```
|
||||
|
||||
The first row in the output represents flights that were on time. The first column in that row shows how many flights were correctly predicted to be on time, while the second column reveals how many flights were predicted as delayed but were not. From this, the model appears to be very adept at predicting that a flight will be on time.
|
||||
|
||||
But look at the second row, which represents flights that were delayed. The first column shows how many delayed flights were incorrectly predicted to be on time. The second column shows how many flights were correctly predicted to be delayed. Clearly, the model isn't nearly as adept at predicting that a flight will be delayed as it is at predicting that a flight will arrive on time. What you want in a confusion matrix is big numbers in the upper-left and lower-right corners, and small numbers (preferably zeros) in the upper-right and lower-left corners.
|
||||
|
||||
Other measures of accuracy for a classification model include precision and recall. Suppose the model was presented with three on-time arrivals and three delayed arrivals, and that it correctly predicted two of the on-time arrivals, but incorrectly predicted that two of the delayed arrivals would be on time. In this case, the precision would be 50% (two of the four flights it classified as being on time actually were on time), while its recall would be 67% (it correctly identified two of the three on-time arrivals). You can learn more about precision and recall from https://en.wikipedia.org/wiki/Precision_and_recall
|
||||
|
||||
Sckit-learn contains a handy method named precision_score for computing precision. To quantify the precision of your model, execute the following statements:
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.metrics import precision_score
|
||||
|
||||
train_predictions = model.predict(train_x)
|
||||
precision_score(train_y, train_predictions)
|
||||
```
|
||||
|
||||
Sckit-learn also contains a method named recall_score for computing recall. To measure you model's recall, execute the following statements:
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.metrics import recall_score
|
||||
|
||||
recall_score(train_y, train_predictions)
|
||||
```
|
||||
|
||||
## Visualize
|
||||
|
||||
Now that you that have trained a machine-learning model to perform predictive analytics, it's time to put it to work. In this lab, the final one in the series, you will write a function that uses the machine-learning model you built in the previous lab to predict whether a flight will arrive on time or late. And you will use Matplotlib, the popular plotting and charting library for Python, to visualize the results.
|
||||
|
||||
The first statement is one of several magic commands supported by the Python kernel that you selected when you created the notebook. It enables Jupyter to render Matplotlib output in a notebook without making repeated calls to show. And it must appear before any references to Matplotlib itself. The final statement configures Seaborn to enhance the output from Matplotlib.
|
||||
|
||||
Execute the following code. Ignore any warning messages that are displayed related to font caching:
|
||||
|
||||
|
||||
```python
|
||||
%matplotlib inline
|
||||
import matplotlib.pyplot as plt
|
||||
import seaborn as sns
|
||||
|
||||
sns.set()
|
||||
```
|
||||
|
||||
The first statement is one of several magic commands supported by the Python kernel that you selected when you created the notebook. It enables Jupyter to render Matplotlib output in a notebook without making repeated calls to show. And it must appear before any references to Matplotlib itself. The final statement configures Seaborn to enhance the output from Matplotlib.
|
||||
|
||||
To see Matplotlib at work, execute the following code in a new cell to plot the ROC curve for the machine-learning model you built in the previous lab:
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.metrics import roc_curve
|
||||
|
||||
fpr, tpr, _ = roc_curve(test_y, probabilities[:, 1])
|
||||
plt.plot(fpr, tpr)
|
||||
plt.plot([0, 1], [0, 1], color='grey', lw=1, linestyle='--')
|
||||
plt.xlabel('False Positive Rate')
|
||||
plt.ylabel('True Positive Rate')
|
||||
```
|
||||
|
||||
The dotted line in the middle of the graph represents a 50-50 chance of obtaining a correct answer. The blue curve represents the accuracy of your model. More importantly, the fact that this chart appears at all demonstrates that you can use Matplotlib in a Jupyter notebook.
|
||||
|
||||
The reason you built a machine-learning model is to predict whether a flight will arrive on time or late. In this exercise, you will write a Python function that calls the machine-learning model you built in the previous lab to compute the likelihood that a flight will be on time. Then you will use the function to analyze several flights.
|
||||
|
||||
This function takes as input a date and time, an origin airport code, and a destination airport code, and returns a value between 0.0 and 1.0 indicating the probability that the flight will arrive at its destination on time. It uses the machine-learning model you built in the previous lab to compute the probability. And to call the model, it passes a DataFrame containing the input values to predict_proba. The structure of the DataFrame exactly matches the structure of the DataFrame depicted in previous steps.
|
||||
|
||||
|
||||
```python
|
||||
def predict_delay(departure_date_time, origin, destination):
|
||||
from datetime import datetime
|
||||
|
||||
try:
|
||||
departure_date_time_parsed = datetime.strptime(departure_date_time, '%d/%m/%Y %H:%M:%S')
|
||||
except ValueError as e:
|
||||
return 'Error parsing date/time - {}'.format(e)
|
||||
|
||||
month = departure_date_time_parsed.month
|
||||
day = departure_date_time_parsed.day
|
||||
day_of_week = departure_date_time_parsed.isoweekday()
|
||||
hour = departure_date_time_parsed.hour
|
||||
|
||||
origin = origin.upper()
|
||||
destination = destination.upper()
|
||||
|
||||
input = [{'MONTH': month,
|
||||
'DAY': day,
|
||||
'DAY_OF_WEEK': day_of_week,
|
||||
'CRS_DEP_TIME': hour,
|
||||
'ORIGIN_ATL': 1 if origin == 'ATL' else 0,
|
||||
'ORIGIN_DTW': 1 if origin == 'DTW' else 0,
|
||||
'ORIGIN_JFK': 1 if origin == 'JFK' else 0,
|
||||
'ORIGIN_MSP': 1 if origin == 'MSP' else 0,
|
||||
'ORIGIN_SEA': 1 if origin == 'SEA' else 0,
|
||||
'DEST_ATL': 1 if destination == 'ATL' else 0,
|
||||
'DEST_DTW': 1 if destination == 'DTW' else 0,
|
||||
'DEST_JFK': 1 if destination == 'JFK' else 0,
|
||||
'DEST_MSP': 1 if destination == 'MSP' else 0,
|
||||
'DEST_SEA': 1 if destination == 'SEA' else 0 }]
|
||||
|
||||
return model.predict_proba(pd.DataFrame(input))[0][0]
|
||||
```
|
||||
|
||||
Use the code below to compute the probability that a flight from New York to Atlanta on the evening of October 1 will arrive on time. Note that the year you enter is irrelevant because it isn't used by the model.
|
||||
|
||||
|
||||
```python
|
||||
predict_delay('1/10/2018 21:45:00', 'JFK', 'ATL')
|
||||
```
|
||||
|
||||
Modify the code to compute the probability that the same flight a day later will arrive on time:
|
||||
|
||||
|
||||
```python
|
||||
predict_delay('2/10/2018 21:45:00', 'JFK', 'ATL')
|
||||
```
|
||||
|
||||
How likely is this flight to arrive on time? If your travel plans were flexible, would you consider postponing your trip for one day?
|
||||
|
||||
Now modify the code to compute the probability that a morning flight the same day from Atlanta to Seattle will arrive on time:
|
||||
|
||||
|
||||
```python
|
||||
predict_delay('2/10/2018 10:00:00', 'ATL', 'SEA')
|
||||
```
|
||||
|
||||
In this exercise, you will combine the predict_delay function you created in the previous exercise with Matplotlib to produce side-by-side comparisons of the same flight on consecutive days and flights with the same origin and destination at different times throughout the day.
|
||||
|
||||
|
||||
```python
|
||||
import numpy as np
|
||||
|
||||
labels = ('Oct 1', 'Oct 2', 'Oct 3', 'Oct 4', 'Oct 5', 'Oct 6', 'Oct 7')
|
||||
values = (predict_delay('1/10/2018 21:45:00', 'JFK', 'ATL'),
|
||||
predict_delay('2/10/2018 21:45:00', 'JFK', 'ATL'),
|
||||
predict_delay('3/10/2018 21:45:00', 'JFK', 'ATL'),
|
||||
predict_delay('4/10/2018 21:45:00', 'JFK', 'ATL'),
|
||||
predict_delay('5/10/2018 21:45:00', 'JFK', 'ATL'),
|
||||
predict_delay('6/10/2018 21:45:00', 'JFK', 'ATL'),
|
||||
predict_delay('7/10/2018 21:45:00', 'JFK', 'ATL'))
|
||||
alabels = np.arange(len(labels))
|
||||
|
||||
plt.bar(alabels, values, align='center', alpha=0.5)
|
||||
plt.xticks(alabels, labels)
|
||||
plt.ylabel('Probability of On-Time Arrival')
|
||||
plt.ylim((0.0, 1.0))
|
||||
```
|
||||
|
||||
Referenced from https://github.com/Microsoft/computerscience/tree/master/Labs/Deep%20Learning/200%20-%20Machine%20Learning%20in%20Python, 12/17/2018
|
Двоичные данные
Data Science 2_Beginners Data Science for Python Developers/.DS_Store
поставляемый
Normal file
Двоичные данные
Data Science 2_Beginners Data Science for Python Developers/.DS_Store
поставляемый
Normal file
Двоичный файл не отображается.
|
@ -0,0 +1,477 @@
|
|||
|
||||
# Section 1: Introduction to machine learning models
|
||||
|
||||
|
||||
## A quick aside: types of ML
|
||||
|
||||
As you get deeper into data science, it might seem like there are a bewildering array of ML algorithms out there. However many you encounter, it can be handy to remember that most ML algorithms fall into three broad categories:
|
||||
- **Predictive algorithms**: These analyze current and historical facts to make predictions about unknown events, such as the future or customers’ choices.
|
||||
- **Classification algorithms**: These teach a program from a body of data, and the program then uses that learning to classify new observations.
|
||||
- **Time-series forecasting algorithms**: While it can argued that these algorithms are a part of predictive algorithms, their techniques are specialized enough that they in many ways functions like a separate category. Time-series forecasting is beyond the scope of this course, but we have more than enough work with focusing here on prediction and classification.
|
||||
|
||||
## Prediction: linear regression
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should be comfortable fitting linear regression models, and you should have some familiarity with interpreting their output.
|
||||
|
||||
### Data exploration
|
||||
|
||||
**Import Libraries**
|
||||
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
%matplotlib inline
|
||||
import seaborn as sns
|
||||
```
|
||||
|
||||
**Dataset Alert**: Boston Housing Dataset
|
||||
|
||||
|
||||
```python
|
||||
df = pd.read_csv('Data/Housing_Dataset_Sample.csv')
|
||||
df.head()
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Do you remember the DataFrame method for looking at overall information
|
||||
# about a DataFrame, such as number of columns and rows? Try it here.
|
||||
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df.describe().T
|
||||
```
|
||||
|
||||
**Price Column**
|
||||
|
||||
|
||||
```python
|
||||
sns.distplot(df['Price'])
|
||||
```
|
||||
|
||||
**House Prices vs Average Area Income**
|
||||
|
||||
|
||||
```python
|
||||
sns.jointplot(df['Avg. Area Income'],df['Price'])
|
||||
```
|
||||
|
||||
**All Columns**
|
||||
|
||||
|
||||
```python
|
||||
sns.pairplot(df)
|
||||
```
|
||||
|
||||
**Some observations**
|
||||
1. Blob Data
|
||||
2. Distortions might be a result of data (e.g. no one has 0.3 rooms)
|
||||
|
||||
|
||||
### Fitting the model
|
||||
|
||||
**Can We Predict Housing Prices?**
|
||||
|
||||
|
||||
```python
|
||||
X = df.iloc[:,:5] # First 5 Columns
|
||||
y = df['Price'] # Price Column
|
||||
```
|
||||
|
||||
**Train, Test, Split**
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=54)
|
||||
```
|
||||
|
||||
**Fit to Linear Regression Model**
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.linear_model import LinearRegression
|
||||
|
||||
reg = LinearRegression()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
reg.fit(X_train,y_train)
|
||||
```
|
||||
|
||||
### Evaluating the model
|
||||
|
||||
**Predict**
|
||||
|
||||
|
||||
```python
|
||||
predictions = reg.predict(X_test)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
predictions
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
print(reg.intercept_,reg.coef_)
|
||||
```
|
||||
|
||||
**Score**
|
||||
|
||||
|
||||
```python
|
||||
#Explained variation. A high R2 close to 1 indicates better prediction with less error.
|
||||
from sklearn.metrics import r2_score
|
||||
|
||||
r2_score(y_test,predictions)
|
||||
```
|
||||
|
||||
**Visualize Errors**
|
||||
|
||||
|
||||
```python
|
||||
sns.distplot([y_test-predictions])
|
||||
```
|
||||
|
||||
**Visualize Predictions**
|
||||
|
||||
|
||||
```python
|
||||
# Plot outputs
|
||||
plt.scatter(y_test,predictions, color='blue')
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
Can you think of a way to refine this visualization to make it clearer, particularly if you were explaining the results to someone?
|
||||
|
||||
|
||||
```python
|
||||
# Hint: Remember to try the plt.scatter parameter alpha=.
|
||||
# It takes values between 0 and 1.
|
||||
|
||||
```
|
||||
|
||||
> **Takeaway:** In this subsection, you performed prediction using linear regression by exploring your data, then fitting your model, and finally evaluating your model’s performance.
|
||||
|
||||
## Classification: logistic regression
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should know how logistic regression differs from linear regression, be comfortable fitting logistic regression models, and have some familiarity with interpreting their output.
|
||||
|
||||
**Dataset Alert**: Fates of RMS Titanic Passengers
|
||||
|
||||
The dataset has 12 variables:
|
||||
- **PassengerId**
|
||||
- **Survived:** 0 = No, 1 = Yes
|
||||
- **Pclass:** Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
|
||||
- **Sex**
|
||||
- **Age**
|
||||
- **Sibsp:** Number of siblings or spouses aboard the *Titanic*
|
||||
- **Parch:** Number of parents or children aboard the *Titanic*
|
||||
- **Ticket:** Passenger ticket number
|
||||
- **Fare:** Passenger fare
|
||||
- **Cabin:** Cabin number
|
||||
- **Embarked:** Port of embarkation; C = Cherbourg, Q = Queenstown, S = Southampton
|
||||
|
||||
|
||||
```python
|
||||
df = pd.read_csv('Data/train_data_titanic.csv')
|
||||
df.head()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df.info()
|
||||
```
|
||||
|
||||
|
||||
|
||||
### Remove extraneous variables
|
||||
|
||||
|
||||
```python
|
||||
df.drop(['Name','Ticket'],axis=1,inplace=True)
|
||||
```
|
||||
|
||||
|
||||
|
||||
### Check for multicollinearity
|
||||
|
||||
**Question**: Do any correlations between **Survived** and **Fare** jump out?
|
||||
|
||||
|
||||
```python
|
||||
sns.pairplot(df[['Survived','Fare']], dropna=True)
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Try running sns.pairplot twice more on some other combinations of columns
|
||||
# and see if any patterns emerge.
|
||||
|
||||
```
|
||||
|
||||
We can also use `groupby` to look for patterns. Consider the mean values for the various variables when we group by **Survived**:
|
||||
|
||||
|
||||
```python
|
||||
df.groupby('Survived').mean()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df.head()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df['SibSp'].value_counts()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df['Parch'].value_counts()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df['Sex'].value_counts()
|
||||
```
|
||||
|
||||
### Handle missing values
|
||||
|
||||
|
||||
|
||||
```python
|
||||
# missing
|
||||
df.isnull().sum()>(len(df)/2)
|
||||
```
|
||||
|
||||
The history saving thread hit an unexpected error (OperationalError('no such table: history',)).History will not be written to the database.
|
||||
|
||||
|
||||
|
||||
---------------------------------------------------------------------------
|
||||
|
||||
NameError Traceback (most recent call last)
|
||||
|
||||
<ipython-input-1-a63f128a4173> in <module>()
|
||||
1 # missing
|
||||
----> 2 df.isnull().sum()>(len(df)/2)
|
||||
|
||||
|
||||
NameError: name 'df' is not defined
|
||||
|
||||
|
||||
|
||||
```python
|
||||
df.drop('Cabin',axis=1,inplace=True)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df.info()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df['Age'].isnull().value_counts()
|
||||
```
|
||||
|
||||
### Corelation Exploration
|
||||
|
||||
|
||||
```python
|
||||
df.groupby('Sex')['Age'].median().plot(kind='bar')
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df['Age'] = df.groupby('Sex')['Age'].apply(lambda x: x.fillna(x.median()))
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df.isnull().sum()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df['Embarked'].value_counts()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df['Embarked'].fillna(df['Embarked'].value_counts().idxmax(), inplace=True)
|
||||
df['Embarked'].value_counts()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df = pd.get_dummies(data=df, columns=['Sex', 'Embarked'],drop_first=True)
|
||||
df.head()
|
||||
```
|
||||
|
||||
**Correlation Matrix**
|
||||
|
||||
|
||||
```python
|
||||
df.corr()
|
||||
```
|
||||
|
||||
**Define X and Y**
|
||||
|
||||
|
||||
```python
|
||||
X = df.drop(['Survived','Pclass'],axis=1)
|
||||
y = df['Survived']
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=67)
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
We now need to split the training and test data, which you will so as an exercise:
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
# Look up in the portion above on linear regression and use train_test_split here.
|
||||
# Set test_size = 0.3 and random_state = 67 to get the same results as below when
|
||||
# you run through the rest of the code example below.
|
||||
|
||||
```
|
||||
|
||||
**Use Logistic Regression Model**
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
lr = LogisticRegression()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
lr.fit(X_train,y_train)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
predictions = lr.predict(X_test)
|
||||
```
|
||||
|
||||
### Evaluate the model
|
||||
|
||||
|
||||
#### Classification report
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
|
||||
```
|
||||
|
||||
The classification reports the proportions of both survivors and non-survivors with four scores:
|
||||
- **Precision:** The number of true positives divided by the sum of true positives and false positives; closer to 1 is better.
|
||||
- **Recall:** The true-positive rate, the number of true positives divided by the sum of the true positives and the false negatives.
|
||||
- **F1 score:** The harmonic mean (the average for rates) of precision and recall.
|
||||
- **Support:** The number of true instances for each label.
|
||||
|
||||
|
||||
```python
|
||||
print(classification_report(y_test,predictions))
|
||||
```
|
||||
|
||||
#### Confusion matrix
|
||||
|
||||
|
||||
```python
|
||||
print(confusion_matrix(y_test,predictions))
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.DataFrame(confusion_matrix(y_test, predictions), columns=['True Survived', 'True Not Survived'], index=['Predicted Survived', 'Predicted Not Survived'])
|
||||
```
|
||||
|
||||
#### Accuracy score
|
||||
|
||||
|
||||
```python
|
||||
print(accuracy_score(y_test,predictions))
|
||||
```
|
||||
|
||||
> **Takeaway:** In this subsection, you performed classification using logistic regression by removing extraneous variables, checking for multicollinearity, handling missing values, and fitting and evaluating your model.
|
||||
|
||||
## Classification: decision trees
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should be comfortable fitting decision-tree models and have some understanding of what they output.
|
||||
|
||||
|
||||
```python
|
||||
from sklearn import tree
|
||||
tr = tree.DecisionTreeClassifier()
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Using the same split data as with the logistic regression,
|
||||
# can you fit the decision tree model?
|
||||
# Hint: Refer to code snippet for fitting the logistic regression above.
|
||||
|
||||
```
|
||||
|
||||
**Note**: Using the same Titanic Data
|
||||
|
||||
|
||||
```python
|
||||
tr.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
tr_predictions = tr.predict(X_test)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.DataFrame(confusion_matrix(y_test, tr_predictions),
|
||||
columns=['True Survived', 'True Not Survived'],
|
||||
index=['Predicted Survived', 'Predicted Not Survived'])
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
print(accuracy_score(y_test,tr_predictions))
|
||||
```
|
||||
|
||||
**Visualize tree**
|
||||
|
||||
|
||||
```python
|
||||
import graphviz
|
||||
|
||||
dot_file = tree.export_graphviz(tr, out_file=None,
|
||||
feature_names=X.columns,
|
||||
class_names='Survived',
|
||||
filled=True,rounded=True)
|
||||
graph = graphviz.Source(dot_file)
|
||||
graph
|
||||
```
|
||||
|
||||
> **Takeaway:** In this subsection, you performed classification on previously cleaned data by fitting and evaluating a decision tree.
|
|
@ -0,0 +1,224 @@
|
|||
|
||||
# Section 2: Cloud-based machine learning
|
||||
|
||||
> <font>**Note:**</font> The `azureml` package presently works only with Python 2. If your notebook is not currently running Python 2, change it in the menu at the top of the notebook by clicking **Kernel > Change kernel > Python 2**.
|
||||
|
||||
## Create and connect to an Azure ML Studio workspace
|
||||
|
||||
The `azureml` package is installed by default with Azure Notebooks, so we don't have to worry about that. It uses an Azure ML Studio workspace ID and authorization token to connect your notebook to the workspace; you will obtain the ID and token by following these steps:
|
||||
|
||||
1. Open [Azure ML Studio](https://studio.azureml.net) in a new browser tab and sign in with a Microsoft account. Azure ML Studio is free and does not require an Azure subscription. Once signed in with your Microsoft account (the same credentials you’ve used for Azure Notebooks), you're in your “workspace.”
|
||||
|
||||
2. On the left pane, click **Settings**.
|
||||
|
||||
![Settings button](https://github.com/Microsoft/AzureNotebooks/blob/master/Samples/images/azure-ml-studio-settings.png?raw=true)<br><br>
|
||||
|
||||
3. On the **Name** tab, the **Workspace ID** field contains your workspace ID. Copy that ID into the `workspace_id` value in the code cell in Step 5 of the notebook below.
|
||||
|
||||
![Location of workspace ID](https://github.com/Microsoft/AzureNotebooks/blob/master/Samples/images/azure-ml-studio-workspace-id.png?raw=true)<br><br>
|
||||
|
||||
4. Click the **Authorization Tokens** tab, and then copy either token into the `authorization_token` value in the code cell in Step 5 of the notebook.
|
||||
|
||||
![Location of authorization token](https://github.com/Microsoft/AzureNotebooks/blob/master/Samples/images/azure-ml-studio-tokens.png?raw=true)<br><br>
|
||||
|
||||
5. 5. Run the code cell below; if it runs without error, you're ready to continue.
|
||||
|
||||
|
||||
```python
|
||||
from azureml import Workspace
|
||||
|
||||
# Replace the values with those from your own Azure ML Studio instance; see Prerequisites
|
||||
# The workspace_id is a string of hexadecimal characters; the token is a long string of random characters.
|
||||
workspace_id = 'your_workspace_id'
|
||||
authorization_token = 'your_auth_token'
|
||||
|
||||
ws = Workspace(workspace_id, authorization_token)
|
||||
```
|
||||
|
||||
## Explore forest fire data
|
||||
|
||||
Let’s look at a meteorological dataset collected by Cortez and Morais for 2007 to study the burned area of forest fires in the northeast region of Portugal.
|
||||
|
||||
> P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data.
|
||||
In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence,
|
||||
Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December,
|
||||
Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9.
|
||||
|
||||
The dataset contains the following features:
|
||||
|
||||
- **`X`**: x-axis spatial coordinate within the Montesinho park map: 1 to 9
|
||||
- **`Y`**: y-axis spatial coordinate within the Montesinho park map: 2 to 9
|
||||
- **`month`**: month of the year: "1" to "12" jan-dec
|
||||
- **`day`**: day of the week: "1" to "7" sun-sat
|
||||
- **`FFMC`**: FFMC index from the FWI system: 18.7 to 96.20
|
||||
- **`DMC`**: DMC index from the FWI system: 1.1 to 291.3
|
||||
- **`DC`**: DC index from the FWI system: 7.9 to 860.6
|
||||
- **`ISI`**: ISI index from the FWI system: 0.0 to 56.10
|
||||
- **`temp`**: temperature in Celsius degrees: 2.2 to 33.30
|
||||
- **`RH`**: relative humidity in %: 15.0 to 100
|
||||
- **`wind`**: wind speed in km/h: 0.40 to 9.40
|
||||
- **`rain`**: outside rain in mm/m2 : 0.0 to 6.4
|
||||
- **`area`**: the burned area of the forest (in ha): 0.00 to 1090.84
|
||||
|
||||
|
||||
Let's load the dataset and visualize the area that was burned in relation to the temperature in that region.
|
||||
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
df = pd.DataFrame(pd.read_csv('Data/forestfires.csv'))
|
||||
%matplotlib inline
|
||||
from ggplot import *
|
||||
ggplot(aes(x='temp', y='area'), data=df) + geom_line() + geom_point()
|
||||
```
|
||||
|
||||
Intuitively, the hotter the weather, the more hectares burned in forest fires.
|
||||
|
||||
## Transfer your data to Azure ML Studio
|
||||
|
||||
|
||||
```python
|
||||
from azureml import DataTypeIds
|
||||
|
||||
dataset = ws.datasets.add_from_dataframe(
|
||||
dataframe=df,
|
||||
data_type_id=DataTypeIds.GenericCSV,
|
||||
name='Forest Fire Data',
|
||||
description='Paulo Cortez and Aníbal Morais (Univ. Minho) @ 2007'
|
||||
)
|
||||
```
|
||||
|
||||
After running the code above, you can see the dataset listed in the **Datasets** section of the Azure Machine Learning Studio workspace. (**Note**: You might need to switch between browser tabs and refresh the page in order to see the dataset.)
|
||||
|
||||
![image.png](attachment:image.png)<br>
|
||||
|
||||
**View Azure ML Studio Data in Notebooks**
|
||||
|
||||
|
||||
```python
|
||||
print('\n'.join([i.name for i in ws.datasets if not i.is_example])) # only list user-created datasets
|
||||
```
|
||||
|
||||
**Interact with Azure ML Studio Data in Notebooks**
|
||||
|
||||
|
||||
```python
|
||||
# Read some more of the metadata
|
||||
ds = ws.datasets['Forest Fire Data']
|
||||
print(ds.name)
|
||||
print(ds.description)
|
||||
print(ds.family_id)
|
||||
print(ds.data_type_id)
|
||||
print(ds.created_date)
|
||||
print(ds.size)
|
||||
|
||||
# Read the contents
|
||||
df2 = ds.to_dataframe()
|
||||
df2.head()
|
||||
```
|
||||
|
||||
## Create your model
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
df[['wind','rain','month','RH']],
|
||||
df['temp'],
|
||||
test_size=0.25,
|
||||
random_state=42
|
||||
)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.tree import DecisionTreeRegressor
|
||||
from sklearn.metrics import r2_score
|
||||
regressor = DecisionTreeRegressor(random_state=42)
|
||||
regressor.fit(X_train, y_train)
|
||||
y_test_predictions = regressor.predict(X_test)
|
||||
print('R^2 for true vs. predicted test set forest temperature: {:0.2f}'.format(r2_score(y_test, y_test_predictions)))
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Play around with this algorithm.
|
||||
# Can you get better results changing the variables you select for the training and test data?
|
||||
# What if you look at different variables for the response?
|
||||
|
||||
```
|
||||
|
||||
## Deploy your model as a web service
|
||||
|
||||
**Access your Model Anywhere**
|
||||
|
||||
|
||||
```python
|
||||
from azureml import services
|
||||
|
||||
@services.publish(workspace_id, authorization_token)
|
||||
@services.types(wind=float, rain=float, month=int, RH=float)
|
||||
@services.returns(float)
|
||||
|
||||
# The name of your web service is set to this function's name
|
||||
def forest_fire_predictor(wind, rain, month, RH):
|
||||
return regressor.predict([wind, rain, month, RH])
|
||||
|
||||
# Hold onto information about your web service so
|
||||
# you can call it within the notebook later
|
||||
service_url = forest_fire_predictor.service.url
|
||||
api_key = forest_fire_predictor.service.api_key
|
||||
help_url = forest_fire_predictor.service.help_url
|
||||
service_id = forest_fire_predictor.service.service_id
|
||||
```
|
||||
|
||||
## Consuming the web service
|
||||
|
||||
|
||||
```python
|
||||
forest_fire_predictor.service(5.4, 0.2, 9, 22.1)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
import urllib2
|
||||
import json
|
||||
|
||||
data = {"Inputs": {
|
||||
"input1": {
|
||||
"ColumnNames": [ "wind", "rain", "month", "RH"],
|
||||
"Values": [["5.4", "0.2", "9", "22.1"]]
|
||||
}
|
||||
}, # Specified feature values
|
||||
|
||||
"GlobalParameters": {}
|
||||
}
|
||||
|
||||
body = json.dumps(data)
|
||||
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}
|
||||
req = urllib2.Request(service_url, body, headers)
|
||||
|
||||
try:
|
||||
response = urllib2.urlopen(req)
|
||||
result = json.loads(response.read()) # load JSON-formatted string response as dictionary
|
||||
print(result['Results']['output1']['value']['Values'][0][0]) # Get the returned prediction
|
||||
|
||||
except urllib2.HTTPError, error:
|
||||
print("The request failed with status code: " + str(error.code))
|
||||
print(error.info())
|
||||
print(json.loads(error.read()))
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
Try this same process of training and hosting a model through Azure ML Studio with the Pima Indians Diabetes dataset (in CSV format in your Data folder). The dataset has nine columns; use any of the eight features you see fit to try and predict the ninth column, Outcome (1 = diabetes, 0 = no diabetes).
|
||||
|
||||
|
||||
```python
|
||||
|
||||
```
|
||||
|
||||
> **Takeaway**: In this part, you explored fitting a model and deploying it as a web service. You did this by using now-familiar tools in an Azure Notebook to build a model relating variables surrounding forest fires and then posting that as a function in Azure ML Studio. From there, you saw how you and others can access the pre-fitted models to make predictions on new data from anywhere on the web.
|
|
@ -0,0 +1,25 @@
|
|||
|
||||
# Capstone Project
|
||||
|
||||
In this Capstone Project you will be engaging with the NOAA Significant Volcanic Eruption database, which can be found at:
|
||||
`Data/noaa_volerup.csv`
|
||||
In this Notebook.
|
||||
|
||||
## Tasks
|
||||
Using what you know about Python, NumPy, Pandas, and Machine Learning you sould:
|
||||
- Idenfity requirements for success
|
||||
- Identify possible risks in the data if this were a real-world scenario
|
||||
- Prepare the data
|
||||
- Select features (variables)
|
||||
- Split the data between training and testing
|
||||
- Choose algorithms
|
||||
|
||||
## Options
|
||||
If you would prefer to find your own dataset, that is OK, however limit your searching to about 15 minutes. Microsoft has several [Public Datasets](https://docs.microsoft.com/en-us/azure/sql-database/sql-database-public-data-sets) if you want to start there.
|
||||
|
||||
You are also encouraged to explore any aspects of the data. Be explicit about your inquiry and success in predicting affects on our world.
|
||||
|
||||
|
||||
```python
|
||||
|
||||
```
|
|
@ -0,0 +1,576 @@
|
|||
|
||||
# Section 1: Introduction to machine learning models
|
||||
|
||||
You have now made it to the section on machine learning (ML). ML and the branch of computer science in which it resides, artificial intelligence (AI), are so central to data science that ML/AI and data science are synonymous in the minds of many people. However, the preceding sections have hopefully demonstrated that there are a lot of other facets to the discipline of data science apart from the prediction and classification tasks that supply so much value to the world. (Remember, at least 80 percent of the effort in most data-science projects will be composed of cleaning and manipulating the data to prepare it for analysis.)
|
||||
|
||||
That said, ML is fun! In this section, and the next one on data science in the cloud, you will get to play around with some of the “magic” of data science and start to put into practice the tools you have spent the last five sections learning. Let's get started!
|
||||
|
||||
## A quick aside: types of ML
|
||||
|
||||
As you get deeper into data science, it might seem like there are a bewildering array of ML algorithms out there. However many you encounter, it can be handy to remember that most ML algorithms fall into three broad categories:
|
||||
- **Predictive algorithms**: These analyze current and historical facts to make predictions about unknown events, such as the future or customers’ choices.
|
||||
- **Classification algorithms**: These teach a program from a body of data, and the program then uses that learning to classify new observations.
|
||||
- **Time-series forecasting algorithms**: While it can argued that these algorithms are a part of predictive algorithms, their techniques are specialized enough that they in many ways functions like a separate category. Time-series forecasting is beyond the scope of this course, but we have more than enough work with focusing here on prediction and classification.
|
||||
|
||||
## Prediction: linear regression
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should be comfortable fitting linear regression models, and you should have some familiarity with interpreting their output.
|
||||
|
||||
Arguably the simplest form of machine learning is to draw a line connecting two points and make predictions about where that trend might lead.
|
||||
|
||||
But what if you have more than two points—and those points don't line up neatly? What if you have points in more than two dimensions? This is where linear regression comes in.
|
||||
|
||||
Formally, linear regression is used to predict a quantitative *response* (the values on a Y axis) that is dependent on one or more *predictors* (values on one or more axes that are orthogonal to Y, commonly just thought of collectively as X). The working assumption is that the relationship between predictors and response is more or less linear. The goal of linear regression is to fit a straight line in the best possible way to minimize the deviation between our observed responses in the dataset and the responses predicted by our line, the linear approximation. (The most common means of assessing this error is called the **least squares method**; it consists of minimizing the number you get when you square the difference between your predicted value and the actual value and add up all of those squared differences for your entire dataset.)
|
||||
|
||||
<img src="../Images/linear_regression.png" style="padding-right: 10px;">
|
||||
|
||||
|
||||
Statistically, we can represent this relationship between response and predictors as:
|
||||
|
||||
$Y = B_0 + B_1X + E$
|
||||
|
||||
Remember high school geometry? $B_0$ is the intercept of our line and $B_1$ is its slope. We commonly refer to $B_0$ and $B_1$ as coefficients and to $E$ as the *error term*, which represents the margin of error in the model.
|
||||
|
||||
Let's try this in practice with actual data. (Note: no graph paper will be harmed in the course of these predictions.)
|
||||
|
||||
### Data exploration
|
||||
|
||||
We'll begin by importing our usual libraries and using our %matplotlib inline magic command:
|
||||
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
%matplotlib inline
|
||||
import seaborn as sns
|
||||
```
|
||||
|
||||
And now for our data. In this case, we’ll use a newer housing dataset than the Boston Housing Dataset we used in the last section (with this one storing data on individual houses across the United States).
|
||||
|
||||
|
||||
```python
|
||||
df = pd.read_csv('../Data/Housing_Dataset_Sample.csv')
|
||||
df.head()
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Do you remember the DataFrame method for looking at overall information
|
||||
# about a DataFrame, such as number of columns and rows? Try it here.
|
||||
|
||||
```
|
||||
|
||||
Let's also use the `describe` method to look at some of the vital statistics about the columns. Note that in cases like this, in which some of the column names are long, it can be helpful to view the transposition of the summary, like so:
|
||||
|
||||
|
||||
```python
|
||||
df.describe().T
|
||||
```
|
||||
|
||||
Let's look at the data in the **Price** column. (You can disregard the deprecation warning if it appears.)
|
||||
|
||||
|
||||
```python
|
||||
sns.distplot(df['Price'])
|
||||
```
|
||||
|
||||
As we would hope with this much data, our prices form a nice bell-shaped, normally distributed curve.
|
||||
|
||||
Now, let's look at a simple relationship like that between house prices and the average income in a geographic area:
|
||||
|
||||
|
||||
```python
|
||||
sns.jointplot(df['Avg. Area Income'],df['Price'])
|
||||
```
|
||||
|
||||
As we would expect, there is an intuitive, linear relationship between them. Also good: the pairplot shows that the data in both columns is normally distributed, so we don't have to worry about somehow transforming the data for meaningful analysis.
|
||||
|
||||
Let's take a quick look at all of the columns:
|
||||
|
||||
|
||||
```python
|
||||
sns.pairplot(df)
|
||||
```
|
||||
|
||||
Some observations:
|
||||
1. Not all of the combinations of columns provide strong linear relationships; some just look like blobs. That's nothing to worry about for our analysis.
|
||||
2. See the visualizations that look like lanes rather than organic groups? That is the result of the average number of bedrooms in houses being measured in discrete values rather than continuous ones (as no one has 0.3 bedrooms in their house). The number of bathrooms is also the one column whose data is not really normally distributed, though some of this might be distortion caused by the default bin size of the pairplot histogram functionality.
|
||||
|
||||
It is now time to make a prediction.
|
||||
|
||||
### Fitting the model
|
||||
|
||||
Let's make a prediction. Let's feed everything into a linear model (average area income, average area house age, average area number of rooms, average area number of bedrooms, and area population) and see how well knowing those factors can help us predict the price of a home.
|
||||
|
||||
To do this, we will make our first five columns the X (our predictors) and the **Price** column the Y (our response):
|
||||
|
||||
|
||||
```python
|
||||
X = df.iloc[:,:5]
|
||||
y = df['Price']
|
||||
```
|
||||
|
||||
Now, we could use all of our data to create our model. However, all that would get us is a model that is good at predicting itself. Not only would that leave us with no objective way to measure how good the model is, it would also likely lead to a model that was less accurate when used on new data. Such a model is termed *overfitted*.
|
||||
|
||||
To avoid this, data scientists divide their datasets for ML into *training* data (the data used to fit the model) and *test* data (data used to evaluate how accurate the model is). Fortunately, scikit-learn provides a function that enables us to easily divide up our data between training and test sets: `train_test_split`. In this case, we will use 70 percent of our data for training and reserve 30 percent of it for testing. (Note that you will also supply a fourth parameter to the function: `random_state`; `train_test_split` randomly divides up our data between test and training, so this number provides an explicit seed for the random-number generator so that you will get the same result each time you run this code snippet.)
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=54)
|
||||
```
|
||||
|
||||
All that is left now is to import our linear regression algorithm and fit our model based on our training data:
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.linear_model import LinearRegression
|
||||
|
||||
reg = LinearRegression()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
reg.fit(X_train,y_train)
|
||||
```
|
||||
|
||||
### Evaluating the model
|
||||
|
||||
Now, a moment of truth: let's see how our model does making predictions based on the test data:
|
||||
|
||||
|
||||
```python
|
||||
predictions = reg.predict(X_test)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
predictions
|
||||
```
|
||||
|
||||
Our predictions are just an array of numbers: these are the house prices predicted by our model. One for every row in our test dataset.
|
||||
|
||||
Remember how we mentioned that linear models have the mathematical form of $Y = B_0 + B_1*X + E$? Let’s look at the actual equation:
|
||||
|
||||
|
||||
```python
|
||||
print(reg.intercept_,reg.coef_)
|
||||
```
|
||||
|
||||
In algebraic terms, here is our model:
|
||||
|
||||
$Y=-2,646,401+0.21587X_1+0.00002X_2+0.00001X_3+0.00279X_4+0.00002X_5$
|
||||
|
||||
where:
|
||||
- $Y=$ Price
|
||||
- $X_1=$ Average area income
|
||||
- $X_2=$ Average area house age
|
||||
- $X_3=$ Average area number of rooms
|
||||
- $X_4=$ Average area number of bedrooms
|
||||
- $X_5=$ Area population
|
||||
|
||||
So, just how good is our model? There are many ways to measure the accuracy of ML models. Linear models have a good one: the $R^2$ score (also knows as the coefficient of determination). A high $R^2$, close to 1, indicates better prediction with less error.
|
||||
|
||||
|
||||
```python
|
||||
#Explained variation. A high R2 close to 1 indicates better prediction with less error.
|
||||
from sklearn.metrics import r2_score
|
||||
|
||||
r2_score(y_test,predictions)
|
||||
```
|
||||
|
||||
The $R^2$ score also indicates how much explanatory power a linear model has. In the case of our model, the five predictors we used in the model explain a little more than 92 percent of the price of a house in this dataset.
|
||||
|
||||
We can also plot our errors to get a visual sense of how wrong our predictions were:
|
||||
|
||||
|
||||
```python
|
||||
#plot errors
|
||||
sns.distplot([y_test-predictions])
|
||||
```
|
||||
|
||||
Do you notice the numbers on the left axis? Whereas a histogram shows the number of things that fall into discrete numeric buckets, a kernel density estimation (KDE, and the histogram that accompanies it in the Seaborn displot) normalizes those numbers to show what proportion of results lands in each bucket. Essentially, these are all decimal numbers less than 1.0 because the area under the KDE has to add up to 1.
|
||||
|
||||
Maybe more gratifying, we can plot the predictions from our model:
|
||||
|
||||
|
||||
```python
|
||||
# Plot outputs
|
||||
plt.scatter(y_test,predictions, color='blue')
|
||||
```
|
||||
|
||||
The linear nature of our predicted prices is clear enough, but there are so many of them that it is hard to tell where dots are concentrated. Can you think of a way to refine this visualization to make it clearer, particularly if you were explaining the results to someone?
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Hint: Remember to try the plt.scatter parameter alpha=.
|
||||
# It takes values between 0 and 1.
|
||||
|
||||
```
|
||||
|
||||
> **Takeaway:** In this subsection, you performed prediction using linear regression by exploring your data, then fitting your model, and finally evaluating your model’s performance.
|
||||
|
||||
## Classification: logistic regression
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should know how logistic regression differs from linear regression, be comfortable fitting logistic regression models, and have some familiarity with interpreting their output.
|
||||
|
||||
We'll now pivot to discussing classification. If our simple analogy of predictive analytics was drawing a line through points and extrapolating from that, then classification can be described in its simplest form as drawing lines around groups of points.
|
||||
|
||||
While linear regression is used to predict quantitative responses, *logistic* regression is used for classification problems. Formally, logistic regression predicts the categorical response (Y) based on predictors (Xs). Logistic regression goes by several names, and it is also known in the scholarly literature as logit regression, maximum-entropy classification (MaxEnt), and the log-linear classifier. In this algorithm, the probabilities describing the possible outcomes of a single trial are modeled using a sigmoid (S-curve) function. Sigmoid functions take any value and transform it to be between 0 and 1, which can be used as a probability for a class to be predicted, with the goal of predictors mapping to 1 when something belongs in the class and 0 when they do not.
|
||||
|
||||
<img src="../Images/logistic_regression.png?" style="padding-right: 10px;">
|
||||
|
||||
To show this in action, let's do something a little different and try a historical dataset: the fates of the passengers of the RMS Titanic, which is a popular dataset for classification problems in machine learning. In this case, the class we want to predict is whether a passenger survived the doomed liner's sinking.
|
||||
|
||||
The dataset has 12 variables:
|
||||
|
||||
- **PassengerId**
|
||||
- **Survived:** 0 = No, 1 = Yes
|
||||
- **Pclass:** Ticket class 1 = 1st, 2 = 2nd, 3 = 3rd
|
||||
- **Sex**
|
||||
- **Age**
|
||||
- **Sibsp:** Number of siblings or spouses aboard the *Titanic*
|
||||
- **Parch:** Number of parents or children aboard the *Titanic*
|
||||
- **Ticket:** Passenger ticket number
|
||||
- **Fare:** Passenger fare
|
||||
- **Cabin:** Cabin number
|
||||
- **Embarked:** Port of embarkation; C = Cherbourg, Q = Queenstown, S = Southampton
|
||||
|
||||
|
||||
```python
|
||||
df = pd.read_csv('../Data/train_data_titanic.csv')
|
||||
df.head()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df.info()
|
||||
```
|
||||
|
||||
One reason that the Titanic data set is a popular classification set is that it provides opportunities to prepare data for analysis. To prepare this dataset for analysis, we need to perform a number of tasks:
|
||||
- Remove extraneous variables
|
||||
- Check for multicollinearity
|
||||
- Handle missing values
|
||||
|
||||
We will touch on each of these steps in turn.
|
||||
|
||||
### Remove extraneous variables
|
||||
|
||||
The name of individual passengers and their ticket numbers will clearly do nothing to help our model, so we can drop those columns to simplify matters.
|
||||
|
||||
|
||||
```python
|
||||
df.drop(['Name','Ticket'],axis=1,inplace=True)
|
||||
```
|
||||
|
||||
There are additional variables that will not add classifying power to our model, but to find them we will need to look for correlation between variables.
|
||||
|
||||
### Check for multicollinearity
|
||||
|
||||
If one or more of our predictors can themselves be predicted from other predictors, it can produce a state of *multicollinearity* in our model. Multicollinearity is a challenge because it can skew the results of regression models (both linear and logistic) and reduce the predictive or classifying power of a model.
|
||||
|
||||
To help combat this problem, we can start to look for some initial patterns. For example, do any correlations between **Survived** and **Fare** jump out?
|
||||
|
||||
|
||||
```python
|
||||
sns.pairplot(df[['Survived','Fare']], dropna=True)
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Try running sns.pairplot twice more on some other combinations of columns
|
||||
# and see if any patterns emerge.
|
||||
|
||||
```
|
||||
|
||||
We can also use `groupby` to look for patterns. Consider the mean values for the various variables when we group by **Survived**:
|
||||
|
||||
|
||||
```python
|
||||
df.groupby('Survived').mean()
|
||||
```
|
||||
|
||||
Survivors appear to be slightly younger on average with higher-cost fare.
|
||||
|
||||
|
||||
```python
|
||||
df.head()
|
||||
```
|
||||
|
||||
Value counts can also help us get a sense of the data before us, such as numbers for siblings and spouses on the *Titanic*, in addition to the sex split of passengers:
|
||||
|
||||
|
||||
```python
|
||||
df['SibSp'].value_counts()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df['Parch'].value_counts()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df['Sex'].value_counts()
|
||||
```
|
||||
|
||||
### Handle missing values
|
||||
|
||||
We now need to address missing values. First, let’s look to see which columns have more than half of their values missing:
|
||||
|
||||
|
||||
```python
|
||||
#missing
|
||||
df.isnull().sum()>(len(df)/2)
|
||||
```
|
||||
|
||||
Let's break down the code in the call above just a bit. `df.isnull().sum()` tells pandas to take the sum of all of the missing values for each column. `len(df)/2` is just another way of expressing half the number of rows in the `DataFrame`. Taken together with the `>`, this line of code is looking for any columns with more than half of its entries missing, and there is one: **Cabin**.
|
||||
|
||||
We could try to do something about those missing values. However, if any pattern does emerge in the data that involves **Cabin**, it will be highly cross-correlated with both **Pclass** and **Fare** (as higher-fare, better-class accommodations were grouped together on the *Titanic*). Given that too much cross-correlation can be detrimental to a model, it is probably just better for us to drop **Cabin** from our `DataFrame`:
|
||||
|
||||
|
||||
```python
|
||||
df.drop('Cabin',axis=1,inplace=True)
|
||||
```
|
||||
|
||||
Let's now run `info` to see if there are columns with just a few null values.
|
||||
|
||||
|
||||
```python
|
||||
df.info()
|
||||
```
|
||||
|
||||
One note on the data: given that 1,503 died in the *Titanic* tragedy (and that we know that some survived), this data set clearly does not include every passenger on the ship (and none of the crew). Also remember that **Survived** is a variable that includes both those who survived and those who perished.
|
||||
|
||||
Back to missing values. **Age** is missing several values, as is **Embarked**. Let's see how many values are missing from **Age**:
|
||||
|
||||
|
||||
```python
|
||||
df['Age'].isnull().value_counts()
|
||||
```
|
||||
|
||||
As we saw above, **Age** isn't really correlated with **Fare**, so it is a variable that we want to eventually use in our model. That means that we need to do something with those missing values. But we before we decide on a strategy, we should check to see if our median age is the same for both sexes.
|
||||
|
||||
|
||||
```python
|
||||
df.groupby('Sex')['Age'].median().plot(kind='bar')
|
||||
```
|
||||
|
||||
The median ages are different for men and women sailing on the *Titanic*, which means that we should handle the missing values accordingly. A sound strategy is to replace the missing ages for passengers with the median age *for the passengers' sexes*.
|
||||
|
||||
|
||||
```python
|
||||
df['Age'] = df.groupby('Sex')['Age'].apply(lambda x: x.fillna(x.median()))
|
||||
```
|
||||
|
||||
Any other missing values?
|
||||
|
||||
|
||||
```python
|
||||
df.isnull().sum()
|
||||
```
|
||||
|
||||
We are missing two values for **Embarked**. Check to see how that variable breaks down:
|
||||
|
||||
|
||||
```python
|
||||
df['Embarked'].value_counts()
|
||||
```
|
||||
|
||||
The vast majority of passengers embarked on the *Titanic* from Southampton, so we will just fill in those two missing values with the most statistically likely value (the median result): Southampton.
|
||||
|
||||
|
||||
```python
|
||||
df['Embarked'].fillna(df['Embarked'].value_counts().idxmax(), inplace=True)
|
||||
df['Embarked'].value_counts()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
df = pd.get_dummies(data=df, columns=['Sex', 'Embarked'],drop_first=True)
|
||||
df.head()
|
||||
```
|
||||
|
||||
Let's do a final look at the correlation matrix to see if there is anything else we should remove.
|
||||
|
||||
|
||||
```python
|
||||
df.corr()
|
||||
```
|
||||
|
||||
**Pclass** and **Fare** have some amount of correlation, we can probably get rid of one of them. In addition, we need to remove **Survived** from our X `DataFrame` because it will be our response `DataFrame`, Y:
|
||||
|
||||
|
||||
```python
|
||||
X = df.drop(['Survived','Pclass'],axis=1)
|
||||
y = df['Survived']
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size=0.3,random_state=67)
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
We now need to split the training and test data, which you will so as an exercise:
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
# Look up in the portion above on linear regression and use train_test_split here.
|
||||
# Set test_size = 0.3 and random_state = 67 to get the same results as below when
|
||||
# you run through the rest of the code example below.
|
||||
|
||||
```
|
||||
|
||||
Now you will import and fit the logistic regression model:
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.linear_model import LogisticRegression
|
||||
|
||||
lr = LogisticRegression()
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
lr.fit(X_train,y_train)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
predictions = lr.predict(X_test)
|
||||
```
|
||||
|
||||
### Evaluate the model
|
||||
|
||||
In contrast to linear regression, logistic regression does not produce an $R^2$ score by which we can assess the accuracy of our model. In order to evaluate that, we will use a classification report, a confusion matrix, and the accuracy score.
|
||||
|
||||
#### Classification report
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
|
||||
```
|
||||
|
||||
The classification reports the proportions of both survivors and non-survivors with four scores:
|
||||
- **Precision:** The number of true positives divided by the sum of true positives and false positives; closer to 1 is better.
|
||||
- **Recall:** The true-positive rate, the number of true positives divided by the sum of the true positives and the false negatives.
|
||||
- **F1 score:** The harmonic mean (the average for rates) of precision and recall.
|
||||
- **Support:** The number of true instances for each label.
|
||||
|
||||
Why so many ways of measuring accuracy for a model? Well, success means different things in different contexts. Imagine that we had a model to diagnose infectious disease. In such a case we might want to tune our model to maximize recall (and thus minimize our false-negative rate): even high precision might miss a lot of infected people. On the other hand, a weather-forecasting model might be interested in maximizing precision because the cost of false negatives is so low. For other uses, striking a balance between precision and recall by maximizing the F1 score might be the best choice. Run the classification report:
|
||||
|
||||
|
||||
```python
|
||||
print(classification_report(y_test,predictions))
|
||||
```
|
||||
|
||||
#### Confusion matrix
|
||||
|
||||
The confusion matrix is another way to present this same information, this time with raw scores. The columns show the true condition, positive on the left, negative on the right. The rows show predicted conditions, positive on the top, negative on the bottom. So, the matrix below shows that our model correctly predicted 146 survivors (true positives) and incorrectly predicted another 16 (false positives). On the other hand, our model correctly predicted 30 non-survivors (true negatives) and incorrectly predicted 76 more (false negatives).
|
||||
|
||||
|
||||
```python
|
||||
print(confusion_matrix(y_test,predictions))
|
||||
```
|
||||
|
||||
Let's dress up the confusion matrix a bit to make it a little easier to read:
|
||||
|
||||
|
||||
```python
|
||||
pd.DataFrame(confusion_matrix(y_test, predictions), columns=['True Survived', 'True Not Survived'], index=['Predicted Survived', 'Predicted Not Survived'])
|
||||
```
|
||||
|
||||
#### Accuracy score
|
||||
|
||||
Finally, our accuracy score tells us the fraction of correctly classified samples; in this case (146 + 76) / (146 + 76 + 30 + 16).
|
||||
|
||||
|
||||
```python
|
||||
print(accuracy_score(y_test,predictions))
|
||||
```
|
||||
|
||||
Not bad for an off-the-shelf model with no tuning!
|
||||
|
||||
> **Takeaway:** In this subsection, you performed classification using logistic regression by removing extraneous variables, checking for multicollinearity, handling missing values, and fitting and evaluating your model.
|
||||
|
||||
## Classification: decision trees
|
||||
|
||||
> **Learning goal:** By the end of this subsection, you should be comfortable fitting decision-tree models and have some understanding of what they output.
|
||||
|
||||
If logistic regression uses observations about variables to swing a metaphorical needle between 0 and 1, classification based on decision trees programmatically builds a Yes/No decision to classify items.
|
||||
|
||||
<img src="../Images/decision_tree.png" style="padding-right: 10px;">
|
||||
|
||||
Let's look at this in practice with the same *Titanic* dataset we used with logistic regression.
|
||||
|
||||
|
||||
```python
|
||||
from sklearn import tree
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
tr = tree.DecisionTreeClassifier()
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Using the same split data as with the logistic regression,
|
||||
# can you fit the decision tree model?
|
||||
# Hint: Refer to code snippet for fitting the logistic regression above.
|
||||
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
tr.fit(X_train, y_train)
|
||||
```
|
||||
|
||||
Once fitted, we get our predicitions just like we did in the logistic regression example above:
|
||||
|
||||
|
||||
```python
|
||||
tr_predictions = tr.predict(X_test)
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
pd.DataFrame(confusion_matrix(y_test, tr_predictions),
|
||||
columns=['True Survived', 'True Not Survived'],
|
||||
index=['Predicted Survived', 'Predicted Not Survived'])
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
print(accuracy_score(y_test,tr_predictions))
|
||||
```
|
||||
|
||||
One of the great attractions of decision trees is that the models are readable by humans. Let's visualize to see it in action. (Note, the generated graphic can be quite large, so scroll to the right if the generated graphic just looks blank at first.)
|
||||
|
||||
|
||||
```python
|
||||
import graphviz
|
||||
|
||||
dot_file = tree.export_graphviz(tr, out_file=None,
|
||||
feature_names=X.columns,
|
||||
class_names='Survived',
|
||||
filled=True,rounded=True)
|
||||
graph = graphviz.Source(dot_file)
|
||||
graph
|
||||
```
|
||||
|
||||
There are, of course, myriad other ML models that we could explore. However, you now know some of the most commonly encountered ones, which is great preparation to understand what automated, cloud-based ML and AI services are doing and how to intelligently apply them to data-science problems, the subject of the next section.
|
||||
|
||||
> **Takeaway:** In this subsection, you performed classification on previously cleaned data by fitting and evaluating a decision tree.
|
|
@ -0,0 +1,245 @@
|
|||
|
||||
# Section 2: Cloud-based machine learning
|
||||
|
||||
Thus far, we have looked at building and fitting ML models “locally.” True, the notebooks have been located in the cloud themselves, but the models with all of their predictive and classification power are stuck in those notebooks. To use these models, you would have to load data into your notebooks and get the results there.
|
||||
|
||||
In practice, we want those models accessible from a number of locations. And while the management of production ML models has a lifecycle all its own, one part of that is making models accessible from the web. One way to do so is to develop them using third-party cloud tools, such as [Microsoft Azure ML Studio](https://studio.azureml.net) (not to be confused with Microsoft Azure Machine Learning sService, which provides end-to-end lifecycle management for ML models).
|
||||
|
||||
Alternatively, we can develop and deploy a function that can be accessed by other programs over the web—a web service—that runs within Azure ML Studio, and we can do so entirely from a Python notebook. In this section, we will use the [`azureml`](https://github.com/Azure/Azure-MachineLearning-ClientLibrary-Python) package to deploy an Azure ML web service directly from within a Python notebook (or other Python environment).
|
||||
|
||||
> <font>**Note:**</font> The `azureml` package presently works only with Python 2. If your notebook is not currently running Python 2, change it in the menu at the top of the notebook by clicking **Kernel > Change kernel > Python 2**.
|
||||
|
||||
## Create and connect to an Azure ML Studio workspace
|
||||
|
||||
The `azureml` package is installed by default with Azure Notebooks, so we don't have to worry about that. It uses an Azure ML Studio workspace ID and authorization token to connect your notebook to the workspace; you will obtain the ID and token by following these steps:
|
||||
|
||||
1. Open [Azure ML Studio](https://studio.azureml.net) in a new browser tab and sign in with a Microsoft account. Azure ML Studio is free and does not require an Azure subscription. Once signed in with your Microsoft account (the same credentials you’ve used for Azure Notebooks), you're in your “workspace.”
|
||||
|
||||
2. On the left pane, click **Settings**.
|
||||
|
||||
![Settings button](https://github.com/Microsoft/AzureNotebooks/blob/master/Samples/images/azure-ml-studio-settings.png?raw=true)<br><br>
|
||||
|
||||
3. On the **Name** tab, the **Workspace ID** field contains your workspace ID. Copy that ID into the `workspace_id` value in the code cell in Step 5 of the notebook below.
|
||||
|
||||
![Location of workspace ID](https://github.com/Microsoft/AzureNotebooks/blob/master/Samples/images/azure-ml-studio-workspace-id.png?raw=true)<br><br>
|
||||
|
||||
4. Click the **Authorization Tokens** tab, and then copy either token into the `authorization_token` value in the code cell in Step 5 of the notebook.
|
||||
|
||||
![Location of authorization token](https://github.com/Microsoft/AzureNotebooks/blob/master/Samples/images/azure-ml-studio-tokens.png?raw=true)<br><br>
|
||||
|
||||
5. 5. Run the code cell below; if it runs without error, you're ready to continue.
|
||||
|
||||
|
||||
```python
|
||||
from azureml import Workspace
|
||||
|
||||
# Replace the values with those from your own Azure ML Studio instance; see Prerequisites
|
||||
# The workspace_id is a string of hexadecimal characters; the token is a long string of random characters.
|
||||
workspace_id = 'your_workspace_id'
|
||||
authorization_token = 'your_auth_token'
|
||||
|
||||
ws = Workspace(workspace_id, authorization_token)
|
||||
```
|
||||
|
||||
## Explore forest fire data
|
||||
|
||||
Let’s look at a meteorological dataset collected by Cortez and Morais for 2007 to study the burned area of forest fires in the northeast region of Portugal.
|
||||
|
||||
> P. Cortez and A. Morais. A Data Mining Approach to Predict Forest Fires using Meteorological Data.
|
||||
In J. Neves, M. F. Santos and J. Machado Eds., New Trends in Artificial Intelligence,
|
||||
Proceedings of the 13th EPIA 2007 - Portuguese Conference on Artificial Intelligence, December,
|
||||
Guimaraes, Portugal, pp. 512-523, 2007. APPIA, ISBN-13 978-989-95618-0-9.
|
||||
|
||||
The dataset contains the following features:
|
||||
|
||||
- **`X`**: x-axis spatial coordinate within the Montesinho park map: 1 to 9
|
||||
- **`Y`**: y-axis spatial coordinate within the Montesinho park map: 2 to 9
|
||||
- **`month`**: month of the year: "1" to "12" jan-dec
|
||||
- **`day`**: day of the week: "1" to "7" sun-sat
|
||||
- **`FFMC`**: FFMC index from the FWI system: 18.7 to 96.20
|
||||
- **`DMC`**: DMC index from the FWI system: 1.1 to 291.3
|
||||
- **`DC`**: DC index from the FWI system: 7.9 to 860.6
|
||||
- **`ISI`**: ISI index from the FWI system: 0.0 to 56.10
|
||||
- **`temp`**: temperature in Celsius degrees: 2.2 to 33.30
|
||||
- **`RH`**: relative humidity in %: 15.0 to 100
|
||||
- **`wind`**: wind speed in km/h: 0.40 to 9.40
|
||||
- **`rain`**: outside rain in mm/m2 : 0.0 to 6.4
|
||||
- **`area`**: the burned area of the forest (in ha): 0.00 to 1090.84
|
||||
|
||||
|
||||
Let's load the dataset and visualize the area that was burned in relation to the temperature in that region.
|
||||
|
||||
|
||||
```python
|
||||
import pandas as pd
|
||||
df = pd.DataFrame(pd.read_csv('../Data/forestfires.csv'))
|
||||
%matplotlib inline
|
||||
from ggplot import *
|
||||
ggplot(aes(x='temp', y='area'), data=df) + geom_line() + geom_point()
|
||||
```
|
||||
|
||||
Intuitively, the hotter the weather, the more hectares burned in forest fires.
|
||||
|
||||
## Transfer your data to Azure ML Studio
|
||||
|
||||
We have our data, but how do we get it into Azure ML Studio in order to use it there? That is where the `azureml` package comes in. It enables us to load data and models into Azure ML Studio from an Azure Notebook (or any Python environment).
|
||||
|
||||
The first code cell of this notebook is what establishes the connection with *your* Azure ML Studio account.
|
||||
|
||||
Now that you have your notebook talking to Azure ML Studio, you can export your data to it:
|
||||
|
||||
|
||||
```python
|
||||
from azureml import DataTypeIds
|
||||
|
||||
dataset = ws.datasets.add_from_dataframe(
|
||||
dataframe=df,
|
||||
data_type_id=DataTypeIds.GenericCSV,
|
||||
name='Forest Fire Data',
|
||||
description='Paulo Cortez and Aníbal Morais (Univ. Minho) @ 2007'
|
||||
)
|
||||
```
|
||||
|
||||
After running the code above, you can see the dataset listed in the **Datasets** section of the Azure Machine Learning Studio workspace. (**Note**: You might need to switch between browser tabs and refresh the page in order to see the dataset.)
|
||||
|
||||
![image.png](attachment:image.png)<br>
|
||||
|
||||
It is also straightforward to list the datasets available in the workspace and transfer datasets from the workspace to the notebook:
|
||||
|
||||
|
||||
```python
|
||||
print('\n'.join([i.name for i in ws.datasets if not i.is_example])) # only list user-created datasets
|
||||
```
|
||||
|
||||
You can also interact with and examine the dataset in Azure ML Studio directly from your notebook:
|
||||
|
||||
|
||||
```python
|
||||
# Read some more of the metadata
|
||||
ds = ws.datasets['Forest Fire Data']
|
||||
print(ds.name)
|
||||
print(ds.description)
|
||||
print(ds.family_id)
|
||||
print(ds.data_type_id)
|
||||
print(ds.created_date)
|
||||
print(ds.size)
|
||||
|
||||
# Read the contents
|
||||
df2 = ds.to_dataframe()
|
||||
df2.head()
|
||||
```
|
||||
|
||||
## Create your model
|
||||
|
||||
We're now back into familiar territory: prepping data for the model and fitting the model. To keep it interesting, we'll use the scikit-learn `train_test_split()` function with a slight change of parameters to select 75 percent of the data points for training and 25 percent for validation (testing).
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.model_selection import train_test_split
|
||||
|
||||
X_train, X_test, y_train, y_test = train_test_split(
|
||||
df[['wind','rain','month','RH']],
|
||||
df['temp'],
|
||||
test_size=0.25,
|
||||
random_state=42
|
||||
)
|
||||
```
|
||||
|
||||
Did you see what we did there? Rather than select all of the variables for the model, we were more selective and just chose windspeed, rainfall, month, and relative humidity in order to predict temperature.
|
||||
|
||||
Fit scikit-learn's `DecisionTreeRegressor` model using the training data. This algorithm is a combination of the linear regression and decision tree classification that you worked with in Section 6.
|
||||
|
||||
|
||||
```python
|
||||
from sklearn.tree import DecisionTreeRegressor
|
||||
from sklearn.metrics import r2_score
|
||||
regressor = DecisionTreeRegressor(random_state=42)
|
||||
regressor.fit(X_train, y_train)
|
||||
y_test_predictions = regressor.predict(X_test)
|
||||
print('R^2 for true vs. predicted test set forest temperature: {:0.2f}'.format(r2_score(y_test, y_test_predictions)))
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
# Play around with this algorithm.
|
||||
# Can you get better results changing the variables you select for the training and test data?
|
||||
# What if you look at different variables for the response?
|
||||
|
||||
```
|
||||
|
||||
## Deploy your model as a web service
|
||||
|
||||
This is the important part. Once deployed as a web service, your model can be accessed from anywhere. This means that rather than refit a model every time you need a new prediction for a business or humanitarian use case, you can send the data to the pre-fitted model and get back a prediction.
|
||||
|
||||
First, deploy the model as a predictive web service. To do so, create a wrapper function that takes input data as an argument and calls `predict()` with your trained model and this input data, returning the results.
|
||||
|
||||
|
||||
```python
|
||||
from azureml import services
|
||||
|
||||
@services.publish(workspace_id, authorization_token)
|
||||
@services.types(wind=float, rain=float, month=int, RH=float)
|
||||
@services.returns(float)
|
||||
|
||||
# The name of your web service is set to this function's name
|
||||
def forest_fire_predictor(wind, rain, month, RH):
|
||||
return regressor.predict([wind, rain, month, RH])
|
||||
|
||||
# Hold onto information about your web service so
|
||||
# you can call it within the notebook later
|
||||
service_url = forest_fire_predictor.service.url
|
||||
api_key = forest_fire_predictor.service.api_key
|
||||
help_url = forest_fire_predictor.service.help_url
|
||||
service_id = forest_fire_predictor.service.service_id
|
||||
```
|
||||
|
||||
You can also go to the **Web Services** section of your Azure ML Studio workspace to see the predictive web service running there.
|
||||
|
||||
## Consuming the web service
|
||||
|
||||
Next, consume the web service. To see if this works, try it here from the notebook session in which the web service was created. Just call the predictor directly:
|
||||
|
||||
|
||||
```python
|
||||
forest_fire_predictor.service(5.4, 0.2, 9, 22.1)
|
||||
```
|
||||
|
||||
At any later time, you can use the stored API key and service URL to call the service. In the example below, data can be packaged in JavaScript Object Notation (JSON) format and sent to the web service.
|
||||
|
||||
|
||||
```python
|
||||
import urllib2
|
||||
import json
|
||||
|
||||
data = {"Inputs": {
|
||||
"input1": {
|
||||
"ColumnNames": [ "wind", "rain", "month", "RH"],
|
||||
"Values": [["5.4", "0.2", "9", "22.1"]]
|
||||
}
|
||||
}, # Specified feature values
|
||||
|
||||
"GlobalParameters": {}
|
||||
}
|
||||
|
||||
body = json.dumps(data)
|
||||
headers = {'Content-Type':'application/json', 'Authorization':('Bearer '+ api_key)}
|
||||
req = urllib2.Request(service_url, body, headers)
|
||||
|
||||
try:
|
||||
response = urllib2.urlopen(req)
|
||||
result = json.loads(response.read()) # load JSON-formatted string response as dictionary
|
||||
print(result['Results']['output1']['value']['Values'][0][0]) # Get the returned prediction
|
||||
|
||||
except urllib2.HTTPError, error:
|
||||
print("The request failed with status code: " + str(error.code))
|
||||
print(error.info())
|
||||
print(json.loads(error.read()))
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
Try this same process of training and hosting a model through Azure ML Studio with the Pima Indians Diabetes dataset (in CSV format in your data folder). The dataset has nine columns; use any of the eight features you see fit to try and predict the ninth column, Outcome (1 = diabetes, 0 = no diabetes).
|
||||
|
||||
> **Takeaway**: In this part, you explored fitting a model and deploying it as a web service. You did this by using now-familiar tools in an Azure Notebook to build a model relating variables surrounding forest fires and then posting that as a function in Azure ML Studio. From there, you saw how you and others can access the pre-fitted models to make predictions on new data from anywhere on the web.
|
||||
|
||||
You have now created your own ML web service. Let's now see how you can also interact with existing ML web services for even more sophisticated applications.
|
|
@ -0,0 +1,449 @@
|
|||
|
||||
# Section 3: Azure Cognitive Services
|
||||
|
||||
Just as you created a web service that could consume data and return predictions, so there are many AI software-as-a-service (SaaS) offerings on the web that will return predictions or classifications based on data you supply to them. One family of these is Microsoft Azure Cognitive Services.
|
||||
|
||||
The advantage of using cloud-based services is that they provide cutting-edge models that you can access without having to train them. This can help accelerate both your exploration and use of ML.
|
||||
|
||||
Azure provides Cognitive Services APIs that can be consumed using Python to conduct image recognition, speech recognition, and text recognition, just to name a few. For the purposes of this notebook, we're going to look at using the Computer Vision API and the Text Analytics API.
|
||||
|
||||
First, we’ll start by obtaining a Cognitive Services API key. Note that you can get a free key for seven days, and then you'll be required to pay.
|
||||
|
||||
To learn more about pricing for Cognitive Services, see https://azure.microsoft.com/en-us/pricing/details/cognitive-services/
|
||||
|
||||
Browse to **Try Azure Cognitive Services** at https://azure.microsoft.com/en-us/try/cognitive-services/
|
||||
|
||||
1. Select **Vision API**.
|
||||
2. Select **Computer Vision**.
|
||||
3. Click **Get API key**.
|
||||
4. If prompted for credentials, select **Free 7-day trial**.
|
||||
|
||||
Complete the above steps to also retrieve a Text Analytics API key from the Language APIs category. (You can also do this by scrolling down on the page with your API keys and clicking **Add** under the appropriate service.)
|
||||
|
||||
Once you have your API keys in hand, you're ready to start.
|
||||
|
||||
> **Learning goal:** By the end of this part, you should have a basic comfort with accessing cloud-based cognitive services by API from a Python environment.
|
||||
|
||||
## Azure Cognitive Services Computer Vision
|
||||
|
||||
Computer vision is a hot topic in academic AI research and in business, medical, government, and environmental applications. We will explore it here by seeing firsthand how computers can tag and identify images.
|
||||
|
||||
The first step in using the Cognitive Services Computer Vision API is to create a client object using the ComputerVisionClient class.
|
||||
|
||||
Replace **ACCOUNT_ENDPOINT** with the account endpoint provided from the free trial. Replace **ACCOUNT_KEY** with the account key provided from the free trial.
|
||||
|
||||
|
||||
```python
|
||||
!pip install azure-cognitiveservices-vision-computervision
|
||||
```
|
||||
|
||||
|
||||
```python
|
||||
from azure.cognitiveservices.vision.computervision import ComputerVisionClient
|
||||
from azure.cognitiveservices.vision.computervision.models import VisualFeatureTypes
|
||||
from msrest.authentication import CognitiveServicesCredentials
|
||||
|
||||
# Get endpoint and key from environment variables
|
||||
endpoint = 'ACCOUNT_ENDPOINT'
|
||||
# Example: endpoint = 'https://westcentralus.api.cognitive.microsoft.com'
|
||||
key = 'ACCOUNT_KEY'
|
||||
# Example key = '1234567890abcdefghijklmnopqrstuv
|
||||
|
||||
# Set credentials
|
||||
credentials = CognitiveServicesCredentials(key)
|
||||
|
||||
# Create client
|
||||
client = ComputerVisionClient(endpoint, credentials)
|
||||
```
|
||||
|
||||
Now that we have a client object to work with, let's see what we can do.
|
||||
|
||||
Using analyze_image, we can see the properties of the image with VisualFeatureTypes.tags.
|
||||
|
||||
|
||||
```python
|
||||
url = 'https://cdn.pixabay.com/photo/2014/05/02/23/54/times-square-336508_960_720.jpg'
|
||||
|
||||
image_analysis = client.analyze_image(url,visual_features=[VisualFeatureTypes.tags])
|
||||
|
||||
for tag in image_analysis.tags:
|
||||
print(tag)
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# How can you use the code above to also see the description using VisualFeatureTypes property?
|
||||
```
|
||||
|
||||
Now let's look at the subject domain of the image. An example of a domain is celebrity.
|
||||
As of now, the analyze_image_by_domain method only supports celebrities and landmarks domain-specific models.
|
||||
|
||||
|
||||
```python
|
||||
# This will list the available subject domains
|
||||
models = client.list_models()
|
||||
|
||||
for x in models.models_property:
|
||||
print(x)
|
||||
```
|
||||
|
||||
Let's analyze an image by domain:
|
||||
|
||||
|
||||
```python
|
||||
# Type of prediction
|
||||
domain = "landmarks"
|
||||
|
||||
# Public-domain image of Seattle
|
||||
url = "https://images.pexels.com/photos/37350/space-needle-seattle-washington-cityscape.jpg"
|
||||
|
||||
# English-language response
|
||||
language = "en"
|
||||
|
||||
analysis = client.analyze_image_by_domain(domain, url, language)
|
||||
|
||||
for landmark in analysis.result["landmarks"]:
|
||||
print(landmark["name"])
|
||||
print(landmark["confidence"])
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# How can you use the code above to predict an image of a celebrity?
|
||||
# Using this image, https://images.pexels.com/photos/270968/pexels-photo-270968.jpeg?
|
||||
# Remember that the domains were printed out earlier.
|
||||
```
|
||||
|
||||
Let's see how we can get a text description of an image using the describe_image method. Use max_descriptions to retrieve how many descriptions of the image the API service can find.
|
||||
|
||||
|
||||
```python
|
||||
domain = "landmarks"
|
||||
url = "https://images.pexels.com/photos/726484/pexels-photo-726484.jpeg"
|
||||
language = "en"
|
||||
max_descriptions = 3
|
||||
|
||||
analysis = client.describe_image(url, max_descriptions, language)
|
||||
|
||||
for caption in analysis.captions:
|
||||
print(caption.text)
|
||||
print(caption.confidence)
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# What other descriptions can be found with other images?
|
||||
# What happens if you change the count of descriptions to output?
|
||||
|
||||
```
|
||||
|
||||
Let's say that the images contain text. How do we retrieve that information? There are two methods that need to be used for this type of call. Batch_read_file and get_read_operation_result. TextOperationStatusCodes is used to ensure that the batch_read_file call is completed before the text is read from the image.
|
||||
|
||||
|
||||
```python
|
||||
# import models
|
||||
from azure.cognitiveservices.vision.computervision.models import TextRecognitionMode
|
||||
from azure.cognitiveservices.vision.computervision.models import TextOperationStatusCodes
|
||||
import time
|
||||
|
||||
url = "https://images.pexels.com/photos/6375/quote-chalk-think-words.jpg"
|
||||
mode = TextRecognitionMode.handwritten
|
||||
raw = True
|
||||
custom_headers = None
|
||||
numberOfCharsInOperationId = 36
|
||||
|
||||
# Async SDK call
|
||||
rawHttpResponse = client.batch_read_file(url, mode, custom_headers, raw)
|
||||
|
||||
# Get ID from returned headers
|
||||
operationLocation = rawHttpResponse.headers["Operation-Location"]
|
||||
idLocation = len(operationLocation) - numberOfCharsInOperationId
|
||||
operationId = operationLocation[idLocation:]
|
||||
|
||||
# SDK call
|
||||
while True:
|
||||
result = client.get_read_operation_result(operationId)
|
||||
if result.status not in ['NotStarted', 'Running']:
|
||||
break
|
||||
time.sleep(1)
|
||||
|
||||
# Get data
|
||||
if result.status == TextOperationStatusCodes.succeeded:
|
||||
for textResult in result.recognition_results:
|
||||
for line in textResult.lines:
|
||||
print(line.text)
|
||||
print(line.bounding_box)
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# What other images with words can be analyzed?
|
||||
```
|
||||
|
||||
You can find addition Cognitive Services demonstrations at the following URLs:
|
||||
- https://aidemos.microsoft.com/
|
||||
- https://github.com/microsoft/computerscience/blob/master/Events%20and%20Hacks/Student%20Hacks/hackmit/cogservices_demos/
|
||||
- https://azure.microsoft.com/en-us/services/cognitive-services/directory/
|
||||
|
||||
Images come in varying sizes, and there might be cases where you want to create a thumbnail of the image. For this, we need to install the Pillow library, which you can learn about at https://python-pillow.org/. Pillow is the PIL fork, or Python Imaging Library, which allows for image processing.
|
||||
|
||||
|
||||
```python
|
||||
# Install Pillow
|
||||
!pip install Pillow
|
||||
```
|
||||
|
||||
Now that the Pillow library is installed, we will import the Image module and create a thumbnail from a provided image. (Once generated, you can find the thumbnail image in your project folder on Azure Notebooks.)
|
||||
|
||||
|
||||
```python
|
||||
# Pillow package
|
||||
from PIL import Image
|
||||
|
||||
# IO package to create local image
|
||||
import io
|
||||
|
||||
width = 50
|
||||
height = 50
|
||||
url = "https://images.pexels.com/photos/37350/space-needle-seattle-washington-cityscape.jpg"
|
||||
|
||||
thumbnail = client.generate_thumbnail(width, height, url)
|
||||
|
||||
for x in thumbnail:
|
||||
image = Image.open(io.BytesIO(x))
|
||||
|
||||
image.save('thumbnail.jpg')
|
||||
```
|
||||
|
||||
> **Takeaway:** In this subsection, you explored how to access computer-vision cognitive services by API. Specifically, you used tools to analyze and describe images that you submitted to these services.
|
||||
|
||||
## Azure Cognitive Services Text Analytics
|
||||
|
||||
Another area where cloud-based AI shines is text analytics. Like computer vision, identifying and pulling meaning from natural human languages is really the intersection of a lot of specialized disciplines, so using cloud services for it provides an economical means of tapping a lot of cognitive horsepower.
|
||||
|
||||
To prepare to use the Cognitive Services Text Analytics API, the requests library must be imported, along with the ability to print out JSON formats.
|
||||
|
||||
|
||||
```python
|
||||
import requests
|
||||
# pprint is pretty print (formats the JSON)
|
||||
from pprint import pprint
|
||||
from IPython.display import HTML
|
||||
```
|
||||
|
||||
Replace 'ACCOUNT_KEY' with the API key that was created during the creation of the seven-day free trial account.
|
||||
|
||||
|
||||
```python
|
||||
subscription_key = 'ACCOUNT_KEY'
|
||||
assert subscription_key
|
||||
|
||||
# If using a Free Trial account, this URL does not need to be udpated.
|
||||
# If using a paid account, verify that it matches the region where the
|
||||
# Text Analytics Service was setup.
|
||||
text_analytics_base_url = "https://westcentralus.api.cognitive.microsoft.com/text/analytics/v2.1/"
|
||||
```
|
||||
|
||||
### Text Analytics API
|
||||
|
||||
Now it's time to start processing some text languages.
|
||||
|
||||
To verify the URL endpoint for text_analytics_base_url, run the following:
|
||||
|
||||
|
||||
```python
|
||||
language_api_url = text_analytics_base_url + "languages"
|
||||
print(language_api_url)
|
||||
```
|
||||
|
||||
The API requires that the payload be formatted in the form of documents containing `id` and `text` attributes:
|
||||
|
||||
|
||||
```python
|
||||
documents = { 'documents': [
|
||||
{ 'id': '1', 'text': 'This is a document written in English.' },
|
||||
{ 'id': '2', 'text': 'Este es un documento escrito en Español.' },
|
||||
{ 'id': '3', 'text': '这是一个用中文写的文件' },
|
||||
{ 'id': '4', 'text': 'Ez egy magyar nyelvű dokumentum.' },
|
||||
{ 'id': '5', 'text': 'Dette er et dokument skrevet på dansk.' },
|
||||
{ 'id': '6', 'text': 'これは日本語で書かれた文書です。' }
|
||||
]}
|
||||
```
|
||||
|
||||
The next lines of code call the API service using the requests library to determine the languages that were passed in from the documents:
|
||||
|
||||
|
||||
```python
|
||||
headers = {"Ocp-Apim-Subscription-Key": subscription_key}
|
||||
response = requests.post(language_api_url, headers=headers, json=documents)
|
||||
languages = response.json()
|
||||
pprint(languages)
|
||||
```
|
||||
|
||||
The next line of code outputs the documents in a table format with the language information for each document:
|
||||
|
||||
|
||||
```python
|
||||
table = []
|
||||
for document in languages["documents"]:
|
||||
text = next(filter(lambda d: d["id"] == document["id"], documents["documents"]))["text"]
|
||||
langs = ", ".join(["{0}({1})".format(lang["name"], lang["score"]) for lang in document["detectedLanguages"]])
|
||||
table.append("<tr><td>{0}</td><td>{1}</td>".format(text, langs))
|
||||
HTML("<table><tr><th>Text</th><th>Detected languages(scores)</th></tr>{0}</table>".format("\n".join(table)))
|
||||
```
|
||||
|
||||
The service did a pretty good job of identifying the languages. It did confidently identify the Danish phrase as being Norwegian, but in fairness, even linguists argue as to whether Danish and Norwegian constitute distinct languages or are dialects of the same language. (**Note:** Danes and Norwegians have no doubts on the subject.)
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Create another document set of text and use the text analytics API to detect the language for the text.
|
||||
```
|
||||
|
||||
### Sentiment Analysis API
|
||||
|
||||
Now that we know how to use the Text Analytics API to detect the language, let's use it for sentiment analysis. Basically, the computers at the other end of the API connection will judge the sentiments of written phrases (anywhere on the spectrum of positive to negative) based solely on the context clues provided by the text.
|
||||
|
||||
|
||||
```python
|
||||
# Verify the API URl source for the Sentiment Analysis API
|
||||
sentiment_api_url = text_analytics_base_url + "sentiment"
|
||||
print(sentiment_api_url)
|
||||
```
|
||||
|
||||
As above, the Sentiment Analysis API requires the language to be passed in as documents with `id` and `text` attributes.
|
||||
|
||||
|
||||
```python
|
||||
documents = {'documents' : [
|
||||
{'id': '1', 'language': 'en', 'text': 'I had a wonderful experience! The rooms were wonderful and the staff was helpful.'},
|
||||
{'id': '2', 'language': 'en', 'text': 'I had a terrible time at the hotel. The staff was rude and the food was awful.'},
|
||||
{'id': '3', 'language': 'es', 'text': 'Los caminos que llevan hasta Monte Rainier son espectaculares y hermosos.'},
|
||||
{'id': '4', 'language': 'es', 'text': 'La carretera estaba atascada. Había mucho tráfico el día de ayer.'}
|
||||
]}
|
||||
```
|
||||
|
||||
Let's analyze the text using the Sentiment Analysis API to output a sentiment analysis score:
|
||||
|
||||
|
||||
```python
|
||||
headers = {"Ocp-Apim-Subscription-Key": subscription_key}
|
||||
response = requests.post(sentiment_api_url, headers=headers, json=documents)
|
||||
sentiments = response.json()
|
||||
pprint(sentiments)
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# Create another document set with varying degree of sentiment and use the Sentiment Analysis API to detect what
|
||||
# the sentiment is
|
||||
```
|
||||
|
||||
### Key Phrases API
|
||||
|
||||
We've detected the language type using the Text Analytics API and the sentiment using the Sentiment Analysis API. What if we want to detect key phrases in the text? We can use the Key Phrase API.
|
||||
|
||||
|
||||
```python
|
||||
# As with the other services, setup the Key Phrases API with the following parameters
|
||||
key_phrase_api_url = text_analytics_base_url + "keyPhrases"
|
||||
print(key_phrase_api_url)
|
||||
```
|
||||
|
||||
Create the documents needed to pass to the Key Phrases API with the `id` and `text` attributes.
|
||||
|
||||
|
||||
```python
|
||||
documents = {'documents' : [
|
||||
{'id': '1', 'language': 'en', 'text': 'I had a wonderful experience! The rooms were wonderful and the staff was helpful.'},
|
||||
{'id': '2', 'language': 'en', 'text': 'I had a terrible time at the hotel. The staff was rude and the food was awful.'},
|
||||
{'id': '3', 'language': 'es', 'text': 'Los caminos que llevan hasta Monte Rainier son espectaculares y hermosos.'},
|
||||
{'id': '4', 'language': 'es', 'text': 'La carretera estaba atascada. Había mucho tráfico el día de ayer.'}
|
||||
]}
|
||||
```
|
||||
|
||||
Now, call the Key Phrases API with the formatted documents to retrieve the key phrases.
|
||||
|
||||
|
||||
```python
|
||||
headers = {'Ocp-Apim-Subscription-Key': subscription_key}
|
||||
response = requests.post(key_phrase_api_url, headers=headers, json=documents)
|
||||
key_phrases = response.json()
|
||||
pprint(key_phrases)
|
||||
```
|
||||
|
||||
We can make this easier to read by outputing the documents in an HTML table format.
|
||||
|
||||
|
||||
```python
|
||||
table = []
|
||||
for document in key_phrases["documents"]:
|
||||
text = next(filter(lambda d: d["id"] == document["id"], documents["documents"]))["text"]
|
||||
phrases = ",".join(document["keyPhrases"])
|
||||
table.append("<tr><td>{0}</td><td>{1}</td>".format(text, phrases))
|
||||
HTML("<table><tr><th>Text</th><th>Key phrases</th></tr>{0}</table>".format("\n".join(table)))
|
||||
```
|
||||
|
||||
Now call the Key Phrases API with the formatted documents to retrive the key phrases.
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# What other key phrases can you come up with for analysis?
|
||||
```
|
||||
|
||||
### Entities API
|
||||
|
||||
The final API we will use in the Text Analytics API service is the Entities API. This will retrieve attributes for documents provided to the API service.
|
||||
|
||||
|
||||
```python
|
||||
# Configure the Entities URI
|
||||
entity_linking_api_url = text_analytics_base_url + "entities"
|
||||
print(entity_linking_api_url)
|
||||
```
|
||||
|
||||
The next step is creating a document with id and text attributes to pass on to the Entities API.
|
||||
|
||||
|
||||
```python
|
||||
documents = {'documents' : [
|
||||
{'id': '1', 'text': 'Microsoft is an It company.'}
|
||||
]}
|
||||
```
|
||||
|
||||
Finally, call the service using the rest call below to retrieve the data listed in the text attribute.
|
||||
|
||||
|
||||
```python
|
||||
headers = {"Ocp-Apim-Subscription-Key": subscription_key}
|
||||
response = requests.post(entity_linking_api_url, headers=headers, json=documents)
|
||||
entities = response.json()
|
||||
entities
|
||||
```
|
||||
|
||||
### Exercise:
|
||||
|
||||
|
||||
```python
|
||||
# What other entities can be retrieved with the API?
|
||||
# Create a document setup and use the Text Analytics, Sentiment Analysis,
|
||||
# Key Phrase, and Entities API services to retrieve the data.
|
||||
|
||||
```
|
||||
|
||||
> **Takeaway:** In this subsection, you explored text analytics in the cloud. Specifically, you used a variety of different APIs to extract different information from text: language, sentiment, key phrases, and entities.
|
||||
|
||||
That's it the instructional portion of this course. In these eight sections, you've now seen the range of tools that go into preparing data for analysis and performing ML and AI analysis on data. In the next, concluding section, you will bring these skills together in a final project.
|
Загрузка…
Ссылка в новой задаче