This commit is contained in:
Christopher Harrison 2020-04-27 16:49:00 -07:00
Родитель 3bdb05adcf
Коммит 311866d09e
236 изменённых файлов: 2107024 добавлений и 50 удалений

Просмотреть файл

@ -1,14 +0,0 @@
# Calling APIs
You can call functions called by other programs hosted on web servers.
[Microsoft Azure Cognitive Services](https://docs.microsoft.com/en-ca/azure/cognitive-services/) contain a number of APIs you can call from your code to add intelligence to your apps and websites.
In the code example you call the [Analyze Image](https://westus.dev.cognitive.microsoft.com/docs/services/5adf991815e1060e6355ad44/operations/56f91f2e778daf14a499e1fa0) function of the [Computer Vision](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/)
Calling the API requires
- [API Key](https://azure.microsoft.com/en-ca/try/cognitive-services/) to give you permission to call the API
- Address or Endpoint of the service
- function name of method to call as listed in the [API documentation](https://westus.dev.cognitive.microsoft.com/docs/services/5adf991815e1060e6355ad44/operations/56f91f2e778daf14a499e1fa)
- function parameters as listed in the [API documentation](https://westus.dev.cognitive.microsoft.com/docs/services/5adf991815e1060e6355ad44/operations/56f91f2e778daf14a499e1fa)
- HTTP Headers as listed in the [API documentation](https://westus.dev.cognitive.microsoft.com/docs/services/5adf991815e1060e6355ad44/operations/56f91f2e778daf14a499e1fa)

Просмотреть файл

@ -1,9 +1,9 @@
# Microsoft Open Source Code of Conduct
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/?WT.mc_id=python-c9-niner).
Resources:
- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/)
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/)
- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with questions or concerns
- [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/?WT.mc_id=python-c9-niner)
- [Microsoft Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/?WT.mc_id=python-c9-niner)
- Contact [opencode@microsoft.com](mailto:opencode@microsoft.com?WT.mc_id=python-c9-niner) with questions or concerns

Просмотреть файл

@ -2,47 +2,51 @@
## Overview
The series of videos on Channel 9 is designed to help get you up to speed on Python. If you're a beginning developer who's looking to add Python to your quiver of languages, or trying to get started on a data science or web project, these videos can help teach you the foundation necessary to walk through a quick start or other tutorial.
These threes series on Channel 9 and YouTube are designed to help get you up to speed on Python. If you're a beginning developer looking to add Python to your quiver of languages, or trying to get started on a data science or web project which uses Python, these videos are here to help show you the foundations necessary to walk through a tutorial or other quick start.
We do assume you are familiar with another programming language, and some core programming concepts. For example, we highlight the syntax for boolean expressions and creating classes, but we don't dig into what a [boolean](https://en.wikipedia.org/wiki/Boolean_data_type) is or [object oriented design](https://en.wikipedia.org/wiki/Object-oriented_design). We show you how to perform the tasks you're familiar with in other languages in Python.
### What you'll learn
- The basics of Python
- Starting a project
- Common syntax
- Package management
### What we don't cover
- Class design and inheritance
- Asynchronous programming
- Basics of programming
- Popular packages
## Prerequisites
- Light experience with another programming language, such as [JavaScript](https://www.edx.org/course/javascript-introduction), [Java](https://www.java.com) or [C#](https://docs.microsoft.com/dotnet/csharp/)
- [An understanding of Git](https://git-scm.com/book/en/v1/Getting-Started)
- Light experience with another programming language, such as [JavaScript](https://www.edx.org/course/javascript-introduction)
## Courses
### Getting started
[Python for beginners](https://aka.ms/pythonbeginnerseries) is the perfect starting location for getting started. No Python experience is required! We'll show you how to set up [Visual Studio Code](https://code.visualstudio.com?WT.mc_id=python-c9-niner) as your code editor, and start creating Python code. You'll see how to manage create, structure and run your code, how to manage packages, and even make [REST calls](https://en.wikipedia.org/wiki/Representational_state_transfer).
### Dig a little deeper
[More Python for beginners](https://aka.ms/morepython) digs deeper into Python syntax. You'll explore how to create classes and mixins in Python, how to work with the file system, and introduce `async/await`. This is the perfect next step if you're looking to see a bit more of what Python can do.
### Peek at data science tools
[Even more Python for beginners](https://aka.ms/evenmorepython) is a practical exploration of a couple of the most common packages and tools you'll use when working with data and machine learning. While we won't dig into why you choose particular machine learning models (that's another course), you will get hands on with Jupyter Notebooks, and create and test models using scikit-learn and pandas.
## Next steps
As the goal of this course is to help get you up to speed on Python so you can work through a quick start, the next step after completing the videos is to follow a tutorial! Here's a few of our favorites:
As the goal of these courses is to help get you up to speed on Python so you can work through a quick start. The next step after completing the videos is to follow a tutorial! Here's a few of our favorites:
- [Quickstart: Detect faces in an image using the Face REST API and Python](https://docs.microsoft.com/en-us/azure/cognitive-services/face/QuickStarts/Python?WT.mc_id=python-c9-niner)
- [Quickstart: Analyze a local image using the Computer Vision REST API and Python](https://docs.microsoft.com/en-us/azure/cognitive-services/computer-vision/quickstarts/python-disk?WT.mc_id=python-c9-niner)
- [Quickstart: Using the Python REST API to call the Text Analytics Cognitive Service](https://docs.microsoft.com/en-us/azure/cognitive-services/Text-Analytics/quickstarts/python?WT.mc_id=python-c9-niner)
- [Tutorial: Build a Flask app with Azure Cognitive Services](https://docs.microsoft.com/en-us/azure/cognitive-services/translator/tutorial-build-flask-app-translation-synthesis)
- [Quickstart: Detect faces in an image using the Face REST API and Python](https://docs.microsoft.com/azure/cognitive-services/face/QuickStarts/Python?WT.mc_id=python-c9-niner?WT.mc_id=python-c9-niner)
- [Quickstart: Analyze a local image using the Computer Vision REST API and Python](https://docs.microsoft.com/azure/cognitive-services/computer-vision/quickstarts/python-disk?WT.mc_id=python-c9-niner?WT.mc_id=python-c9-niner)
- [Quickstart: Using the Python REST API to call the Text Analytics Cognitive Service](https://docs.microsoft.com/azure/cognitive-services/Text-Analytics/quickstarts/python?WT.mc_id=python-c9-niner?WT.mc_id=python-c9-niner)
- [Tutorial: Build a Flask app with Azure Cognitive Services](https://docs.microsoft.com/azure/cognitive-services/translator/tutorial-build-flask-app-translation-synthesis?WT.mc_id=python-c9-niner)
- [Flask tutorial in Visual Studio Code](https://code.visualstudio.com/docs/python/tutorial-flask?WT.mc_id=python-c9-niner)
- [Django tutorial in Visual Studio Code](https://code.visualstudio.com/docs/python/tutorial-django?WT.mc_id=python-c9-niner)
- [Predict flight delays by creating a machine learning model in Python](https://docs.microsoft.com/learn/modules/predict-flight-delays-with-python?WT.mc_id=python-c9-niner)
## Contributing
This project welcomes contributions and suggestions. Most contributions require you to agree to a
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide
a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions
provided by the bot. You will only need to do this once across all repos using our CLA.
When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/). For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.

Просмотреть файл

@ -2,19 +2,19 @@
## Security
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [many more](https://opensource.microsoft.com/).
Microsoft takes the security of our software products and services seriously, which includes all source code repositories managed through our GitHub organizations, which include [Microsoft](https://github.com/Microsoft), [Azure](https://github.com/Azure), [DotNet](https://github.com/dotnet), [AspNet](https://github.com/aspnet), [Xamarin](https://github.com/xamarin), and [many more](https://opensource.microsoft.com/?WT.mc_id=python-c9-niner).
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets Microsoft's [definition](https://docs.microsoft.com/en-us/previous-versions/tn-archive/cc751383(v=technet.10)) of a security vulnerability, please report it to us as described below.
If you believe you have found a security vulnerability in any Microsoft-owned repository that meets Microsoft's [definition](https://docs.microsoft.com/previous-versions/tn-archive/cc751383(v=technet.10)?WT.mc_id=python-c9-niner) of a security vulnerability, please report it to us as described below.
## Reporting Security Issues
**Please do not report security vulnerabilities through public GitHub issues.**
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report).
Instead, please report them to the Microsoft Security Response Center (MSRC) at [https://msrc.microsoft.com/create-report](https://msrc.microsoft.com/create-report?WT.mc_id=python-c9-niner).
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/en-us/msrc/pgp-key-msrc).
If you prefer to submit without logging in, send email to [secure@microsoft.com](mailto:secure@microsoft.com). If possible, encrypt your message with our PGP key; please download it from the the [Microsoft Security Response Center PGP Key page](https://www.microsoft.com/msrc/pgp-key-msrc?WT.mc_id=python-c9-niner).
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc).
You should receive a response within 24 hours. If for some reason you do not, please follow up via email to ensure we received your original message. Additional information can be found at [microsoft.com/msrc](https://www.microsoft.com/msrc?WT.mc_id=python-c9-niner).
Please include the requested information listed below (as much as you can provide) to help us better understand the nature and scope of the possible issue:
@ -28,7 +28,7 @@ Please include the requested information listed below (as much as you can provid
This information will help us triage your report more quickly.
If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty) page for more details about our active programs.
If you are reporting for a bug bounty, more complete reports can contribute to a higher bounty award. Please visit our [Microsoft Bug Bounty Program](https://microsoft.com/msrc/bounty?WT.mc_id=python-c9-niner) page for more details about our active programs.
## Preferred Languages
@ -36,6 +36,6 @@ We prefer all communications to be in English.
## Policy
Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/en-us/msrc/cvd).
Microsoft follows the principle of [Coordinated Vulnerability Disclosure](https://www.microsoft.com/msrc/cvd?WT.mc_id=python-c9-niner).
<!-- END MICROSOFT SECURITY.MD BLOCK -->

Просмотреть файл

@ -0,0 +1,18 @@
# Jupyter Notebooks
Jupyter Notebooks are an open source web application that allows you to create and share Python code. They are frequently used for data science. The code samples in this course are completed using Jupyter Notebooks which have a .ipynb file extension.
## Documentation
- [Jupyter](https://jupyter.org/) to install Jupyter so you can run Jupyter Notebooks locally on your computer
- [Jupyter Notebook viewer](https://nbviewer.jupyter.org/) to view Jupyter Notebooks in this GitHub repository without installing Jupyter
- [Azure Notebooks](https://notebooks.azure.com/) to create a free Azure Notebooks account to run Notebooks in the cloud
- [Create and run a notebook](https://docs.microsoft.com/azure/notebooks/tutorial-create-run-jupyter-notebook?WT.mc_id=python-c9-niner) is a tutorial that walks you through the process of using Azure Notebooks to create a complete Jupyter Notebook that demonstrates linear regression
- [How to create and clone projects](https://docs.microsoft.com/azure/notebooks/create-clone-jupyter-notebooks?WT.mc_id=python-c9-niner) to create a project
- [Manage and configure projects in Azure Notebooks](https://docs.microsoft.com/azure/notebooks/configure-manage-azure-notebooks-projects?WT.mc_id=python-c9-niner) to upload Notebooks to your project
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,14 @@
# Anaconda
[Anaconda](https://www.anaconda.com/) is an open source distribution of Python and R for data science. It includes more than 1500 packages, a graphical interface called Anaconda Navigator, a command line interface called Anaconda prompt and a tool called Conda.
## Conda
Python code often relies on external libraries stored in packages. Conda is an open source package management system and environment management system. Conda helps you manage environments and install packages for Jupyter Notebooks.
## Documentation
- [Conda home page](https://docs.conda.io/)
- [Managing Conda environments](https://docs.conda.io/projects/conda/latest/user-guide/tasks/manage-environments.html) to find links and instructions for creating Conda environments, activating, and de-activating Conda environments
- [Managing packages](https://docs.conda.io/projects/conda/latest/user-guide/getting-started.html#managing-packages) to learn how to install packages in a Conda environment
- [Conda cheat sheet](https://docs.conda.io/projects/conda/latest/user-guide/cheatsheet.html) is a handy quick reference of common Conda commands

Просмотреть файл

@ -0,0 +1,390 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# pandas Series and DataFrame"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## pandas\n",
"**pandas** is an open source library providing data structures and data analysis tools for Python programmers"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Series\n",
"The pandas **Series** is a one dimensional array, similar to a Python list"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Seattle-Tacoma\n",
"1 Dulles\n",
"2 London Heathrow\n",
"3 Schiphol\n",
"4 Changi\n",
"5 Pearson\n",
"6 Narita\n",
"dtype: object"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports = pd.Series([\n",
" 'Seattle-Tacoma', \n",
" 'Dulles', \n",
" 'London Heathrow', \n",
" 'Schiphol', \n",
" 'Changi', \n",
" 'Pearson', \n",
" 'Narita'\n",
" ])\n",
"\n",
"# When using a notebook, you can use the print statement\n",
"# print(airports) to examine the contents of a variable\n",
"# or you can print a value on the screen by just typing the object name\n",
"airports"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can reference an individual value in a Series using it's index"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'London Heathrow'"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports[2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can use a loop to iterate through all the values in a Series"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Seattle-Tacoma\n",
"Dulles\n",
"London Heathrow\n",
"Schiphol\n",
"Changi\n",
"Pearson\n",
"Narita\n"
]
}
],
"source": [
"for value in airports:\n",
" print(value) "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## DataFrame\n",
"Most of the time when we are working with pandas we are dealing with two-dimensional arrays\n",
"\n",
"The pandas **DataFrame** can store two dimensional arrays"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seatte-Tacoma</td>\n",
" <td>Seattle</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>London Heathrow</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Schiphol</td>\n",
" <td>Amsterdam</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Pearson</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Narita</td>\n",
" <td>Tokyo</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 Seatte-Tacoma Seattle USA\n",
"1 Dulles Washington USA\n",
"2 London Heathrow London United Kingdom\n",
"3 Schiphol Amsterdam Netherlands\n",
"4 Changi Singapore Singapore\n",
"5 Pearson Toronto Canada\n",
"6 Narita Tokyo Japan"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports = pd.DataFrame([\n",
" ['Seatte-Tacoma', 'Seattle', 'USA'],\n",
" ['Dulles', 'Washington', 'USA'],\n",
" ['London Heathrow', 'London', 'United Kingdom'],\n",
" ['Schiphol', 'Amsterdam', 'Netherlands'],\n",
" ['Changi', 'Singapore', 'Singapore'],\n",
" ['Pearson', 'Toronto', 'Canada'],\n",
" ['Narita', 'Tokyo', 'Japan']\n",
" ])\n",
"\n",
"airports"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the **columns** parameter to specify names for the columns when you create the DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>City</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seatte-Tacoma</td>\n",
" <td>Seattle</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>London Heathrow</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Schiphol</td>\n",
" <td>Amsterdam</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Pearson</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Narita</td>\n",
" <td>Tokyo</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name City Country\n",
"0 Seatte-Tacoma Seattle USA\n",
"1 Dulles Washington USA\n",
"2 London Heathrow London United Kingdom\n",
"3 Schiphol Amsterdam Netherlands\n",
"4 Changi Singapore Singapore\n",
"5 Pearson Toronto Canada\n",
"6 Narita Tokyo Japan"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports = pd.DataFrame([\n",
" ['Seatte-Tacoma', 'Seattle', 'USA'],\n",
" ['Dulles', 'Washington', 'USA'],\n",
" ['London Heathrow', 'London', 'United Kingdom'],\n",
" ['Schiphol', 'Amsterdam', 'Netherlands'],\n",
" ['Changi', 'Singapore', 'Singapore'],\n",
" ['Pearson', 'Toronto', 'Canada'],\n",
" ['Narita', 'Tokyo', 'Japan']\n",
" ],\n",
" columns = ['Name', 'City', 'Country']\n",
" )\n",
"\n",
"airports "
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,14 @@
# pandas
[pandas](https://pandas/pydata.org) is an open source Python library contains a number of high performance data structures and tools for data analysis.
## Documentation
- [Series](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.html) stores one dimensional arrays
- [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/frame.html) stores two dimensional arrays and can contain different datatypes
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,381 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Examining pandas DataFrame contents\n",
"It's useful to be able to quickly examine the contents of a DataFrame. \n",
"\n",
"Let's start by importing the pandas library and creating a DataFrame populated with information about airports"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>City</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seatte-Tacoma</td>\n",
" <td>Seattle</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Heathrow</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Schiphol</td>\n",
" <td>Amsterdam</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Pearson</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Narita</td>\n",
" <td>Tokyo</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name City Country\n",
"0 Seatte-Tacoma Seattle USA\n",
"1 Dulles Washington USA\n",
"2 Heathrow London United Kingdom\n",
"3 Schiphol Amsterdam Netherlands\n",
"4 Changi Singapore Singapore\n",
"5 Pearson Toronto Canada\n",
"6 Narita Tokyo Japan"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports = pd.DataFrame([\n",
" ['Seatte-Tacoma', 'Seattle', 'USA'],\n",
" ['Dulles', 'Washington', 'USA'],\n",
" ['Heathrow', 'London', 'United Kingdom'],\n",
" ['Schiphol', 'Amsterdam', 'Netherlands'],\n",
" ['Changi', 'Singapore', 'Singapore'],\n",
" ['Pearson', 'Toronto', 'Canada'],\n",
" ['Narita', 'Tokyo', 'Japan']\n",
" ],\n",
" columns = ['Name', 'City', 'Country']\n",
" )\n",
"\n",
"airports "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Returning first *n* rows\n",
"If you have thousands of rows, you might just want to look at the first few rows\n",
"\n",
"* **head**(*n*) returns the top *n* rows "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>City</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seatte-Tacoma</td>\n",
" <td>Seattle</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Heathrow</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name City Country\n",
"0 Seatte-Tacoma Seattle USA\n",
"1 Dulles Washington USA\n",
"2 Heathrow London United Kingdom"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports.head(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Returning last *n* rows\n",
"Looking at the last rows in a DataFrame can be a good way to check that all your data loaded correctly\n",
"* **tail**(*n*) returns the last *n* rows"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>City</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Pearson</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Narita</td>\n",
" <td>Tokyo</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name City Country\n",
"4 Changi Singapore Singapore\n",
"5 Pearson Toronto Canada\n",
"6 Narita Tokyo Japan"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports.tail(3)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Checkign number of rows and columns in DataFrame\n",
"Sometimes you just need to know how much data you have in the DataFrame\n",
"\n",
"* **shape** returns the number of rows and columns"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(7, 3)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Getting mroe detailed information about DataFrame contents\n",
"\n",
"* **info**() returns more detailed information about the DataFrame\n",
"\n",
"Information returned includes:\n",
"* The number of rows, and the range of index values\n",
"* The number of columns\n",
"* For each column: column name, number of non-null values, the datatype\n"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 7 entries, 0 to 6\n",
"Data columns (total 3 columns):\n",
"Name 7 non-null object\n",
"City 7 non-null object\n",
"Country 7 non-null object\n",
"dtypes: object(3)\n",
"memory usage: 148.0+ bytes\n"
]
}
],
"source": [
"airports.info()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,10 @@
# Examining pandas DataFrame contents
The pandas [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) is a structure for storing two-dimensional tabular data.
## Common functions
- [head](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.head.html) returns the first *n* rows from the DataFrame
- [tail](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.tail.html) returns the last *n* rows from the DataFrame
- [shape](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.shape.html) returns the dimensions of the DataFrame (e.g. number of rows and columns)
- [info](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.info.html) provides a summary of the DataFrame content including column names, their datatypes, and number of rows containing non-null values

Просмотреть файл

@ -0,0 +1,812 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Query a pandas DataFrame \n",
"\n",
"Returning a portion of the data in a DataFrame is called slicing or dicing the data\n",
"\n",
"There are many different ways to query a pandas DataFrame, here are a few to get you started"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>City</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seatte-Tacoma</td>\n",
" <td>Seattle</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>London Heathrow</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Schiphol</td>\n",
" <td>Amsterdam</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Pearson</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Narita</td>\n",
" <td>Tokyo</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name City Country\n",
"0 Seatte-Tacoma Seattle USA\n",
"1 Dulles Washington USA\n",
"2 London Heathrow London United Kingdom\n",
"3 Schiphol Amsterdam Netherlands\n",
"4 Changi Singapore Singapore\n",
"5 Pearson Toronto Canada\n",
"6 Narita Tokyo Japan"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports = pd.DataFrame([\n",
" ['Seatte-Tacoma', 'Seattle', 'USA'],\n",
" ['Dulles', 'Washington', 'USA'],\n",
" ['London Heathrow', 'London', 'United Kingdom'],\n",
" ['Schiphol', 'Amsterdam', 'Netherlands'],\n",
" ['Changi', 'Singapore', 'Singapore'],\n",
" ['Pearson', 'Toronto', 'Canada'],\n",
" ['Narita', 'Tokyo', 'Japan']\n",
" ],\n",
" columns = ['Name', 'City', 'Country']\n",
" )\n",
"airports "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Return one column\n",
"Specify the name of the column you want to return\n",
"* *DataFrameName*['*columnName*']\n"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 Seattle\n",
"1 Washington\n",
"2 London\n",
"3 Amsterdam\n",
"4 Singapore\n",
"5 Toronto\n",
"6 Tokyo\n",
"Name: City, dtype: object"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports['City']"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Return multiple columns\n",
"Provide a list of the columns you want to return\n",
"* *DataFrameName*[['*FirstColumnName*','*SecondColumnName*',...]]"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seatte-Tacoma</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>London Heathrow</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Schiphol</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Pearson</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Narita</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Country\n",
"0 Seatte-Tacoma USA\n",
"1 Dulles USA\n",
"2 London Heathrow United Kingdom\n",
"3 Schiphol Netherlands\n",
"4 Changi Singapore\n",
"5 Pearson Canada\n",
"6 Narita Japan"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports[['Name', 'Country']]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using *iloc* to specify rows and columns to return\n",
"**iloc**[*rows*,*columns*] allows you to access a group of rows or columns by row and column index positions."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You specify the specific row and column you want returned\n",
"* First row is row 0\n",
"* First column is column 0"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'Seatte-Tacoma'"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Return the value in the first row, first column\n",
"airports.iloc[0,0]"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"'United Kingdom'"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Return the value in the third row, third column\n",
"airports.iloc[2,2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A value of *:* returns all rows or all columns"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>City</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seatte-Tacoma</td>\n",
" <td>Seattle</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>London Heathrow</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Schiphol</td>\n",
" <td>Amsterdam</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Pearson</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Narita</td>\n",
" <td>Tokyo</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name City Country\n",
"0 Seatte-Tacoma Seattle USA\n",
"1 Dulles Washington USA\n",
"2 London Heathrow London United Kingdom\n",
"3 Schiphol Amsterdam Netherlands\n",
"4 Changi Singapore Singapore\n",
"5 Pearson Toronto Canada\n",
"6 Narita Tokyo Japan"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports.iloc[:,:]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can request a range of rows or a range of columns\n",
"* [x:y] will return rows or columns x through y"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>City</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seatte-Tacoma</td>\n",
" <td>Seattle</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name City Country\n",
"0 Seatte-Tacoma Seattle USA\n",
"1 Dulles Washington USA"
]
},
"execution_count": 10,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Return the first two rows and display all columns \n",
"airports.iloc[0:2,:]"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>City</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seatte-Tacoma</td>\n",
" <td>Seattle</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>London Heathrow</td>\n",
" <td>London</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Schiphol</td>\n",
" <td>Amsterdam</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Pearson</td>\n",
" <td>Toronto</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Narita</td>\n",
" <td>Tokyo</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name City\n",
"0 Seatte-Tacoma Seattle\n",
"1 Dulles Washington\n",
"2 London Heathrow London\n",
"3 Schiphol Amsterdam\n",
"4 Changi Singapore\n",
"5 Pearson Toronto\n",
"6 Narita Tokyo"
]
},
"execution_count": 11,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Return all rows and display the first two columns\n",
"airports.iloc[:,0:2]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"You can request a list of rows or a list of columns\n",
"* [x,y,z] will return rows or columns x,y, and z"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seatte-Tacoma</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>London Heathrow</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Schiphol</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Pearson</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Narita</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Country\n",
"0 Seatte-Tacoma USA\n",
"1 Dulles USA\n",
"2 London Heathrow United Kingdom\n",
"3 Schiphol Netherlands\n",
"4 Changi Singapore\n",
"5 Pearson Canada\n",
"6 Narita Japan"
]
},
"execution_count": 12,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports.iloc[:,[0,2]]"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Using *loc* to specify columns by name\n",
"If you want to list the column names instead of the column positions use **loc** instead of **iloc**"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seatte-Tacoma</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>London Heathrow</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Schiphol</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Pearson</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Narita</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name Country\n",
"0 Seatte-Tacoma USA\n",
"1 Dulles USA\n",
"2 London Heathrow United Kingdom\n",
"3 Schiphol Netherlands\n",
"4 Changi Singapore\n",
"5 Pearson Canada\n",
"6 Narita Japan"
]
},
"execution_count": 13,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports.loc[:,['Name', 'Country']]"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,14 @@
# Query a pandas DataFrame
The pandas [DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) is a structure for storing two-dimensional tabular data.
## Common properties
- [loc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.loc.html) returns specific rows and columns by specifying column names
- [iloc](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.iloc.html) returns specific rows and columns by specifying column positions
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,9 @@
# CSV Files and Jupyter Notebooks
CSV files are comma separated variable file. CSV files are frequently used to store data. In order to access the data in a CSV file from a Jupyter Notebook you must upload the file.
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,8 @@
Name,City,Country
Seattle-Tacoma,Seattle,USA
Dulles,Washington,USA
Heathrow,London,United Kingdom
Schiphol,Amsterdam,Netherlands
Changi,Singapore,Singapore
Pearson,Toronto,Canada
Narita,Tokyo,Japan
1 Name City Country
2 Seattle-Tacoma Seattle USA
3 Dulles Washington USA
4 Heathrow London United Kingdom
5 Schiphol Amsterdam Netherlands
6 Changi Singapore Singapore
7 Pearson Toronto Canada
8 Narita Tokyo Japan

Просмотреть файл

@ -0,0 +1,903 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Read and write CSV files with pandas DataFrames\n",
"\n",
"You can load data from a CSV file directly into a pandas DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Reading a CSV file into a pandas DataFrame\n",
"**read_csv** allows you to read the contents of a csv file into a DataFrame\n",
"\n",
"airports.csv contains the following: \n",
"\n",
"Name,City,Country \n",
"Seattle-Tacoma,Seattle,USA \n",
"Dulles,Washington,USA \n",
"Heathrow,London,United Kingdom \n",
"Schiphol,Amsterdam,Netherlands \n",
"Changi,Singapore,Singapore \n",
"Pearson,Toronto,Canada \n",
"Narita,Tokyo,Japan"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>City</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seattle-Tacoma</td>\n",
" <td>Seattle</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Heathrow</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Schiphol</td>\n",
" <td>Amsterdam</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Pearson</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Narita</td>\n",
" <td>Tokyo</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name City Country\n",
"0 Seattle-Tacoma Seattle USA\n",
"1 Dulles Washington USA\n",
"2 Heathrow London United Kingdom\n",
"3 Schiphol Amsterdam Netherlands\n",
"4 Changi Singapore Singapore\n",
"5 Pearson Toronto Canada\n",
"6 Narita Tokyo Japan"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports_df = pd.read_csv('Data/airports.csv')\n",
"airports_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Handling rows with errors\n",
"By default rows with an extra , or other issues cause an error\n",
"\n",
"Note the extra , in the row for Heathrow London in airportsInvalidRows.csv: \n",
"\n",
"Name,City,Country \n",
"Seattle-Tacoma,Seattle,USA \n",
"Dulles,Washington,USA \n",
"Heathrow,London,,United Kingdom \n",
"Schiphol,Amsterdam,Netherlands \n",
"Changi,Singapore,Singapore \n",
"Pearson,Toronto,Canada \n",
"Narita,Tokyo,Japan "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"ename": "ParserError",
"evalue": "Error tokenizing data. C error: Expected 3 fields in line 4, saw 4\n",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mParserError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-3-73bdf61a29e1>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0mairports_df\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mpd\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread_csv\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m'Data/airportsInvalidRows.csv'\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2\u001b[0m \u001b[0mairports_df\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\Anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py\u001b[0m in \u001b[0;36mparser_f\u001b[1;34m(filepath_or_buffer, sep, delimiter, header, names, index_col, usecols, squeeze, prefix, mangle_dupe_cols, dtype, engine, converters, true_values, false_values, skipinitialspace, skiprows, skipfooter, nrows, na_values, keep_default_na, na_filter, verbose, skip_blank_lines, parse_dates, infer_datetime_format, keep_date_col, date_parser, dayfirst, cache_dates, iterator, chunksize, compression, thousands, decimal, lineterminator, quotechar, quoting, doublequote, escapechar, comment, encoding, dialect, error_bad_lines, warn_bad_lines, delim_whitespace, low_memory, memory_map, float_precision)\u001b[0m\n\u001b[0;32m 683\u001b[0m )\n\u001b[0;32m 684\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 685\u001b[1;33m \u001b[1;32mreturn\u001b[0m \u001b[0m_read\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mfilepath_or_buffer\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mkwds\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 686\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 687\u001b[0m \u001b[0mparser_f\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m__name__\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mname\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\Anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py\u001b[0m in \u001b[0;36m_read\u001b[1;34m(filepath_or_buffer, kwds)\u001b[0m\n\u001b[0;32m 461\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 462\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m--> 463\u001b[1;33m \u001b[0mdata\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mparser\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mnrows\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 464\u001b[0m \u001b[1;32mfinally\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 465\u001b[0m \u001b[0mparser\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mclose\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\Anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py\u001b[0m in \u001b[0;36mread\u001b[1;34m(self, nrows)\u001b[0m\n\u001b[0;32m 1152\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnrows\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1153\u001b[0m \u001b[0mnrows\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0m_validate_integer\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;34m\"nrows\"\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnrows\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 1154\u001b[1;33m \u001b[0mret\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_engine\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mnrows\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 1155\u001b[0m \u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 1156\u001b[0m \u001b[1;31m# May alter columns / col_dict\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32m~\\Anaconda3\\lib\\site-packages\\pandas\\io\\parsers.py\u001b[0m in \u001b[0;36mread\u001b[1;34m(self, nrows)\u001b[0m\n\u001b[0;32m 2046\u001b[0m \u001b[1;32mdef\u001b[0m \u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mself\u001b[0m\u001b[1;33m,\u001b[0m \u001b[0mnrows\u001b[0m\u001b[1;33m=\u001b[0m\u001b[1;32mNone\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2047\u001b[0m \u001b[1;32mtry\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[1;32m-> 2048\u001b[1;33m \u001b[0mdata\u001b[0m \u001b[1;33m=\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_reader\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mread\u001b[0m\u001b[1;33m(\u001b[0m\u001b[0mnrows\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m\u001b[0;32m 2049\u001b[0m \u001b[1;32mexcept\u001b[0m \u001b[0mStopIteration\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0;32m 2050\u001b[0m \u001b[1;32mif\u001b[0m \u001b[0mself\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0m_first_chunk\u001b[0m\u001b[1;33m:\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n",
"\u001b[1;32mpandas\\_libs\\parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader.read\u001b[1;34m()\u001b[0m\n",
"\u001b[1;32mpandas\\_libs\\parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._read_low_memory\u001b[1;34m()\u001b[0m\n",
"\u001b[1;32mpandas\\_libs\\parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._read_rows\u001b[1;34m()\u001b[0m\n",
"\u001b[1;32mpandas\\_libs\\parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.TextReader._tokenize_rows\u001b[1;34m()\u001b[0m\n",
"\u001b[1;32mpandas\\_libs\\parsers.pyx\u001b[0m in \u001b[0;36mpandas._libs.parsers.raise_parser_error\u001b[1;34m()\u001b[0m\n",
"\u001b[1;31mParserError\u001b[0m: Error tokenizing data. C error: Expected 3 fields in line 4, saw 4\n"
]
}
],
"source": [
"airports_df = pd.read_csv('Data/airportsInvalidRows.csv')\n",
"airports_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Specify **error_bad_lines=False** to skip any rows with errors"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"b'Skipping line 4: expected 3 fields, saw 4\\n'\n"
]
},
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>City</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seattle-Tacoma</td>\n",
" <td>Seattle</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Schiphol</td>\n",
" <td>Amsterdam</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Pearson</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Narita</td>\n",
" <td>Tokyo</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name City Country\n",
"0 Seattle-Tacoma Seattle USA\n",
"1 Dulles Washington USA\n",
"2 Schiphol Amsterdam Netherlands\n",
"3 Changi Singapore Singapore\n",
"4 Pearson Toronto Canada\n",
"5 Narita Tokyo Japan"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports_df = pd.read_csv(\n",
" 'Data/airportsInvalidRows.csv', \n",
" error_bad_lines=False\n",
" )\n",
"airports_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Handling files which do not contain column headers\n",
"If your file does not have the column headers in the first row by default, the first row of data is treated as headers\n",
"\n",
"airportsNoHeaderRows.csv contains airport data but does not have a row specifying the column headers:\n",
"\n",
"Seattle-Tacoma,Seattle,USA \n",
"Dulles,Washington,USA \n",
"Heathrow,London,United Kingdom \n",
"Schiphol,Amsterdam,Netherlands \n",
"Changi,Singapore,Singapore \n",
"Pearson,Toronto,Canada \n",
"Narita,Tokyo,Japan "
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Seattle-Tacoma</th>\n",
" <th>Seattle</th>\n",
" <th>USA</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Heathrow</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Schiphol</td>\n",
" <td>Amsterdam</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Pearson</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Narita</td>\n",
" <td>Tokyo</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Seattle-Tacoma Seattle USA\n",
"0 Dulles Washington USA\n",
"1 Heathrow London United Kingdom\n",
"2 Schiphol Amsterdam Netherlands\n",
"3 Changi Singapore Singapore\n",
"4 Pearson Toronto Canada\n",
"5 Narita Tokyo Japan"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports_df = pd.read_csv('Data/airportsNoHeaderRows.csv')\n",
"airports_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Specify **header=None** if you do not have a Header row to avoid having the first row of data treated as a header row"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" <th>1</th>\n",
" <th>2</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seattle-Tacoma</td>\n",
" <td>Seattle</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Heathrow</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Schiphol</td>\n",
" <td>Amsterdam</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Pearson</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Narita</td>\n",
" <td>Tokyo</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0 1 2\n",
"0 Seattle-Tacoma Seattle USA\n",
"1 Dulles Washington USA\n",
"2 Heathrow London United Kingdom\n",
"3 Schiphol Amsterdam Netherlands\n",
"4 Changi Singapore Singapore\n",
"5 Pearson Toronto Canada\n",
"6 Narita Tokyo Japan"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports_df = pd.read_csv(\n",
" 'Data/airportsNoHeaderRows.csv', \n",
" header=None\n",
" )\n",
"airports_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you do not have a header row you can use the **names** parameter to specify column names when data is loaded"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>City</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seattle-Tacoma</td>\n",
" <td>Seattle</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Heathrow</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Schiphol</td>\n",
" <td>Amsterdam</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Pearson</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Narita</td>\n",
" <td>Tokyo</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name City Country\n",
"0 Seattle-Tacoma Seattle USA\n",
"1 Dulles Washington USA\n",
"2 Heathrow London United Kingdom\n",
"3 Schiphol Amsterdam Netherlands\n",
"4 Changi Singapore Singapore\n",
"5 Pearson Toronto Canada\n",
"6 Narita Tokyo Japan"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports_df = pd.read_csv(\n",
" 'Data/airportsNoHeaderRows.csv', \n",
" header=None, \n",
" names=['Name', 'City', 'Country']\n",
" )\n",
"airports_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Missing values in Data files\n",
"Missing values appear in DataFrames as **NaN**\n",
"\n",
"There is no city listed for Schiphol airport in airportsBlankValues.csv :\n",
"\n",
"Name,City,Country \n",
"Seattle-Tacoma,Seattle,USA \n",
"Dulles,Washington,USA \n",
"Heathrow,London,United Kingdom \n",
"Schiphol,,Netherlands \n",
"Changi,Singapore,Singapore \n",
"Pearson,Toronto,Canada \n",
"Narita,Tokyo,Japan"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>City</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seattle-Tacoma</td>\n",
" <td>Seattle</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Heathrow</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Schiphol</td>\n",
" <td>NaN</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Pearson</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Narita</td>\n",
" <td>Tokyo</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name City Country\n",
"0 Seattle-Tacoma Seattle USA\n",
"1 Dulles Washington USA\n",
"2 Heathrow London United Kingdom\n",
"3 Schiphol NaN Netherlands\n",
"4 Changi Singapore Singapore\n",
"5 Pearson Toronto Canada\n",
"6 Narita Tokyo Japan"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports_df = pd.read_csv('Data/airportsBlankValues.csv')\n",
"airports_df"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Writing DataFrame contents to a CSV file\n",
"**to_csv** will write the contents of a pandas DataFrame to a CSV file"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>City</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seattle-Tacoma</td>\n",
" <td>Seattle</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Heathrow</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Schiphol</td>\n",
" <td>NaN</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Pearson</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Narita</td>\n",
" <td>Tokyo</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name City Country\n",
"0 Seattle-Tacoma Seattle USA\n",
"1 Dulles Washington USA\n",
"2 Heathrow London United Kingdom\n",
"3 Schiphol NaN Netherlands\n",
"4 Changi Singapore Singapore\n",
"5 Pearson Toronto Canada\n",
"6 Narita Tokyo Japan"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports_df"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [],
"source": [
"airports_df.to_csv('Data/MyNewCSVFile.csv')"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The index column is written to the csv file\n",
"\n",
"Specify **index=False** if you do not want the index column to be included in the csv file"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [],
"source": [
"airports_df.to_csv(\n",
" 'Data/MyNewCSVFileNoIndex.csv', \n",
" index=False\n",
" )"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,15 @@
# Read and write CSV files from pandas DataFrames
You can populate a DataFrame with the data in a CSV file.
## Common functions and properties
- [read_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.read_csv.html) reads a comma-separated values file into a DataFrame
- [to_csv](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.to_csv.html) writes contents of a DataFrame to a comma-separated values file
- [NaN](https://pandas.pydata.org/pandas-docs/stable/user_guide/missing_data.html) is the default representation of missing values
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,8 @@
Name,City,Country
Seattle-Tacoma,Seattle,USA
Dulles,Washington,USA
Heathrow,London,United Kingdom
Schiphol,Amsterdam,Netherlands
Changi,Singapore,Singapore
Pearson,Toronto,Canada
Narita,Tokyo,Japan
1 Name City Country
2 Seattle-Tacoma Seattle USA
3 Dulles Washington USA
4 Heathrow London United Kingdom
5 Schiphol Amsterdam Netherlands
6 Changi Singapore Singapore
7 Pearson Toronto Canada
8 Narita Tokyo Japan

Просмотреть файл

@ -0,0 +1,8 @@
Name,City,Country
Seattle-Tacoma,Seattle,USA
Dulles,Washington,USA
Heathrow,London,United Kingdom
Schiphol,,Netherlands
Changi,Singapore,Singapore
Pearson,Toronto,Canada
Narita,Tokyo,Japan
1 Name City Country
2 Seattle-Tacoma Seattle USA
3 Dulles Washington USA
4 Heathrow London United Kingdom
5 Schiphol Netherlands
6 Changi Singapore Singapore
7 Pearson Toronto Canada
8 Narita Tokyo Japan

Просмотреть файл

@ -0,0 +1,8 @@
Name,City,Country
Seattle-Tacoma,Seattle,USA
Dulles,Washington,USA
Heathrow,London,,United Kingdom
Schiphol,Amsterdam,Netherlands
Changi,Singapore,Singapore
Pearson,Toronto,Canada
Narita,Tokyo,Japan
1 Name,City,Country
2 Seattle-Tacoma,Seattle,USA
3 Dulles,Washington,USA
4 Heathrow,London,,United Kingdom
5 Schiphol,Amsterdam,Netherlands
6 Changi,Singapore,Singapore
7 Pearson,Toronto,Canada
8 Narita,Tokyo,Japan

Просмотреть файл

@ -0,0 +1,7 @@
Seattle-Tacoma,Seattle,USA
Dulles,Washington,USA
Heathrow,London,United Kingdom
Schiphol,Amsterdam,Netherlands
Changi,Singapore,Singapore
Pearson,Toronto,Canada
Narita,Tokyo,Japan
1 Seattle-Tacoma Seattle USA
2 Dulles Washington USA
3 Heathrow London United Kingdom
4 Schiphol Amsterdam Netherlands
5 Changi Singapore Singapore
6 Pearson Toronto Canada
7 Narita Tokyo Japan

Просмотреть файл

@ -0,0 +1,634 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Removing and splitting pandas DataFrame columns"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When you are preparing to train machine learning models, you often need to delete specific columns, or split certain columns from your DataFrame into a new DataFrame.\n",
"\n",
"We need the pandas library and a DataFrame to explore"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's load a bigger csv file with more columns, **flight_delays.csv** provides information about flights and flight delays"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>FL_DATE</th>\n",
" <th>OP_UNIQUE_CARRIER</th>\n",
" <th>TAIL_NUM</th>\n",
" <th>OP_CARRIER_FL_NUM</th>\n",
" <th>ORIGIN</th>\n",
" <th>DEST</th>\n",
" <th>CRS_DEP_TIME</th>\n",
" <th>DEP_TIME</th>\n",
" <th>DEP_DELAY</th>\n",
" <th>CRS_ARR_TIME</th>\n",
" <th>ARR_TIME</th>\n",
" <th>ARR_DELAY</th>\n",
" <th>CRS_ELAPSED_TIME</th>\n",
" <th>ACTUAL_ELAPSED_TIME</th>\n",
" <th>AIR_TIME</th>\n",
" <th>DISTANCE</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N221WN</td>\n",
" <td>802</td>\n",
" <td>ABQ</td>\n",
" <td>BWI</td>\n",
" <td>905</td>\n",
" <td>903</td>\n",
" <td>-2</td>\n",
" <td>1450</td>\n",
" <td>1433</td>\n",
" <td>-17</td>\n",
" <td>225</td>\n",
" <td>210</td>\n",
" <td>197</td>\n",
" <td>1670</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N8329B</td>\n",
" <td>3744</td>\n",
" <td>ABQ</td>\n",
" <td>BWI</td>\n",
" <td>1500</td>\n",
" <td>1458</td>\n",
" <td>-2</td>\n",
" <td>2045</td>\n",
" <td>2020</td>\n",
" <td>-25</td>\n",
" <td>225</td>\n",
" <td>202</td>\n",
" <td>191</td>\n",
" <td>1670</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N920WN</td>\n",
" <td>1019</td>\n",
" <td>ABQ</td>\n",
" <td>DAL</td>\n",
" <td>1800</td>\n",
" <td>1802</td>\n",
" <td>2</td>\n",
" <td>2045</td>\n",
" <td>2032</td>\n",
" <td>-13</td>\n",
" <td>105</td>\n",
" <td>90</td>\n",
" <td>80</td>\n",
" <td>580</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N480WN</td>\n",
" <td>1499</td>\n",
" <td>ABQ</td>\n",
" <td>DAL</td>\n",
" <td>950</td>\n",
" <td>947</td>\n",
" <td>-3</td>\n",
" <td>1235</td>\n",
" <td>1223</td>\n",
" <td>-12</td>\n",
" <td>105</td>\n",
" <td>96</td>\n",
" <td>81</td>\n",
" <td>580</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N227WN</td>\n",
" <td>3635</td>\n",
" <td>ABQ</td>\n",
" <td>DAL</td>\n",
" <td>1150</td>\n",
" <td>1151</td>\n",
" <td>1</td>\n",
" <td>1430</td>\n",
" <td>1423</td>\n",
" <td>-7</td>\n",
" <td>100</td>\n",
" <td>92</td>\n",
" <td>80</td>\n",
" <td>580</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" FL_DATE OP_UNIQUE_CARRIER TAIL_NUM OP_CARRIER_FL_NUM ORIGIN DEST \\\n",
"0 2018-10-01 WN N221WN 802 ABQ BWI \n",
"1 2018-10-01 WN N8329B 3744 ABQ BWI \n",
"2 2018-10-01 WN N920WN 1019 ABQ DAL \n",
"3 2018-10-01 WN N480WN 1499 ABQ DAL \n",
"4 2018-10-01 WN N227WN 3635 ABQ DAL \n",
"\n",
" CRS_DEP_TIME DEP_TIME DEP_DELAY CRS_ARR_TIME ARR_TIME ARR_DELAY \\\n",
"0 905 903 -2 1450 1433 -17 \n",
"1 1500 1458 -2 2045 2020 -25 \n",
"2 1800 1802 2 2045 2032 -13 \n",
"3 950 947 -3 1235 1223 -12 \n",
"4 1150 1151 1 1430 1423 -7 \n",
"\n",
" CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME AIR_TIME DISTANCE \n",
"0 225 210 197 1670 \n",
"1 225 202 191 1670 \n",
"2 105 90 80 580 \n",
"3 105 96 81 580 \n",
"4 100 92 80 580 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"delays_df = pd.read_csv('Data/flight_delays.csv')\n",
"delays_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Removing a column from a DataFrame.\n",
"\n",
"When you are preparing your data for machine learning, you may need to delete specific columns from the DataFrame before training the model.\n",
"\n",
"For example:\n",
"Imagine you are training a model to predict how many minutes late a flight will be (ARR_DELAY)\n",
"\n",
"If the model knew the scheduled arrival time (CRS_ARR_TIME) and the actual arrival time (ARR_TIME), the model would quickly figure out ARR_DELAY = ARR_TIME - CRS_ARR_TIME\n",
"\n",
"When we predict arrival times for future flights, we won't have a value for arrival time (ARR_TIME). So we should remove this column from the DataFrame so it is not used as a feature when training the model to predict ARR_DELAY. "
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>FL_DATE</th>\n",
" <th>OP_UNIQUE_CARRIER</th>\n",
" <th>TAIL_NUM</th>\n",
" <th>OP_CARRIER_FL_NUM</th>\n",
" <th>ORIGIN</th>\n",
" <th>DEST</th>\n",
" <th>CRS_DEP_TIME</th>\n",
" <th>DEP_TIME</th>\n",
" <th>DEP_DELAY</th>\n",
" <th>CRS_ARR_TIME</th>\n",
" <th>ARR_DELAY</th>\n",
" <th>CRS_ELAPSED_TIME</th>\n",
" <th>ACTUAL_ELAPSED_TIME</th>\n",
" <th>AIR_TIME</th>\n",
" <th>DISTANCE</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N221WN</td>\n",
" <td>802</td>\n",
" <td>ABQ</td>\n",
" <td>BWI</td>\n",
" <td>905</td>\n",
" <td>903</td>\n",
" <td>-2</td>\n",
" <td>1450</td>\n",
" <td>-17</td>\n",
" <td>225</td>\n",
" <td>210</td>\n",
" <td>197</td>\n",
" <td>1670</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N8329B</td>\n",
" <td>3744</td>\n",
" <td>ABQ</td>\n",
" <td>BWI</td>\n",
" <td>1500</td>\n",
" <td>1458</td>\n",
" <td>-2</td>\n",
" <td>2045</td>\n",
" <td>-25</td>\n",
" <td>225</td>\n",
" <td>202</td>\n",
" <td>191</td>\n",
" <td>1670</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N920WN</td>\n",
" <td>1019</td>\n",
" <td>ABQ</td>\n",
" <td>DAL</td>\n",
" <td>1800</td>\n",
" <td>1802</td>\n",
" <td>2</td>\n",
" <td>2045</td>\n",
" <td>-13</td>\n",
" <td>105</td>\n",
" <td>90</td>\n",
" <td>80</td>\n",
" <td>580</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N480WN</td>\n",
" <td>1499</td>\n",
" <td>ABQ</td>\n",
" <td>DAL</td>\n",
" <td>950</td>\n",
" <td>947</td>\n",
" <td>-3</td>\n",
" <td>1235</td>\n",
" <td>-12</td>\n",
" <td>105</td>\n",
" <td>96</td>\n",
" <td>81</td>\n",
" <td>580</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N227WN</td>\n",
" <td>3635</td>\n",
" <td>ABQ</td>\n",
" <td>DAL</td>\n",
" <td>1150</td>\n",
" <td>1151</td>\n",
" <td>1</td>\n",
" <td>1430</td>\n",
" <td>-7</td>\n",
" <td>100</td>\n",
" <td>92</td>\n",
" <td>80</td>\n",
" <td>580</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" FL_DATE OP_UNIQUE_CARRIER TAIL_NUM OP_CARRIER_FL_NUM ORIGIN DEST \\\n",
"0 2018-10-01 WN N221WN 802 ABQ BWI \n",
"1 2018-10-01 WN N8329B 3744 ABQ BWI \n",
"2 2018-10-01 WN N920WN 1019 ABQ DAL \n",
"3 2018-10-01 WN N480WN 1499 ABQ DAL \n",
"4 2018-10-01 WN N227WN 3635 ABQ DAL \n",
"\n",
" CRS_DEP_TIME DEP_TIME DEP_DELAY CRS_ARR_TIME ARR_DELAY \\\n",
"0 905 903 -2 1450 -17 \n",
"1 1500 1458 -2 2045 -25 \n",
"2 1800 1802 2 2045 -13 \n",
"3 950 947 -3 1235 -12 \n",
"4 1150 1151 1 1430 -7 \n",
"\n",
" CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME AIR_TIME DISTANCE \n",
"0 225 210 197 1670 \n",
"1 225 202 191 1670 \n",
"2 105 90 80 580 \n",
"3 105 96 81 580 \n",
"4 100 92 80 580 "
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Remove the column ARR_TIME from the DataFrane delays_df\n",
"\n",
"#delays_df = delays_df.drop(['ARR_TIME'],axis=1)\n",
"new_df = delays_df.drop(columns=['ARR_TIME'])\n",
"new_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the **inplace** parameter to specify you want to drop the column from the original DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>FL_DATE</th>\n",
" <th>OP_UNIQUE_CARRIER</th>\n",
" <th>TAIL_NUM</th>\n",
" <th>OP_CARRIER_FL_NUM</th>\n",
" <th>ORIGIN</th>\n",
" <th>DEST</th>\n",
" <th>CRS_DEP_TIME</th>\n",
" <th>DEP_TIME</th>\n",
" <th>DEP_DELAY</th>\n",
" <th>CRS_ARR_TIME</th>\n",
" <th>ARR_DELAY</th>\n",
" <th>CRS_ELAPSED_TIME</th>\n",
" <th>ACTUAL_ELAPSED_TIME</th>\n",
" <th>AIR_TIME</th>\n",
" <th>DISTANCE</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N221WN</td>\n",
" <td>802</td>\n",
" <td>ABQ</td>\n",
" <td>BWI</td>\n",
" <td>905</td>\n",
" <td>903</td>\n",
" <td>-2</td>\n",
" <td>1450</td>\n",
" <td>-17</td>\n",
" <td>225</td>\n",
" <td>210</td>\n",
" <td>197</td>\n",
" <td>1670</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N8329B</td>\n",
" <td>3744</td>\n",
" <td>ABQ</td>\n",
" <td>BWI</td>\n",
" <td>1500</td>\n",
" <td>1458</td>\n",
" <td>-2</td>\n",
" <td>2045</td>\n",
" <td>-25</td>\n",
" <td>225</td>\n",
" <td>202</td>\n",
" <td>191</td>\n",
" <td>1670</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N920WN</td>\n",
" <td>1019</td>\n",
" <td>ABQ</td>\n",
" <td>DAL</td>\n",
" <td>1800</td>\n",
" <td>1802</td>\n",
" <td>2</td>\n",
" <td>2045</td>\n",
" <td>-13</td>\n",
" <td>105</td>\n",
" <td>90</td>\n",
" <td>80</td>\n",
" <td>580</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N480WN</td>\n",
" <td>1499</td>\n",
" <td>ABQ</td>\n",
" <td>DAL</td>\n",
" <td>950</td>\n",
" <td>947</td>\n",
" <td>-3</td>\n",
" <td>1235</td>\n",
" <td>-12</td>\n",
" <td>105</td>\n",
" <td>96</td>\n",
" <td>81</td>\n",
" <td>580</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N227WN</td>\n",
" <td>3635</td>\n",
" <td>ABQ</td>\n",
" <td>DAL</td>\n",
" <td>1150</td>\n",
" <td>1151</td>\n",
" <td>1</td>\n",
" <td>1430</td>\n",
" <td>-7</td>\n",
" <td>100</td>\n",
" <td>92</td>\n",
" <td>80</td>\n",
" <td>580</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" FL_DATE OP_UNIQUE_CARRIER TAIL_NUM OP_CARRIER_FL_NUM ORIGIN DEST \\\n",
"0 2018-10-01 WN N221WN 802 ABQ BWI \n",
"1 2018-10-01 WN N8329B 3744 ABQ BWI \n",
"2 2018-10-01 WN N920WN 1019 ABQ DAL \n",
"3 2018-10-01 WN N480WN 1499 ABQ DAL \n",
"4 2018-10-01 WN N227WN 3635 ABQ DAL \n",
"\n",
" CRS_DEP_TIME DEP_TIME DEP_DELAY CRS_ARR_TIME ARR_DELAY \\\n",
"0 905 903 -2 1450 -17 \n",
"1 1500 1458 -2 2045 -25 \n",
"2 1800 1802 2 2045 -13 \n",
"3 950 947 -3 1235 -12 \n",
"4 1150 1151 1 1430 -7 \n",
"\n",
" CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME AIR_TIME DISTANCE \n",
"0 225 210 197 1670 \n",
"1 225 202 191 1670 \n",
"2 105 90 80 580 \n",
"3 105 96 81 580 \n",
"4 100 92 80 580 "
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Remove the column ARR_TIME from the DataFrame delays_df\n",
"\n",
"#delays_df = delays_df.drop(['ARR_TIME'],axis=1)\n",
"delays_df.drop(columns=['ARR_TIME'], inplace=True)\n",
"delays_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We use different techniques to predict based on quantititative values which are usually numeric values (e.g. distance, number of minutes, weight) and qualitative (descriptive) values which may not be numeric (e.g. what airport a flight left from, what airline operated the flight)\n",
"\n",
"Quantitative data may be moved into a separate DataFrame before training a model.\n",
"\n",
"You also need to put the value you want to predict, called the label (ARR_DELAY) in a separate DataFrame from the values you think can help you make the prediction, called the features\n",
"\n",
"We need to be able to create a new dataframe from the columns in an existing dataframe"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"# Create a new DataFrame called desc_df\n",
"# include all rows\n",
"# include the columns ORIGIN, DEST, OP_CARRIER_FL_NUM, OP_UNIQUE_CARRIER, TAIL_NUM\n",
"\n",
"desc_df = delays_df.loc[:,['ORIGIN', 'DEST', 'OP_CARRIER_FL_NUM', 'OP_UNIQUE_CARRIER', 'TAIL_NUM']]\n",
"desc_df.head()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Просмотреть файл

@ -0,0 +1,13 @@
# Removing and splitting DataFrame columns
When preparing data for machine learning you may need to remove specific columns from the DataFrame.
## Common functions
- [drop](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop.html) deletes specified columns from a DataFrame
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,100 @@
FL_DATE,OP_UNIQUE_CARRIER,TAIL_NUM,OP_CARRIER_FL_NUM,ORIGIN,DEST,CRS_DEP_TIME,DEP_TIME,DEP_DELAY,CRS_ARR_TIME,ARR_TIME,ARR_DELAY,CRS_ELAPSED_TIME,ACTUAL_ELAPSED_TIME,AIR_TIME,DISTANCE
2018-10-01,WN,N221WN,802,ABQ,BWI,905,903,-2,1450,1433,-17,225,210,197,1670
2018-10-01,WN,N8329B,3744,ABQ,BWI,1500,1458,-2,2045,2020,-25,225,202,191,1670
2018-10-01,WN,N920WN,1019,ABQ,DAL,1800,1802,2,2045,2032,-13,105,90,80,580
2018-10-01,WN,N480WN,1499,ABQ,DAL,950,947,-3,1235,1223,-12,105,96,81,580
2018-10-01,WN,N227WN,3635,ABQ,DAL,1150,1151,1,1430,1423,-7,100,92,80,580
2018-10-01,WN,N243WN,3998,ABQ,DAL,655,652,-3,940,924,-16,105,92,83,580
2018-10-01,WN,N485WN,5432,ABQ,DAL,1340,1354,14,1625,1631,6,105,97,81,580
2018-10-01,WN,N229WN,4596,ABQ,DEN,1420,1444,24,1540,1552,12,80,68,55,349
2018-10-01,WN,N934WN,6013,ABQ,DEN,910,907,-3,1025,1027,2,75,80,52,349
2018-10-01,WN,N934WN,6015,ABQ,DEN,1735,1742,7,1845,1854,9,70,72,58,349
2018-10-01,WN,N8615E,2885,ABQ,HOU,1240,1239,-1,1540,1539,-1,120,120,108,759
2018-10-01,WN,N965WN,3939,ABQ,HOU,640,640,0,940,938,-2,120,118,103,759
2018-10-01,WN,N408WN,4025,ABQ,HOU,1555,1610,15,1850,1906,16,115,116,103,759
2018-10-01,WN,N913WN,1642,ABQ,LAS,1040,1037,-3,1115,1057,-18,95,80,69,486
2018-10-01,WN,N927WN,3271,ABQ,LAS,1615,1614,-1,1645,1646,1,90,92,75,486
2018-10-01,WN,N732SW,4816,ABQ,LAS,605,601,-4,635,628,-7,90,87,73,486
2018-10-01,WN,N496WN,6095,ABQ,LAS,2130,2123,-7,2155,2146,-9,85,83,70,486
2018-10-01,WN,N468WN,555,ABQ,LAX,1710,1708,-2,1815,1805,-10,125,117,92,677
2018-10-01,WN,N7751A,3858,ABQ,LAX,545,541,-4,645,638,-7,120,117,102,677
2018-10-01,WN,N435WN,5757,ABQ,MCI,1720,2119,239,2005,2357,232,105,98,87,718
2018-10-01,WN,N556WN,538,ABQ,MDW,1705,1756,51,2040,2114,34,155,138,129,1121
2018-10-01,WN,N410WN,4837,ABQ,MDW,705,708,3,1045,1032,-13,160,144,127,1121
2018-10-01,WN,N8726H,792,ABQ,OAK,815,809,-6,940,928,-12,145,139,123,889
2018-10-01,WN,N956WN,5673,ABQ,OAK,1125,1221,56,1250,1337,47,145,136,121,889
2018-10-01,WN,N739GB,5753,ABQ,OAK,1915,1915,0,2035,2029,-6,140,134,121,889
2018-10-01,WN,N7723E,5516,ABQ,PDX,1020,1017,-3,1215,1204,-11,175,167,157,1111
2018-10-01,WN,N770SA,1415,ABQ,PHX,945,944,-1,1005,949,-16,80,65,54,328
2018-10-01,WN,N730SW,2782,ABQ,PHX,1410,1424,14,1430,1431,1,80,67,55,328
2018-10-01,WN,N7725A,2863,ABQ,PHX,700,702,2,720,720,0,80,78,56,328
2018-10-01,WN,N450WN,4114,ABQ,PHX,1935,1931,-4,1950,1959,9,75,88,58,328
2018-10-01,WN,N8673F,5500,ABQ,PHX,1625,1630,5,1640,1636,-4,75,66,58,328
2018-10-01,WN,N948WN,6315,ABQ,PHX,1120,1126,6,1140,1138,-2,80,72,57,328
2018-10-01,WN,N566WN,19,ABQ,SAN,1505,1551,46,1555,1631,36,110,100,87,628
2018-10-01,WN,N957WN,4832,ABQ,SAN,610,616,6,700,658,-2,110,102,91,628
2018-10-01,WN,N8704Q,824,ALB,BWI,805,801,-4,920,911,-9,75,70,54,289
2018-10-01,WN,N903WN,1758,ALB,BWI,605,601,-4,720,706,-14,75,65,55,289
2018-10-01,WN,N8572X,2790,ALB,BWI,925,928,3,1040,1031,-9,75,63,53,289
2018-10-01,WN,N7701B,3292,ALB,BWI,1315,1308,-7,1435,1417,-18,80,69,58,289
2018-10-01,WN,N295WN,3376,ALB,BWI,1105,1101,-4,1220,1206,-14,75,65,53,289
2018-10-01,WN,N716SW,4898,ALB,BWI,1710,1707,-3,1825,1815,-10,75,68,56,289
2018-10-01,WN,N8674B,5153,ALB,DEN,1850,1849,-1,2050,2045,-5,240,236,223,1610
2018-10-01,WN,N8643A,390,ALB,MCO,705,705,0,955,954,-1,170,169,149,1073
2018-10-01,WN,N730SW,2776,ALB,MDW,630,625,-5,735,744,9,125,139,120,717
2018-10-01,WN,N798SW,4197,ALB,MDW,1655,1652,-3,1805,1815,10,130,143,112,717
2018-10-01,WN,N729SW,988,AMA,DAL,1605,1615,10,1720,1713,-7,75,58,48,323
2018-10-01,WN,N933WN,1913,AMA,DAL,605,603,-2,720,705,-15,75,62,50,323
2018-10-01,WN,N7706A,5226,AMA,DAL,1045,1047,2,1155,1156,1,70,69,52,323
2018-10-01,WN,N755SA,6984,AMA,DAL,1830,1825,-5,1940,1921,-19,70,56,48,323
2018-10-01,WN,N211WN,6822,AMA,LAS,1425,1429,4,1425,1438,13,120,129,107,758
2018-10-01,WN,N928WN,4261,ATL,AUS,1015,1011,-4,1140,1137,-3,145,146,123,813
2018-10-01,WN,N8581Z,4701,ATL,AUS,2030,2024,-6,2150,2133,-17,140,129,109,813
2018-10-01,WN,N950WN,5615,ATL,AUS,1645,1647,2,1810,1805,-5,145,138,112,813
2018-10-01,WN,N932WN,106,ATL,BNA,2215,2211,-4,2215,2205,-10,60,54,39,214
2018-10-01,WN,N739GB,2583,ATL,BNA,800,756,-4,755,752,-3,55,56,42,214
2018-10-01,WN,N454WN,3766,ATL,BNA,1955,1951,-4,2000,1948,-12,65,57,39,214
2018-10-01,WN,N7716A,4165,ATL,BNA,1225,1226,1,1235,1226,-9,70,60,41,214
2018-10-01,WN,N7822A,4501,ATL,BNA,1750,1745,-5,1745,1742,-3,55,57,40,214
2018-10-01,WN,N8324A,3360,ATL,BOS,1330,1500,90,1605,1737,92,155,157,126,946
2018-10-01,WN,N444WN,3987,ATL,BOS,2210,2204,-6,50,24,-26,160,140,118,946
2018-10-01,WN,N472WN,1031,ATL,BWI,1120,1119,-1,1310,1303,-7,110,104,81,577
2018-10-01,WN,N758SW,1526,ATL,BWI,800,757,-3,945,934,-11,105,97,81,577
2018-10-01,WN,N8642E,1922,ATL,BWI,1700,1656,-4,1850,1840,-10,110,104,88,577
2018-10-01,WN,N7838A,3991,ATL,BWI,2115,2202,47,2305,2339,34,110,97,80,577
2018-10-01,WN,N7839A,4436,ATL,BWI,1905,1904,-1,2100,2044,-16,115,100,85,577
2018-10-01,WN,N8509U,5150,ATL,BWI,1340,1416,36,1530,1548,18,110,92,79,577
2018-10-01,WN,N242WN,2574,ATL,CLE,835,833,-2,1015,1016,1,100,103,79,554
2018-10-01,WN,N961WN,5133,ATL,CLE,2200,2155,-5,2335,2334,-1,95,99,80,554
2018-10-01,WN,N8503A,2571,ATL,CMH,1540,1540,0,1710,1711,1,90,91,65,447
2018-10-01,WN,N282WN,6348,ATL,CMH,835,834,-1,1005,1005,0,90,91,67,447
2018-10-01,WN,N293WN,6661,ATL,CMH,2200,2208,8,2325,2332,7,85,84,67,447
2018-10-01,WN,N954WN,63,ATL,DAL,2010,2010,0,2125,2118,-7,135,128,101,721
2018-10-01,WN,N764SW,2838,ATL,DAL,720,719,-1,825,816,-9,125,117,103,721
2018-10-01,WN,N8549Z,3845,ATL,DAL,1740,1738,-2,1845,1838,-7,125,120,102,721
2018-10-01,WN,N8620H,5577,ATL,DAL,1045,1102,17,1200,1208,8,135,126,103,721
2018-10-01,WN,N8317M,6768,ATL,DAL,1330,1331,1,1440,1431,-9,130,120,101,721
2018-10-01,WN,N550WN,1347,ATL,DCA,1525,1542,17,1710,1724,14,105,102,79,547
2018-10-01,WN,N8648A,2600,ATL,DCA,715,716,1,855,851,-4,100,95,82,547
2018-10-01,WN,N726SW,4747,ATL,DCA,2025,2027,2,2210,2209,-1,105,102,83,547
2018-10-01,WN,N8665D,5207,ATL,DCA,1030,1105,35,1215,1241,26,105,96,80,547
2018-10-01,WN,N930WN,208,ATL,DEN,1835,1846,11,1945,1948,3,190,182,167,1199
2018-10-01,WN,N400WN,4133,ATL,DEN,610,607,-3,715,711,-4,185,184,170,1199
2018-10-01,WN,N7817J,4139,ATL,DEN,1430,1432,2,1540,1539,-1,190,187,167,1199
2018-10-01,WN,N8647A,5960,ATL,DEN,955,1008,13,1110,1112,2,195,184,162,1199
2018-10-01,WN,N212WN,115,ATL,DTW,835,832,-3,1025,1024,-1,110,112,86,594
2018-10-01,WN,N256WN,1896,ATL,DTW,2200,2200,0,2345,2349,4,105,109,86,594
2018-10-01,WN,N8556Z,5388,ATL,DTW,1530,1531,1,1730,1727,-3,120,116,86,594
2018-10-01,WN,N279WN,2775,ATL,FLL,1110,1111,1,1310,1250,-20,120,99,82,581
2018-10-01,WN,N945WN,3088,ATL,FLL,1340,1351,11,1535,1529,-6,115,98,82,581
2018-10-01,WN,N8699A,5459,ATL,FLL,650,645,-5,835,820,-15,105,95,80,581
2018-10-01,WN,N8691A,6191,ATL,FLL,1950,1958,8,2140,2146,6,110,108,83,581
2018-10-01,WN,N8548P,964,ATL,GSP,1540,1539,-1,1630,1628,-2,50,49,27,153
2018-10-01,WN,N258WN,5417,ATL,GSP,2205,2240,35,2255,2322,27,50,42,27,153
2018-10-01,WN,N8660A,6185,ATL,GSP,1105,1101,-4,1200,1145,-15,55,44,26,153
2018-10-01,WN,N274WN,343,ATL,HOU,1810,1808,-2,1920,1901,-19,130,113,100,696
2018-10-01,WN,N230WN,1176,ATL,HOU,1955,1955,0,2105,2057,-8,130,122,101,696
2018-10-01,WN,N786SW,1433,ATL,HOU,1130,1308,98,1235,1443,128,125,155,140,696
2018-10-01,WN,N452WN,2847,ATL,HOU,605,601,-4,710,659,-11,125,118,99,696
2018-10-01,WN,N8619F,5161,ATL,HOU,1340,1333,-7,1440,1503,23,120,150,136,696
2018-10-01,WN,N8513F,812,ATL,IAD,1535,1535,0,1725,1727,2,110,112,78,534
1 FL_DATE OP_UNIQUE_CARRIER TAIL_NUM OP_CARRIER_FL_NUM ORIGIN DEST CRS_DEP_TIME DEP_TIME DEP_DELAY CRS_ARR_TIME ARR_TIME ARR_DELAY CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME AIR_TIME DISTANCE
2 2018-10-01 WN N221WN 802 ABQ BWI 905 903 -2 1450 1433 -17 225 210 197 1670
3 2018-10-01 WN N8329B 3744 ABQ BWI 1500 1458 -2 2045 2020 -25 225 202 191 1670
4 2018-10-01 WN N920WN 1019 ABQ DAL 1800 1802 2 2045 2032 -13 105 90 80 580
5 2018-10-01 WN N480WN 1499 ABQ DAL 950 947 -3 1235 1223 -12 105 96 81 580
6 2018-10-01 WN N227WN 3635 ABQ DAL 1150 1151 1 1430 1423 -7 100 92 80 580
7 2018-10-01 WN N243WN 3998 ABQ DAL 655 652 -3 940 924 -16 105 92 83 580
8 2018-10-01 WN N485WN 5432 ABQ DAL 1340 1354 14 1625 1631 6 105 97 81 580
9 2018-10-01 WN N229WN 4596 ABQ DEN 1420 1444 24 1540 1552 12 80 68 55 349
10 2018-10-01 WN N934WN 6013 ABQ DEN 910 907 -3 1025 1027 2 75 80 52 349
11 2018-10-01 WN N934WN 6015 ABQ DEN 1735 1742 7 1845 1854 9 70 72 58 349
12 2018-10-01 WN N8615E 2885 ABQ HOU 1240 1239 -1 1540 1539 -1 120 120 108 759
13 2018-10-01 WN N965WN 3939 ABQ HOU 640 640 0 940 938 -2 120 118 103 759
14 2018-10-01 WN N408WN 4025 ABQ HOU 1555 1610 15 1850 1906 16 115 116 103 759
15 2018-10-01 WN N913WN 1642 ABQ LAS 1040 1037 -3 1115 1057 -18 95 80 69 486
16 2018-10-01 WN N927WN 3271 ABQ LAS 1615 1614 -1 1645 1646 1 90 92 75 486
17 2018-10-01 WN N732SW 4816 ABQ LAS 605 601 -4 635 628 -7 90 87 73 486
18 2018-10-01 WN N496WN 6095 ABQ LAS 2130 2123 -7 2155 2146 -9 85 83 70 486
19 2018-10-01 WN N468WN 555 ABQ LAX 1710 1708 -2 1815 1805 -10 125 117 92 677
20 2018-10-01 WN N7751A 3858 ABQ LAX 545 541 -4 645 638 -7 120 117 102 677
21 2018-10-01 WN N435WN 5757 ABQ MCI 1720 2119 239 2005 2357 232 105 98 87 718
22 2018-10-01 WN N556WN 538 ABQ MDW 1705 1756 51 2040 2114 34 155 138 129 1121
23 2018-10-01 WN N410WN 4837 ABQ MDW 705 708 3 1045 1032 -13 160 144 127 1121
24 2018-10-01 WN N8726H 792 ABQ OAK 815 809 -6 940 928 -12 145 139 123 889
25 2018-10-01 WN N956WN 5673 ABQ OAK 1125 1221 56 1250 1337 47 145 136 121 889
26 2018-10-01 WN N739GB 5753 ABQ OAK 1915 1915 0 2035 2029 -6 140 134 121 889
27 2018-10-01 WN N7723E 5516 ABQ PDX 1020 1017 -3 1215 1204 -11 175 167 157 1111
28 2018-10-01 WN N770SA 1415 ABQ PHX 945 944 -1 1005 949 -16 80 65 54 328
29 2018-10-01 WN N730SW 2782 ABQ PHX 1410 1424 14 1430 1431 1 80 67 55 328
30 2018-10-01 WN N7725A 2863 ABQ PHX 700 702 2 720 720 0 80 78 56 328
31 2018-10-01 WN N450WN 4114 ABQ PHX 1935 1931 -4 1950 1959 9 75 88 58 328
32 2018-10-01 WN N8673F 5500 ABQ PHX 1625 1630 5 1640 1636 -4 75 66 58 328
33 2018-10-01 WN N948WN 6315 ABQ PHX 1120 1126 6 1140 1138 -2 80 72 57 328
34 2018-10-01 WN N566WN 19 ABQ SAN 1505 1551 46 1555 1631 36 110 100 87 628
35 2018-10-01 WN N957WN 4832 ABQ SAN 610 616 6 700 658 -2 110 102 91 628
36 2018-10-01 WN N8704Q 824 ALB BWI 805 801 -4 920 911 -9 75 70 54 289
37 2018-10-01 WN N903WN 1758 ALB BWI 605 601 -4 720 706 -14 75 65 55 289
38 2018-10-01 WN N8572X 2790 ALB BWI 925 928 3 1040 1031 -9 75 63 53 289
39 2018-10-01 WN N7701B 3292 ALB BWI 1315 1308 -7 1435 1417 -18 80 69 58 289
40 2018-10-01 WN N295WN 3376 ALB BWI 1105 1101 -4 1220 1206 -14 75 65 53 289
41 2018-10-01 WN N716SW 4898 ALB BWI 1710 1707 -3 1825 1815 -10 75 68 56 289
42 2018-10-01 WN N8674B 5153 ALB DEN 1850 1849 -1 2050 2045 -5 240 236 223 1610
43 2018-10-01 WN N8643A 390 ALB MCO 705 705 0 955 954 -1 170 169 149 1073
44 2018-10-01 WN N730SW 2776 ALB MDW 630 625 -5 735 744 9 125 139 120 717
45 2018-10-01 WN N798SW 4197 ALB MDW 1655 1652 -3 1805 1815 10 130 143 112 717
46 2018-10-01 WN N729SW 988 AMA DAL 1605 1615 10 1720 1713 -7 75 58 48 323
47 2018-10-01 WN N933WN 1913 AMA DAL 605 603 -2 720 705 -15 75 62 50 323
48 2018-10-01 WN N7706A 5226 AMA DAL 1045 1047 2 1155 1156 1 70 69 52 323
49 2018-10-01 WN N755SA 6984 AMA DAL 1830 1825 -5 1940 1921 -19 70 56 48 323
50 2018-10-01 WN N211WN 6822 AMA LAS 1425 1429 4 1425 1438 13 120 129 107 758
51 2018-10-01 WN N928WN 4261 ATL AUS 1015 1011 -4 1140 1137 -3 145 146 123 813
52 2018-10-01 WN N8581Z 4701 ATL AUS 2030 2024 -6 2150 2133 -17 140 129 109 813
53 2018-10-01 WN N950WN 5615 ATL AUS 1645 1647 2 1810 1805 -5 145 138 112 813
54 2018-10-01 WN N932WN 106 ATL BNA 2215 2211 -4 2215 2205 -10 60 54 39 214
55 2018-10-01 WN N739GB 2583 ATL BNA 800 756 -4 755 752 -3 55 56 42 214
56 2018-10-01 WN N454WN 3766 ATL BNA 1955 1951 -4 2000 1948 -12 65 57 39 214
57 2018-10-01 WN N7716A 4165 ATL BNA 1225 1226 1 1235 1226 -9 70 60 41 214
58 2018-10-01 WN N7822A 4501 ATL BNA 1750 1745 -5 1745 1742 -3 55 57 40 214
59 2018-10-01 WN N8324A 3360 ATL BOS 1330 1500 90 1605 1737 92 155 157 126 946
60 2018-10-01 WN N444WN 3987 ATL BOS 2210 2204 -6 50 24 -26 160 140 118 946
61 2018-10-01 WN N472WN 1031 ATL BWI 1120 1119 -1 1310 1303 -7 110 104 81 577
62 2018-10-01 WN N758SW 1526 ATL BWI 800 757 -3 945 934 -11 105 97 81 577
63 2018-10-01 WN N8642E 1922 ATL BWI 1700 1656 -4 1850 1840 -10 110 104 88 577
64 2018-10-01 WN N7838A 3991 ATL BWI 2115 2202 47 2305 2339 34 110 97 80 577
65 2018-10-01 WN N7839A 4436 ATL BWI 1905 1904 -1 2100 2044 -16 115 100 85 577
66 2018-10-01 WN N8509U 5150 ATL BWI 1340 1416 36 1530 1548 18 110 92 79 577
67 2018-10-01 WN N242WN 2574 ATL CLE 835 833 -2 1015 1016 1 100 103 79 554
68 2018-10-01 WN N961WN 5133 ATL CLE 2200 2155 -5 2335 2334 -1 95 99 80 554
69 2018-10-01 WN N8503A 2571 ATL CMH 1540 1540 0 1710 1711 1 90 91 65 447
70 2018-10-01 WN N282WN 6348 ATL CMH 835 834 -1 1005 1005 0 90 91 67 447
71 2018-10-01 WN N293WN 6661 ATL CMH 2200 2208 8 2325 2332 7 85 84 67 447
72 2018-10-01 WN N954WN 63 ATL DAL 2010 2010 0 2125 2118 -7 135 128 101 721
73 2018-10-01 WN N764SW 2838 ATL DAL 720 719 -1 825 816 -9 125 117 103 721
74 2018-10-01 WN N8549Z 3845 ATL DAL 1740 1738 -2 1845 1838 -7 125 120 102 721
75 2018-10-01 WN N8620H 5577 ATL DAL 1045 1102 17 1200 1208 8 135 126 103 721
76 2018-10-01 WN N8317M 6768 ATL DAL 1330 1331 1 1440 1431 -9 130 120 101 721
77 2018-10-01 WN N550WN 1347 ATL DCA 1525 1542 17 1710 1724 14 105 102 79 547
78 2018-10-01 WN N8648A 2600 ATL DCA 715 716 1 855 851 -4 100 95 82 547
79 2018-10-01 WN N726SW 4747 ATL DCA 2025 2027 2 2210 2209 -1 105 102 83 547
80 2018-10-01 WN N8665D 5207 ATL DCA 1030 1105 35 1215 1241 26 105 96 80 547
81 2018-10-01 WN N930WN 208 ATL DEN 1835 1846 11 1945 1948 3 190 182 167 1199
82 2018-10-01 WN N400WN 4133 ATL DEN 610 607 -3 715 711 -4 185 184 170 1199
83 2018-10-01 WN N7817J 4139 ATL DEN 1430 1432 2 1540 1539 -1 190 187 167 1199
84 2018-10-01 WN N8647A 5960 ATL DEN 955 1008 13 1110 1112 2 195 184 162 1199
85 2018-10-01 WN N212WN 115 ATL DTW 835 832 -3 1025 1024 -1 110 112 86 594
86 2018-10-01 WN N256WN 1896 ATL DTW 2200 2200 0 2345 2349 4 105 109 86 594
87 2018-10-01 WN N8556Z 5388 ATL DTW 1530 1531 1 1730 1727 -3 120 116 86 594
88 2018-10-01 WN N279WN 2775 ATL FLL 1110 1111 1 1310 1250 -20 120 99 82 581
89 2018-10-01 WN N945WN 3088 ATL FLL 1340 1351 11 1535 1529 -6 115 98 82 581
90 2018-10-01 WN N8699A 5459 ATL FLL 650 645 -5 835 820 -15 105 95 80 581
91 2018-10-01 WN N8691A 6191 ATL FLL 1950 1958 8 2140 2146 6 110 108 83 581
92 2018-10-01 WN N8548P 964 ATL GSP 1540 1539 -1 1630 1628 -2 50 49 27 153
93 2018-10-01 WN N258WN 5417 ATL GSP 2205 2240 35 2255 2322 27 50 42 27 153
94 2018-10-01 WN N8660A 6185 ATL GSP 1105 1101 -4 1200 1145 -15 55 44 26 153
95 2018-10-01 WN N274WN 343 ATL HOU 1810 1808 -2 1920 1901 -19 130 113 100 696
96 2018-10-01 WN N230WN 1176 ATL HOU 1955 1955 0 2105 2057 -8 130 122 101 696
97 2018-10-01 WN N786SW 1433 ATL HOU 1130 1308 98 1235 1443 128 125 155 140 696
98 2018-10-01 WN N452WN 2847 ATL HOU 605 601 -4 710 659 -11 125 118 99 696
99 2018-10-01 WN N8619F 5161 ATL HOU 1340 1333 -7 1440 1503 23 120 150 136 696
100 2018-10-01 WN N8513F 812 ATL IAD 1535 1535 0 1725 1727 2 110 112 78 534

Просмотреть файл

@ -0,0 +1,618 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Handling duplicate rows and rows with missing values\n",
"\n",
"Most machine learning algorithms will return an error if they encounter a missing value. So, you often have to remove rows with missing values from your DataFrame.\n",
"\n",
"To learn how, we need to create a pandas DataFrame and load it with data."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The flight delays data set contains information about flights and flight delays"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>FL_DATE</th>\n",
" <th>OP_UNIQUE_CARRIER</th>\n",
" <th>TAIL_NUM</th>\n",
" <th>OP_CARRIER_FL_NUM</th>\n",
" <th>ORIGIN</th>\n",
" <th>DEST</th>\n",
" <th>CRS_DEP_TIME</th>\n",
" <th>DEP_TIME</th>\n",
" <th>DEP_DELAY</th>\n",
" <th>CRS_ARR_TIME</th>\n",
" <th>ARR_TIME</th>\n",
" <th>ARR_DELAY</th>\n",
" <th>CRS_ELAPSED_TIME</th>\n",
" <th>ACTUAL_ELAPSED_TIME</th>\n",
" <th>AIR_TIME</th>\n",
" <th>DISTANCE</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N221WN</td>\n",
" <td>802</td>\n",
" <td>ABQ</td>\n",
" <td>BWI</td>\n",
" <td>905</td>\n",
" <td>903.0</td>\n",
" <td>-2.0</td>\n",
" <td>1450</td>\n",
" <td>1433.0</td>\n",
" <td>-17.0</td>\n",
" <td>225</td>\n",
" <td>210.0</td>\n",
" <td>197.0</td>\n",
" <td>1670</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N8329B</td>\n",
" <td>3744</td>\n",
" <td>ABQ</td>\n",
" <td>BWI</td>\n",
" <td>1500</td>\n",
" <td>1458.0</td>\n",
" <td>-2.0</td>\n",
" <td>2045</td>\n",
" <td>2020.0</td>\n",
" <td>-25.0</td>\n",
" <td>225</td>\n",
" <td>202.0</td>\n",
" <td>191.0</td>\n",
" <td>1670</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N920WN</td>\n",
" <td>1019</td>\n",
" <td>ABQ</td>\n",
" <td>DAL</td>\n",
" <td>1800</td>\n",
" <td>1802.0</td>\n",
" <td>2.0</td>\n",
" <td>2045</td>\n",
" <td>2032.0</td>\n",
" <td>-13.0</td>\n",
" <td>105</td>\n",
" <td>90.0</td>\n",
" <td>80.0</td>\n",
" <td>580</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N480WN</td>\n",
" <td>1499</td>\n",
" <td>ABQ</td>\n",
" <td>DAL</td>\n",
" <td>950</td>\n",
" <td>947.0</td>\n",
" <td>-3.0</td>\n",
" <td>1235</td>\n",
" <td>1223.0</td>\n",
" <td>-12.0</td>\n",
" <td>105</td>\n",
" <td>96.0</td>\n",
" <td>81.0</td>\n",
" <td>580</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2018-10-01</td>\n",
" <td>WN</td>\n",
" <td>N227WN</td>\n",
" <td>3635</td>\n",
" <td>ABQ</td>\n",
" <td>DAL</td>\n",
" <td>1150</td>\n",
" <td>1151.0</td>\n",
" <td>1.0</td>\n",
" <td>1430</td>\n",
" <td>1423.0</td>\n",
" <td>-7.0</td>\n",
" <td>100</td>\n",
" <td>92.0</td>\n",
" <td>80.0</td>\n",
" <td>580</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" FL_DATE OP_UNIQUE_CARRIER TAIL_NUM OP_CARRIER_FL_NUM ORIGIN DEST \\\n",
"0 2018-10-01 WN N221WN 802 ABQ BWI \n",
"1 2018-10-01 WN N8329B 3744 ABQ BWI \n",
"2 2018-10-01 WN N920WN 1019 ABQ DAL \n",
"3 2018-10-01 WN N480WN 1499 ABQ DAL \n",
"4 2018-10-01 WN N227WN 3635 ABQ DAL \n",
"\n",
" CRS_DEP_TIME DEP_TIME DEP_DELAY CRS_ARR_TIME ARR_TIME ARR_DELAY \\\n",
"0 905 903.0 -2.0 1450 1433.0 -17.0 \n",
"1 1500 1458.0 -2.0 2045 2020.0 -25.0 \n",
"2 1800 1802.0 2.0 2045 2032.0 -13.0 \n",
"3 950 947.0 -3.0 1235 1223.0 -12.0 \n",
"4 1150 1151.0 1.0 1430 1423.0 -7.0 \n",
"\n",
" CRS_ELAPSED_TIME ACTUAL_ELAPSED_TIME AIR_TIME DISTANCE \n",
"0 225 210.0 197.0 1670 \n",
"1 225 202.0 191.0 1670 \n",
"2 105 90.0 80.0 580 \n",
"3 105 96.0 81.0 580 \n",
"4 100 92.0 80.0 580 "
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"delays_df = pd.read_csv('Data/Lots_of_flight_data.csv')\n",
"delays_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**info** will tell us how many rows are in the DataFrame and for each column how many of those rows contain non-null values. From this we can determine which columns (if any) contain null/missing values"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"RangeIndex: 300000 entries, 0 to 299999\n",
"Data columns (total 16 columns):\n",
"FL_DATE 300000 non-null object\n",
"OP_UNIQUE_CARRIER 300000 non-null object\n",
"TAIL_NUM 299660 non-null object\n",
"OP_CARRIER_FL_NUM 300000 non-null int64\n",
"ORIGIN 300000 non-null object\n",
"DEST 300000 non-null object\n",
"CRS_DEP_TIME 300000 non-null int64\n",
"DEP_TIME 296825 non-null float64\n",
"DEP_DELAY 296825 non-null float64\n",
"CRS_ARR_TIME 300000 non-null int64\n",
"ARR_TIME 296574 non-null float64\n",
"ARR_DELAY 295832 non-null float64\n",
"CRS_ELAPSED_TIME 300000 non-null int64\n",
"ACTUAL_ELAPSED_TIME 295832 non-null float64\n",
"AIR_TIME 295832 non-null float64\n",
"DISTANCE 300000 non-null int64\n",
"dtypes: float64(6), int64(5), object(5)\n",
"memory usage: 30.9+ MB\n"
]
}
],
"source": [
"delays_df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"TAIL_NUM, DEP_TIME, DEP_DELAY, ARR_TIME, ARR_DELAY, ACTUAL_ELAPSED_TIME, and AIR_TIME all have rows with missing values."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"There are many techniques to deal with missing values, the simplest is to delete the rows with missing values.\n",
"\n",
"**dropna** will delete rows containing null/missing values"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 295832 entries, 0 to 299999\n",
"Data columns (total 16 columns):\n",
"FL_DATE 295832 non-null object\n",
"OP_UNIQUE_CARRIER 295832 non-null object\n",
"TAIL_NUM 295832 non-null object\n",
"OP_CARRIER_FL_NUM 295832 non-null int64\n",
"ORIGIN 295832 non-null object\n",
"DEST 295832 non-null object\n",
"CRS_DEP_TIME 295832 non-null int64\n",
"DEP_TIME 295832 non-null float64\n",
"DEP_DELAY 295832 non-null float64\n",
"CRS_ARR_TIME 295832 non-null int64\n",
"ARR_TIME 295832 non-null float64\n",
"ARR_DELAY 295832 non-null float64\n",
"CRS_ELAPSED_TIME 295832 non-null int64\n",
"ACTUAL_ELAPSED_TIME 295832 non-null float64\n",
"AIR_TIME 295832 non-null float64\n",
"DISTANCE 295832 non-null int64\n",
"dtypes: float64(6), int64(5), object(5)\n",
"memory usage: 32.7+ MB\n"
]
}
],
"source": [
"delay_no_nulls_df = delays_df.dropna() # Delete the rows with missing values\n",
"delay_no_nulls_df.info() # Check the number of rows and number of rows with non-null values to confirm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you don't need to keep the original DataFrame, you can just delete the rows within the existing DataFrame instead of creating a new one\n",
"\n",
"**inplace=*True*** indicates you want to drop the rows in the specified DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"<class 'pandas.core.frame.DataFrame'>\n",
"Int64Index: 295832 entries, 0 to 299999\n",
"Data columns (total 16 columns):\n",
"FL_DATE 295832 non-null object\n",
"OP_UNIQUE_CARRIER 295832 non-null object\n",
"TAIL_NUM 295832 non-null object\n",
"OP_CARRIER_FL_NUM 295832 non-null int64\n",
"ORIGIN 295832 non-null object\n",
"DEST 295832 non-null object\n",
"CRS_DEP_TIME 295832 non-null int64\n",
"DEP_TIME 295832 non-null float64\n",
"DEP_DELAY 295832 non-null float64\n",
"CRS_ARR_TIME 295832 non-null int64\n",
"ARR_TIME 295832 non-null float64\n",
"ARR_DELAY 295832 non-null float64\n",
"CRS_ELAPSED_TIME 295832 non-null int64\n",
"ACTUAL_ELAPSED_TIME 295832 non-null float64\n",
"AIR_TIME 295832 non-null float64\n",
"DISTANCE 295832 non-null int64\n",
"dtypes: float64(6), int64(5), object(5)\n",
"memory usage: 32.7+ MB\n"
]
}
],
"source": [
"delays_df.dropna(inplace=True)\n",
"delays_df.info()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When data is loaded from multiple data sources you sometimes end up with duplicate records. "
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>City</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seattle-Tacoma</td>\n",
" <td>Seattle</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Heathrow</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Schiphol</td>\n",
" <td>Amsterdam</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name City Country\n",
"0 Seattle-Tacoma Seattle USA\n",
"1 Dulles Washington USA\n",
"2 Dulles Washington USA\n",
"3 Heathrow London United Kingdom\n",
"4 Schiphol Amsterdam Netherlands"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports_df = pd.read_csv('Data/airportsDuplicateRows.csv')\n",
"airports_df.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"use **duplicates** to find the duplicate rows.\n",
"\n",
"If a row is a duplicate of a previous row it returns **True**"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0 False\n",
"1 False\n",
"2 True\n",
"3 False\n",
"4 False\n",
"5 False\n",
"6 False\n",
"7 False\n",
"dtype: bool"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports_df.duplicated()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**drop_duplicates** will delete the duplicate rows"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>Name</th>\n",
" <th>City</th>\n",
" <th>Country</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Seattle-Tacoma</td>\n",
" <td>Seattle</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>Dulles</td>\n",
" <td>Washington</td>\n",
" <td>USA</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>Heathrow</td>\n",
" <td>London</td>\n",
" <td>United Kingdom</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Schiphol</td>\n",
" <td>Amsterdam</td>\n",
" <td>Netherlands</td>\n",
" </tr>\n",
" <tr>\n",
" <th>5</th>\n",
" <td>Changi</td>\n",
" <td>Singapore</td>\n",
" <td>Singapore</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6</th>\n",
" <td>Pearson</td>\n",
" <td>Toronto</td>\n",
" <td>Canada</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7</th>\n",
" <td>Narita</td>\n",
" <td>Tokyo</td>\n",
" <td>Japan</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" Name City Country\n",
"0 Seattle-Tacoma Seattle USA\n",
"1 Dulles Washington USA\n",
"3 Heathrow London United Kingdom\n",
"4 Schiphol Amsterdam Netherlands\n",
"5 Changi Singapore Singapore\n",
"6 Pearson Toronto Canada\n",
"7 Narita Tokyo Japan"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"airports_df.drop_duplicates(inplace=True)\n",
"airports_df"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -0,0 +1,15 @@
# Handling duplicates and rows with missing values
When preparing data for machine learning you need to remove duplicate rows and you may need to delete rows with missing values.
## Common functions
- [dropna](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) removes rows with missing values
- [duplicated](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.duplicated.html) returns a True or False to indicate if a row is a duplicate of a previous row
- [drop_duplicates](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.drop_duplicates.html) returns a DataFrame with duplicate rows removed
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,9 @@
Name,City,Country
Seattle-Tacoma,Seattle,USA
Dulles,Washington,USA
Dulles,Washington,USA
Heathrow,London,United Kingdom
Schiphol,Amsterdam,Netherlands
Changi,Singapore,Singapore
Pearson,Toronto,Canada
Narita,Tokyo,Japan
1 Name City Country
2 Seattle-Tacoma Seattle USA
3 Dulles Washington USA
4 Dulles Washington USA
5 Heathrow London United Kingdom
6 Schiphol Amsterdam Netherlands
7 Changi Singapore Singapore
8 Pearson Toronto Canada
9 Narita Tokyo Japan

Просмотреть файл

@ -0,0 +1,578 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Splitting test and training data\n",
"When you train a data model you may need to split up your data into test and training data sets\n",
"\n",
"To accomplish this task we will use the [scikit-learn](https://scikit-learn.org/stable/) library\n",
"\n",
"scikit-learn is an open source, BSD licensed library for data science for preprocessing and training models."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Before we can split our data test and training data, we need to do some data preparation"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's load our csv file with information about flights and flight delays\n",
"\n",
"Use **shape** to find out how many rows and columns are in the original DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(300000, 16)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"delays_df = pd.read_csv('Data/Lots_of_flight_data.csv')\n",
"delays_df.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Split data into features and labels\n",
"Create a DataFrame called X containing only the features we want to use to train our model.\n",
"\n",
"**Note** You can only use numeric values as features, if you have non-numeric values you must apply different techniques such as Hot Encoding to convert these into numeric values before using them as features to train a model. Check out Data Science courses for more information on these techniques!"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>DISTANCE</th>\n",
" <th>CRS_ELAPSED_TIME</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>1670</td>\n",
" <td>225</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>1670</td>\n",
" <td>225</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>580</td>\n",
" <td>105</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>580</td>\n",
" <td>105</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>580</td>\n",
" <td>100</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" DISTANCE CRS_ELAPSED_TIME\n",
"0 1670 225\n",
"1 1670 225\n",
"2 580 105\n",
"3 580 105\n",
"4 580 100"
]
},
"execution_count": 3,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X = delays_df.loc[:,['DISTANCE', 'CRS_ELAPSED_TIME']]\n",
"X.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Create a DataFrame called y containing only the value we want to predict with our model. \n",
"\n",
"In our case we want to predict how many minutes late a flight will arrive. This information is in the ARR_DELAY column. "
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ARR_DELAY</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>-17.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>-25.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>-13.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>-12.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>-7.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ARR_DELAY\n",
"0 -17.0\n",
"1 -25.0\n",
"2 -13.0\n",
"3 -12.0\n",
"4 -7.0"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y = delays_df.loc[:,['ARR_DELAY']]\n",
"y.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Split into test and training data\n",
"Use **scikitlearn train_test_split** to move 30% of the rows into Test DataFrames\n",
"\n",
"The other 70% of the rows into DataFrames we can use to train our model\n",
"\n",
"NOTE: by specifying a value for *random_state* we ensure that if we run the code again the same rows will be moved into the test DataFrame. This makes our results repeatable."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"from sklearn.model_selection import train_test_split\n",
"\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X, \n",
" y, \n",
" test_size=0.3, \n",
" random_state=42\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We now have a DataFrame **X_train** which contains 70% of the rows\n",
"\n",
"We will use this DataFrame to train our model"
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(210000, 2)"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The DataFrame **X_test** contains the remaining 30% of the rows\n",
"\n",
"We will use this DataFrame to test our trained model, so we can check it's accuracy"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(90000, 2)"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_test.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**X_train** and **X_test** contain our features\n",
"\n",
"The features are the columns we think can help us predict how late a flight will arrive: **DISTANCE** and **CRS_ELAPSED_TIME**"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>DISTANCE</th>\n",
" <th>CRS_ELAPSED_TIME</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>186295</th>\n",
" <td>237</td>\n",
" <td>60</td>\n",
" </tr>\n",
" <tr>\n",
" <th>127847</th>\n",
" <td>411</td>\n",
" <td>111</td>\n",
" </tr>\n",
" <tr>\n",
" <th>274740</th>\n",
" <td>342</td>\n",
" <td>85</td>\n",
" </tr>\n",
" <tr>\n",
" <th>74908</th>\n",
" <td>1005</td>\n",
" <td>164</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11630</th>\n",
" <td>484</td>\n",
" <td>100</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" DISTANCE CRS_ELAPSED_TIME\n",
"186295 237 60\n",
"127847 411 111\n",
"274740 342 85\n",
"74908 1005 164\n",
"11630 484 100"
]
},
"execution_count": 9,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"X_train.head()"
]
},
{
"cell_type": "markdown",
"metadata": {
"scrolled": true
},
"source": [
"The DataFrame **y_train** contains 70% of the rows\n",
"\n",
"We will use this DataFrame to train our model"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you don't need to keep the original DataFrame, you can just delete the rows within the existing DataFrame instead of creating a new one\n",
"**inplace=*True*** indicates you want to drop the rows in the specified DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 27,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(210000, 1)"
]
},
"execution_count": 27,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_train.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The DataFrame **y_test** contains the remaining 30% of the rows\n",
"\n",
"We will use this DataFrame to test our trained model, so we can check it's accuracy"
]
},
{
"cell_type": "code",
"execution_count": 28,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"(90000, 1)"
]
},
"execution_count": 28,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_test.shape"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"**y_train** and **y_test** contain our label\n",
"\n",
"The label is the columns we want to predict with our trained model: **ARR_DELAY**\n",
"\n",
"**NOTE:** a negative value for ARR_DELAY indicates a flight arrived early"
]
},
{
"cell_type": "code",
"execution_count": 29,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ARR_DELAY</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>186295</th>\n",
" <td>-7.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>127847</th>\n",
" <td>-16.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>274740</th>\n",
" <td>-10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>74908</th>\n",
" <td>-19.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11630</th>\n",
" <td>-13.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" ARR_DELAY\n",
"186295 -7.0\n",
"127847 -16.0\n",
"274740 -10.0\n",
"74908 -19.0\n",
"11630 -13.0"
]
},
"execution_count": 29,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_train.head()"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -0,0 +1,13 @@
# Splitting test and training data with scikit-learn
[scikit-learn](https://scikit-learn.org/) is a library of tools for predictive data analysis, which will allow you to prepare your data for machine learning and create models.
## Common functions
- [train_test_split](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html) splits arrays into random train and test subsets
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,116 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train a linear regression model\n",
"When you have your data prepared you can train a model.\n",
"\n",
"There are multiple libraries and methods you can call to train models. In this notebook we will use the **LinearRegression** model in the **scikit-learn** library"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We need our DataFrame, with data loaded, all the rows with null values removed, and the features and labels split into the separate training and test data. So, we'll start by just rerunning the commands from the previous notebooks."
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Load our data from the csv file\n",
"delays_df = pd.read_csv('Data/Lots_of_flight_data.csv') \n",
"\n",
"# Remove rows with null values since those will crash our linear regression model training\n",
"delays_df.dropna(inplace=True)\n",
"\n",
"# Move our features into the X DataFrame\n",
"X = delays_df.loc[:,['DISTANCE', 'CRS_ELAPSED_TIME']]\n",
"\n",
"# Move our labels into the y DataFrame\n",
"y = delays_df.loc[:,['ARR_DELAY']] \n",
"\n",
"# Split our data into test and training DataFrames\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X, \n",
" y, \n",
" test_size=0.3, \n",
" random_state=42\n",
" )"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use **Scikitlearn LinearRegression** *fit* method to train a linear regression model based on the training data stored in X_train and y_train"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"from sklearn.linear_model import LinearRegression\n",
"\n",
"regressor = LinearRegression() # Create a scikit learn LinearRegression object\n",
"regressor.fit(X_train, y_train) # Use the fit method to train the model using your training data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"The *regressor* object now contains your trained Linear Regression model"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -0,0 +1,14 @@
# Train a linear regression model with scikit-learn
[Linear regression](https://en.wikipedia.org/wiki/Linear_regression) is a common algorithm for predicting values based on a given dataset.
## Common classes and functions
- [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) fits a linear model
- [LinearRegression.fit](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linearregression#sklearn.linear_model.LinearRegression.fit) is used to fit the linear model based on training data
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,459 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Test a trained model\n",
"Once you have trained a model, you can test it with the test data you put aside"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"We will start by rerunning the code from the previous notebook to create a trained model"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LinearRegression"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"LinearRegression(copy_X=True, fit_intercept=True, n_jobs=None, normalize=False)"
]
},
"execution_count": 2,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Load our data from the csv file\n",
"delays_df = pd.read_csv('Data/Lots_of_flight_data.csv') \n",
"\n",
"# Remove rows with null values since those will crash our linear regression model training\n",
"delays_df.dropna(inplace=True)\n",
"\n",
"# Move our features into the X DataFrame\n",
"X = delays_df.loc[:,['DISTANCE', 'CRS_ELAPSED_TIME']]\n",
"\n",
"# Move our labels into the y DataFrame\n",
"y = delays_df.loc[:,['ARR_DELAY']] \n",
"\n",
"# Split our data into test and training DataFrames\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X, \n",
" y, \n",
" test_size=0.3, \n",
" random_state=42\n",
" )\n",
"regressor = LinearRegression() # Create a scikit learn LinearRegression object\n",
"regressor.fit(X_train, y_train) # Use the fit method to train the model using your training data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test the model\n",
"Use **Scikitlearn LinearRegression predict** to have our trained model predict values for our test data\n",
"\n",
"We stored our test data in X_Test\n",
"\n",
"We will store the predicted results in y_pred"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"y_pred = regressor.predict(X_test)"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[3.47739078],\n",
" [5.89055919],\n",
" [4.33288464],\n",
" ...,\n",
" [5.84678979],\n",
" [6.05195889],\n",
" [5.66255414]])"
]
},
"execution_count": 4,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_pred"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"When we split our data into training and test data we stored the actual values for each row of test data in the DataFrame y_test\n",
"\n",
"We can compare the values in y_pred to the value in y_test to get a sense of how accurately our mdoel predicted arrival delays"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ARR_DELAY</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>291483</th>\n",
" <td>-5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98997</th>\n",
" <td>-12.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23454</th>\n",
" <td>-9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>110802</th>\n",
" <td>-14.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>49449</th>\n",
" <td>-20.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>94944</th>\n",
" <td>14.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>160885</th>\n",
" <td>-17.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47572</th>\n",
" <td>-20.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>164800</th>\n",
" <td>20.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>62578</th>\n",
" <td>-9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>196742</th>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>91166</th>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>171564</th>\n",
" <td>-9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>60706</th>\n",
" <td>6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>240773</th>\n",
" <td>-6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32695</th>\n",
" <td>-13.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98399</th>\n",
" <td>-23.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>167341</th>\n",
" <td>-11.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>126191</th>\n",
" <td>-4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>188715</th>\n",
" <td>131.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>258610</th>\n",
" <td>-5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>215751</th>\n",
" <td>-20.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>41210</th>\n",
" <td>-15.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>68090</th>\n",
" <td>-19.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>140794</th>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>178840</th>\n",
" <td>-14.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>248071</th>\n",
" <td>21.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12770</th>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>95948</th>\n",
" <td>40.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>172913</th>\n",
" <td>-13.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>200797</th>\n",
" <td>21.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36199</th>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>70402</th>\n",
" <td>-37.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>285308</th>\n",
" <td>152.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>201508</th>\n",
" <td>-2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>154671</th>\n",
" <td>-5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>238535</th>\n",
" <td>-5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>133567</th>\n",
" <td>-9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3349</th>\n",
" <td>-8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>257254</th>\n",
" <td>-28.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>106572</th>\n",
" <td>-19.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>73023</th>\n",
" <td>-25.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>214699</th>\n",
" <td>-12.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>274435</th>\n",
" <td>-7.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>67089</th>\n",
" <td>-10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>269917</th>\n",
" <td>-4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>164966</th>\n",
" <td>70.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>275120</th>\n",
" <td>-12.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>139292</th>\n",
" <td>-8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31106</th>\n",
" <td>-25.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>277799</th>\n",
" <td>17.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>293749</th>\n",
" <td>-7.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>231114</th>\n",
" <td>35.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11645</th>\n",
" <td>-15.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>252520</th>\n",
" <td>-12.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>209898</th>\n",
" <td>-20.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22210</th>\n",
" <td>-9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>165727</th>\n",
" <td>-6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>260838</th>\n",
" <td>-33.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192546</th>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>88750 rows × 1 columns</p>\n",
"</div>"
],
"text/plain": [
" ARR_DELAY\n",
"291483 -5.0\n",
"98997 -12.0\n",
"23454 -9.0\n",
"110802 -14.0\n",
"49449 -20.0\n",
"... ...\n",
"209898 -20.0\n",
"22210 -9.0\n",
"165727 -6.0\n",
"260838 -33.0\n",
"192546 0.0\n",
"\n",
"[88750 rows x 1 columns]"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_test"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -0,0 +1,14 @@
# Testing a model
Once a model is built it can be used to predict values. You can provide new values to see where it would fall on the spectrum, and test the generated model.
## Common classes and functions
- [LinearRegression](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html) fits a linear model
- [LinearRegression.predict](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html?highlight=linearregression#sklearn.linear_model.LinearRegression.predict) is used to predict outcomes for new data based on the trained linear model
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,202 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Evaluating accuracy of a model using calculations\n",
"After you train a model, you need to get a sense of it's accuracy. The accuracy of a model gives you an idea of how much confidence you can put it predictions made by the model.\n",
"\n",
"The **scitkit-learn** and **numpy** libraries are both helpful for measuring model accuracy"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Let's start by recreating our trained linear regression model from the last lesson"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LinearRegression"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"# Load our data from the csv file\n",
"delays_df = pd.read_csv('Data/Lots_of_flight_data.csv') \n",
"\n",
"# Remove rows with null values since those will crash our linear regression model training\n",
"delays_df.dropna(inplace=True)\n",
"\n",
"# Move our features into the X DataFrame\n",
"X = delays_df.loc[:,['DISTANCE', 'CRS_ELAPSED_TIME']]\n",
"\n",
"# Move our labels into the y DataFrame\n",
"y = delays_df.loc[:,['ARR_DELAY']] \n",
"\n",
"# Split our data into test and training DataFrames\n",
"X_train, X_test, y_train, y_test = train_test_split(\n",
" X, \n",
" y, \n",
" test_size=0.3, \n",
" random_state=42\n",
" )\n",
"regressor = LinearRegression() # Create a scikit learn LinearRegression object\n",
"regressor.fit(X_train, y_train) # Use the fit method to train the model using your training data\n",
"\n",
"y_pred = regressor.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Measuring accuracy\n",
"Now that we have a trained model there are a number of metrics you can use to check the accuracy of the model. \n",
"\n",
"All these metrics are based on mathematical calculations, the key take-away here is you don't have to calculate everything yourself. Scikit-learn and numpy will do most of the work and provide good performance."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Mean Squared Error (MSE)\n",
"The MSE is the average error performed by the model when predicting the outcome for an observation. \n",
"The lower the MSE, the better the model.\n",
"\n",
"MSE is the average squared difference between the observed actual outome values and the values predicted by the model.\n",
"\n",
"MSE = mean((actuals - predicteds)^2) \n",
"\n",
"We could write code to loop through our records comparing actual and predicated values to perform this calculation, but we don't have to! Just use **mean_squared_error** from the **scikit-learn** library"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Mean Squared Error: 2250.4445141530855\n"
]
}
],
"source": [
"from sklearn import metrics\n",
"print('Mean Squared Error:', metrics.mean_squared_error(y_test, y_pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Root Mean Squared Error (RMSE)\n",
"RMSE is the average error performed by the model when predicting the outcome for an observation. \n",
"The lower the RMSE, the better the model.\n",
"\n",
"Mathematically, the RMSE is the square root of the mean squared error \n",
"\n",
"RMSE = sqrt(MSE)\n",
"\n",
"Skikit learn does not have a function for RMSE, but since it's just the square root of MSE, we can use the numpy library which contains lots of mathematical functions to calculate the square root of the MSE"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"print('Root Mean Squared Error:', np.sqrt(metrics.mean_squared_error(y_test, y_pred)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Mean Absolute Error (MAE)\n",
"MAE measures the prediction error. The lower the MAE the better the model\n",
"\n",
"Mathematically, it is the average absolute difference between observed and predicted outcomes\n",
"\n",
"MAE = mean(abs(actuals - predicteds)). \n",
"\n",
"MAE is less sensitive to outliers compared to RMSE. Calculate RMSE using **mean_absolute_error** in the **scikit-learn** library"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('Mean absolute error: ',metrics.mean_absolute_error(y_test, y_pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# R^2 or R-Squared\n",
"\n",
"R squared is the proportion of variation in the outcome that is explained by the predictor variables. It is an indication of how much the values passed to the model influence the predicted value. \n",
"\n",
"The Higher the R-squared, the better the model. Calculate R-Squared using **r2_score** in the **scikit-learn** library."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"print('R^2: ',metrics.r2_score(y_test, y_pred))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Different models have different ways to measure accuracy. Fortunately **scikit-learn** and **numpy** provide a wide variety of functions to help measure accuracy."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -0,0 +1,23 @@
# Evaluating accuracy of a model using calculations
Playing with individual values isn't the best way to test a model. Fortunately, [scikit-learn](https://scikit-learn.org/) provides tools for automated testing an analysis.
## Common functions
- [metrics](https://scikit-learn.org/stable/modules/classes.html?highlight=metrics#module-sklearn.metrics) includes functions and metrics that can be used for data science including measuring accuracy of models
- [mean_squared_error](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.mean_squared_error.html#sklearn.metrics.mean_squared_error) returns the mean squared error, a measure used to measure accuracy of linear regression models
- [r2_score](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html#sklearn.metrics.r2_score) returns the R^2 regression score, a measure used to measure accuracy of linear regression models
## NumPy
[NumPy](https://numpy.org/) is a package for scientific computing with Python
### Common functions
- [sqrt](https://numpy.org/doc/1.18/reference/generated/numpy.sqrt.html?highlight=sqrt#numpy.sqrt) returns the square root of a value
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,697 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# Moving data from numpy arrays to pandas DataFrames\n",
"In our last notebook we trained a model and compared our actual and predicted results\n",
"\n",
"What may not have been evident was when we did this we were working with two different objects: a **numpy array** and a **pandas DataFrame**"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To explore further let's rerun the code from the previous notebook to create a trained model and get predicted values for our test data"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd\n",
"from sklearn.model_selection import train_test_split\n",
"from sklearn.linear_model import LinearRegression"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"# Load our data from the csv file\n",
"delays_df = pd.read_csv('Data/Lots_of_flight_data.csv') \n",
"\n",
"# Remove rows with null values since those will crash our linear regression model training\n",
"delays_df.dropna(inplace=True)\n",
"\n",
"# Move our features into the X DataFrame\n",
"X = delays_df.loc[:,['DISTANCE','CRS_ELAPSED_TIME']]\n",
"\n",
"# Move our labels into the y DataFrame\n",
"y = delays_df.loc[:,['ARR_DELAY']] \n",
"\n",
"# Split our data into test and training DataFrames\n",
"X_train, X_test, y_train, y_test = train_test_split(X, \n",
" y, \n",
" test_size=0.3, \n",
" random_state=42)\n",
"regressor = LinearRegression() # Create a scikit learn LinearRegression object\n",
"regressor.fit(X_train, y_train) # Use the fit method to train the model using your training data\n",
"\n",
"y_pred = regressor.predict(X_test) # Generate predicted values for our test data"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"In the last Notebook, you might have noticed the output displays differently when you display the contents of the predicted values in y_pred and the actual values in y_test"
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"array([[3.47739078],\n",
" [5.89055919],\n",
" [4.33288464],\n",
" ...,\n",
" [5.84678979],\n",
" [6.05195889],\n",
" [5.66255414]])"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_pred"
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>ARR_DELAY</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>291483</th>\n",
" <td>-5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98997</th>\n",
" <td>-12.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>23454</th>\n",
" <td>-9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>110802</th>\n",
" <td>-14.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>49449</th>\n",
" <td>-20.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>94944</th>\n",
" <td>14.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>160885</th>\n",
" <td>-17.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>47572</th>\n",
" <td>-20.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>164800</th>\n",
" <td>20.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>62578</th>\n",
" <td>-9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>196742</th>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>91166</th>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>171564</th>\n",
" <td>-9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>60706</th>\n",
" <td>6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>240773</th>\n",
" <td>-6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>32695</th>\n",
" <td>-13.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>98399</th>\n",
" <td>-23.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>167341</th>\n",
" <td>-11.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>126191</th>\n",
" <td>-4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>188715</th>\n",
" <td>131.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>258610</th>\n",
" <td>-5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>215751</th>\n",
" <td>-20.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>41210</th>\n",
" <td>-15.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>68090</th>\n",
" <td>-19.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>140794</th>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>178840</th>\n",
" <td>-14.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>248071</th>\n",
" <td>21.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>12770</th>\n",
" <td>5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>95948</th>\n",
" <td>40.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>172913</th>\n",
" <td>-13.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>200797</th>\n",
" <td>21.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>36199</th>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>70402</th>\n",
" <td>-37.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>285308</th>\n",
" <td>152.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>201508</th>\n",
" <td>-2.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>154671</th>\n",
" <td>-5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>238535</th>\n",
" <td>-5.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>133567</th>\n",
" <td>-9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3349</th>\n",
" <td>-8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>257254</th>\n",
" <td>-28.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>106572</th>\n",
" <td>-19.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>73023</th>\n",
" <td>-25.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>214699</th>\n",
" <td>-12.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>274435</th>\n",
" <td>-7.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>67089</th>\n",
" <td>-10.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>269917</th>\n",
" <td>-4.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>164966</th>\n",
" <td>70.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>275120</th>\n",
" <td>-12.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>139292</th>\n",
" <td>-8.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>31106</th>\n",
" <td>-25.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>277799</th>\n",
" <td>17.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>293749</th>\n",
" <td>-7.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>231114</th>\n",
" <td>35.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>11645</th>\n",
" <td>-15.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>252520</th>\n",
" <td>-12.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>209898</th>\n",
" <td>-20.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>22210</th>\n",
" <td>-9.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>165727</th>\n",
" <td>-6.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>260838</th>\n",
" <td>-33.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>192546</th>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>88750 rows × 1 columns</p>\n",
"</div>"
],
"text/plain": [
" ARR_DELAY\n",
"291483 -5.0\n",
"98997 -12.0\n",
"23454 -9.0\n",
"110802 -14.0\n",
"49449 -20.0\n",
"... ...\n",
"209898 -20.0\n",
"22210 -9.0\n",
"165727 -6.0\n",
"260838 -33.0\n",
"192546 0.0\n",
"\n",
"[88750 rows x 1 columns]"
]
},
"execution_count": 6,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"y_test"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use **type()** to check the datatype of an object."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"numpy.ndarray"
]
},
"execution_count": 7,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(y_pred)"
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"pandas.core.frame.DataFrame"
]
},
"execution_count": 8,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"type(y_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"* **y_pred** is a numpy array\n",
"* **y_test** is a pandas DataFrame\n",
"\n",
"Another way you might discover this is if you try to use the **head** method on **y_pred**. \n",
"\n",
"This will return an error, because **head** is a method of the DataFrame class it is not a method of numpy arrays"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"ename": "AttributeError",
"evalue": "'numpy.ndarray' object has no attribute 'head'",
"output_type": "error",
"traceback": [
"\u001b[1;31m---------------------------------------------------------------------------\u001b[0m",
"\u001b[1;31mAttributeError\u001b[0m Traceback (most recent call last)",
"\u001b[1;32m<ipython-input-9-05146ec42336>\u001b[0m in \u001b[0;36m<module>\u001b[1;34m\u001b[0m\n\u001b[1;32m----> 1\u001b[1;33m \u001b[0my_pred\u001b[0m\u001b[1;33m.\u001b[0m\u001b[0mhead\u001b[0m\u001b[1;33m(\u001b[0m\u001b[1;33m)\u001b[0m\u001b[1;33m\u001b[0m\u001b[1;33m\u001b[0m\u001b[0m\n\u001b[0m",
"\u001b[1;31mAttributeError\u001b[0m: 'numpy.ndarray' object has no attribute 'head'"
]
}
],
"source": [
"y_pred.head()"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A one dimensional numpy array is similar to a pandas Series\n"
]
},
{
"cell_type": "code",
"execution_count": 10,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"['Pearson' 'Changi' 'Narita']\n",
"Narita\n"
]
}
],
"source": [
"import numpy as np\n",
"airports_array = np.array(['Pearson','Changi','Narita'])\n",
"print(airports_array)\n",
"print(airports_array[2])"
]
},
{
"cell_type": "code",
"execution_count": 11,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"0 Pearson\n",
"1 Changi\n",
"2 Narita\n",
"dtype: object\n",
"Narita\n"
]
}
],
"source": [
"airports_series = pd.Series(['Pearson','Changi','Narita'])\n",
"print(airports_series)\n",
"print(airports_series[2])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"A two dimensional numpy array is similar to a pandas DataFrame"
]
},
{
"cell_type": "code",
"execution_count": 12,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"[['YYZ' 'Pearson']\n",
" ['SIN' 'Changi']\n",
" ['NRT' 'Narita']]\n",
"YYZ\n"
]
}
],
"source": [
"airports_array = np.array([\n",
" ['YYZ','Pearson'],\n",
" ['SIN','Changi'],\n",
" ['NRT','Narita']])\n",
"print(airports_array)\n",
"print(airports_array[0,0])"
]
},
{
"cell_type": "code",
"execution_count": 13,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
" 0 1\n",
"0 YYZ Pearson\n",
"1 SIN Changi\n",
"2 NRT Narita\n",
"YYZ\n"
]
}
],
"source": [
"airports_df = pd.DataFrame([['YYZ','Pearson'],['SIN','Changi'],['NRT','Narita']])\n",
"print(airports_df)\n",
"print(airports_df.iloc[0,0])"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"If you need the functionality of a DataFrame, you can move data from numpy objects to pandas objects and vice-versa.\n",
"\n",
"In the example below we use the DataFrame constructor to read the contents of the numpy array *y_pred* into a DataFrame called *predicted_df*\n",
"\n",
"Then we can use the functionality of the DataFrame object"
]
},
{
"cell_type": "code",
"execution_count": 14,
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>0</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>3.477391</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>5.890559</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>4.332885</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>3.447476</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>5.072394</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" 0\n",
"0 3.477391\n",
"1 5.890559\n",
"2 4.332885\n",
"3 3.447476\n",
"4 5.072394"
]
},
"execution_count": 14,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"predicted_df = pd.DataFrame(y_pred)\n",
"predicted_df.head()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.6.9"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -0,0 +1,28 @@
# NumPy vs pandas
There are numerous libraries available for use for data scientists. NumPy and pandas are two of the most common.
Some operations may return different data types. You can use the Python function [type](https://docs.python.org/3/library/functions.html#type) to determine the type of an object.
## NumPy
[NumPy](https://numpy.org/) is a Python package for scientific computing that includes a array and dictionary type objects for data analysis.
### Common object
- [array](https://numpy.org/doc/1.18/reference/generated/numpy.array.html?highlight=array#numpy.array) creates an N-dimensional array object
## pandas
[pandas](https://pandas.pydata.org/) is a Python package for data analysis that includes a 1 dimensional and 2 dimensional array objects
### Common objects
- [Series](https://pandas.pydata.org/docs/reference/api/pandas.Series.html) stores a one dimensional array
- [DataFrame](https://pandas.pydata.org/docs/reference/frame.html) stores a two-dimensional array
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,150 @@
{
"cells": [
{
"cell_type": "markdown",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Visualizing data with matplotlib"
]
},
{
"cell_type": "markdown",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"Somtimes graphs provide the best way to visualize data\n",
"\n",
"The **matplotlib** library allows you to draw graphs to help with visualization\n",
"\n",
"If we want to visualize data, we will need to load some data into a DataFrame"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import pandas as pd"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"# Load our data from the csv file\n",
"delays_df = pd.read_csv('Data/Lots_of_flight_data.csv') "
]
},
{
"cell_type": "markdown",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"In order to display plots we need to import the **matplotlib** library"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import matplotlib.pyplot as plt"
]
},
{
"cell_type": "markdown",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"A common plot used in data science is the scatter plot for checking the relationship between two columns\n",
"If you see dots scattered everywhere, there is no correlation between the two columns\n",
"If you see somethign resembling a line, there is a correlation between the two columns\n",
"\n",
"You can use the plot method of the DataFrame to draw the scatter plot\n",
"* kind - the type of graph to draw\n",
"* x - value to plot as x\n",
"* y - value to plot as y\n",
"* color - color to use for the graph points\n",
"* alpha - opacity - useful to show density of points in a scatter plot\n",
"* title - title of the graph"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"scrolled": true
},
"outputs": [],
"source": [
"#Check if there is a relationship between the distance of a flight and how late the flight arrives\n",
"delays_df.plot(\n",
" kind='scatter',\n",
" x='DISTANCE',\n",
" y='ARR_DELAY',\n",
" color='blue',\n",
" alpha=0.3,\n",
" title='Correlation of arrival and distance'\n",
" )\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"#Check if there is a relationship between the how late the flight leaves and how late the flight arrives\n",
"delays_df.plot(\n",
" kind='scatter',\n",
" x='DEP_DELAY',\n",
" y='ARR_DELAY',\n",
" color='blue',\n",
" alpha=0.3,\n",
" title='Correlation of arrival and departure delay'\n",
" )\n",
"plt.show()"
]
},
{
"cell_type": "markdown",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"The scatter plot allows us to see there is no correlation between distance and arrival delay but there is a strong correlation between departure delay and arrival delay.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.7.3"
}
},
"nbformat": 4,
"nbformat_minor": 2
}

Разница между файлами не показана из-за своего большого размера Загрузить разницу

Просмотреть файл

@ -0,0 +1,16 @@
# Visualizing data with Matplotlib
[Matplotlib](https://matplotlib.org/) gives you the ability to draw charts which can be used to visualize data.
## Common tools and functions
- [pyplot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.html?highlight=pyplot#module-matplotlib.pyplot) provides the ability to draw plots similar to the MATLAB tool
- [pyplot.plot](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.plot.html#matplotlib.pyplot.plot) plots a graph
- [pyplot.show](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.show.html#matplotlib.pyplot.show) displays figures such as a graph
- [pyplot.scatter](https://matplotlib.org/api/_as_gen/matplotlib.pyplot.scatter.html?highlight=scatter%20plot#matplotlib.pyplot.scatter) is used to draw scatter plots, a diagram that shows the relationship between two sets of data
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Intro to machine learning with Python and Azure Notebooks](https://docs.microsoft.com/learn/paths/intro-to-ml-with-python/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,36 @@
# Even more Python for beginners - data tools
## Overview
Data science and machine learning among the most popular fields today, in which Python is one of the most popular languages. As you might expect, there are several libraries and tools available to you. As you begin your journey into this field, it will help to be familiar with the most common frameworks and techniques. This is what we're here to help you with!
We're going to introduce [Jupyter notebooks](https://jupyter.org/), a common tool for data scientists. We're also going to show off [pandas](https://pandas.pydata.org/) which is used to help manage and explore data, and [scikit-learn](https://scikit-learn.org/) for incorporating machine learning. You'll see how to bring everything together and walk through a common scenario of loading data and running it through a particular algorithm.
Our goal is to help show you the tools you'll be using as you dig deeper into data science and machine learning. While we won't highlight the decision points of algorithms or collecting the data (there are other courses available for those topics), you will explore the techniques and libraries.
### What you'll learn
- Jupyter notebooks
- pandas DataFrame for managing data
- NumPy for arrays
- scikit-learn for machine learning
### What we don't cover
- Theory behind machine learning
- Algorithm selection
- Managing big data
## Prerequisites
- [An understanding of Git](https://git-scm.com/book/en/v1/Getting-Started)
- [An understanding of Python](https://aka.ms/pythonbeginnerseries)
## Next steps
As the goal of this course is to help get you up to speed on Python so you can work through a quick start, the next step after completing the videos is to follow a tutorial! Here's a few of our favorites:
- [Predict flight delays by creating a machine learning model in Python](https://docs.microsoft.com/learn/modules/predict-flight-delays-with-python?WT.mc_id=python-c9-niner)
- [Train a machine learning model with Azure Machine Learning](https://docs.microsoft.com/learn/modules/train-local-model-with-azure-mls?WT.mc_id=python-c9-niner)
- [Analyze climate data](https://docs.microsoft.com/learn/modules/analyze-climate-data-with-azure-notebooks?WT.mc_id=python-c9-niner)
- [Use unsupervised learning to analyze unlabeled data](https://docs.microsoft.com/learn/modules/introduction-to-unsupervised-learning?WT.mc_id=python-c9-niner)

Двоичный файл не отображается.

Двоичный файл не отображается.

Двоичный файл не отображается.

Двоичный файл не отображается.

Двоичный файл не отображается.

Двоичный файл не отображается.

Двоичный файл не отображается.

Двоичный файл не отображается.

Двоичный файл не отображается.

Двоичный файл не отображается.

Двоичный файл не отображается.

Двоичный файл не отображается.

Двоичный файл не отображается.

Двоичный файл не отображается.

Двоичный файл не отображается.

134
more-python-for-beginners/.gitignore поставляемый Normal file
Просмотреть файл

@ -0,0 +1,134 @@
# Created by https://www.gitignore.io/api/python,virtualenv,visualstudiocode
# Edit at https://www.gitignore.io/?templates=python,virtualenv,visualstudiocode
### Python ###
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
*$py.class
# C extensions
*.so
# Distribution / packaging
.Python
build/
develop-eggs/
dist/
downloads/
eggs/
.eggs/
lib/
lib64/
parts/
sdist/
var/
wheels/
pip-wheel-metadata/
share/python-wheels/
*.egg-info/
.installed.cfg
*.egg
MANIFEST
# PyInstaller
# Usually these files are written by a python script from a template
# before PyInstaller builds the exe, so as to inject date/other infos into it.
*.manifest
*.spec
# Installer logs
pip-log.txt
pip-delete-this-directory.txt
# Unit test / coverage reports
htmlcov/
.tox/
.nox/
.coverage
.coverage.*
.cache
nosetests.xml
coverage.xml
*.cover
.hypothesis/
.pytest_cache/
# Translations
*.mo
*.pot
# Scrapy stuff:
.scrapy
# Sphinx documentation
docs/_build/
# PyBuilder
target/
# pyenv
.python-version
# pipenv
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
# However, in case of collaboration, if having platform-specific dependencies or dependencies
# having no cross-platform support, pipenv may install dependencies that don't work, or not
# install all needed dependencies.
#Pipfile.lock
# celery beat schedule file
celerybeat-schedule
# SageMath parsed files
*.sage.py
# Spyder project settings
.spyderproject
.spyproject
# Rope project settings
.ropeproject
# Mr Developer
.mr.developer.cfg
.project
.pydevproject
# mkdocs documentation
/site
# mypy
.mypy_cache/
.dmypy.json
dmypy.json
# Pyre type checker
.pyre/
### VirtualEnv ###
# Virtualenv
# http://iamzed.com/2009/05/07/a-primer-on-virtualenv/
pyvenv.cfg
.env
.venv
env/
venv/
ENV/
env.bak/
venv.bak/
pip-selfcheck.json
### VisualStudioCode ###
.vscode/*
!.vscode/settings.json
!.vscode/tasks.json
!.vscode/launch.json
!.vscode/extensions.json
### VisualStudioCode Patch ###
# Ignore all local history of files
.history
# End of https://www.gitignore.io/api/python,virtualenv,visualstudiocode

Просмотреть файл

@ -0,0 +1,8 @@
{
"python.linting.pylintEnabled": true,
"python.linting.enabled": true,
"python.linting.flake8Enabled": false,
"python.linting.banditEnabled": false,
"python.linting.pycodestyleEnabled": false,
"python.linting.mypyEnabled": false
}

Просмотреть файл

@ -0,0 +1,24 @@
# Style Guidelines
## Formatting
Formatting makes code readable and easier to debug.
## Documentation
- [PEP 8](https://pep8.org/) is a set of coding conventions for Python code
- [Docstring](https://www.python.org/dev/peps/pep-0257/) is the standard for documenting a module, function, class or method definition
## Linting
Linting helps you identify formatting and convention issues in your Python code
- [Pylint](https://www.pylint.org/) Pylint is a linter for Python to help enforce coding standards and check for errors in Python code
- [Linting Python in Visual Studio Code](https://code.visualstudio.com/docs/python/linting) will show you how to enable litners in VS Code
- [Type hints](https://docs.python.org/3/library/typing.html) allow some interactive development environments and linters to enforce types
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Set up your Python beginner development environment with Visual Studio Code](https://docs.microsoft.com/learn/languages/python-install-vscode/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,11 @@
x = 12
if x == 24:
print('Is valid')
else:
print("Not valid")
def helper(name='sample'):
pass
def another(name = 'sample'):
pass

Просмотреть файл

@ -0,0 +1,10 @@
def print_hello(name: str) -> str:
"""
Greets the user by name
Parameters:
name (str): The name of the user
Returns:
str: The greeting
"""
print('Hello, ' + name)

Просмотреть файл

@ -0,0 +1,9 @@
# Lambdas
A [lambda](https://www.w3schools.com/python/python_lambda.asp) function is a small anonymous function. It can take any number of arguments but can only execute one expression.
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Create reusable functionality with functions in Python](https://docs.microsoft.com/learn/languages/python-functions/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,9 @@
# This code will return an error because the sort method does not know
# which presenter field to use when sorting
presenters = [
{'name': 'Susan', 'age': 50},
{'name': 'Christopher', 'age': 47}
]
presenters.sort()
print(presenters)

Просмотреть файл

@ -0,0 +1,14 @@
# Sort alphabetically
presenters = [
{'name': 'Susan', 'age': 50},
{'name': 'Christopher', 'age': 47}
]
presenters.sort(key=lambda item: item['name'])
print('-- alphabetically --')
print(presenters)
# Sort by length of name (shortest to longest)
presenters.sort(key=lambda item: len(item['name']))
print('-- length --')
print(presenters)

Просмотреть файл

@ -0,0 +1,10 @@
def sorter(item):
return item['name']
presenters = [
{'name': 'Susan', 'age': 50},
{'name': 'Christopher', 'age': 47}
]
presenters.sort(key=sorter)
print(presenters)

Просмотреть файл

@ -0,0 +1,9 @@
# Classes
[Classes](https://docs.python.org/3/tutorial/classes.html) define data structures and behavior. Classes allow you to group data and functionality together.
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Object-oriented programming in Python](https://docs.microsoft.com/learn/modules/python-object-oriented-programming/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,11 @@
class Presenter():
def __init__(self, name):
# Constructor
self.name = name
def say_hello(self):
# method
print('Hello, ' + self.name)
presenter = Presenter('Chris')
presenter.name = 'Christopher'
presenter.say_hello()

Просмотреть файл

@ -0,0 +1,18 @@
class Presenter():
def __init__(self, name):
# Constructor
self.name = name
@property
def name(self):
print('Retrieving name...')
return self.__name
@name.setter
def name(self, value):
# cool validation here
print('Validating name...')
self.__name = value
presenter = Presenter('Chris')
presenter.name = 'Christopher'
print(presenter.name)

Просмотреть файл

@ -0,0 +1,9 @@
# Inheritance
[Inheritance](https://docs.python.org/3/tutorial/classes.html#inheritance) allows you to define a class that inherits all the methods and properties from another class. The parent or base class is the class being inherited from. The child or derived class is the class that inherits from another class.
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Object-oriented programming in Python](https://docs.microsoft.com/learn/modules/python-object-oriented-programming/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,20 @@
class Person:
def __init__(self, name):
self.name = name
def say_hello(self):
print('Hello, ' + self.name)
class Student(Person):
def __init__(self, name, school):
super().__init__(name)
self.school = school
def sing_school_song(self):
print('Ode to ' + self.school)
student = Student('Christopher', 'UVM')
student.say_hello()
student.sing_school_song()
# What are you?
print(isinstance(student, Student))
print(isinstance(student, Person))
print(issubclass(Student, Person))

Просмотреть файл

@ -0,0 +1,12 @@
# Mixins (multiple inheritance)
Python allows you to inherit from multiple classes. While the technical term for this is multiple inheritance, many developers refer to the use of more than one base class adding a mixin. These are commonly used in frameworks such as [Django](https://www.djangoproject.com).
- [Multiple Inheritance](https://docs.python.org/3/tutorial/classes.html#multiple-inheritance)
- [super](https://docs.python.org/3/library/functions.html#super) is used to give access to methods and properties of a parent class
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).
- [Object-oriented programming in Python](https://docs.microsoft.com/learn/modules/python-object-oriented-programming/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,27 @@
class Loggable:
def __init__(self):
self.title = ''
def log(self):
print('Log message from ' + self.title)
class Connection:
def __init__(self):
self.server = ''
def connect(self):
print('Connecting to database on ' + self.server)
class SqlDatabase(Connection, Loggable):
def __init__(self):
super().__init__()
self.title = 'Sql Connection Demo'
self.server = 'Some_Server'
def framework(item):
if isinstance(item, Connection):
item.connect()
if isinstance(item, Loggable):
item.log()
sql_connection = SqlDatabase()
framework(sql_connection)

Просмотреть файл

@ -0,0 +1,7 @@
# Managing the file system
Python's [pathlib](https://docs.python.org/3/library/pathlib.html) provides operations and classes to access files and directories in the file system.
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1 @@
Lorem ipsum

Просмотреть файл

@ -0,0 +1,17 @@
from pathlib import Path
cwd = Path.cwd()
# Get the parent directory
parent = cwd.parent
# Is this a directory?
print('\nIs this a directory? ' + str(parent.is_dir()))
# Is this a file?
print('\nIs this a file? ' + str(parent.is_file()))
# List child directories
print('\n-----directory contents-----')
for child in parent.iterdir():
if child.is_dir():
print(child)

Просмотреть файл

@ -0,0 +1,16 @@
from pathlib import Path
cwd = Path.cwd()
demo_file = Path(Path.joinpath(cwd, 'demo.txt'))
# Get the file name
print('\nfile name: ' + demo_file.name)
# Get the extension
print('\nfile suffix: ' + demo_file.suffix)
# Get the folder
print('\nfile folder: ' + demo_file.parent.name)
# Get the size
print('\nfile size: ' + str(demo_file.stat().st_size) + '\n')

Просмотреть файл

@ -0,0 +1,14 @@
# Python 3.6 or higher
# Grab the library
from pathlib import Path
# What is the current working directory?
cwd = Path.cwd()
print('\nCurrent working directory:\n' + str(cwd))
# Create full path name by joining path and filename
new_file = Path.joinpath(cwd, 'new_file.txt')
print('\nFull path:\n' + str(new_file))
# Check if file exists
print('\nDoes that file exist? ' + str(new_file.exists()) + '\n')

Просмотреть файл

@ -0,0 +1,7 @@
# Working with files
Python allows you to read and write from files. [io](https://docs.python.org/3/library/io.html) is the module that provides Python capabilities for input/output (I/O), including text I/O from files
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,3 @@
This is the first line of the file
And this is the second line of the file
This is the last line of the file

Просмотреть файл

@ -0,0 +1,17 @@
# Open manage.txt file to write text
stream = open('manage.txt', 'wt')
#Write the word demo to the file stream
stream.write('demo!')
# Move back to the start of the file stream
stream.seek(0)
#write the word cool to the file stream
stream.write('cool')
#Flush the file stream contents to the file buffer
stream.flush()
# Flush the file stream and close the file
stream.close()

Просмотреть файл

@ -0,0 +1,6 @@
# Open file demo.txt and read the contents
stream = open('./demo.txt', 'rt')
print('\nIs this readable? ' + str(stream.readable()))
print('\nRead one character : ' + stream.read(1))
print('\nRead to end of line : ' + stream.readline())
print('\nRead all lines to end of file :\n' + str(stream.readlines())+ '\n')

Просмотреть файл

@ -0,0 +1,16 @@
# Open output.txt as a text file for writing
stream = open('output.txt', 'wt')
print('\nCan I write to this file? ' + str(stream.writable()) + '\n')
stream.write('H') # Write a single string
stream.writelines(['ello',' ','world']) # Write one or more strings
stream.write('\n') # Write a new line
names = ['Susan','Christopher']
stream.writelines(names)
# Here's a neat trick to insert a new line between items in the list
stream.write('\n') # Write a new line
stream.writelines('\n'.join(names))
stream.close() #Flush stream and close

Просмотреть файл

@ -0,0 +1,7 @@
# with
The [with](https://docs.python.org/3/reference/compound_stmts.html#with) statement allows you to simplify code in [try](https://docs.python.org/3/reference/compound_stmts.html#the-try-statement)/finally statements. It's considered to use `with` for any operation which supports it.
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner)

Просмотреть файл

@ -0,0 +1,8 @@
try:
stream = open('output.txt', 'wt')
stream.write('Lorem ipsum dolar')
finally:
stream.close() # THIS IS REALLY IMPORTANT!!
# with open('output.txt', 'wt') as stream:
# stream.write('Lorem ipsum dolar')

Просмотреть файл

@ -0,0 +1 @@
Lorem ipsum dolar

Просмотреть файл

@ -0,0 +1,7 @@
# Asynchronous operations
Python offers several options for managing long running operations asynchronously. [asyncio](https://docs.python.org/3/library/asyncio.html) is the core library for supporting asynchronous operations, including [async](https://docs.python.org/3/reference/compound_stmts.html#async-def)/[await](https://docs.python.org/3/reference/expressions.html#await).
## Microsoft Learn Resources
Explore related tutorials on [Microsoft Learn](https://learn.microsoft.com/?WT.mc_id=python-c9-niner).

Просмотреть файл

@ -0,0 +1,34 @@
from timeit import default_timer
import aiohttp
import asyncio
async def load_data(session, delay):
print(f'Starting {delay} second timer')
async with session.get(f'http://httpbin.org/delay/{delay}') as resp:
text = await resp.text()
print(f'Completed {delay} second timer')
return text
async def main():
# Start the timer
start_time = default_timer()
# Creating a single session
async with aiohttp.ClientSession() as session:
# Setup our tasks and get them running
two_task = asyncio.create_task(load_data(session, 2))
three_task = asyncio.create_task(load_data(session, 3))
# Simulate other processing
await asyncio.sleep(1)
print('Doing other work')
# Let's go get our values
two_result = await two_task
three_result = await three_task
# Print our results
elapsed_time = default_timer() - start_time
print(f'The operation took {elapsed_time:.2} seconds')
asyncio.run(main())

Просмотреть файл

@ -0,0 +1,21 @@
from timeit import default_timer
import requests
def load_data(delay):
print(f'Starting {delay} second timer')
text = requests.get(f'http://httpbin.org/delay/{delay}').text
print(f'Completed {delay} second timer')
def run_demo():
start_time = default_timer()
two_data = load_data(2)
three_data = load_data(3)
elapsed_time = default_timer() - start_time
print(f'The operation took {elapsed_time:.2} seconds')
def main():
run_demo()
main()

Просмотреть файл

@ -0,0 +1,55 @@
# More Python for beginners
## Overview
When we created [Python for beginners](https://aka.ms/pythonbeginnerseries) we knew we wouldn't be able to cover everything in Python. We focused on the features which are core to getting started with the language. But, of course, we left some items off the list. Well, we're back for more! We created another set of videos to highlight more features, including a couple of "cutting edge" items like `async/await`. These skills will allow you to continue to grow as a Python developer.
### What you'll learn
- Creating classes and objects
- Asynchronous development
- Working with the filesystem
### What we don't cover
- Programming concepts like [object-oriented design](https://en.wikipedia.org/wiki/Object-oriented_design)
- Database access
## Prerequisites
- [An understanding of git](https://git-scm.com/book/en/v2)
- [An understanding of Python](https://aka.ms/pythonbeginnerseries)
- [Visual Studio Code](https://code.visualstudio.com?WT.mc_id=python-c9-niner) or another code editor
### Setup steps
- [Create a virtual environment](https://docs.python.org/3/tutorial/venv.html)
``` bash
# Windows
python -m venv venv
.\venv\Scripts\activate
# Linux or macOS
python3 -m venv venv
. ./venv/bin/activate
```
- Install the packages for Async/Await
``` bash
# Windows
pip install -r requirements.txt
# Linux or macOS
pip3 install -r requirements.txt
```
## Next steps
If you're looking to continue building, here's a couple of courses and quickstarts you might find of interest:
- [Object-oriented programming in Python](https://docs.microsoft.com/learn/modules/python-object-oriented-programming?WT.mc_id=python-c9-niner?WT.mc_id=python-c9-niner)
- [Build an AI web app using Python and Flask](https://docs.microsoft.com/learn/modules/python-flask-build-ai-web-app?WT.mc_id=python-c9-niner?WT.mc_id=python-c9-niner)
- [Build Python Django apps with Microsoft Graph](https://docs.microsoft.com/graph/tutorials/python?WT.mc_id=python-c9-niner?WT.mc_id=python-c9-niner)
- [Create a Python app in Azure App Service on Linux](https://docs.microsoft.com/azure/app-service/containers/quickstart-python?WT.mc_id=python-c9-niner?WT.mc_id=python-c9-niner)

Двоичные данные
more-python-for-beginners/Slides/01 - Formatting and linting.pptx Normal file

Двоичный файл не отображается.

Двоичные данные
more-python-for-beginners/Slides/02 - Lambdas.pptx Normal file

Двоичный файл не отображается.

Двоичные данные
more-python-for-beginners/Slides/03 - Classes.pptx Normal file

Двоичный файл не отображается.

Двоичные данные
more-python-for-beginners/Slides/04 - Inhheritance.pptx Normal file

Двоичный файл не отображается.

Двоичный файл не отображается.

Двоичный файл не отображается.

Некоторые файлы не были показаны из-за слишком большого количества измененных файлов Показать больше