Initial commit
This commit is contained in:
Родитель
19ceb583d0
Коммит
6ce0f25eeb
|
@ -0,0 +1,21 @@
|
|||
MIT License
|
||||
|
||||
Copyright (c) Microsoft Corporation. All rights reserved.
|
||||
|
||||
Permission is hereby granted, free of charge, to any person obtaining a copy
|
||||
of this software and associated documentation files (the "Software"), to deal
|
||||
in the Software without restriction, including without limitation the rights
|
||||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
||||
copies of the Software, and to permit persons to whom the Software is
|
||||
furnished to do so, subject to the following conditions:
|
||||
|
||||
The above copyright notice and this permission notice shall be included in all
|
||||
copies or substantial portions of the Software.
|
||||
|
||||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
||||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
||||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
||||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
||||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
||||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
||||
SOFTWARE
|
|
@ -0,0 +1,34 @@
|
|||
# python-sklearn-regression-cookiecutter
|
||||
|
||||
A [Cookiecutter](http://cookiecutter.readthedocs.io/) template for a
|
||||
[Python](https://www.python.org/) application that demonstrates the use of
|
||||
[scikit-learn](http://scikit-learn.org/) regression learners.
|
||||
|
||||
## Using this template
|
||||
|
||||
1. [Install Cookiecutter](http://cookiecutter.readthedocs.io/en/latest/installation.html)
|
||||
2. `cookiecutter gh:Microsoft/python-sklearn-regression-cookiecutter`
|
||||
(or `cookiecutter https://github.com/Microsoft/python-sklearn-regression-cookiecutter.git`
|
||||
if you prefer)
|
||||
3. Fill in the Cookiecutter items (see below as to what each item
|
||||
represents)
|
||||
4. Install required Python packages as needed (these will vary based on which parts of the code you enable).
|
||||
|
||||
### Cookiecutter items
|
||||
|
||||
- `app_name`: the name of the folder/project to create
|
||||
- `create_vs_project`: `y` to create a Visual Studio project file (.pyproj)
|
||||
|
||||
# Contributing
|
||||
|
||||
This project welcomes contributions and suggestions. Most contributions require you to agree to a
|
||||
Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us
|
||||
the rights to use your contribution. For details, visit https://cla.microsoft.com.
|
||||
|
||||
When you submit a pull request, a CLA-bot will automatically determine whether you need to provide
|
||||
a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions
|
||||
provided by the bot. You will only need to do this once across all repos using our CLA.
|
||||
|
||||
This project has adopted the [Microsoft Open Source Code of Conduct](https://opensource.microsoft.com/codeofconduct/).
|
||||
For more information see the [Code of Conduct FAQ](https://opensource.microsoft.com/codeofconduct/faq/) or
|
||||
contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.
|
|
@ -0,0 +1,19 @@
|
|||
{
|
||||
"app_name": "regression",
|
||||
"create_vs_project": "y",
|
||||
"_visual_studio" : {
|
||||
"app_name": {
|
||||
"value_source": "ProjectName",
|
||||
"visible": false
|
||||
},
|
||||
"create_vs_project": {
|
||||
"value_source": "IsNewProject",
|
||||
"visible": false
|
||||
}
|
||||
},
|
||||
"_visual_studio_post_cmds": [
|
||||
{
|
||||
"name": "File.OpenProject",
|
||||
"args": "{{cookiecutter._output_folder_path}}\\{{cookiecutter.app_name}}.pyproj"
|
||||
}
|
||||
]}
|
|
@ -0,0 +1,9 @@
|
|||
#!/usr/bin/env python
|
||||
import os
|
||||
|
||||
def delete_file(filepath):
|
||||
os.remove(os.path.join(os.path.realpath(os.path.curdir), filepath))
|
||||
|
||||
if __name__ == '__main__':
|
||||
if '{{cookiecutter.create_vs_project}}'.lower() != 'y':
|
||||
delete_file('{{cookiecutter.app_name}}.pyproj')
|
|
@ -0,0 +1,303 @@
|
|||
'''
|
||||
This script perfoms the basic process for applying a machine learning
|
||||
algorithm to a dataset using Python libraries.
|
||||
|
||||
The four steps are:
|
||||
1. Download a dataset (using pandas)
|
||||
2. Process the numeric data (using numpy)
|
||||
3. Train and evaluate learners (using scikit-learn)
|
||||
4. Plot and compare results (using matplotlib)
|
||||
|
||||
|
||||
The data is downloaded from URL, which is defined below. As is normal
|
||||
for machine learning problems, the nature of the source data affects
|
||||
the entire solution. When you change URL to refer to your own data, you
|
||||
will need to review the data processing steps to ensure they remain
|
||||
correct.
|
||||
|
||||
============
|
||||
Example Data
|
||||
============
|
||||
The example is from http://mldata.org/repository/data/viewslug/stockvalues/
|
||||
It contains stock prices and the values of three indices for each day
|
||||
over a five year period. See the linked page for more details about
|
||||
this data set.
|
||||
|
||||
This script uses regression learners to predict the stock price for
|
||||
the second half of this period based on the values of the indices. This
|
||||
is a naive approach, and a more robust method would use each prediction
|
||||
as an input for the next, and would predict relative rather than
|
||||
absolute values.
|
||||
'''
|
||||
|
||||
# Remember to update the script for the new data when you change this URL
|
||||
URL = "http://mldata.org/repository/data/download/csv/stockvalues/"
|
||||
|
||||
# This is the column of the sample data to predict.
|
||||
# Try changing it to other integers between 1 and 155.
|
||||
TARGET_COLUMN = 32
|
||||
|
||||
# Uncomment this call when using matplotlib to generate images
|
||||
# rather than displaying interactive UI.
|
||||
#import matplotlib
|
||||
#matplotlib.use('Agg')
|
||||
|
||||
from pandas import read_table
|
||||
import numpy as np
|
||||
import matplotlib.pyplot as plt
|
||||
|
||||
try:
|
||||
# [OPTIONAL] Seaborn makes plots nicer
|
||||
import seaborn
|
||||
except ImportError:
|
||||
pass
|
||||
|
||||
# =====================================================================
|
||||
|
||||
def download_data():
|
||||
'''
|
||||
Downloads the data for this script into a pandas DataFrame.
|
||||
'''
|
||||
|
||||
# If your data is in an Excel file, install 'xlrd' and use
|
||||
# pandas.read_excel instead of read_table
|
||||
#from pandas import read_excel
|
||||
#frame = read_excel(URL)
|
||||
|
||||
# If your data is in a private Azure blob, install 'azure-storage' and use
|
||||
# BlockBlobService.get_blob_to_path() with read_table() or read_excel()
|
||||
#from azure.storage.blob import BlockBlobService
|
||||
#service = BlockBlobService(ACCOUNT_NAME, ACCOUNT_KEY)
|
||||
#service.get_blob_to_path(container_name, blob_name, 'my_data.csv')
|
||||
#frame = read_table('my_data.csv', ...
|
||||
|
||||
frame = read_table(
|
||||
URL,
|
||||
|
||||
# Uncomment if the file needs to be decompressed
|
||||
#compression='gzip',
|
||||
#compression='bz2',
|
||||
|
||||
# Specify the file encoding
|
||||
# Latin-1 is common for data from US sources
|
||||
encoding='latin-1',
|
||||
#encoding='utf-8', # UTF-8 is also common
|
||||
|
||||
# Specify the separator in the data
|
||||
sep=',', # comma separated values
|
||||
#sep='\t', # tab separated values
|
||||
#sep=' ', # space separated values
|
||||
|
||||
# Ignore spaces after the separator
|
||||
skipinitialspace=True,
|
||||
|
||||
# Generate row labels from each row number
|
||||
index_col=None,
|
||||
#index_col=0, # use the first column as row labels
|
||||
#index_col=-1, # use the last column as row labels
|
||||
|
||||
# Generate column headers row from each column number
|
||||
header=None,
|
||||
#header=0, # use the first line as headers
|
||||
|
||||
# Use manual headers and skip the first row in the file
|
||||
#header=0,
|
||||
#names=['col1', 'col2', ...],
|
||||
)
|
||||
|
||||
# Return the entire frame
|
||||
#return frame
|
||||
|
||||
# Return a subset of the columns
|
||||
return frame[[156, 157, 158, TARGET_COLUMN]]
|
||||
|
||||
|
||||
# =====================================================================
|
||||
|
||||
|
||||
def get_features_and_labels(frame):
|
||||
'''
|
||||
Transforms and scales the input data and returns numpy arrays for
|
||||
training and testing inputs and targets.
|
||||
'''
|
||||
|
||||
# Replace missing values with 0.0
|
||||
# or we can use scikit-learn to calculate missing values below
|
||||
#frame[frame.isnull()] = 0.0
|
||||
|
||||
# Convert values to floats
|
||||
arr = np.array(frame, dtype=np.float)
|
||||
|
||||
# Normalize the entire data set
|
||||
from sklearn.preprocessing import StandardScaler, MinMaxScaler
|
||||
arr = MinMaxScaler().fit_transform(arr)
|
||||
|
||||
# Use the last column as the target value
|
||||
X, y = arr[:, :-1], arr[:, -1]
|
||||
# To use the first column instead, change the index value
|
||||
#X, y = arr[:, 1:], arr[:, 0]
|
||||
|
||||
# Use 50% of the data for training, but we will test against the
|
||||
# entire set
|
||||
from sklearn.model_selection import train_test_split
|
||||
X_train, _, y_train, _ = train_test_split(X, y, test_size=0.5)
|
||||
X_test, y_test = X, y
|
||||
|
||||
# If values are missing we could impute them from the training data
|
||||
#from sklearn.preprocessing import Imputer
|
||||
#imputer = Imputer(strategy='mean')
|
||||
#imputer.fit(X_train)
|
||||
#X_train = imputer.transform(X_train)
|
||||
#X_test = imputer.transform(X_test)
|
||||
|
||||
# Normalize the attribute values to mean=0 and variance=1
|
||||
from sklearn.preprocessing import StandardScaler
|
||||
scaler = StandardScaler()
|
||||
# To scale to a specified range, use MinMaxScaler
|
||||
#from sklearn.preprocessing import MinMaxScaler
|
||||
#scaler = MinMaxScaler(feature_range=(0, 1))
|
||||
|
||||
# Fit the scaler based on the training data, then apply the same
|
||||
# scaling to both training and test sets.
|
||||
scaler.fit(X_train)
|
||||
X_train = scaler.transform(X_train)
|
||||
X_test = scaler.transform(X_test)
|
||||
|
||||
# Return the training and test sets
|
||||
return X_train, X_test, y_train, y_test
|
||||
|
||||
|
||||
# =====================================================================
|
||||
|
||||
|
||||
def evaluate_learner(X_train, X_test, y_train, y_test):
|
||||
'''
|
||||
Run multiple times with different algorithms to get an idea of the
|
||||
relative performance of each configuration.
|
||||
|
||||
Returns a sequence of tuples containing:
|
||||
(title, expected values, actual values)
|
||||
for each learner.
|
||||
'''
|
||||
|
||||
# Use a support vector machine for regression
|
||||
from sklearn.svm import SVR
|
||||
|
||||
# Train using a radial basis function
|
||||
svr = SVR(kernel='rbf', gamma=0.1)
|
||||
svr.fit(X_train, y_train)
|
||||
y_pred = svr.predict(X_test)
|
||||
r_2 = svr.score(X_test, y_test)
|
||||
yield 'RBF Model ($R^2={:.3f}$)'.format(r_2), y_test, y_pred
|
||||
|
||||
# Train using a linear kernel
|
||||
svr = SVR(kernel='linear')
|
||||
svr.fit(X_train, y_train)
|
||||
y_pred = svr.predict(X_test)
|
||||
r_2 = svr.score(X_test, y_test)
|
||||
yield 'Linear Model ($R^2={:.3f}$)'.format(r_2), y_test, y_pred
|
||||
|
||||
# Train using a polynomial kernel
|
||||
svr = SVR(kernel='poly', degree=2)
|
||||
svr.fit(X_train, y_train)
|
||||
y_pred = svr.predict(X_test)
|
||||
r_2 = svr.score(X_test, y_test)
|
||||
yield 'Polynomial Model ($R^2={:.3f}$)'.format(r_2), y_test, y_pred
|
||||
|
||||
|
||||
# =====================================================================
|
||||
|
||||
|
||||
def plot(results):
|
||||
'''
|
||||
Create a plot comparing multiple learners.
|
||||
|
||||
`results` is a list of tuples containing:
|
||||
(title, expected values, actual values)
|
||||
|
||||
All the elements in results will be plotted.
|
||||
'''
|
||||
|
||||
# Using subplots to display the results on the same X axis
|
||||
fig, plts = plt.subplots(nrows=len(results), figsize=(8, 8))
|
||||
fig.canvas.set_window_title('Predicting data from ' + URL)
|
||||
|
||||
# Show each element in the plots returned from plt.subplots()
|
||||
for subplot, (title, y, y_pred) in zip(plts, results):
|
||||
# Configure each subplot to have no tick marks
|
||||
# (these are meaningless for the sample dataset)
|
||||
subplot.set_xticklabels(())
|
||||
subplot.set_yticklabels(())
|
||||
|
||||
# Label the vertical axis
|
||||
subplot.set_ylabel('stock price')
|
||||
|
||||
# Set the title for the subplot
|
||||
subplot.set_title(title)
|
||||
|
||||
# Plot the actual data and the prediction
|
||||
subplot.plot(y, 'b', label='actual')
|
||||
subplot.plot(y_pred, 'r', label='predicted')
|
||||
|
||||
# Shade the area between the predicted and the actual values
|
||||
subplot.fill_between(
|
||||
# Generate X values [0, 1, 2, ..., len(y)-2, len(y)-1]
|
||||
np.arange(0, len(y), 1),
|
||||
y,
|
||||
y_pred,
|
||||
color='r',
|
||||
alpha=0.2
|
||||
)
|
||||
|
||||
# Mark the extent of the training data
|
||||
subplot.axvline(len(y) // 2, linestyle='--', color='0', alpha=0.2)
|
||||
|
||||
# Include a legend in each subplot
|
||||
subplot.legend()
|
||||
|
||||
# Let matplotlib handle the subplot layout
|
||||
fig.tight_layout()
|
||||
|
||||
# ==================================
|
||||
# Display the plot in interactive UI
|
||||
plt.show()
|
||||
|
||||
# To save the plot to an image file, use savefig()
|
||||
#plt.savefig('plot.png')
|
||||
|
||||
# Open the image file with the default image viewer
|
||||
#import subprocess
|
||||
#subprocess.Popen('plot.png', shell=True)
|
||||
|
||||
# To save the plot to an image in memory, use BytesIO and savefig()
|
||||
# This can then be written to any stream-like object, such as a
|
||||
# file or HTTP response.
|
||||
#from io import BytesIO
|
||||
#img_stream = BytesIO()
|
||||
#plt.savefig(img_stream, fmt='png')
|
||||
#img_bytes = img_stream.getvalue()
|
||||
#print('Image is {} bytes - {!r}'.format(len(img_bytes), img_bytes[:8] + b'...'))
|
||||
|
||||
# Closing the figure allows matplotlib to release the memory used.
|
||||
plt.close()
|
||||
|
||||
|
||||
# =====================================================================
|
||||
|
||||
|
||||
if __name__ == '__main__':
|
||||
# Download the data set from URL
|
||||
print("Downloading data from {}".format(URL))
|
||||
frame = download_data()
|
||||
|
||||
# Process data into feature and label arrays
|
||||
print("Processing {} samples with {} attributes".format(len(frame.index), len(frame.columns)))
|
||||
X_train, X_test, y_train, y_test = get_features_and_labels(frame)
|
||||
|
||||
# Evaluate multiple regression learners on the data
|
||||
print("Evaluating regression learners")
|
||||
results = list(evaluate_learner(X_train, X_test, y_train, y_test))
|
||||
|
||||
# Display the results
|
||||
print("Plotting the results")
|
||||
plot(results)
|
|
@ -0,0 +1,34 @@
|
|||
<?xml version="1.0" encoding="utf-8"?>
|
||||
<Project DefaultTargets="Build" xmlns="http://schemas.microsoft.com/developer/msbuild/2003" ToolsVersion="4.0">
|
||||
<PropertyGroup>
|
||||
<Configuration Condition=" '$(Configuration)' == '' ">Debug</Configuration>
|
||||
<SchemaVersion>2.0</SchemaVersion>
|
||||
<ProjectTypeGuids>{6c0efafa-1a04-41b6-a6d7-511b90951b5b};{888888a0-9f3d-457c-b088-3a5042f75d52}</ProjectTypeGuids>
|
||||
<ProjectHome>.</ProjectHome>
|
||||
<StartupFile>regression.py</StartupFile>
|
||||
<SearchPath>
|
||||
</SearchPath>
|
||||
<WorkingDirectory>.</WorkingDirectory>
|
||||
<OutputPath>.</OutputPath>
|
||||
</PropertyGroup>
|
||||
<PropertyGroup Condition=" '$(Configuration)' == 'Debug' ">
|
||||
<DebugSymbols>true</DebugSymbols>
|
||||
<EnableUnmanagedDebugging>false</EnableUnmanagedDebugging>
|
||||
</PropertyGroup>
|
||||
<PropertyGroup Condition=" '$(Configuration)' == 'Release' ">
|
||||
<DebugSymbols>true</DebugSymbols>
|
||||
<EnableUnmanagedDebugging>false</EnableUnmanagedDebugging>
|
||||
</PropertyGroup>
|
||||
<ItemGroup>
|
||||
<Compile Include="regression.py" />
|
||||
</ItemGroup>
|
||||
<Import Project="$(MSBuildExtensionsPath32)\Microsoft\VisualStudio\v$(VisualStudioVersion)\Python Tools\Microsoft.PythonTools.targets" />
|
||||
<!-- Uncomment the CoreCompile target to enable the Build command in
|
||||
Visual Studio and specify your pre- and post-build commands in
|
||||
the BeforeBuild and AfterBuild targets below. -->
|
||||
<!--<Target Name="CoreCompile" />-->
|
||||
<Target Name="BeforeBuild">
|
||||
</Target>
|
||||
<Target Name="AfterBuild">
|
||||
</Target>
|
||||
</Project>
|
Загрузка…
Ссылка в новой задаче