This commit is contained in:
Thilo Will 2016-05-04 16:36:42 +02:00
Родитель aa383f486b
Коммит fd7bced277
1 изменённых файлов: 31 добавлений и 0 удалений

Просмотреть файл

@ -284,3 +284,34 @@ that the minibatch layout for the labels and the data with dynamic axes is compa
For the full explanation of how ``lstm_layer()`` is defined, please see the full example in the
Examples section.
How to pass Python data as train/test data
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The Python CNTK API allows to pass training / testing data either by specifing external input files or by using Python data directly to CNTK.
This second alternative - using internal Python data - is usefull especially if you want to do some quick experimentation with small synthetic data sets.
In what follows you will learn in what structure these data has to be provided.
Let us start with a scenario coming from one of our code examples (`logreg_numpy.py <https://github.com/Microsoft/CNTK/tree/master/contrib/Python/cntk/examples/LogReg/logreg_numpy.py>`_).
In this example we want to classify a 250 dimensional feature vector into one of two classes. In this case whe have two *inputs*:
- The features values for each training item. In the example these are 500 vectors each of dimension 250.
- The expected class. In this example the class is encoded with a two-dimensonal vector where the element for expected class is set to 1 and the other to 0.
For each of these inputs we have to provide one data structure containing all training instances.
You might notice that this is conceptually different to the case where we provide the data from external files using the CNTKTextReader.
In the input file for CNTKTextReader we provide data for different *inputs* of one instance on the same line, so the data from different inputs are much more interwined.
In Python the feature data are reprensented by a NumPy array of dimension ``number_of_instances X dimension_of_feature_space`` so in out example its a NumPy array of dimension ``500 X 250``.
Likewise the expected output is reprensented by another NumPy array of dimension ``500 X 2``.
Passing sequence data from Python
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
CNTK can handle sequences with arbitrary maximal length. This feature is also called *dynamic-axis*.
To represent an input with a dynamic-axis in Python you have to provide each sequence as a NumPy-array where the first axis has a dimension equal to the sequence length.
The complete dataset is then just a normal one-dimensional numpy array of these sequences.
Take as an artifical example a sentence classification problem. Each sentence has a different number of words, i.e. it is a *sequence* of words. The individual words might each be represented by some lantent vector.
So each sentence is represented by a NumPy array of dimension ``sequence_length X embedding_dimension``. The whole set of instances (sentences) is then represented by putting them into a one-dimensional array with the size equal to the number of instances.