This commit is contained in:
d1ana 2012-03-29 13:03:33 -04:00
Родитель 6707f2c745
Коммит d935982359
2 изменённых файлов: 35 добавлений и 11 удалений

Просмотреть файл

@ -1,7 +1,7 @@
Example 1 - Count Last Names Example 1 - Count Last Names
============================ ============================
The canonical map/reduce example: count the occurrences of words in a The canonical map/reduce example: **count** the occurrences of words in a
document. In this case, we'll count the occurrences of last names in a data document. In this case, we'll count the occurrences of last names in a data
file containing lines of json. file containing lines of json.
@ -41,7 +41,7 @@ In this case, we'll be tagging our data file as **example:chunk:users**.
:scale: 75 % :scale: 75 %
:alt: tag_name -> [blob1, blob2, blob3] :alt: tag_name -> [blob1, blob2, blob3]
Make sure `disco <http://discoproject.org/>`_ is running:: Make sure `Disco <http://discoproject.org/>`_ is running::
diana@ubuntu:~$ disco start diana@ubuntu:~$ disco start
Master ubuntu:8989 started Master ubuntu:8989 started
@ -77,9 +77,9 @@ the next.
The input step of an Inferno map/reduce job is responsible for parsing and The input step of an Inferno map/reduce job is responsible for parsing and
readying the input data for the map step. readying the input data for the map step.
If you're using Inferno's built in keyset map/reduce functionality, this If you're using Inferno's built in **keyset** map/reduce functionality,
step mostly amounts to transforming your CSV or JSON input into python this step mostly amounts to transforming your CSV or JSON input into
dictionaries. python dictionaries.
The default Inferno input reader is **chunk_csv_keyset_stream**, which is The default Inferno input reader is **chunk_csv_keyset_stream**, which is
intended for CSV data that was placed in DDFS using the ``ddfs chunk`` intended for CSV data that was placed in DDFS using the ``ddfs chunk``
@ -89,6 +89,18 @@ the next.
**map_input_stream** to use the **chunk_json_keyset_stream** reader in **map_input_stream** to use the **chunk_json_keyset_stream** reader in
your Inferno rule instead. your Inferno rule instead.
.. code-block:: python
:emphasize-lines: 3,4
InfernoRule(
name='last_names_json',
source_tags=['example:chunk:users'],
map_input_stream=chunk_json_keyset_stream,
parts_preprocess=[count],
key_parts=['last'],
value_parts=['count'],
)
Example data transition during the **input** step: Example data transition during the **input** step:
.. image:: input.png .. image:: input.png
@ -109,6 +121,18 @@ the next.
relevant key and value parts by declaring **key_parts** and **value_parts** relevant key and value parts by declaring **key_parts** and **value_parts**
in your Inferno rule. in your Inferno rule.
.. code-block:: python
:emphasize-lines: 6,7
InfernoRule(
name='last_names_json',
source_tags=['example:chunk:users'],
map_input_stream=chunk_json_keyset_stream,
parts_preprocess=[count],
key_parts=['last'],
value_parts=['count'],
)
Example data transition during the **map** step: Example data transition during the **map** step:
.. image:: map.png .. image:: map.png
@ -126,7 +150,7 @@ the next.
Inferno's default **reduce_function** is the **keyset_reduce**. It will sum Inferno's default **reduce_function** is the **keyset_reduce**. It will sum
the value parts yielded by the map step, grouped by the key parts. the value parts yielded by the map step, grouped by the key parts.
In this example, we're only summing one value: the ``count``. You can In this example, we're only summing one value (the ``count``). You can
define and sum many value parts, as you'll see :doc:`here </election>` in define and sum many value parts, as you'll see :doc:`here </election>` in
the next example. the next example.
@ -159,10 +183,10 @@ the next.
:scale: 60 % :scale: 60 %
:alt: reduce -> output :alt: reduce -> output
Example Rule Inferno Rule
------------ ------------
The inferno map/reduce rule (inferno/example_rules/names.py):: The Inferno map/reduce rule (``inferno/example_rules/names.py``)::
from inferno.lib.rule import chunk_json_keyset_stream from inferno.lib.rule import chunk_json_keyset_stream
from inferno.lib.rule import InfernoRule from inferno.lib.rule import InfernoRule

Просмотреть файл

@ -4,14 +4,14 @@ Example 2 - Campaign Finance
Rule Rule
---- ----
The inferno map/reduce rule (inferno/example_rules/election.py): The Inferno map/reduce rule (``inferno/example_rules/election.py``):
.. literalinclude:: ../inferno/example_rules/election.py .. literalinclude:: ../inferno/example_rules/election.py
Input Input
----- -----
Make sure `disco <http://discoproject.org/>`_ is running:: Make sure `Disco <http://discoproject.org/>`_ is running::
diana@ubuntu:~$ disco start diana@ubuntu:~$ disco start
Master ubuntu:8989 started Master ubuntu:8989 started
@ -29,7 +29,7 @@ Place the input data in `disco's distributed filesystem <http://discoproject.org
diana@ubuntu:~$ ddfs chunk gov:chunk:presidential_campaign_finance:2012-03-19 ./P00000001-ALL.txt diana@ubuntu:~$ ddfs chunk gov:chunk:presidential_campaign_finance:2012-03-19 ./P00000001-ALL.txt
created: disco://localhost/ddfs/vol0/blob/1c/P00000001-ALL_txt-0$533-86a6d-ec842 created: disco://localhost/ddfs/vol0/blob/1c/P00000001-ALL_txt-0$533-86a6d-ec842
Verify that the data is in DDFS:: Verify that the data is in DDFS'::
diana@ubuntu:~$ ddfs xcat gov:chunk:presidential_campaign_finance:2012-03-19 | head -3 diana@ubuntu:~$ ddfs xcat gov:chunk:presidential_campaign_finance:2012-03-19 | head -3
C00410118,"P20002978","Bachmann, Michelle","HARVEY, WILLIAM","MOBILE","AL","366010290","RETIRED","RETIRED",250... C00410118,"P20002978","Bachmann, Michelle","HARVEY, WILLIAM","MOBILE","AL","366010290","RETIRED","RETIRED",250...