mig/doc/concepts.rst

728 строки
32 KiB
ReStructuredText

===================================================
Mozilla InvestiGator Concepts & Internal Components
===================================================
:Author: Julien Vehent <jvehent@mozilla.com>
.. sectnum::
.. contents:: Table of Contents
MIG is a platform to perform investigative surgery on remote endpoints.
It enables investigators to obtain information from large numbers of systems
in parallel, thus accelerating investigation of incidents.
Besides scalability, MIG is designed to provide strong security primitives:
* **Access control** is ensured by requiring GPG signatures on all actions. Sensitive
actions can also request signatures from multiple investigators. An attacker
who takes over the central server will be able to read non-sensitive data,
but will not be able to send actions to agents. The GPG keys are securely
kept by their investigators.
* **Privacy** is respected by never retrieving raw data from endpoints. When MIG is
ran on laptops or phones, end-users can request reports on the operations
performed on their devices. The 2-man-rule for sensitive actions also protect
from rogue investigators invading privacy.
* **Reliability** is built in. No component is critical. If an agent crashes, it
will attempt to recover and reconnect to the platform indefinitely. If the
platform crashes, a new platform can be rebuilt rapidly without backups.
MIG privileges a model where requesting information from endpoints is fast and
simple. It does not attempt to record everything all the time. Instead, it
assumes that when an information will be needed, it will be easy to retrieve it.
It's an army of Sherlock Holmes, ready to interrogate your network within
milliseconds.
Terminology:
* **Investigators**: humans who use clients to investigate things on agents
* **Agent**: a small program that runs on a remote endpoint. It receives commands
from the scheduler through the relays, execute those commands using modules,
and sends the results back to the relays.
* **Module**: single feature Go program that does stuff, like inspecting a file
system, listing connected IP addresses, creating user accounts or adding
firewall rules
* **Scheduler**: a messenging daemon that routes actions and commands to and from
agents.
* **Relay**: a RabbitMQ server that queues messages between schedulers and agents.
* **Database**: a storage backend used by the scheduler and the api
* **API**: a REST api that exposes the MIG platform to clients
* **Client**: a program used by an investigator to interface with MIG (like the
MIG Console, or the action generator)
An investigator uses a client (such as the MIG Console) to communicate with
the API. The API interfaces with the Database and the Scheduler.
When an action is created by an investigator, the API receives it and writes
it into the spool of the scheduler (they share it via NFS). The scheduler picks
it up, creates one command per target agent, and sends those commands to the
relays (running RabbitMQ). Each agent is listening on its own queue on the relay.
The agents execute their commands, and return the results through the same
relays (same exchange, different queues). The scheduler writes the results into
the database, where the investigator can access them through the API.
The agents also use the relays to send heartbeat at regular intervals, such that
the scheduler always knows how many agents are alive at a given time.
The end-to-end workflow is:
::
{investigator} -https-> {API} -nfs-> {Scheduler} -amqps-> {Relays} -amqps-> {Agents}
\ /
sql\ /sql
{DATABASE}
Below is a high-level view of the different components:
::
( ) signed actions
\|/ +------+ -----------------------> +-------+
| |client| responses | A P I |
/ \ +------+ <----------------------- +-----+-+ +--------+
investigator +-------->| data |
| |
action/command|--------|
| |
+-------->| base |
| | |
signed commands +-------+---+ +--------+
| |
+++++--------------+| SCHEDULER |
||||| | |
vvvvv +-----------+
+-------+ ^^^^^
| | |||||
|message|+-----------------+++++
|-------| command responses
|broker |
| |
+-------+
^^ ^ ^
|| | |
+------------+| | +-----------------+
| +-+ +--+ |
| | | |
+--+--+ +--+--+ +-+---+ +-+---+
|agent| |agent| |agent| ..... |agent|
+-----+ +-----+ +-----+ +-----+
Actions and Commands
--------------------
Actions
~~~~~~~
Actions are JSON files created by investigator to perform tasks on agents.
For example, an investigator who wants to verify than root passwords are hashed
and salted on linux systems, would use the following action:
.. code:: json
{
"name": "Compliance check for Auditd",
"description": {
"author": "Julien Vehent",
"email": "ulfr@mozilla.com",
"url": "https://some_example_url/with_details",
"revision": 201402071200
},
"target": "agents.environment->>'ident' ILIKE '%ubuntu%' AND agents.name LIKE '%dc1.example.net'",
"threat": {
"level": "info",
"family": "compliance",
"ref": "syslowaudit1"
},
"operations": [
{
"module": "filechecker",
"parameters": {
"/etc/shadow": {
"regex": {
"root password strongly hashed and salted": [
"root:\\$(2a|5|6)\\$"
]
}
}
}
}
],
"syntaxversion": 2
}
The parameters are:
* **name**: a string that represents the action.
* **target**: a search string used by the scheduler to find agents to run the
action on. The target format uses Postgresql's WHERE condition format against
the `agents`_ table of the database. This method allows for complex target
queries, like running an action against a specific operating system, or
against an endpoint that has a given public IP, etc...
The most simple query that targets all agents is `name like '%'` (the `%`
character is a wildcard in SQL pattern matching). Targetting by OS family can
be done on the `os` parameters such as `os='linux'` or `os='darwin'`.
Combining conditions is also trivial: `version='201409171023+c4d6f50.prod'
and heartbeattime > NOW() - interval '1 minute'` will only target agents that
run a specific version and have sent a heartbeat during the last minute.
Complex queries are also possible.
For example: imagine an action with ID 1 launched against 10,000 endpoints,
which returned 300 endpoints with positive results. We want to launch action
2 on those 300 endpoints only. It can be accomplished with the following
`target` condition. (note: you can reuse this condition by simply changing
the value of `actionid`)
.. code:: sql
id IN (select agentid from commands, json_array_elements(commands.results) as r where actionid=1 and r#>>'{foundanything}' = 'true')
.. _`agents`: data.rst.html#entity-relationship-diagram
* **description** and **threat**: additional fields to describe the action
* **operations**: an array of operations, each operation calls a module with a set
of parameters. The parameters syntax are specific to the module.
* **syntaxversion**: indicator of the action format used. Should be set to 2
Upon generation, additional fields are appended to the action:
* **pgpsignatures**: all of the parameters above are concatenated into a string and
signed with the investigator's private GPG key. The signature is part of the
action, and used by agents to verify that an action comes from a trusted
investigator. `PGPSignatures` is an array that contains one or more signature
from authorized investigators.
* **validfrom** and **expireafter**: two dates that constrains the validity of the
action to a UTC time window.
Actions files are submitted to the API or the Scheduler directly. The PGP
Signatures are always verified by the agents, and can optionally be verified by
other components along the way.
Additional attributes are added to the action by the scheduler. Those are
defined in the database schema and are used to track the action status.
Commands
~~~~~~~~
Upon processing of an Action, the scheduler will retrieve a list of agents to
send the action to. One action is then derived into Commands. A command contains an
action plus additional parameters that are specific to the target agent, such as
command processing timestamps, name of the agent queue on the message broker,
Action and Command unique IDs, status and results of the command. Below is an
example of the previous action ran against the agent named
'myserver1234.test.example.net'.
.. code:: json
{
"action": { ... signed copy of action ... }
"agentname": "myserver1234.test.example.net",
"agentqueueloc": "linux.myserver1234.test.example.net.55tjippis7s4t",
"finishtime": "2014-02-10T15:28:34.687949847Z",
"id": 5978792535962156489,
"results": [
{
"elements": {
"/etc/shadow": {
"regex": {
"root password strongly hashed and salted": {
"root:\\$(2a|5|6)\\$": {
"Filecount": 1,
"Files": {},
"Matchcount": 0
}
}
}
}
},
"extra": {
"statistics": {
"checkcount": 1,
"checksmatch": 0,
"exectime": "183.237us",
"filescount": 1,
"openfailed": 0,
"totalhits": 0,
"uniquefiles": 0
}
},
"foundanything": false
}
],
"starttime": "2014-02-10T15:28:34.118926659Z",
"status": "succeeded"
}
The results of the command show that the file '/etc/shadow' has not matched,
and thus "FoundAnything" returned "false.
While the result is negative, the command itself has succeeded. Had a failure
happened on the agent, the scheduler would have been notified and the status
would be one of "failed", "timeout" or "cancelled".
Action/Commands workflow
~~~~~~~~~~~~~~~~~~~~~~~~
The diagram below represents the full workflow from the launch of an action by
an investigation, to the retrieval of results from the database. The steps are
explained in the legend of the diagram, and map to various components of MIG.
View `full size diagram`_.
.. _`full size diagram`: .files/action_command_flow.svg
.. image:: .files/action_command_flow.svg
Access Control Lists
--------------------
Not all keys can perform all actions. The scheduler, for example, sometimes need
to issue specific actions to agents (such as during the upgrade protocol) but
shouldn't be able to perform more dangerous actions. This is enforced by
an Access Control List, or ACL, stored on the agents. An ACL describes who can
access what function of which module. It can be used to require multiple
signatures on specific actions, and limit the list of investigators allowed to
perform an action.
An ACL is composed of permissions, which are JSON documents hardwired into
the agent configuration. In the future, MIG will dynamically ship permissions
to agents.
Below is an example of a permission for the `filechecker` module:
.. code:: json
{
"filechecker": {
"minimumweight": 2,
"investigators": {
"Bob Kelso": {
"fingerprint": "E60892BB9BD...",
"weight": 2
},
"John Smith": {
"fingerprint": "9F759A1A0A3...",
"weight": 1
}
}
}
}
`investigators` contains a list of users with their PGP fingerprints, and their
weight, an integer that represents their access level.
When an agent receives an action that calls the filechecker module, it will
first verify the signatures of the action, and then validates that the signers
are authorized to perform the action. This is done by summing up the weights of
the signatures, and verifying that they equal or exceed the minimum required
weight.
Thus, in the example above, investigator John Smith cannot issue a filechecker
action alone. His weight of 1 doesn't satisfy the minimum weight of 2 required
by the filechecker permission. Therefore, John will need to ask investigator Bob
Kelso to sign his action as well. The weight of both investigators are then
added, giving a total of 3, which satisfies the minimum weight of 2.
This method gives ample flexibility to require multiple signatures on modules,
and ensure that one investigator cannot perform sensitive actions on remote
endpoints without the permissions of others.
The default permission `default` can be used as a default for all modules. It
has the following syntax:
.. code:: json
{
"default": {
"minimumweight": 2,
"investigators": { ... }
]
}
}
The `default` permission is overridden by module specific permissions.
The ACL is currently applied to modules. In the future, ACL will have finer
control to authorize access to specific functions of modules. For example, an
investigator could be authorized to call the `regex` function of filechecker
module, but only in `/etc`. This functionality is not implemented yet.
Extracting PGP fingerprints from public keys
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
On Linux, the `gpg` command can easily display the fingerprint of a key using
`gpg --fingerprint <key id>`. For example:
.. code:: bash
$ gpg --fingerprint jvehent@mozilla.com
pub 2048R/3B763E8F 2013-04-30
Key fingerprint = E608 92BB 9BD8 9A69 F759 A1A0 A3D6 5217 3B76 3E8F
uid Julien Vehent (personal) <julien@linuxwall.info>
uid Julien Vehent (ulfr) <jvehent@mozilla.com>
sub 2048R/8026F39F 2013-04-30
You should always verify the trustworthiness of a key before using it:
.. code:: bash
$ gpg --list-sigs jvehent@mozilla.com
pub 2048R/3B763E8F 2013-04-30
uid Julien Vehent (personal) <julien@linuxwall.info>
sig 3 3B763E8F 2013-06-23 Julien Vehent (personal) <julien@linuxwall.info>
sig 3 28A860CE 2013-10-04 Curtis Koenig <ckoenig@mozilla.com>
.....
We want to extract the fingerprint, and obtain a 40 characters hexadecimal
string that can used in permissions.
.. code:: bash
$gpg --fingerprint --with-colons jvehent@mozilla.com |grep '^fpr'|cut -f 10 -d ':'
E60892BB9BD89A69F759A1A0A3D652173B763E8F
Agent initialization process
----------------------------
The agent tries to be as autonomous as possible. One of the goal is to ship
agents without requiring external provisioning tools, such as Chef or Puppet.
Therefore, the agent attempts to install itself as a service, and also supports
a builtin upgrade protocol (described in the next section).
As a portable binary, the agent needs to detect the type of operating system
and init method that is used by an endpoint. Depending on the endpoint,
different initialization methods are used. The diagram below explains the
decision process followed by the agent.
.. image:: .files/mig-agent-initialization-process.png
Go does not provide support for running programs in the backgroud. On endpoints
that run upstart, systemd (linux) or launchd (darwin), this is not an issue
because the init daemon takes care of running the agent in the background,
rerouting its file descriptors and restarting on crash. On Windows and System-V,
however, the agent daemonizes by forking itself into `foreground` mode, and
re-forking itself on error (such as loss of connectivity to the relay).
On Windows and System-V, if the agent is killed, it will not be restarted
automatically.
Registration process
~~~~~~~~~~~~~~~~~~~~
The initialization process goes through several environment detection steps
which are used to select the proper init method. Once started, the agent will
send a heartbeat to the public relay, and also store that heartbeat in its
`run` directory. The location of the `run` directory is platform specific.
* windows: C:\Windows\
* darwin: /Library/Preferences/mig/
* linux: /var/run/mig/
Below is a sample heartbeat message from a linux agent stored in
`/var/run/mig/mig-agent.ok`.
.. code:: json
{
"destructiontime": "0001-01-01T00:00:00Z",
"environment": {
"arch": "amd64",
"ident": "Red Hat Enterprise Linux Server release 6.5 (Santiago)",
"init": "upstart"
},
"heartbeatts": "2014-07-31T14:00:20.00442837-07:00",
"name": "someserver.example.net",
"os": "linux",
"pid": 26256,
"queueloc": "linux.someserver.example.net.5hsa811oda",
"starttime": "2014-07-30T21:34:48.525449401-07:00",
"version": "201407310027+bcbdd94.prod"
}
Check-In mode
~~~~~~~~~~~~~
In infrastructure where running the agent as a permanent process is not
acceptable, it is possible to run the agent as a cron job. By starting the
agent with the flag **-m agent-checkin**, the agent will connect to the
configured relay, retrieve and run outstanding commands, and exit after 10
seconds of inactivity.
Agent upgrade process
---------------------
MIG supports upgrading agents in the wild. The upgrade protocol is designed with
security in mind. The flow diagram below presents a high-level view:
::
Investigator Scheduler Agent NewAgent FileServer
+-----------+ +-------+ +---+ +------+ +--------+
| | | | |
| 1.initiate | | | |
|------------------>| | | |
| | 2.send command | | |
| |------------------>| 3.verify | |
| | |--------+ | |
| | | | | |
| | | | | |
| | |<-------+ | |
| | | | |
| | | 4.download | |
| | |-------------------------------------->|
| | | | |
| | | 5.checksum | |
| | |--------+ | |
| | | | | |
| | | | | |
| | |<-------+ | |
| | | | |
| | | 6.exec | |
| | |------------------>| |
| | 7.return own PID | | |
| |<------------------| | |
| | | | |
| |------+ 8.mark | | |
| | | agent as | | |
| | | upgraded | | |
| |<-----+ | | |
| | | | |
| | 9.register | | |
| |<--------------------------------------| |
| | | | |
| |------+10.find dup | | |
| | |agents in | | |
| | |registrations | |
| |<-----+ | | |
| | | | |
| | 11.send command to kill PID old agt| |
| |-------------------------------------->| |
| | | | |
| | 12.acknowledge | | |
| |<--------------------------------------| |
All upgrade operations are initiated by an investigator (1). The upgrade is
triggered by an action to the upgrade module with the following parameters:
.. code:: json
"Operations": [
{
"Module": "upgrade",
"Parameters": {
"linux/amd64": {
"to_version": "16eb58b-201404021544",
"location": "http://localhost/mig/bin/linux/amd64/mig-agent",
"checksum": "31fccc576635a29e0a27bbf7416d4f32a0ebaee892475e14708641c0a3620b03"
}
}
}
],
* Each OS family and architecture have their own parameters (ex: "linux/amd64",
"darwin/amd64", "windows/386", ...). Then, in each OS/Arch group, we have:
* to_version is the version an agent should upgrade to
* location points to a HTTPS address that contains the agent binary
* checksum is a SHA256 hash of the agent binary to be verified after download
The parameters above are signed using a standard PGP action signature.
The upgrade action is forwarded to agents (2) like any other action. The action
signature is verified by the agent (3), and the upgrade module is called. The
module downloads the new binary (4), verifies the version and checksum (5) and
installs itself on the system.
Assuming everything checks in, the old agent executes the binary of the new
agent (6). At that point, two agents are running on the same machine, and the
rest of the protocol is designed to shut down the old agent, and clean up.
After executing the new agent, the old agent returns a successful result to the
scheduler, and includes its own PID in the results.
The new agent starts by registering with the scheduler (7). This tells the
scheduler that two agents are running on the same node, and one of them must
terminate. The scheduler sends a kill action to both agents with the PID of the
old agent (8). The kill action may be executed twice, but that doesn't matter.
When the scheduler receives the kill results (9), it sends a new action to check
for `mig-agent` processes (10). Only one should be found in the results (11),
and if that is the case, the scheduler tells the agent to remove the binary of
the old agent (12). When the agent returns (13), the upgrade protocol is done.
If the PID of the old agent lingers on the system, an error is logged for the
investigator to decide what to do next. The scheduler does not attempt to clean
up the situation.
Command execution flow in Agent and Modules
-------------------------------------------
An agent receives a command from the scheduler on its personal AMQP queue (1).
It parses the command (2) and extracts all of the operations to perform.
Operations are passed to modules and executed asynchronously (3). Rather than
maintaining a state of the running command, the agent create a goroutine and a
channel tasked with receiving the results from the modules. Each modules
published its results inside that channel (4). The result parsing goroutine
receives them, and when it has received all of them, builds a response (5)
that is sent back to the scheduler(6).
When the agent is done running the command, both the channel and the goroutine
are destroyed.
::
+-------+ [ - - - - - - A G E N T - - - - - - - - - - - - ]
|command|+---->(listener)
+-------+ |(2)
^ V
|(1) (parser)
| + [ m o d u l e s ]
+-----+ | (3)|----------> op1 +----------------+
|SCHED|+---+ |------------> op2 +--------------|
| ULER|<---+ |--------------> op3 +------------|
+-----+ | +----------------> op4 +----------+
| V(4)
|(6) (receiver)
| |
| V(5)
+ (publisher)
+-------+ /
|results|<-----------------------------------------'
+-------+
Threat Model
------------
Running an agent as root on a large number of endpoints means that Mozilla
InvestiGator is a target of choice to compromise an infrastructure.
Without proper protections, a vulnerability in the agent or in the platform
could lead to a compromission of the endpoints.
The architectural choices made in MIG diminish the exposure of the endpoints to
a compromise. And while the risk cannot be reduced to zero entirely, it would
take an attacker direct control on the investigators key material, or be root
on the infrastructure in order to take control of MIG.
MIG's security controls include:
* Strong GPG security model
* Infrastructure resiliency
* No port listening
* Protection of connections to the relays
* Randomization of the queue names
* Whitelisting of agents
* Limit data extraction to a minimum
Strong GPG security model
~~~~~~~~~~~~~~~~~~~~~~~~~
All actions that are passed to the MIG platform and to the agents require
valid GPG signatures from one or more trusted investigators. The public keys of
trusted investigators are hardcoded in the agents, making it almost impossible
to override without root access to the endpoints, or access to an investigator's
private key. The GPG private keys are never seen by the MIG platform (API,
Scheduler, Database or Relays). A compromise of the platform would not lead to
an attacker taking control of the agents and compromising the endpoints.
Infrastructure resiliency
~~~~~~~~~~~~~~~~~~~~~~~~~
One of the design goal of MIG is to make each components as stateless as
possible. The database is used as a primary data store, and the schedulers and
relays keep data in transit in their respective cache. But any of these
components can go down and be rebuilt without compromising the resiliency of
the platform. As a matter of fact, it is strongly recommended to rebuilt each
of the platform component from scratch on a regular basis, and only keep the
database as a persistent storage.
Unlike other systems that require constant network connectivity between the
agents and the platform, MIG is designed to work with intermittent or unreliable
connectivity with the agents. The rabbitmq relays will cache commands that are
not consumed immediately by offline agents. These agents can connect to the
relay whenever they chose to, and pick up outstanding tasks.
If the relays go down for any period of time, the agents will attempt to
reconnect at regular intervals continuously. It is trivial to rebuild
a fresh rabbitmq cluster, even on a new IP space, as long as the FQDN of the
cluster, and the TLS cert/key and credentials of the AMQPS access point
remain the same.
No port listening
~~~~~~~~~~~~~~~~~
The agents do not accept incoming connections. There is no listening port that
an attacker could use to exploit a vulnerability in the agent. Instead, the
agent connects to the platform by establishing an outbound connection to the
relays. The connection uses TLS, making it theorically impossible for an
attacker to MITM without access to the PKI and DNS, both of which are not
part of the MIG platform.
Protection of connections to the relays
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The rabbitmq relay of a MIG infrastructure may very well be listening on the
public internet. This is used when MIG agents are distributed into various
environments, as opposed to concentrated on a single network location. RabbitMQ
and Erlang provide a stable network stack, but are not shielded from a network
attack that would take down the cluster. To reduce the exposure of the AMQP
endpoints, the relays use AMQP over TLS and require the agents to present a
client certificate before accepting the connection.
The client certificate is shared across all the agents. **It is not used as an
authentication mechanism.** Its sole purpose is to limit the exposure of a public
AMQP endpoint. Consider it a network filter.
Once the TLS connection between the agent and the relay is established, the
agent will present a username and password to open the AMQP connection. Again,
these credentials are shared across all agents, and are not used to authenticate
individual agents. Their role is to assign an ACL to the agent.
The ACL limits the AMQP action an agent can perform on the cluster.
See `rabbitmq configuration`_ for more information.
.. _`rabbitmq configuration`: configuration.rst
Randomization of the queue names
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
The protections above limit the exposure of the AMQP endpoint, but since the
secrets are shared across all agents, the possibility still exists that an
attacker gains access to the secrets, and establish a connection to the relays.
Such access would have very limited capabilities. It cannot be used to publish
commands to the agents, because publication is ACL-limited to the scheduler.
It can be used to publish fake results to the scheduler, or listen on the
agent queue for incoming commands.
Both are made difficult by prepending a random number to the name of an agent
queue. An agent queue is named using the following scheme:
`mig.agt.<OS family>.<Hostname>.<uid>`
The OS and hostname of a given agent are easy to guess, but the uid isn't.
The UID is a 64 bits integer composed of nanosecond timestamps and a random 32
bits integer, chosen by the agent on first start. It is specific to an endpoint.
Whitelisting of agents
~~~~~~~~~~~~~~~~~~~~~~
At the moment, MIG does not provide a strong mechanism to authenticate agents.
It is a work in progress, but for now agents are whitelisted in the scheduler
using the hostname that are advertised in the heartbeat messages. While easy to
spoof, it provides a basic filtering mechanism. The long term goal is to allow
the scheduler to call an external database to authorize agents. In AWS, the
scheduler could call the AWS API to verify that a given agent does indeed exist
in the infrastructure. In a traditional datacenter, this could be an inventory
database.
Limit data extraction to a minimum
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Agents are not `meant` to retrieve raw data from their endpoints. This is more
of a good practice rather than a technical limitation. The modules shipped with
the agent are meant to return boolean answers of the type "match" or "no match".
It could be argued that answering "match" on sensitive requests is similar to
extracting data from the agents. MIG does not solve this issue.. It is the
responsibility of the investigators to limit the scope of their queries (ie, do
not search for a root password by sending an action with the password in the
regex).
The goal here is to prevent a rogue investigator from dumping large amount of
data from an endpoint. MIG could trigger a memory dump of a process, but
retrieve that data will require direct access to the endpoint.
Note that MIG's database keeps records of all actions, commands and results. If
sensitive data were to be collected by MIG, that data would be available in the
database.