Skip to content

Processing Example

The example directory contains an example directory layout and mini configuration for dataset processing using the Cyber Range Kyoushi Dataset CLI tool. While the tool itself does not enforce any specific directory or file layout, other than configuring the processing and parsing pipeline through the process.yaml file, we recommend following the structure presented here as it promotes separating processing into multiple steps making each step smaller and easier to understand.

Note

The layout presented here is also reflected by many of the default values of configuration options.

Layout

Note

* These files and directories are automatically generated by the Logstash setup processor.

Note

** By default the logstash/data and logstash/log directories are used as Logstash runtime directories for storing current state, temporary files and generated logs.

process.yaml

The processing and parsing configuration.

config

The config directory should be used to store any static configuration files to be used or loaded as part of the dataset processing pipeline. Note that we also highly recommend to make this directory or a sub directory of it the target location for any config file that is dynamically rendered as part of the pipeline.

<name>_queries.yaml

As part of the post-processing phase it is likely that some of the used configuration file or processor templates will have to read the Elasticsearch database. This can either be done through an EQL or DSL query, both syntax types can result long query definitions. To keep template files and processor configurations brief we recommend putting Elasticsearch queries in separate files and load the query definitions when needed as part of the template or processor context.

logs.yaml

The logs.yaml file is used to define which of the gathered log files should be parsed and what meta information should be added to them. The file is structured into two main fields servers and groups mirroring the Ansible group and host vars structure. Log file configurations can be either assigned to a host (server) directly or indirectly via the groups it belongs to. See the Logstash Log Config model for details on the configuration format for a single log file. While it is possible to integrate this information as part of the testbeds TIM group and host vars we recommend using the logs.yaml file, because it ensures all Dataset processing configuration can be viewed and accessed in a single location. Host and group vars could only be inspected in the servers fact files (which are very large and hard to read) or after pre-processing once the facts have been curated.

logstash

The logstash directory is used for both storing the Logstash parser configuration as well as files generated at runtime (e.g., Logstash logs).

conf.d

The conf.d directory contains all Logstash parsing configuration files that should be loaded i.e., input, output and filter definitions. Note that if you use the Logstash setup processor input and output configuration are automatically generated and you only have to define filter definitions used to parse raw logs into structured data.

0000_pre_process.conf*

A special Logstash filter configuration file containing filters used to bootstrap log events for parsing. For example patching line numbers into log events is done by a ruby filter configured in this file.

input.conf*

Logstash input configuration file containing all log file inputs and their input settings.

output.conf*

Logstash output configuration containing the elasticsearch output settings and an optional file output used to store parsed versions of the log files (only used for logs configured with save_parsed: true).

<log type>.conf

To configure how log data is parsed it is recommended to create a Logstash filter configuration file for each log type. This makes it easier to edit, update and read parsing configuration as each file will be smaller and less complicated than having a single file containing filters for all log data.

data**

The Logstash runtime directory used to store state und plugin data such as pointers for currently processed files.

log**

The Logstash runtime logs directory.

jvm.options*

Options file for the Java Virtual Machine (JVM) used to run Logstash. These options can be used to e.g., change the amount of RAM used by Logstash.

log4j2.properties*

Logstash runtime logger configuration file.

logstash.yml*

The Logstash main configuration file used to set the data and log runtime directories.

pipelines.yml*

The Logstash pipeline configuration used to configure the processing mode and options. This is generated by the Logstash setup processor and the config is setup so that Logstash runs in single thread mode as to preserve the log line order.

<name>-index-template.json

While by default Elasticsearch will try to automatically create correct field mappings (i.e., type definitions) for the data fields created through the parsing process, but this is not very efficient if there are many different fields or fields for which Elasticsearch would produce an incorrect mapping. Also when using Eql Sequence Rules all referenced fields must have a defined field mapping or otherwise the underlying EQL query will result in an error. This can be problematic if one of the reference fields is optional (i.e., might not occur in all dataset instances) no automatic field mapping would be created. Thus it is recommend to pre-define the index field mappings using index templates. Each of the created index template files must be imported using the Index Template processor during the pre-processing phase.

<name>-ingest.yaml

As an alternative to configuring Logstash filters for parsing log data it is also possible to define Ingest Pipelines on the Elasticsearch server. This way the log data processing is handled by the Elasticsearch server upon receiving the log event. Each of the created ingest pipeline configurations must be imported into the Elasticsearch server during the pre-processing phase using the Ingest Pipeline processor. Logs can then be marked for processing by using the add_field configuration option.

add_field:
    "[@metadata][pipeline]": "<pipeline id>"

templates

The templates directory should be used to store all template files used in the processing pipeline.

rules

The rules directory should contain all labeling rule templates to be rendered during the post-processing phase.

<rule class>.yaml.j2

We recommend to split labeling rule templates in multiple files based on the log class/file they are written for. For example, if you write labeling rules for Apache server logs you might want to put them in a file called apache.yaml.j2. During the labeling phase rules are applied in lexicographical order so you might want to use numerical prefixes to ensure a certain rule file is applied before another.

attacker

One of the main inputs for labeling rule templates are the attacker facts and logs, because of this we recommend to keep attacker related config templates in a separate directory. This makes it easier for other people to read your processing-pipeline configuration.

attacker.yaml.j2

Further we recommend that you define general attacker information template for rendering basic attacker data (e.g., IP addresses, users, etc.). Should you have more than one attacker you might want to render the file multiple times.

<phase name>.yaml.j2

Also it is recommend to prepare a configuration template for each of the define attack phases for rendering phase specific information such as execution timestamps (parsed from the attacker logs) or dynamically chosen execution parameters (e.g. executed CLI commands).

groups.json.j2

While the host group information can also be read directly from the fact files, we recommend to curate this information into a simple groups.json file using a template processor. This makes the data easier to use in subsequent processors.

servers.json.j2

Similarly we recommend curating the server facts into a file only containing information necessary for dataset processing or documentation. This makes it easier to access the information as well as easier to understand for a human inspecting the resulting dataset.