Skip to content

A mapping tool that transforms the censys.io data into a model that is more suited for data analysis.

License

Notifications You must be signed in to change notification settings

censys-ml/censys-ml

Repository files navigation

Censys-ML (Beta)

Build Status

alt text

Current release: 0.0.2

License: MIT

Censys-ML is a data mapping tool that transforms the Censys data model and other similar sources into one more suited for data analysis.

Installation

Python Dependencies

All of the dependencies and requirements are listed in the pipfile.

Use the package manager pipenv to install all the dependencies. Visit the official site to install pipenv.

Run this line from a terminal opened from the root project directory to install the dependencies.

pipenv install

Vector

Vector is one on the key components of this project. It is used as a engine to transform the original dataset.

To install vector, run the bash script setup.sh in the scripts folder.

$ bash scripts/setup.sh

Alternatively, Vector can also be installed using curl

curl --proto '=https' --tlsv1.2 -sSf https://sh.vector.dev | sh

If these methods do not work, take a look at these Vector installation Methods.

Setup

Handling Datasets

The datasets need to be structured into JSON format.

Configurations

Change the input and output configurations to match the target environment.

~/censys-ml $ cd scripts
~/censys-ml/scripts $ vim main.sh
export INPUT="file"     <-- input configuration
export OUTPUT="console" <-- output configuration

There are additional sinks and sources in beta on the Vector website that might have not been included here.

Additional configuration changes might need to be made in the source_configs (./vector/source_configs/) and sink_configs (./vector/sink_configs/) directories based on other source/sink specific requirements. E.g., topics for kafka, indices for elasticsearch, etc.

~/censys-ml $ cd scripts
~/censys-ml/scripts $ bash main.sh

Enabling TLS

TLS is not used by default. If a connection is to be set up with TLS, configurations such as CA certificate file, certificate file, and certificate key file should be specified in the sources/sinks configuration.

Transforming Datasets

After setting up the correct source and sink, execute the main.sh script in the scripts directory to start the transform.

~/censys-ml $ cd scripts
~/censys-ml/scripts $ bash main.sh

Press CTRL+c to stop the transform process.

(Optional) Updating the Model definition

Note Make sure that the correct service account file is found in the auth folder inside the main config directory before proceeding. This service account would need to have access to the Censys Bigquery database as described here.

The schema set by these sources is subject to change as more services are found and ports are scanned. These can in turn affect the transformation process and resulting dataset. This means that the model definition needs to be updated from time to time.

This is OPTIONAL because the model definition is frequently updated by our team.

To update the Model definition, run the following

cd censys_ml
pipenv run python update_model.py

Once the shell is running, navigate to the censys_ml directory and execute the update_model.py module. Run these lines in the terminal to do so

cd censys_ml
python update_model.py

(Optional) Generating the Lua Scripts

Before getting into the transformation phase, the Lua scripts must first be generated. These scripts are dynamically generated based on the current Censys model definition.

THIS IS OPTIONAL because the Lua scripts are frequently updated by our team.

To do this, simply execute the generate_transforms module in the censys_ml directory

cd censys_ml
pipenv run python generate_lua_transforms.py

Supported Sources

Note: All sources must supply data that is structured as a standard single-line JSON

Standard Input (stdin)

A straight forward data source is the console itself. Although this method might be tedious to insert large data it is still an option. This method might come in handy when self analyzed data needs to be included in the report. Inputs still need to be single line JSONs. Every line input is considered as a separate event and hence its own output JSON will be generated for it.

File

One of the supported sources for this mapping tool is a File input.
The default pattern for recognition is any file ending with a *.json extension. Once the transformation process has began, new files can still be added to the input directory and they will be caught automatically. Compressed files (gzip e.t.c) are also decompressed for reading but is not a reliable method. Once a file has been read and has a checkpoint set for it, it WILL NOT be read again.

Apache Kafka

Meant for larger data sets that will continuously flow in. Kafka streaming is also an option with Kafka version >= 0.8 .

Supply the bootstrap server address as a string of IP followed by port, separated by a colon like this "127.0.0.1:9092" . Data can also be ingested from different bootstrap servers, not just one. If there are different bootstrap servers that act as data sources set the bootstrap servers variable to a string of IPs separated by commas like "127.0.0.1:9092,3.345.323.2:9092" . Once the servers are set, the group id for the consumer group should be specified. Afterwards comes the message_key, this key is the field that holds the message in the output log event. Finally the topics need to be set. These topics are Kafka topics to read events from. Simply supply the names separated by a comma like "topic1,topic2"

HTTP

Data can also be retrieved from HTTP requests.
To use this source set the source to http . This component exposes a configured port. You must ensure your network allows access to this port. Set the ADDRESS variable as a text of the target IP and port to listen for connections like such "0.0.0.0:80" .

For other advanced options check out the HTTP section in Vector HTTP source documentation.

Splunk HEC Source

This source ingests data through the Splunk HTTP Event Collector protocol. To use this source set the source to splunk_hec . A valid address, meaning an IP followed by a port number should be set. A token is used to authorize connection.

Heroku Logplex Source

This source ingests data through the Heroku Logplex HTTP Drain protocol. To use this source set the source to logplex . A valid address, meaning an IP followed by a port number should be set.

Socket

Custom socket connections are also an option.
To use this source set the source to socket . Set the MODE env variable to either 'udp', ' tcp' or 'unix' based on your socket connection type. In a tcp connection mode TLS can be enabled. If the connection mode is set to 'unix', the path to the absolute unix socket should be set. For advanced settings, take a look at the Socket section in docs.

Vector

If there already exists a vector instance that acts as an upstream, it can be used as a data source for this instance.
To use this source set the source to vector . The dataset from the upstream instance can be ingested over a socket connection. Since we only support json format, make sure that the data coming through this source was a valid JSON prior to the encoding.

Docker

Docker engine daemon can also act as a source. A Docker API >= 1.24 and Json-file logging driver are required and so should be installed first. To use this source set the source to docker . Please note a connection to the docker daemon is only automated if the active user can and has privilege to run docker ps .
This plugin allows the consumption of data from a container and an image. To supply the container or image please, set the field container or image to the name of the desired container.

Syslog

This source ingests data through the syslog protocol. To use this source, set the source to syslog . A syslog protocol uses a socket connection. So the same steps in the socket section can be followed.

Journald

Data can also be ingested through Systemd's Journald utility. The journalctl binary is required, this is the interface Vector uses to retrieve JournalD logs. User must also be part of the systemd-journal group in order to execute the journalctl binary. Checkpoints are set for every batch that is read.

If needed entries from alternate boots can be included. The full path of the journalctl executable must also be specified. Units are monitored once process has began. To select which units to monitor simply specify there names separated with a comma as such .unit1,unit2 . Any unit lacking a "." will have a ".service" appended to it to make it a valid service unit name.

Source comparisons


Sources Source function Multiple OS support OpenSSL Over TLS Guaranteed Data Delivery in all cases*
File collect -
Stdin recieve -
Docker collect
Socket recieve
HTTP recieve
Kafka collect
Vector recieve
Journald collect -
Heroku Logplex recieve
Splunk HEC recieve
Syslog recieve -

*. Some sources have been given negative feedback under this field, this DOES NOT mean that they do not have modes that guarantee data delivery. But rather in some modes, these sources makes best-effort delivery guarantee and, in rare cases, can lose data.

Supported Sinks

Console (stdout)

Output the result JSON directly to stdout. More info

Elasticsearch

Output the result JSON to an elasticsearch index/indices. More info

File

Output the result JSON to a file with an optional time based pattern. More info

HTTP

Output the result JSON via an HTTP request. More info

Kafka

Output the result JSON via a kafka. More info

Splunk HEC

Output the result JSON via a Splunk HEC. More info

Running Tests

Run the following from the root directory to run the unit tests.

pipenv run python run_tests.py     

The code coverage report will be generated in the /coverage/ directory.

Visit ./coverage/index.html to view the report.

Development tools

  • PyCharm

Resources

* https://pypi.org/project/pipenv/
* https://censys.io/ipv4
* https://vector.dev/
* Lua programming language
* https://cloud.google.com/bigquery

About

A mapping tool that transforms the censys.io data into a model that is more suited for data analysis.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Contributors 4

  •  
  •  
  •  
  •  

Languages