3 Steps to Improve the Data Quality of a Data lake

Rédigé par Irina Stolbolva

From Customising Logs in the Code to Monitoring in Kibana

This article will share an approach on how to make the data injection flow more transparent in the data lake due to monitoring of custom logs in Kibana.

The previous project I was working on was dedicated to the construction of a data lake. Its purpose was to inject gigabytes of data from various sources into the system and make it available for multiple users within the organization. As it turned out, it was not always easy to certify if all the data was successfully inserted and even if the problem was already evident, it required hours to identify its cause. Hence, there was no doubt that the system needed fixing.

From a technical perspective, the solution that we proposed might be divided into three principal blocks:

logging the necessary information in the code,
indexing logs in Elasticsearch using Logstash,
visualizing logs in custom dashboards on Kibana.

In software development, logging is a means to decrypt the black box of a running application. When the app is growing in its complexity, it starts to be trickier to figure out what is going on inside and here is where the logs are getting more influent. Who could benefit from them? Both developers and software users! Thanks to logs, the developer can restore the path the program is passing through and get a signal of potential bug location while the user can obtain the necessary information regarding the program and its output: such as time of execution, the data about processed files etc.

In order to improve the robustness of the application, the logs should fulfil the two standards: we wanted them to be customized so that they contain only the data we are interested in. Hence, it is important to think of what really values in the application: it may be the name of a script or an environment, time of execution, the name of the file containing an error, etc. The logs should be human-readable so that the problem could be detected as fast as possible regardless of the processed data volume.

Step 1: Logging essential information in the code.

The first sub-goal is to prepare the logs that can be easily parsed by Logstash and Elasticsearch. For that reason, we are keeping the logs messages as a multi-line JSON that contains the information we would like to display: log message, timestamp, script name, environment (prod or dev), log level (debug, info, warning, error), stack trace.

The code below can help you to create your customized logs in JSON format for a mock application which consists of the following parts: the application body is written in main.py script, the logger object is defined in logging_service.py, its parameters are described in logging_configuration.yml. To add the specific fields into the logging statement we have written CustomJsonFormatter class that overwrites add_fields method of its superclass imported from pythonjsonlogger package. The function get_logger from logging_service.py returns the new logger with the desired configurations. Note: the best practice is to define the logger at the top of every module of your application.

	from logging import config, getLogger
	from pythonjsonlogger import jsonlogger
	import datetime
	import yaml


	class CustomJsonFormatter(jsonlogger.JsonFormatter):
	def add_fields(self, log_record, record, message_dict):
	if log_record.get('level'):
	log_record['level'] = log_record['level'].upper()
	else:
	log_record['level'] = record.levelname

	if not log_record.get('timestamp'):
	log_record['timestamp'] = datetime.datetime.now()\
	.strftime('%Y-%m-%dT%H:%M:%S.%fZ')
	super(CustomJsonFormatter, self).add_fields(log_record, record, message_dict)


	def get_logger():
	with open('logging_configuration.yml') as logging_configuration_file:
	dict_config = yaml.load(logging_configuration_file)
	config.dictConfig(dict_config)
	logger = getLogger()
	return logger

view raw logging_service.py hosted with ❤ by GitHub

	version: 1
	disable_existing_loggers: true
	root:
	handlers: [file_handler]
	handlers:
	file_handler:
	backupCount: 3
	class: logging.handlers.RotatingFileHandler
	filename: /path/to/logs/file.log
	formatter: json_formatter
	maxBytes: 5000
	formatters:
	json_formatter:
	(): logging_service.CustomJsonFormatter

view raw logging_configuration.yml hosted with ❤ by GitHub

	from logging_service import get_logger

	logger = get_logger()


	def division(a, b):
	try:
	return a / b
	except:
	error_message = 'The division {a} by {b} is impossible'.format(a=a, b=b)
	logger.error(error_message, exc_info=True)


	if __name__ == '__main__':
	division(5, 0)

view raw main.py hosted with ❤ by GitHub

To create file.log, run the code above, placing the files in the same folder and running the following command from your terminal:

cd {project_root} && python3 main.py

view raw run.sh hosted with ❤ by GitHub

Step 2: Indexing logs in Elasticsearch using Logstash.

To promote the readability of logs we were using the ELK stack: the combination of the three open-sourced projects Elasticsearch, Logstash, Kibana. There exist multiple articles that can give you insights about what it is and its pros and cons. Here is one of them.

The major steps we committed while installing ELK:

Install Logstash on the machine where the logs are located. Start a Logstash container.
Install Elasticsearch and Kibana. In our case, they were located on a separate machine.
Configure Logstash so that it points to Elasticsearch and searches for the right files.Note 1) The regular expression folder/**/*.log* allows you to look for.log files in all subfolders of folder. 2) The extra * at the end allows you to also watch for rotated files. 3) All badly parsed lines marked with a tag _jsonparsefailure by Logstash are dropped in order not to index them in Elasticsearch. 4) In order not to mix the data coming from different environments, you can use index. That will allow Elasticsearch to group the data.

	input {
	file {
	codec => "json"
	path => ["/path/to/logs//.log"]
	start_position => beginning
	discover_interval => 3
	}
	} filter {
	date {
	match => ["timestamp", "yyyy-MM-dd HH:mm:ss"]
	}
	if "_jsonparsefailure" in [tags] {
	drop { }
	}
	mutate {
	add_field => { "source_environment" => "dev" }
	}
	} output {
	elasticsearch {
	hosts => ["elasticsearch_kibana_machine_address:9200"]
	index => "logstash-dev-%{+YYYY.MM.dd}"
	ssl => true ssl_certificate_verification => false
	}
	}

view raw logstash.conf hosted with ❤ by GitHub

Start Elasticsearch and Kibana containers on a dedicated machine (make sure their versions match Logstash’s version or there will be incompatibilities between their APIs).
Configure services to automatically relaunch Logstash.
Logstash, Elasticsearch, and Kibana produce their personal logs. In case you have insufficient memory on your machine, it may cause a problem very fast. For that, configure the logging of your ELK, for example, specifying the maximum size of log files and the number of back-ups.

Step 3: Visualization with Kibana.

Build Kibana dashboards that contain all your visualizations. Grasp the big picture and choose your analysis angle. The first interest of dashboards is being able to dive in any possible dimension (as it is presented in the graph below, it is a source environment or a log level). Display the details for each log. The table behind the charts allows seeing the details of each filtered log.

Here is the snapshot of Kibana dashboard of our final application:

I hope you enjoyed this article! And wish you happy logging in your software development adventures.

Thanks to Pierre Marcenac, Nicolas Jean, and Alexandre Sapet.

If you are looking for Data Engineering experts, don't hesitate to contact us !

Cet article a été écrit par

Irina Stolbolva

3 Steps to Improve the Data Quality of a Data lake

Step 1: Logging essential information in the code.

Step 2: Indexing logs in Elasticsearch using Logstash.

Step 3: Visualization with Kibana.

Cet article a été écrit par

Need an expert for your project?