Machine Learning

March 4, 2024 • 6 min read

How to master LightGBM to efficiently make predictions

Rédigé par Agathe Minaro

If you don’t know what LightGBM is or how it works, you can first read this article, which explains the basics. After a little theory, let’s learn by doing!

We will dive into:

Preparing your data
Fine-tuning your model
Interpreting and analysing your model's predictions
Enhancing the temporal efficiency of your model.

Along the way, we'll include practical code examples to illustrate these concepts.

Get your Data Ready

Before diving into model training, preparing your dataset is crucial (cleaning, feature engineering...). LightGBM stands out because it efficiently handles various data types without extensive preprocessing.

Compared to other algorithms, LightGBM does a lot of things on its own. For instance, LightGBM can natively process categorical data. It eliminates the need for one-hot encoding compared to other models like Random Forest for instance. This reduces memory usage and speeds up the training process. Therefore, LightGBM offers a straightforward approach to managing types or labels within your dataset.

Moreover, the model can handle missing values, by using special values to represent missing data in the tree. After the split, the unrecorded values are treated as zeros. However, keep in mind that you can further enhance performance by imputing missing values or using techniques like mean encoding.

For many algorithms, like distance-based models (KNN, SVM...) for instance, feature scaling is compulsory. However, as a tree-based algorithm, LightGBM does not need feature scaling to perform well.

Here are the little steps remaining to get your data ready :

Feature Selection. Imagine your data is like a backpack. Some things in it are super important, and some things are just extra weight. LightGBM likes it when you keep only what's important and leave the rest behind. Thus, identifying and excluding non-informative features from your model can drastically reduce noise and computational complexity, and therefore enhance the model performance. To do so, you can use Boruta, a feature selection method, that automatically classifies features based on their usefulness to the task at hand.
Feature Engineering. This step involves generating new features from existing data, selecting the most relevant attributes, and transforming variables to capture the underlying patterns better. It is an important step for each machine learning model, which is not a deep learning one. By carefully creating features that reflect the complexities of your data, you can provide LightGBM with a rich, informative dataset primed for learning. For time series, for instance, you can create lag features, which are the values of previous time steps. Moreover, you can merge it with external data, which can give great insights into your model.

By doing these simple things, you lay a solid foundation for building a robust LightGBM model. They are critical for ensuring the success of the subsequent stages of model development.

Fine-tune your LightGBM

Tweaking special settings of the model can boost its performance. This part aims to find the optimal set of hyperparameters, instead of using the default values of LightGBM. This will allow you to have the best results on your dataset. Even though most of the time you are not going to do this fine-tuning by hand, and many algorithms can help you do that, it is important to understand the core concepts to be able to use them better: choose the right ones to optimise regarding your problem, set the right hyperparameter space. Here are the most important ones:

Learning rate. (learning_rate) Determines the step size at each iteration while moving toward a minimum of a loss function. This idea of going to the minimum of the loss function is explained in the first article. A smaller value makes the model learn slowly but potentially leads to better generalisation. However, a larger value speeds up the learning process but is at risk of overshooting the minimum of the loss function. A standard value is between 0.1 and 0.3.
Maximum depth of trees. (max_depth) Controls the depth of trees constructed during the learning process. This impacts the model's ability to capture interactions among features. Shallow trees may generalise better. You can begin with a non-restrictive value like 10 for example.
Maximum number of leaves by tree. (num_leaves) Influences the complexity of the tree model. A higher number of leaves allows the model to capture more information but at the risk of overfitting. It should be smaller than 2 powers max_depth to avoid overfitting.
Minimum of data in leaf. (min_data_in_leaf) Trees can be very specific about who they want to talk about. This parameter allows setting a threshold for the minimum number of samples in a leaf. Thus, it can help prevent overfitting by ensuring that leaves represent a sufficient number of instances. In practice, setting it to hundreds is enough for a large dataset.
Feature Fraction. (feature_fraction) Dictates the fraction of features considered for splitting at each node, encouraging feature diversity in the model's learning process. A good initial value might be around 0.8 for medium to large datasets. For smaller datasets, where overlearning is a greater concern, starting with a lower value, such as 0.6 or 0.7, might be beneficial.

However, as we said, you need to use automated hyperparameter optimisation tools. They can streamline the process of finding the optimal settings for these parameters, enhancing model accuracy and efficiency. Optuna is a smart tool that helps your model pick the right options. It works by efficiently searching through the hyperparameter space using strategies like Bayesian optimisation, Tree-structured Parzen Estimator (TPE), and others, to identify the set of parameters that result in the best performance of a model according to a predefined metric. This library can be used with lots of machine-learning models; here is how you can use it with LightGBM:

Use Optuna with LightGBM

Interpret and Analyse your LightGBM

Model interpretation is essential for gaining insights into the factors driving your model's predictions. This stage involves understanding how different features influence the output of LightGBM and identifying areas for improvement. It will help, not only to improve your model but also to have important business insights thanks to the understandability of LightGBM. Here are different methods you can use :

1. Analyse Feature Importance Scores. This can reveal which features most significantly affect the model's predictions, offering guidance on feature selection and engineering. This type of importance can be used for every model based on trees.

Plot LightGBM Feature Importances

2. Analyse SHAP Values. They provide a detailed breakdown of the contribution of each feature to individual predictions, facilitating a deeper understanding of the model's decision. It uses a game theoretical approach to explain the output of any model. This can be very useful with “black-box” algorithms.

Shap Values for LightGBM

In the picture below, you have an example of the graphic you can have for a survival model. The use of colour helps illustrate the impact of changes in a feature's value. For example, an elevated white blood cell count correlates with an increased mortality risk.

Shap Value for Survival Model - Source : Shap documentation

3. Analyse the Errors. This step is very important to uncover patterns in the data that the model struggles with. To analyse your model errors effectively, you have to collect and quantify incorrect predictions and segment them to uncover patterns or trends. You can then perform a detailed root cause analysis focusing on data characteristics, feature contributions, and external insights. This process involves comparing erroneous predictions against correct ones to identify discrepancies and leveraging domain knowledge for deeper insights. It highlights opportunities for further tuning or additional feature engineering.

This analytical process not only demystifies the model's operation but also empowers you to make informed decisions to refine and improve its predictive performance.

Be aware that the steps of data preparation, training, hyperparameter optimization, and analysis are integral components of an iterative process, where each step informs and refines the others to continuously improve model performance.

Enhance your LightGBM Efficiency

Now that you have a perfect, optimised model, you want to have the most efficient one. Performance enhancement involves leveraging the advanced capabilities of LightGBM to maximise model temporal efficiency.

During the training of the model, you should use the callback of early stopping of LightGBM. It allows halting training when improvements diminish on the validation set. You can use it, with your best parameters, found before, as follows:

Train LightGBM with Early Stopping

Moreover, do not forget to take advantage of LightGBM's support for multi-core processing and distributed computing. It enables you to handle large datasets more efficiently, by reducing training time without sacrificing performance. The relevant parameter for it, in the model, is num_threads, which is the number of parallel threads to use. You should set it to the number of physical cores in the CPU, which corresponds to half the number of maximum threads.

Finally, once optimised, you can export your model to use it later or deploy it in production environments. It is a crucial step in operationalising your predictions. Here is an easy way to do it:

Export and Load LightGBM Model

The ability for LightGBM to save the trained model parameters in a text file is very useful as you can go and read by yourself what the model is composed of.

Conclusion

Mastering LightGBM encapsulates a journey from data preparation to model optimisation, interpretation, and performance enhancement. However, it's important to recognize that LightGBM is just one tool over a large range of machine-learning models. Alternatives like XGBoost, CatBoost, and various deep learning frameworks present their unique advantages and can be better suited to specific types of data or prediction tasks. New models are appearing every day, so it's important to remain alert, but the main steps to improve a model will always be more or less the same.

Are you looking for machine learning experts? Don’t hesitate to contact us!

Cet article a été écrit par

Agathe Minaro