April 8, 2022 • 3 min read

How to implement a Fast Custom KNN in Sklearn Using Cython

Rédigé par Reda Boumahdi

Reda Boumahdi

Let’s dive into how you can implement a fast custom KNN in Scikit-learn.

Scikit-learn is a great library for machine learning, but quite slow to solve some problems, especially for custom enrichment such as custom metrics. k-NN or KNN is an intuitive algorithm for classification or regression. I will mainly use it for classification, but the same principle works for regression and for other algorithms using custom metrics.

Let’s dive into how you can implement a fast custom KNN in Scikit-learn.

A quick taste of Cython

The fundamental nature of Cython can be summed up as follows: Cython is Python with C data types.

Cython is actually Python code that will be compiled to C file and create a library. The calls to this library will be faster than calls to python files. How fast ? See for yourself !

First, let’s create a simple loop in python, for instance like this:

def python_loop():
n = 1000
for i in range(n):
i**5
view raw python_loop.py hosted with ❤ by GitHub

Then, let’s do the same in cython:

def cython_loop():
cdef int n = 1000
cdef int i
for i in range(n):
i**5
view raw cython_loop.pyx hosted with ❤ by GitHub
from distutils.core import setup
from Cython.Build import cythonize
setup(name="cython_loop", ext_modules=cythonize('cython_loop.pyx'),)
view raw setup.py hosted with ❤ by GitHub

To build the cython library, the command line is:

python setup.py build_ext --inplace

Then we need to execute the main file:

from time import time
from cython_loop import cython_loop
from python_loop import python_loop
t1 = time()
python_loop()
t2 = time()
cython_loop()
t3 = time()
print("looping with python takes : " + str(t2 - t1))
print("looping with cython takes : " + str(t3 - t2))
view raw main.py hosted with ❤ by GitHub

Surprise… Cython is 1000 times faster!

What is KNN?

KNN is a great tool for interpreting the outputs and not having to handle black boxes. The idea behind this clustering algorithm is to compare a new point (the green circle) to the K most similar points in the data set (the closest ones), and to give it the mainly represented label (square or triangle).

K-NN classification

In the following case, if K = 3, the algorithm will predict a triangle, if K = 5, the algorithm will predict a square. Thus, the choice of K is quite important.

The main computational complexity is to calculate all the distances between the current point and all the remaining points. Let’s see how we can apply what we learned on Cython in the first part of this tutorial to K-NN.

Why use a custom distance?

When it comes to data, the more features you can use, the better it is. Often, you will be faced with continuous and discrete features at the same time. In this case, a classic distance may fail to give good results. A custom distance needs to be chosen. There is an infinity of distances choices, and you could combine them to create your own for a better predictive power… Just like the avengers they become better if you combine them !!

Here you can find a list of some common metrics that are implemented in Scikit-learn :

  • Euclidean : sqrt(sum((x - y)^2))
  • Manhattan : sum(|x - y|)
  • Chebyshev : max(|x - y|)
  • Minkowski with parameter p : sum(|x - y|^p)^(1/p)
  • W-Minkowski with parameters p,w : sum(w * |x - y|^p)^(1/p)
  • S-Euclidean with parameter V : sqrt(sum((x - y)^2 / V))
  • Mahalanobis with parameter V : sqrt((x - y)' V^-1 (x - y))

Click here to learn more about importing these distances with Scikit-learn.

As an example, custom metrics are useful for recommendation. Imagine if Booking wanted to suggest hostels as a function of what you already booked in the past using the price (price_i), the location (geoloc_i) and the stars (stars_i) as features. To apply K-NN we must be careful to compare the same level of objects. One way to overcome this difficulty is to normalize, another one is to use the following distance :

α‖geoloc_1-geoloc_2‖² + β‖price_1-price_2‖² + γ‖stars_1-stars_2‖²

And to choose α, β and γ so that the learning rate is better.

Custom distance syntax

The first step is the definition of our custom distance. We define a metric in a file called cython_metric.pyx. My custom distance takes two arrays and returns a double.

import numpy as np
def mydist(double[:] x, double[:] y):
cdef int n = x.shape[0]
cdef double res = 0
cdef double resbis = 0
for i in range(n):
res += (x[i] - y[i])**2
return res

In simple Python, it would look like this.

import numpy as np
def mydist(x, y):
return np.sum((y-x)**2)

Then in another file, we load the data and call KNeighborsClassifier from Scikit-learn, and give it our custom distance.

import numpy as np
from sklearn.neighbors import KNeighborsClassifier
from cython_metric import mydist
import time
view raw knn_part1.py hosted with ❤ by GitHub
classifier = KNeighborsClassifier(metric= mydist)
t1 = time.time()
classifier.fit(x,y)
res = classifier.predict(x)
t2 = time.time()
print(t2 - t1)
print("precision is : " + str(1 - np.abs(res - y).sum()*1./(2*n)))
view raw knn_part3.py hosted with ❤ by GitHub

We output the time taken by the algorithm and the precision as well.

Performance benchmark

For the same distance (Euclidean) we compare the performance of python code vs cython code by running the previous code several time for a different number of observations. We can then plot the following graph.

Conclusion

Python is a great language to solve several problems. If you are interested in performance and want to speed some part of your code, you have the possibility to move it in a Cython module. As we did with the calculation of the distance, your code will run much much faster! I hope this tutorial will help your algorithms learn blazing-fast!

External links


Thanks to Flavian Hautbois, Adil Baaj, Adrien Lina, and Charles Bochet.

If you are looking for Machine Learning Experts, don't hesitate to contact us

Cet article a été écrit par

Reda Boumahdi

Reda Boumahdi