This article shows how to perform fraud detection with Graph Analysis.
Fraud detection is a major field of interest for data science. As fraud is a rare event, the main challenge is to find a way to bring to light abnormal behavior. That is why graph analysis is a useful approach to perform fraud detection. Many algorithms exist to extract information from graphs. In this article, we will study one of them: the Personalized Page Rank algorithm. To manipulate our graphs and compute this algorithm we will use the python package Networkx.
Page Rank Algorithm
Page Rank is a well-known algorithm developed by Larry Page and Sergey Brin in 1996. They were studying at Standford University and it was part of a research project to develop a new kind of search engine. They then successfully founded Google Inc.
This algorithm assigns a numerical weighting to every node of a connected network. This measure represents the relative importance of a node within the graph (its rank).
To compute Page Rank a random walk is performed. This random walk is defined as follow:
- The walker starts at a random node in the graph.
- At each iteration, the walker follows an outgoing edge to one of the next nodes with a probability α or jumps to another random node with a probability 1-α.
The PageRank theory holds that the imaginary walker who is randomly walking on links will eventually stop. The probability, at any step, that the person will continue is called the damping factor α.
As an example, for Google, the network is composed of websites that point to each other through links. The page rank measure of each web page is then the probability that a person randomly surfing on the internet would finally arrive on this specific page.
In mathematical terms, The Page Rank of a node is the stationary measure of the Markov Chain described by the random walk.
On the animation below you can visualize a Random Walk performed on a connected graph with a damping factor set to 0.85.
On the above example, one would predict that the node ‘c’ is the one with the higher rank. This is the most central node. On the contrary, the node ‘h’ is more likely to have a low rank.
How to compute Page Rank with Python and Networkx
With the python script below, thanks to Networkx, we will first generate a random graph and then apply pagerank function.
Personalized Page Rank Algorithm
We have seen that the Page Rank is a representation of the importance of nodes within a network. Personalized Page Rank gives the possibility to bring out nodes in a graph that are central from the perspective of a set of specific nodes. Those specific nodes could be the acknowledged fraudsters for example.
To do so, the random walk is biased:
- The walker can only start from one of the nodes of the featured set.
- The walker can only jump to one of the nodes of the featured set.
As a result, the nodes with the higher rank will be the ones that are the most central from the perspective of featured nodes.
On the example below, you can see a personalized random walk. The nodes ‘a’ and ‘e’ are the featured nodes.
The Personalized Page Rank of the graph presented above is then:
The nodes ‘c’ and ‘d’ are then the most linked to the feature in question. So they are nodes we could be interested to study.
How to compute Personalized Page Rank with Python and Networkx
With Networkx it is possible to compute personalized page rank using the same function than the one used to compute page rank: pagerank. You just have to pass an extra parameter: personalization. A vector, that once normalized, gives for each node the probability to be chosen as the source vertex.
With the below python script, we are going to build a random graph of 7 individuals. Two of them are known as fraudsters. Thanks to Networkx and Personalized Page Rank we will be able to detect “at-risk” individuals. They are those with a high rank.
In this example, the two acknowledged fraudsters are assigned a personalization.
Examples of Implementation for Fraud detection
Implementation for fraud detection within a community
As a first example of implementation of Personalized Page Rank for fraud detection, let’s consider a population in which a small percentage of fraudsters are identified. Given the postulate “People knowing fraudsters are more likely to be themselves fraudsters”, you may be interested in studying a graph of relations between individuals (just pay attention to ethical questions that could be raised). You could apply Personalized Page Rank algorithm to detect those individuals that have a higher probability to be fraudsters. In this case, fraud detection is supervised, some individuals are already labeled as fraudsters.
For example in a community of N people, with M fraudsters you can assign a personalization weight of M/N to known fraudsters and 0 to others. After running the Personalized Page Rank algorithm, the individuals (not yet acknowledged as crooks) that have the higher page rank are the one you want to mind.
If you want to train on this business case you could use this dataset. This is a graph of Facebook relations between students. Select 4% of individuals and appoint them as fraudsters. Find the students that are the more likely to misbehave computing Personalized Page Rank thanks to Networkx package.
Implementation for fraud detection in Medicare system
The article Using PageRank to Detect Anomalies and Fraud in Healthcare describes another implementation of Personalized Page Rank algorithm for fraud detection. From the public dataset published by the American government, it gives a solution to detect abuse from doctors in healthcare.
The dataset gives, for more than 880,000 American medical providers, the number and nature of procedures declared to the United States Center for Medicare and Medicaid Services. As the specialty of those practicians is given, the idea of researchers was to detect doctors that provide healings inconsistent with their specialty.
From the dataset, they built a graph with practicians as nodes. In this network two practicians are linked by an edge if they are closed enough in term of shared CPT (medical procedures) codes. We could expect that providers sharing the same specialty would be linked to one another in the graph.
For every specialty among the 89 in medical field, researchers computed a Personalized Page Rank. At each iteration, they used practicians of the specialty as “source” nodes. From the computed rank, it is possible to detect potential fraudulent providers. They are those who don’t practice the specialty but that have a high rank. That would be the case of a cardiologist that provide a lot of common procedures with dermatologists. In this case, fraud detection is unsupervised, no one has been identified as fraudster before the experience.
With this method, the researchers detected some anomalous practicians. For example, they spotted an otolaryngologist that billed usual plastic surgery procedures: skin tissue rearrangement, biopsy skin lesion. They might have caught him/her red-handed.
Wanna go further?
Personalized Page Rank is appropriate for fraud detection, but could also be applied to many other business cases. It is in fact successfully used for personalized search or recommendation systems.
Many other algorithms are implemented in the package Networkx: Djikstra algorithm, communities detection algorithm… They could help you to stand out information from your graph, and then lead to detect fraudulent behavior.
On the website Network Repository, you will find thousands of graph datasets. You could train on it to perform graph analysis. I advise you to have a look at it!
Thanks to Florian Carra, Nicolas Jean, Louis Nicolle, Alexandre Sapet, and Antoine Toubhans.