Have a personal or library account? Click to login
Abnormal Pattern Prediction: Detecting Fraudulent Insurance Property Claims with Semi-Supervised Machine-Learning Cover

Abnormal Pattern Prediction: Detecting Fraudulent Insurance Property Claims with Semi-Supervised Machine-Learning

Open Access
|Jul 2019

Figures & Tables

Table 1

The 20 Data Bottles and their descriptions extracted from a Data Lake created for this particular case study.

BottlesDescriptions
IDID about claims, policy, person, etc.
CUSTOMERPolicyholder’s attributes embodied in insurance policies: name, sex, age, address, etc.
CUSTOMER_PROPERTYCustomer related with the property data.
DATESDates of about claims, policy, visits, etc.
GUARANTEESCoverage and guarantees of the subscribed policy.
ASSISTANCECall center claim assistance.
PROPERTYData related to the insured object.
PAYMENTSPolicy payments made by the insured.
POLICYPolicy contract data, including changes, duration, etc.
LOSS ADJUSTERInformation about the process of the investigation but also about the loss adjuster.
CLAIMBrief, partial information about the claim, including date and location.
INTERMEDIARYInformation about the policies’ intermediaries.
CUSTOMER_OBJECT_RESERVEThe coverage and guarantees involved in the claim.
HISTORICAL_CLAIMHistorical movements associated with the reference claim.
HISTORICAL_POLICYHistorical movements associated with the reference policy (the policy involved in the claim).
HISTORICAL_OTHER_POLICIESHistorical movements of any other policy (property or otherwise) related to the reference policy.
HISTORICAL_OTHER_CLAIMHistorical claim associated with the reference policy (excluding the claim analyzed).
HISTORICAL_OTHER_POL_CLAIMOther claim associated with other policies not in the reference policy (but related to the customer).
BLACK_LISTEvery participant involved in a fraudulent claim (insured, loss-adjuster, intermediary, other professionals, etc.)
CROSS VARIABLESSeveral variables constructed with the interaction between the bottles.
dsj-18-914-g1.png
Figure 1

Possible clusters. (a) shows a separable and compact cluster of the abnormal points. On the other side, (b) shows abnormal and normal cases uniformly distributed.

dsj-18-914-g2.png
Figure 2

Schematic representation of the desired threshold which is expected to split high fraud probability cases from low fraud probability cases.

dsj-18-914-g3.png
Figure 3

Cluster Example Output. (a) shows an example of a cluster algorithm output over a sample of data points. (b) shows how the Cluster Score choose the points that are relabeled as fraud cases (points inside the doted line).

Algorithm 1

Unsupervised algorithm

Data: Load transformed data-set. Oversample the fraud cases in order to have the same amount as the number of unknown cases.
1  for kK = {model1, model2, …} where K is a set of unsupervised models. do
2        for iI where I is a matrix of parameter vectors containing all possible combinations of the parameters in model k do
3             We fit the model k with the parameters i to the oversampled data-set.
4             We get the J clusters: {C1, C2, …, CJ} for the combination {k, i}, i.e., Ck,i={Ck,i1,Ck,i2,,Ck,iJ}
5             For Ck,i we calculate C1 Score and C2 Score and we obtain the cluster score CSk,i, based on the acceptance threshold t*.
6             Save the cluster score result CSk,iCSK,I, where CSK,I is the cluster score vector for each pair {k, i}.
7     end
8end
9Choose the optimal CS* where CS* = max{CSK,I}
10Relabel the fraud variable using the optimal clustering model derived from CS*. Each unknown case in a fraud cluster is now equal to 1, known fraud cases are equal to 1 and remaining cases are equal to 0.
Algorithm 2

Supervised algorithm.

Data: Load relabeled data-set.
1for modeliM′ = {M, S} where M is the set of supervised individual models M and S the set of stacking models from M do
2    for {traink, testk} folds in the Stratified k-Folds do
3          We apply PCA to folder traink and save the weights/parameters.
4           if Oversample==True then traink = oversample (traink) where oversampling is applied to 50/50 using the ADASYN method.
5           else traink = traink and the balanced subsampling option is activated.
6           Fit the modeli in traink, where modeliM′ = {M, S}.
7           Transform testk with PCA’s weights/parameters and get predicted probabilities pk of testk using modeli.
8           Save the probabilities pk in Pi, where Pi is the concatenation of modeli’s probabilities.
9end
10forti ∈ [0, 1], where t is a probability threshold of the modeli to consider a case as fraudulent do
11        if Piti then Pi = 1
12        else Pi = 0
13        Using Pi, where now Pi is a binary list, we calculate,
FScorei,t=(1+β2)*precision*recallrecall+β2*precisionwith β = 2.
14        Save FScorei,t in FScorei, a list of vectors of modeli with FScore results for each t.
15  end
16We get FScorei* = max{FScorei(t)}.
17end
Table 2

Unsupervised model results.

Modeln ClustersC1C2CS (α = 2)
Mini-Batch K-Means496.6%96.6%96.6%
Isolation Forest251.5%51.1%51.4%
DBSCAN250.2%49.8%50.1%
Gaussian Mixture595.0%95.0%96.3%
Bayesian Mixture696.5%96.4%96.5%
Table 3

Oversampled Unsupervised Mini-Batch K-Means.

ClustersFraudPercentage
002%
0198%
1099%
111%
20100%
210%
301%
3199%
Table 4

Supervised model results.

ModelCluster RecallOriginal RecallPrecisionF-Score
ERT-ss0.97340.98400.67180.8932
ERT-os0.96470.98190.69370.8948
GB0.90920.93760.63500.8369
LXGB0.89010.92490.74840.8576
Stacked-ERT0.89010.92830.75240.8587
Stacked-GB0.89470.92870.76300.8649
Stacked-LXGB0.91800.94640.68250.8588
Table 5

Model Robustness Check.

Original ValuePredictionCases
Non-InvestigatedNon-Fraud29.631
FraudNon-Fraud0
Non-InvestigatedFraud415
FraudFraud271
(a) ERT-ss Robustness Check
Original ValuePredictionCases
Non-InvestigatedNon-Fraud29.656
FraudNon-Fraud8
Non-InvestigatedFraud390
FraudFraud263
(b) ERT-os Robustness Check
Table 6

Base Model Final Results.

Original ValuePredictionCases
Non-InvestigatedNon-Fraud29.631
FraudNon-Fraud0
Non-FraudFraud(415 – 333) = 82
FraudFraud(271 + 333) = 604
Table 7

Oversampled Unsupervised Mini-Batch K-Means.

ClustersFraudPercentage
0099.4%
010.6%
100.7%
1199.3%
202.6%
2197.4%
Table 8

Base Model with the machine-learning process applied.

PeriodJan 15–Jan 17Jan 15–Jan 18
Claims303,166519,921
Observed Fraud2,6414,623
Cluster Score96.59%96.89%
Recall Score ERT-ss97.34%96.31%
Precision Score ERT-ss67.18%89.35%
F-Score ERT-ss89.32%94.84%
Recall Score ERT-os96.47%96.44%
Precision Score ERT-os69.37%92.18%
F-Score ERT-os89.48%95.56%
Language: English
Submitted on: Dec 6, 2018
Accepted on: Jul 3, 2019
Published on: Jul 17, 2019
Published by: Ubiquity Press
In partnership with: Paradigm Publishing Services
Publication frequency: 1 issue per year

© 2019 Sebastián M. Palacio, published by Ubiquity Press
This work is licensed under the Creative Commons Attribution 4.0 License.