Tax fraud, characterized by the deliberate manipulation of information in tax returns to reduce tax liabilities, poses a substantial challenge for governments globally. The resultant annual financial losses are immense, emphasizing the critical need for effective fraud detection measures. Tax authorities worldwide are turning to machine learning strategies to enhance their capabilities in identifying and preventing fraudulent activities, marking a crucial step in safeguarding government revenues.
Current strategies for detection primarily involve either supervised models, which rely on previously audited tax returns, or unsupervised models that analyze the entire dataset without distinguishing fraudulent from non-fraudulent returns. However, these approaches have limitations. Supervised models suffer from sample selection bias due to a small percentage of labeled data, while unsupervised models struggle to effectively detect fraud independently.
To address these issues, a recently published paper from King Saud University, Riyadh, introduces a novel tax fraud detection machine learning framework. This framework integrates supervised and unsupervised models, employing ensemble learning paradigms to enhance fraud detection. Additionally, newly engineered features are incorporated into the framework, demonstrating its effectiveness through testing on tax returns provided by the Saudi tax authority. This approach aims to overcome the shortcomings of traditional strategies, offering a more comprehensive and accurate method for detecting tax fraud.
In more detail, the approach comprises four modules as follows:
- Supervised Module: Utilizes an Extreme Gradient Boosting (XGBoost) model to assign each tax return to a set of groups using tree-based classification. The model generates a matrix representing the tax return’s assignment to leaf nodes in each tree, forming the input for the prediction module.
- Unsupervised Module: Applies autoencoders on the original data to identify anomaly features. Autoencoders encode input data to a lower dimension and attempt to regenerate the input, detecting anomalies based on the regeneration error. The resulting matrix and anomaly scores serve as input for the prediction module.
- Behavioral Module: Measures a compliance score for each taxpayer, considering audit outcomes and time. The score ranges from -1 to 1, reflecting compliance or non-compliance over time. This module outputs a list of scores for each taxpayer, serving as input for the prediction module.
- Prediction Module: The final step combines all engineered features to predict tax fraud. It takes a matrix incorporating supervised module outputs, unsupervised module results, and behavioral module scores as input. Two classifiers, Artificial Neural Network (ANN) and Support Vector Machine (SVM), are used to test the performance of the engineered features in predicting tax fraud.
The evaluation study assessed the proposed approach using data from the Saudi Zakat, Tax, and Customs Authority. Four algorithms were employed: XGBoost, autoencoders, ANN, and SVM. Precision was the primary metric, with additional metrics such as recall, F1 score, and accuracy considered.
Results indicated that the ANN model slightly outperformed SVM in predicting the “fraud” class, emphasizing high precision. The proposed framework outperformed models using only original data, except for recall on the “not fraud” class using SVM. Hyperparameter experiments in ANN and SVM resulted in performance slightly inferior to the best-performing model.
Compliance scores of taxpayers were incorporated into the framework, aiding in coverage assessment and implementing an audit selection strategy and despite promising results, acknowledged limitations included assumptions of homogeneous behavior within sectors/business sizes and compliance scores close to zero for many taxpayers.
In conclusion, the tax fraud detection framework, combining supervised and unsupervised models with behavioral compliance scores, showed promising results in the evaluation study on Saudi tax data. Notably, the Artificial Neural Network accurately predicted tax fraud. Despite outperforming models using only original data, acknowledged limitations include assumptions of homogeneous behavior within sectors. Nevertheless, this innovative approach significantly enhances tax authorities’ capabilities against fraud, offering a potential paradigm shift in tax fraud detection for global adoption.
Check out the Paper. All credit for this research goes to the researchers of this project. Also, don’t forget to follow us on Twitter. Join our 36k+ ML SubReddit, 41k+ Facebook Community, Discord Channel, and LinkedIn Group.
If you like our work, you will love our newsletter..
Don’t Forget to join our Telegram Channel
Mahmoud is a PhD researcher in machine learning. He also holds a
bachelor’s degree in physical science and a master’s degree in
telecommunications and networking systems. His current areas of
research concern computer vision, stock market prediction and deep
learning. He produced several scientific articles about person re-
identification and the study of the robustness and stability of deep
networks.