Entity Resolution at Scale: Advanced Fuzzy Matching Techniques for Company and Project Data

Main Article Content

Sravan Kumar Kunadi

Abstract

Entity resolution scalability has become a major challenge in modern data management particularly in consolidation of larger volumes of disparate company/project data on different sources. The discrepancies in naming schemes, typographical errors, omissions, and lack of field, as well as inconsistencies in formatting tend to leave multiple records of the same data, half-finished records, or vague records which reduce the quality of the information, and limit the credibility of the analysis. This study paper explores the sophisticated fuzzy matching methods of entity resolutions at large scale with an emphasis on enhancing the recognition and unification of company and project entries in the complicated enterprise datasets. The article compares one hybrid system that combines the string similarity, the phonetic encoding, the token based comparison, the rule based standardization and the machine learning based matching in order to find solutions to the precision and the recall of the entity linkage processes. Special attention is paid to the scalable processing strategies that may be scaled to high-volume data environment and preserve computational effectiveness and the same accuracy. The proposed approach is evaluated by using databases with real-world style structured and semi-structured data with noisy and incomplete attributes. Findings show that the advanced fuzzy matching proves to be a lot more efficient and beneficial than exact matching methods in that false negatives are minimized and duplicate records are detected much better when there is inconsistency in the records. The findings further demonstrate robustness of preprocessing, feature engineering, threshold tuning, and blocking strategies in achieving a robust scale performance. This study will contribute a practical and malleable method to companies that desire to improve master data quality, reduce redundancy and improve sound decision making by ensuring that there is more accurate entity resolution of company and project information systems.

Article Details

Section

Articles

How to Cite

Entity Resolution at Scale: Advanced Fuzzy Matching Techniques for Company and Project Data. (2023). International Journal of Research Publications in Engineering, Technology and Management (IJRPETM), 6(1), 8014-8022. https://doi.org/10.15662/IJRPETM.2023.0601003

References

[1] A. Allam, S. Skiadopoulos, and P. Kalnis, “Improved suffix blocking for record linkage and entity resolution,” Data & Knowledge Engineering, vol. 117, pp. 126–144, 2018.

[2] M. Stonebraker et al., “Data integration: The current status and the way forward,” IEEE Data Engineering Bulletin, vol. 41, no. 2, pp. 3–9, 2018.

[3] X. L. Dong and D. Srivastava, “Data integration and machine learning: A natural synergy,” Proceedings of the VLDB Endowment, vol. 11, no. 12, pp. 1994–1997, 2018.

[4] G. Papadakis, G. Koutrika, T. Palpanas, and W. Nejdl, “Comparative analysis of approximate blocking techniques for entity resolution,” Proceedings of the VLDB Endowment, vol. 9, no. 9, pp. 684–695, 2016.

[5] W. Tao, X. Xiao, S. Zhou, and J. X. Yu, “Approximate string joins with abbreviations,” Proceedings of the VLDB Endowment, vol. 10, no. 1, pp. 1–12, 2017.

[6] G. Simonini, G. Papadakis, T. Palpanas, and S. Bergamaschi, “BLAST: A loosely schema-aware meta-blocking approach for entity resolution,” Proceedings of the VLDB Endowment, vol. 9, no. 12, pp. 1173–1184, 2016.

[7] P. Christen, Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection. Berlin, Germany: Springer, 2012.

[8] P. Christen, “A survey of indexing techniques for scalable record linkage and deduplication,” IEEE Transactions on Knowledge and Data Engineering, vol. 24, no. 9, pp. 1537–1555, 2012.

[9] G. Papadakis, E. Ioannou, T. Palpanas, C. Niederée, and W. Nejdl, “Schema-agnostic vs schema-based configurations for blocking methods on homogeneous data,” Proceedings of the VLDB Endowment, vol. 9, no. 4, pp. 312–323, 2015.

[10] H.-S. Kim, D. Lee, and M. Kang, “HARRA: Fast iterative hashed record linkage for large-scale data collections,” Information Systems, vol. 71, pp. 1–12, 2017.

[11] D. Karapiperis, V. Verykios, and A. Gkoulalas-Divanis, “An LSH-based blocking approach with a homomorphic matching technique for privacy-preserving record linkage,” IEEE Transactions on Knowledge and Data Engineering, vol. 26, no. 12, pp. 2959–2972, 2014.

[12] C. Xiao, W. Wang, X. Lin, and J. X. Yu, “Efficient similarity joins for near-duplicate detection,” ACM Transactions on Database Systems, vol. 36, no. 3, pp. 1–41, 2011.