摘要
大数据时代下,各业务部门基于已有业务数据积累激发数据价值已成为一种共识。由于各业务系统数据标准不统一,导致元数据杂乱无章、数据孤岛、低质数据等问题层出不穷,阻碍数据的有效利用,需进行必要的治理。这其中,数据血缘分析是元数据管理的关键任务之一,对于数据溯源和数据治理具有重要意义。然而,传统的数据血缘构建方法往往面临着计算复杂度高、准确性差、执行成本高等问题。为克服这些问题,提出一种基于数据表相似度计算的数据血缘构建方法:通过对数据表的命名、表结构和数据字段三要素进行文本特征表示,利用TFIDF计算数据表的相似度,并进一步通过改进的Jaro-Winkler Distances算法验证字段重合度、表名相似度的方法构建数据表血缘关系。结果表明,该算法在数据表血缘关系构建方面效果显著,促进了数据治理工作的顺利开展。
Abstract
In the era of big data, it has become a consensus that various business departments can stimulate data value
based on the accumulation of existing business data. However, due to the lack of uni?ed data standards across di?erent
business systems, disorganized metadata, data silos, and low-quality data problems constantly emerge, hindering the
e?ective utilization of data and necessitating necessary governance. Among them, data lineage analysis is one of the key
tasks of metadata management, which is of great signi?cance for data traceability and data governance. However,
traditional methods for constructing data lineage often face high computational complexity, poor accuracy, and high
execution costs. To overcome these issues, a data lineage construction method based on the similarity calculation of data
tables is proposed: by text feature representation of the three elements of data table naming, table structure, and data
?elds, using TFIDF to calculate the similarity of data tables, and further constructing the data table lineage relationship
through the improved Jaro-Winkler Distances algorithm to verify the ?eld overlap and table name similarity. The results
show that the algorithm has a signi?cant e?ect on the construction of data table lineage, facilitating the smooth progress of
data governance work.
关键词
数据血缘 /
数据治理 /
元数据 /
表相似度
Key words
data lineage /
data governance /
metadata /
table similarity
潘奇蔡斯博, 魏芳芳.
基于数据表相似度计算的数据血缘构建方法[J]. 电脑与电信. 2024, 1(6): 11 https://doi.org/10.15966/j.cnki.dnydx.2024.06.015
PAN Qi CAI Si-bo WEI Fang-fang.
Building Method for Data Lineage Based on Data Table Similarity Calculation [J]. Computer & Telecommunication. 2024, 1(6): 11 https://doi.org/10.15966/j.cnki.dnydx.2024.06.015
{{custom_sec.title}}
{{custom_sec.title}}
{{custom_sec.content}}