Semantic Data Annotations to Support AI-enabled Data Processing; Text in English
Last updated: 14 Apr 2025
Development Stage
Pre-draft
Draft
Published
Abstract
A good understanding of a data set is a prerequisite for its correct evaluation, e. g. in the context of operational decision-making. There are various procedures, some of them standardized, for documenting the properties of a data set and thus enabling the necessary understanding of the data for each new use: – Metadata documents information about the context of a data set, e. g., the creator or the creation date. – Syntactic annotations document requirements for data types and formatting. – Content annotations document the correct interpretation of individual rows for the purpose of machine learning. However, a major gap to correct data interpretation, especially if it is to be (partially) automated, is semantic and structural understanding for a data set. This includes questions such as: – Is a field a database key (primary/secondary key)? – Does an empty field in a column represent the absence of data (“null”) or no countable activity (0)? – Is it mathematically / substantively correct to sum the values of a column (example: sales yes, prices no)? These questions are currently left to the interpretation of the user and thus cause fuzziness in data evaluation, which can lead to errors in manual evaluation and greatly complicate machine evaluation, e. g., using artificial intelligence. Standardized semantic and structural annotations of a data set can avoid errors in manual data evaluation and enable machine data evaluation. © 2024 DIN German Institute for Standardization