• Content Type

How do you assess your AI’s training data?
  • Author
  • Up

    Generative AI tools are very much in the limelight at the moment. If your social media feed is anything like mine, you’ve been seeing a lot of ChatGPT, Stable Diffusion, Midjourney, DALL-E, and others, alongside the usual talk of it being a revolutionary new tool in a number of sectors.

    To work these tools train on a vast amount of data pulled from the web, and this training data has led to some controversy, including a graphic artist objecting to her copywritten original works being used as training data and the misogynistic portraits often produced by Lensa.

    Having good quality training data for an AI system is one of the key challenges mentioned by stakeholders in BSI’s research, with one interviewee telling us that especially when buying in an AI system, one must understand exactly what data has gone into building the system. Without this, the risks produced could compromise the safety and robusness of the output.

    If you have experience of training or procuring AI, how do you assess the quality of the data used to train your AI system? What criteria do you use, and how much of a concern are the potential biases and/or legal and ethical risks it may contain?

    If you’ve used AI in your work—or indeed one of the generative AI tools mentioned above—how did you find it? Was the quality of the training data a consideration?

You must be logged in to contribute to the discussion