There’s nothing worse than putting in the time, effort, and resources into building something only to find out you can’t use it. It’s even worse if you realise that you can’t use it because you overlooked a crucial step that should have been baked in from the beginning.
When it comes to building AI systems, you’ve got to take data compliance considerations into account from day one; otherwise, your project will be finished before it even begins.
Compliance regulations exist for good reason. They ensure that companies, governments, and researchers handle data responsibly and ethically. However, developing machine learning models that derive meaningful information from imagery is a challenging task in and of itself, and when designing these systems for production AI, compliance regulations can create additional headaches.
Production models run in the real world on out-of-sample data. They evaluate never-before-seen data to make predictions and generate outcomes, and they can only make predictions based on their previous training. They can’t reason when they encounter new information for which they have no frame of reference.
For the best performance, these models need to train on a vast amount and variety of data. However, different regulatory frameworks govern data in different ways. When building and training a model, the data used must be compliant with the regulatory framework where the data originated, even if the model is being built or deployed elsewhere. For instance, some jurisdictions' have stricter laws for protecting citizens' identifiable information than others. Models trained on data collected in these jurisdictions might not be able to be shipped elsewhere. Similarly, healthcare AI systems trained on US data must oftentimes meet HIPAA regulations which have unique criteria for patients’ medical data and therefore create constraints around where the model can be deployed.
Machine learning engineers must successfully navigate the inherent tension between acquiring as much data as possible and abiding by compliance regulations. With that in mind, here are three compliance considerations to take into account when building production AI.
Partitioning Training Data
To follow best practices for data-centric AI, you should train the model on an abundance of data that is diverse and high-quality; however, you can’t just mix and match data as needed to fill out your training data set. You’ve got to be sure that the data you're using complies with the regulatory requirements of its origin. Within each country of origin, different institutions and governing bodies may also have different requirements for handling data.
For instance, let’s say you’re building a computer vision model for medical imaging. You’ve obtained a million images from patient data to train the model. However, one third of the images originated in the US so that data is subject to HIPAA regulations while another third originated in the EU, so it’s subject to GDPR regulations. Meanwhile, the last third is freely licensed.
Unfortunately, it would be difficult to train one model on all these images and have it still be compliant. You’d be better off partitioning the data into separate buckets and building three distinct models so that each one is compliant with the appropriate regulatory framework as determined by the data’s origins.
You’ll also need to be able to show your work and prove that you followed the respective compliance rules from the ground up, so keep a record of the training data used for each model. Traceability can create a significant challenge from an engineering perspective. It’s a cumbersome and difficult task, but it’s also a serious consideration when building production AI. If you spend resources building a model only to realise later on that one piece of data in the training dataset wasn’t compliant, you’ll have to scrap that model. Thanks to the noncompliant data, you’d have to go through the entire building process again, retraining the model without it.
Auditability for Data Annotations
When putting AI into production, you’ve got to consider the auditability of the data, not just the models. Make sure there’s an exact audit trail of how each individual piece of training data and its label was generated because both the labels and data must be compliant with the process for which you’re trying to optimise.
For example, when it comes to developing medical AI, some regulatory bodies have implemented an approval process for algorithms, which requires independent expert reviews. These procedures are in place to ensure that the model learns to make predictions from training data that has either been labelled or reviewed by a certified professional.
As such, when medical companies build production AI, a designated number of medical specialists must review the labelled training data before the company is allowed to use it in downstream model building applications. They must also keep a record of how each piece of data was labelled, who it was reviewed by, and how many times it was reviewed.
The Release Lifecycle: From Annotation to Deployment
Before building the model, it’s wise to consider the localities that will be involved in each stage of the production cycle. Ask yourself: Where is the model being trained? Is it being trained in the same jurisdiction as where the labels and training data were generated? Where is the model being deployed after training?
From a production standpoint, the answers to these questions are important for preventing issues down the road. For instance, if your training data is living in the US, but your model training infrastructure is set up in the UK, you need to know if you’re allowed to process that data by sending it to the UK. Even if you have no intention of storing data in the UK, you still have to establish whether you’re allowed to process that data– e.g. train the model and perform various types of experiments over the model– there.
The practical implication for AI companies is that they either have to have model infrastructure deployed in different jurisdictions so that they can process data locally, or they have to ensure that they have data processing agreements in place with customers, which clearly state whether and where they intend to process the data.
Some jurisdictions have much more stringent rules around data processing and storage than others, and it’s important to know the regulations around data collection, usage, processing, and storage, for all the relevant jurisdictions.
Compliance regulations can create headaches for building production AI because they add operational overhead when making the model work in practice. However, it’s best to know the rules from the start and decrease your risk of having to abandon a model for falling afoul of compliance regulations.
At Encord, we’ve worked with multiple customers from different jurisdictions and different data requirements. With our user friendly, computer-vision first platform and in-house expertise, we help companies develop their training data pipeline while removing their compliance headaches.
Machine learning and data operations teams of all sizes use Encord’s collaborative applications, automation features, and APIs to build models & annotate, manage, and evaluate their datasets. Check us out here.
Put simply, Synthetic data is information – aka data – that’s been artificially manufactured rather than having been captured via real-world events.
Last month, Encord was one of a number of global tech companies invited by Amazon Web Services (AWS) to attend the event dubbed Project Stormcloud.