If you’ve been evaluating new security tools, you’ve undoubtedly heard machine learning (ML) touted many times. It is fast becoming the backbone of all modern software, security systems included. Thus, it appears that resistance is futile, as some version of Skynet is likely inevitable—although which version ultimately manifests depends entirely on how it was trained. But that’s the future and we are in the Land of the Here and Now.
In the interest of protecting our charges, be they companies or individuals or mankind at large, let’s focus on dutifully and correctly training the machine. On that note, here are three tips to training ML responsibly now.
Garbage in, monster out. The old adage “garbage in, garbage out” in computer programming still applies, but it’s magnified in machine learning. The quality of the data used in training is so vital that “garbage in, monster out” is a likely outcome. Consider, for example, that exposure to Twitter taught Microsoft’s AI to be racist. It could have been worse; Twitter could have taught it to be a terrorist. In either case, it’s hardly the makings of a great security AI system.
Pay a lot of attention to the quality of data used in ML training. Whether your company or a third-party does the actual training, check and recheck data quality. If you teach it the wrong thing accidentally, you’ll end up with a monster you may or may not be able to control.
Teach only what is known, but teach it continuously. Make sure the training data contains lots of examples of known attacks. Why? Because the machine will only learn what it is taught, and it will also only work within the parameters of what it learned. Make sure everything it needs to learn is in that initial training data set.
That also means that machine learning is terrible at predicting new attacks. Predictive analytics can predict known types of attacks based on early activity that humans may miss, but it cannot predict the rise of unknown attacks. This means that training the machine will always remain an ongoing exercise to keep it up to date with known attacks.
Testing ML drives the teacher mad. Understand that no matter how much you or your team knows about app development and testing, precious little of that applies to machine learning. Testing machine learning code is maddening because you can’t assume two runs will produce the same output, ever.
You read that right. It’s damn near impossible to get the same output from two identical runs. That makes testing maddening. It’s called the machine learning reproducibility crisis.
Things get dicier when anyone make changes to the machine learning code or the training data.
It’s also stifling for research experimentation; since making changes to code or training data can be hard to roll back it’s a lot more risky to try different variations, just like coding without source control raises the cost of experimenting with changes.
Be hyper aware of this problem going in so you can plan testing and future changes accordingly.
Teams who are serious about using models in production who put similar amounts of time and effort into ensuring their training can be reproduced, but the problem is that it’s still a very manual process. There’s no equivalent to source control or even agreed best-practices about how to archive a training process so that it can be successfully re-run in the future.
The point is that testing is a difficult problem and you need to plan ahead for how you’re going to handle it.