The current course I’m running through (Scalable Machine Learning) and other recent online courses (including Introduction to Big Data with Apache Spark) are amongst the many high-quality courses available on edX, Coursera, Udacity and so on. Prestigious universities offering free online courses so anyone can learn about advanced topics, not just in IT but in other subjects as well.
There is one disappointment with the current module, though. As it looks at click-through rates and notes that the source data has been anonymised, it’s obvious that while the technical side is being covered in good depth (covering the mathematical foundations, basic implementations, process for scaling to a distributed setup and the functions to use), there is far less coverage of the ethical and legal aspects. This is perhaps to be expected from such a short course (this course should be around 20-25 hours of study), However, a little more discussion would be good, if only enough to raise more awareness of the issues.
In legal terms, it may be easiest to set up moderately large-scale data processing in the cloud initially. It’s generally cheaper to run the servers in the US than in Europe. However, if you transferred personally identifiable data from Europe to the US then you risk falling foul of data protection legislation. What’s currently largely untested is what potential liability you have if supposedly depersonalised data proves to be sufficiently specific to identify individuals, but without touching on some of the basic ethical considerations about data privacy students may not be encouraged to find out their obligations or consider the rights and wrongs of this.
There is a point where personalisation could potentially become creepy or intrusive. If you searched for a pregnancy test, then you might not want to receive other pregnancy-related results – particularly on a shared computer which might make your condition apparent to others who you might not wish to tell yet. Similarly, targeting that may imply or show someone as having a stigmatised condition such as HIV, or lead to questions about sexual, religious or other views could in certain circumstances cause significant harm.
In the pursuit of better targetting, better classification, better modeling, we need to make sure that what we do is not just commercially justifiable but ethically and legally justifiable as well. Those of us who are registered professionals (of BCS or another recognised professional group) have an obligation in our professional code of conduct to operate in an ethical and legal manner. This means that we should raise awareness of these issues if they arise, and, if necessary, to refuse to perform work which is in breach of those standards.
I don’t take this view lightly, and have sought to raise legal concerns which have arisen in previous contracts. It’s not always easy. As a contractor without the protections of a permanent employee, that may mean risking your income to disagree with decision-makers. However, like an expectation to perform some pro bono work to provide service back to your community, our professional obligations aren’t just about what is best for us personally, but about what is honest and right. Many IT managers haven’t studied the legal and ethical concerns previously, so there is scope to raise these issues, generally.
As we make more and more use of the data we have available, we have to consider whether correct permission has been given for the use of the data in this way, whether it is being held in the correct locations, whether the correct access restrictions are in place, and what the correct response would be to a Freedom of Information request. These are just a few of the considerations. They apply not just when gathering the data, but in how it’s used, and if you look carefully, various advertising and personalisation systems offer options to only have more generic results and turn off the better personalisation. This should be an option available in your own site if you’re looking at something like recommendation systems based on past purchases or item views.
It’s a complex area, and the legal details go beyond my limited understanding of the legislation (I’m not a lawyer), but perhaps the ethical questions can be as much of a guide as the legal ones? Is this personal? Would I want my data used in this way? Am I acting for or against the best interests of the person or people whose data I’m analysing? Those are perhaps easier questions to answer without studying endless legislation, but no less important if you want to do things right.