It’s 10PM do you know what your model is doing?

“Customers like you have also …”  This concept appears explicitly, or implicitly at many points in the web-of-our-lives, aka the Internet. Specific corporations, and aggregate operations are building increasingly sophisticated models of individuals.  Not just “like you”, but “you”! Prof. Pedro Domingos at UW  in his book “The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World” suggests this model of you may become a key factor of your ‘public‘ interactions.

Examples include having Linked-in add a “find me a job” button that will conduct interviews with relevant open positions and provide you a list of the best.  Or perhaps locating a house, a car, a spouse, …well, maybe somethings are better done face-2-face.

Apparently a Asian firm, “Deep Knowledge” has appointed a virtual director to their Board. In this case it is a construct designed to detect trends that the human directors might miss.  However, one suspects that Apple might want a model of Steve Jobs around for occasional consultation, if not back in control again.

Predictive Analytics – Rhinos, Elephants, Donkeys and Minority Report

The  IEEE Computer Society published “Saving Rhinos with Predictive Analytics” in both IEEE Intelligent Systems, and in the more widely distributed ‘Computing Edge‘ (a compendium of interesting papers taken from 13 of the CS publications and provided to members and technologists at no cost.  The article describes how data based analysis of both rhino and poacher activity in concert with AI algorithms can focus enforcement activities in terms of timing and location and hopefully save rhinos.

For those outside of the U.S., the largest population of elephants (Republicans) and donkeys (Democrats) are in the U.S.– these animals being symbols for the respective political parties, and now on the brink of the 2016 presidential primaries, these critters are being aggressively hunted — ok, actually sought after for their votes.  Not surprisingly the same tools are used to locate, identify and predict the behaviour of these persons.   When I was young (1964) I read a book called The 480, which described the capabilities of that timeframe for computer based political analysis and targeting of “groups” required to win an election. (480 was the number of groupings of the 68 million voters in 1960 to identify which groups you needed to attract to win the election.)   21st century analytics are a bit more sophisticated — with as many as 235 million groups, or one per potential voter (and over 130 million voters likely to vote.).  A recent kerfuffle between the Sanders and Clinton campaign over “ownership/access” to voter records stored on a computer system operated by the Democratic National Committee reflects the importance of this data.  By cross connecting (data mining) registered voter information with external sources such as web searches, credit card purchases, etc. the candidates can mine this data for cash (donations) and later votes.  A few percentage point change in delivering voters to the polls (both figuratively, and by providing rides where needed) in key states can impact the outcome. So knowing each individual is a significant benefit.

Predictive Analytics is saving rhinos, and affecting the leadership of super powers. But wait, there’s more.  Remember the movie “Minority Report” (2002). This movie started on the surface with apparent computer technology able to predict future crimes by specific individuals — who were arrested to prevent the crimes.  (Spoiler alert) the movie actually proposes a group of psychics were the real source of insight.  This was consistent with the original story (Philip K Dick) in 1956, prior to The 480, and the emergence of the computer as a key predictive device.  Here’s the catch, we don’t need the psychics, just the data and the computers.  Just as the probability of a specific individual voting for a specific candidate or a specific rhino getting poached in a specific territory can be assigned a specific probability, we are reaching the point where aspects of the ‘Minority Report’ predictions can be realized.

Oddly, in the U.S., governmental collection and use of this level of Big Data is difficult due to privacy illusions, and probably bureaucratic stove pipes and fiefdoms.   These problems do not exist in the private sector.  Widespread data collection on everybody at every opportunity is the norm, and the only limitation on sharing is determining the price.  The result is that your bank or insurance company is more likely to be able to predict your likely hood of being a criminal, terrorist, or even a victim of a crime than the government.  Big Data super-powers like Google, Amazon, Facebook and Acxiom have even more at their virtual fingertips.

Let’s assume that sufficient data can be obtained, and robust AI techniques applied to be able to identify a specific individual with a high probability of a problematic event — initiating or victim of a crime in the next week.  And this data is implicit or even explicit in the hands of some corporate entity.  Now what?  What actions should said corporation take? What probability is needed to trigger such actions? What liability exists for failure to take such actions (or should exist)?

These are issues that the elephants, and donkeys will need to consider over the next few years — we can’t expect the rhinos to do the work for us.  We technologists may also have a significant part to play.

Amazon vs Hachette – Tech Consolidation Impact on Emerging Authors

The dispute between Amazon and book publisher Hachette reached a settlement in November.  The Authors United group formed by a number of top selling authors, including Steven King, sent a letter to the Amazon Board of Directors expressing their concern with “sanctions” directed at Hachette authors including “refusing pre-orders, delaying shipping, reducing discounting, and using pop-up windows to cover authors’ pages and redirect buyers to non-Hachette books“.  This group has not yet resolved their concerns about the impact of this applied technology. There are financial and career implications from the loss of Amazon as a channel for sales, even for the months of this dispute.  These include reduced sales for proven best selling authors, and for first-time authors, reduced sales can be the end of their career.

The Bangor Daily News indicates this group is pressuring the Federal government and exploring a law suit to address some of these damages.

A key question is the monopolistic potential of having a single major channel for selling a class of products.  Amazon is reported in this article as being the source of 41% of new book sales in the U.S. And is reported by some best selling authors as having “disappeared” them — with searches for their names on Amazon yielding no results.

Data Mining makes it possible to associate authors with publishers, and manipulate their visibility via online sales channels.  There are legal and ethical issues here that span beyond the immediate “Hatchet”: case.  Apple is continuing its e-book anti-trust battle claiming a “David vs Goliath” position where Amazon holds 90%+ of e-book sales.

Both Apple and Amazon hold significant control over critical channels that authors (books, software, etc) need to both sell their products, but also to even become visible to the to potential readers/users/consumers. Both are for-profit companies that apply their market power and technology to maximize their profits (which is what capitalism and stock holders expect.)  The creative individuals producing indi or even traditional channel creations who might be expected to benefit from the global access of the Internet can get trampled when these mammoth’s charge towards their goals.

Is the Internet creating new opportunities, or consolidating to create concentrated bastions of power?  (Or both?)   Oddly this comes around to parallel issues with “net neutrality” and how the entertainment industry is relating to Internet channels — perhaps there is a broader set of principles involved.


US States use Big Data to Catch Big Thieves

Various states are using big data tools, such as the Lexus-Nexus database, to identify folks who are filing false tax returns.  A recent posting at the Pew Trusts, indicates that  “Indiana spotted 74,782 returns filed with stolen or manufactured identities as of the end of last month with its new identity-matching effort. Without it, the Department of Revenue caught just 1,500 cases of identity theft out of more than 3 million returns filed in all of 2013.”

The article goes on to outline other ways big data is being used by the states.  This can include the focus (e.g. tax refund validation) use of third party data sets, or can include ways to span state data sets to surface “exceptions”.  A state can cross check drivers license records, with car registrations, property tax records, court records, etc … to ultimately identify wrong-doers.

This harkens back in my own family experience when my daughter was working for a catalog sales company.  She was assigned the task of following up on ‘invalid credit cards’ to get valid entries to allow the items to ship.  She discovered via her own memory of contact data, that a number of invalid credit cards, being used with a variety of names were going to a single address.  She contacted the credit card companies to point out this likely source of fraud, only to find out that they incorporated the costs of credit fraud as part of their costs of doing business and were not interested in pursuing an apparent abuser.  Big data, appropriate queries and a willingness to pursue abuse could yield much greater results than the coincidental awareness of an alert employee.

So … here’s the question(s) that come to my mind:

  1. What are the significant opportunities for pursuing ne’er-do-well‘s with big data either by governments or by industry?
  2. What are the potential abuses that may emerge from similar approaches being applied in less desirable ways? (or with more controversial definitions of ne’er-do-well)?

Genomics, Big Data and Google

Google is offering cloud storage and genomic specific services for genome data bases.  It is unclear (to this blogger) what levels of anonymity can be assured with such data.  Presumably a full sequencing (perhaps 100 GB of data) is unique to a given person (or set of identical twins since this does not, yet, include epigenetic data) providing a specific personal identifier — even if it lacks name or social security number. Researchers can share data sets with team members, colleagues or the public.  The National Cancer Institute has moved thousands of patient datasets to both Google and Amazon cloud storage.

So here are some difficult questions:

If the police have a DNA sample from a “perp”, and search the public genome records, and find a match, or parent, or … how does this relate to U.S. (or other jurisdiction) legal rights?  Can Google (or the researcher) be forced to identify the related individual?

Who “owns” your DNA dataset? The lab that analyses it,  the researcher, you?  And what can these various interests do with that data?  In the U.S. there are laws that prohibit discrimination for health insurance based on this data, but not long term care insurance, life insurance or employment decisions.

Presumably for a cost of $1000 or so I can have any DNA sample sequenced.  Off of a glass from a restaurant, or some other source that was “left behind”.  Now what rights, limits, etc. are implicit in this collection and the resulting dataset?  Did you leave a coffee cup at that last staff meeting?

The technology is running well ahead of our understanding of the implications here — it will be interesting.

Enslaved by Technology?

A recent “formal” debate in Australia, We are Becoming Enslaved by our Technology addresses this question (90 min).  A look at the up side and down side of technological advances with three experts addressing both sides of the question.

One key point made by some of the speakers is the lopsided impact that technology may have towards government abuse.  One example is captured in the quote “a cell phone is a surveillance device that also provides communications”  (quoted by Bernard  Keene)  In this case one who benefits from continuous location, connectivity, app and search presence.

Much of the discussion focuses on the term “enslave” … as opposed to “control”.  And also on the question of choice … to what degree do we have “choice”, or perhaps are trying to absolve our responsibility by putting the blame on technology.

Perhaps the key issue is the catchall “technology”.  There are examples of technology, vaccines for example, where the objectives and ‘obvious’ uses are beneficial (one can envision abuse by corporations/countries creating vaccines.) And then the variations in weapons, eavesdropping, big-data-analysis vs privacy, etc.  Much of technology is double-edged – with impacts both “pro and con” (and of course individuals have different views of what a good impact.)

A few things are not debatable (IMHO):
1. the technology is advancing rapidly on all fronts
2. the driving interests tend to be corporate profit, government agendas and in some cases inventor curiosity and perhaps at times altruistic benefits for humanity.
3. there exists no coherent way to anticipate the unintended consequences much less predict the abuses or discuss them in advance.

So, are we enslaved? …. YOU WILL RESPOND TO THIS QUESTION! (Oh, excuse me…)


Your Data in the Cloud, like it or not!

Various suppliers of products & services are now integrating cloud operations as necessary aspects of their offerings.  This has raised questions in an article in Scientific American, “The Curse of the Cloud” (hard copy) and related online entry “We are Forced to Use Cloud Services” about who is in control of your data. Before we disparage this situation, we (technologists) need to consider why it is happening.

From a user perspective, having a “common” file structure between devices can be a real advantage.  My music is there, the files I need for this meeting, when I’m on the road. For those of us using online email services (Gmail, Hotmail, etc.) this is a common concept. Once upon a time (like 3 years ago) I’d email myself a file just so it was stored in the implicit cloud. Google has made that explicit with Google Drive and the aps environment.  And specific services like iTunes/iCloud, Microsoft with Windows 8, Chrome/Android have also been structured to preserve “context” in the cloud that can span multiple systems.  This includes the automated password completion data your browser so kindly provides.  In short, your bank account access is now only as safe as your Gmail or other auto-login product’s protection. Ditto your online health records, etc.

Why?  The technological answer is simple — single devices fail, automate backup. Also for apps that run in the cloud, they can be automatically maintained, less user fuss – more secure. But there is a dark cloud above this silver lining called “vendor lock in”.   Microsoft has constantly been challenged with “how can we sell you another copy of xyz?” (DOS, Word, etc.) Historically this has been accomplished by periodic updates and eventually making an old version obsolete in terms of support, security, file formats, etc. Today the solution is Office 365, where you subscribe to the use of the software, and must connect every 30 days or “lose it”.

Your Data Are Ours” — and of course the data you store in these locations has limited practical portability to a competing environment.  Typically you can export, or “save as” a file in an common format and move it, but with some data files (music) built in DRM may prevent that.  (An interesting example was Amazon’s removal of Orwell’s 1984 from all of the Kindles that had downloaded it when they found they were in violation of copyright. There is some irony in giving Big Brother that capability.)

As the Scientific American article points out, participation in these services is no longer optional. It is the only way to sync your calendar, etc. with the new iOS, and is strongly encouraged by Microsoft Windows 8.  Needless to say, a Chrome book, is marginal “For those rare times when you aren’t connected to the web“.

As you may have noticed, some of these suppliers are also selling or compelled to provide access to your content by the “authorities”. For each country the “authorities” varies, but includes NSA and such.  If you are engaged in a business that spans international boarders, and tries to maintain trade secrets from foreign competitors, you may want to think twice, or even a dozen times, before committing  critical data to the cloud.  Consider the liability of an “innocent” memo stored in the cloud (email, or even just a draft — early draft given version rollback) … that reveals just a bit too much about some key secret.  Back in the last millennium, when lawsuit “discovery” entailed providing access to your file cabinets to an army of paralegals one of them tripped over this comment in one company’s files: “XYZ corp is eating our lunch, we need to buy them out or burn them down” … written prior to the fire that destroyed XYZ corp’s facilities.  Such a cute turn of phrase would be far easier to find with a good search engine and authorized (or even unauthorized) access to your files.

There is a solution which I think might work: buy your cloud services from NSA. Consider this. NSA has the worlds leading experts on cyber-security, and while an obvious target for attack, is probably one of the best defended. They have data centers that can handle the load (if they can ever get the generators in Utah working), and they probably have more restrictions on their abuse of the data than any other entity (with increasing restrictions every day.) — and best of all, they already have a copy (I think I’m kidding here.)

What you post may be used against you

The Jan 9 Wall St Journal points out that credit analysts are starting to use your Facebook, LinkedIn and eBay activities to evaluate you.   For example, does your job history and status on these sites correspond with the one you submitted in an application?  What are buyers saying about you on eBay (assuming you are selling stuff there?) , etc.  In short, your “rep” (as in reputation) is being tracked as it spans social media.

This is added to the “75% of employers check your social media presence before pursuing an interview” (feedback from an HR friend of mine). Universities that use your presence as part of their acceptance process (are you really sure you want those party pictures on-line?), and even schools that have expelled students for violations admitted on their social media sites.

Scott McNealy asserted “You have no privacy anyway, get over it“, and it appears the NSA may concur.  However, it is not clear this is a situation we should take lying down …. anyone want to stand up?