Federated Analytics: What Is It, and How Does It Work?
What you need to know about federated analytics, in a few short minutes.
What is federated analytics?
Federated analytics is an approach to user data analysis that does not capture data from individual devices.
The idea has circled for a few years, but Google has introduced federated analytics to a wider audience.
They define it as “Collaborative data science without data collection”.
Where ‘traditional’ data science brings lots of information into one central data lake, federated analytics combines information from distributed datasets without gathering it in one central location.
Federated analytics relates to federated learning (clue’s in the name there), but it doesn’t do the learning part (again, see the name).
Federated learning (introduced in 2017) is a way to train centralised machine learning models on decentralised data.
That is to say, Google (or similar) can make its algorithms smarter by aggregating device data (we’ll explain that process later), but the user data stays on phones.
In essence, federated analytics offers a way to measure and improve the performance federated learning models.
We can imagine how useful it might be for health data, for example, where the need for privacy and accuracy is heightened significantly.
Google made a comic about federated learning, which seemed silly to me until I started reading the academic papers on the topic.
The comic, at this point, became a sturdy reference.
There we go. Everything is so much simpler in a comic.
The Utopian ideal is to learn from everyone, without learning about anyone.
It’s an emerging field, but it offers a new way of thinking about data collection.
Why is it needed?
Federated analytics is a response to that seemingly insoluble paradox:
- Modern systems need lots of data to function optimally.
- Namely, they need to know how people interact with products in order to improve them.
- So, they need user data or they can’t keep improving.
- Users want better products, which means sharing their data.
- Users are less inclined to share personal data with tech companies today.
- Governments are starting to clamp down on rapacious data capture by Big Tech.
As these privacy regulations take hold and the tech companies fall under more scrutiny, something has to give.
The economic imperative to find a solution could hardly be greater.
People are inventive when faced with such circumstances.
And lo, we have federated analytics and its like.
Federated analytics can potentially solve other challenges, too.
Recently, we have seen instances where machine learning models trained on one centralised data repository do not generalise well when faced with new scenarios.
For example, a model that learns to spot specific issues in brain scans that then misses issues with marginally different tumour boundaries in new scans.
Early trials (and they are very early), have led to performance improvements through federated learning.
Federated analytics is also fast, takes place on the “edge”, and can work offline.
How does federated analytics work?
Ok, I might need that comic again here.
If we revisit that Google definition, “Collaborative data science without data collection”, we haven’t really touched on the “collaborative” bit yet.
I think we understand the fact that raw data is not collected, but how can multiple sources work with the data if it is not sent to the central server?
We can think of it this way: Instead of moving data to the algorithm, the algorithm is brought to the data.
The “work” happens close to the device and ONLY the “insights” are sent back to the central architecture.
From here, product managers and engineers can work on improvements. The individual data points are secure and only the general patterns are revealed.
So, how does that happen?
Google offers an example from its Gboard, used in messaging apps. It wants to assess how effective its next-word prediction models are. You’ll see these pop up in Gmail, if you have Smart Reply set up.
It could send everyone’s data back to the server and run analysis on the aggregated results. This is unnecessarily unwieldy and runs the risk of putting all of our messages in a hackable location. Not good.
Instead, they can send a model to individual phones to run local analysis on the device. The devices perform calculations locally and upload the metric data, which is then aggregated using the secure aggregation protocol. In essence, they can “collaboratively compute the sum of the values without revealing the values themselves.”
Google can see how well the service is working, at a population level, and identify improvements. These can then be sent out to the devices and analysed again in future.
The process happens when the phone is idle and plugged in. It joins a “round” with hundreds of other phones
The models remain static on phones until they join another round and receive updates.
This is great for voice assistants, where we want our individual data to remain private but we do want the assistant to improve over time.
Apple is big on privacy (at least publically, ironically enough) in this regard, but that approach also accounts for Siri’s lack of development. It does the local analysis part, but not the collaborative, federated part, where devices “learn” from each other’s performance.
As a result, Apple has been caught recording individual data and sending it to a third party for analysis.
Federated analytics is a step towards synthesising the competing needs for privacy and innovation.
Well, Google also wants to add services to solidify this new way of working. It has just introduced Private Service Connect, which keeps data away from public Internet and in the Google network:
And it is building cables all over the show to power its Cloud services:
Privacy is big business and it all depends on who you trust with that precious currency.
Google is “leaning in” to this new reality, rather than fighting it. Federated analytics still needs a lot of testing and development, but it is an encouraging step in the right direction.
Federated analytics: Decentralised analysis of the raw data stored on user devices. Used for basic computations about user behaviour that do not need machine learning. Federated analytics can assess the effectiveness of federated learning models.
Federated learning: Machine learning while keeping data secure on devices. The phones (and we are dealing with phones here, not desktops) collaboratively work on a model while storing the training data on the device. Used for more complex tasks that do require machine learning, without sharing sensitive data in a central location.
Secure aggregation: Important part of the federated analytics ecosystem. Secure aggregation protocols ensure that when devices collaborate by sharing their data, the individual data values are not revealed. (Read all about it here.)