About Clean Insights

Data collection has become the default for most companies. But, like the fossil fuels that keep your car running, the data that keep a company running can lead to harmful side effects, leading some to call data a toxic element. The most privacy-conscious companies have proposed a simple solution to the problem: stop collecting data altogether. But ending all data collection means companies are “flying blind,” an unacceptable trade-off for most organizations. How, then, can campaign managers and product developers strike the right balance between promoting user privacy and promoting product success?

Clean Insights gives developers a way to plug into a secure, private measurement platform. It is focused on assisting in answering key questions about app usage patterns, and not on enabling invasive surveillance of all user habits. CI has levers to pull to cater to specific use cases and privacy needs. CI also provides methods for user interactions that are empowering instead of alienating.

View presentation Listen to the Clean Insights origin story

Clean Insights was originated during the Berkman-Klein Assembly 2017 program at Harvard University, and is currently cared for by a project team that includes Guardian Project and Okthanks.

Practical Techniques

The Clean Insights projects utilizes practical techniques, distributed as a new kind of measurement SDK for connected devices and server infrastructure, that follows these core tenets:

  • Data minimization: only the minimum amount of usage and behavioral data should be gathered to answer a determined set of questions. The frequency, range, and level of details of measurements should be as small as possible.
  • Source aggregation: Possibly identifying data should not be held in any part of the system longer than necessary, aggregated at the source at the earliest possible time.
  • Randomization: Remove the link between the data and the individual by introducing noise to the data, to be sufficiently uncertain to unlink from a specific individual.
  • Generalization: Dilute the attributes of data subjects by modifying the respective scale or order of magnitude (i.e. a region rather than a city, a month rather than a week).
  • Transparency: Always get consent, and the scope of the data collection and algorithms used should be made publicly available and well explained.

Our approach defends from a variety of tracking and deanonymization attacks:

  • Singling out, which corresponds to the possibility to isolate some or all records which identify an individual in the dataset
  • Linkability, which is the ability to link, at least, two records concerning the same data subject or a group of data subjects
  • Inference, which is the possibility to deduce, with significant probability, the value of an attribute from the values of a set of other attributes.

Clean Insights also aims to improve the security of the measurement process by defending against the following threats:

  • Surveillance & Censorship (DoS, Information Disclosure, Repudiation): Traffic Analysis can identify users of a specific app based on the inclusion of a certain instance of an analytics package, connections to a specific endpoint, or traffic fingerprinting
  • Increased Vulnerabilities (Spoofing, Tampering, Elevation of Privs): Inclusion of unvetted, insecure third-party analytics libraries can lead to backdoors in software, man in the middle attacks through weak network security, and a general broader set of attack vectors for someone seeking unauthorized access to data and services.
  • Weaponization of Users (Tampering, DoS): Traffic from a set of users can be redirect and effectively weaponized to perform a DDoS attack against any desired target.