Search Among Sensitive Content

ECIR 2021 Tutorial

Organisers:

Graham McDonald, University of Glasgow, UK, graham.mcdonald@glasgow.ac.uk

Douglas W. Oard, University of Maryland, USA, oard@umd.edu

Resources:

  • Tutorial Slides are available here
  • A reading list is available here

About The Tutorial

Information retrieval (IR) systems provide access to large amounts of content, some of which may be personal, confidential, or otherwise sensitive. While some information protection is legislated, e.g., through the European Union’s General Data Protection Regulation (GDPR) or disclosure exemptions in Freedom of Information Acts; other cases are regulated only by social expectations. Information retrieval research has traditionally focused on finding indexed content. However, the increased intermixing of sensitive content with content that can properly be disclosed now motivates research on systems that can balance multiple interests: serving the searcher’s interest in finding content while serving other stakeholders’ interests in appropriately protecting their sensitive information.

If the content requiring protection were marked, protecting it would be straightforward. There are, however, many cases in which sensitive content must be discovered before it can be protected. Discovering such sensitivities ranges in complexity from detection of personally identifiable information (PII), to automated text classification for sensitive content, to human-in-the-loop techniques for identifying sensitivities that result from the context in which the information was produced.

Once discovered, IR systems can use the results of sensitivity decisions to fil- ter search results, or such systems can be designed to balance the risks of missing relevant content with the risks of disclosing sensitive content. Optimising sensitivity-aware IR systems’ performance depends on how well the sensitivity classification works, and requires development of new evaluation measures. Moreover, the evaluation of such IR systems requires new test collections that contain (actual or simulated) sensitive content, and when actual sensitive content is used secure ways of evaluating retrieval algorithms (e.g., algorithm deposit or trusted online evaluation) are needed. Where untrusted data centres provide services, encrypted search may also be needed. Many of these components rely on algorithmic privacy guarantees, such as those provided by k-anonymity, L-diversity, t-closeness or differential privacy.

This tutorial will introduce these challenges, review work to date on each aspect of the problem, describe current best practices, and identify open research questions.