The ever-increasing adoption of privacy-enhancing technologies (PETs) provides new layers of privacy in areas such as secure data analytics and machine learning. The first step in many analytics pipelines is the matching of records from multiple data sources. Doing so in a privacy-preserving manner requires the adoption of sophisticated cryptographic techniques. One such approach designed for this task is Private Matching for Compute (PMC), developed here at Meta.
Our private matching protocols enable the matching of records across databases, each owned by a separate entity, while preserving the privacy of each dataset. PMC introduced two protocols for this task, Private-ID and PS3I. The Private-ID protocol allows two parties to compute a secure mapping between their respective databases. The two parties then use this mapping, together with additional (secret-shared) data, in a secure multi-party computation (MPC) to perform further downstream tasks, such as private analytics or MPC-based machine learning.
There is a plethora of applications that may leverage private matching protocols and securely compute analytics. In one example, medical researchers can calculate the risk of a health condition by merging information held by a large health-care provider with a database of a doctor’s office. In another example, an ad publisher (designated “Company” in the graphics below) holding user-provided information might want to measure advertising efficacy and offer personalization by merging with the data held by an advertiser (designated “Partner”) while still preserving user privacy. The recent acceleration of improvements in PETs, due to R&D from across academia and industry, is helping bring ever-improving privacy layers to applications such as these.
In the context of such applications, we are presented with a multitude of opportunities for enhancements, including how to increase the number of matches and how to make it easier to adopt such technologies.
Multi-key matching. One shortcoming of the previous private matching protocols is that they consider only matching that’s based on a single type of identifier (e.g., email addresses). This significantly restricts the number of possible matched records across the databases of the two parties. Most parties have more information about their users than email addresses alone. A natural extension of our previous protocols would be to consider “multi-key private matching for compute” to improve the match rate by considering any type (e.g., email addresses, phone numbers) and any number of identifiers. Note: the terms identifier and key are used interchangeably in this post.
Delegation of computation. Our next observation is that private matching protocols require both parties (company and partner) to actively participate in both the matching and in the downstream MPC computation, which restricts the ease of deployment of such protocols. To enable wide-scale adoption of our multi-key PMC protocol, we extend it to a delegated setting where the Partner performs only lightweight computations to encrypt their datasets and then outsources both the matching and the downstream MPC phases to a powerful server, called Delegate. In this case, after the Partner goes offline, the Delegate and the Company perform the matching and the MPC phases.
Delegation of computation from multiple parties. Finally, we observe that our delegated setting can easily facilitate multiple partners at the same time and even combine their results. The Delegate performs the matching phase after it gathers all encrypted databases from the partners. In this case, the downstream MPC phase runs over all matches.
Let’s delve deeper into some new protocols which 1) enable multi-key matching to improve match rates, and 2) ease adoption through the introduction of delegation in both single and multiple partner settings.
As discussed, a natural extension to Private-ID is to consider a setting in which each party has a database with multiple identifiers, or keys (email address, phone number, name, etc.), per row. Allowing matching on multiple keys significantly improves the match rate since contributing parties might have different types of identifiers (e.g., some might have only phone numbers). For instance, one record may be indexed by both an email address and a phone number in one dataset, whereas in the other dataset the combination of identifiers may differ (e.g., name, email address). To make things more interesting, we assume that each party may have a different number of identifiers for each record and neither party wants to share their notion of a record with other parties (i.e., the order of their identifiers or identity graph). As shown in the image below, the Partner’s first record only includes an email address, whereas the second record includes both an email address and a phone number. Each party orders the identifiers for each of their records based on the party’s belief about the importance of each identifier.
Joining these “multi-key” datasets inherently results in many-to-many relationships between the records. In the image above, Company’s record is matched with both the first and the second record from Partner’s dataset. There are various use-cases in which this can happen, such as individuals having multiple email addresses or phone numbers, spouses sharing a phone number, etc.
We extend Private-ID with the Multi-key Private-ID protocol that considers multi-key databases to improve our matching rate. There are multiple ways in which these many-to-many relationships can be resolved. In our Multi-key Private-ID, we leverage a ranked deterministic join logic to collapse many-to-many connections and achieve a one-to-one mapping. The main idea is that each identifier has a predefined weight (i.e., a matching priority) determined by each party (e.g., Company and Partner) and matching is first performed on all the records based on the first identifier (as in the Private-ID protocol). Next, the protocol proceeds to a second round where records that were unmatched from the previous round will be matched based on the second identifier. This “waterfall” process continues for all the identifiers present within the records. For instance, in the image above, the first match could be based on the email (annelopez82@example.net), while the next could be based on the phone number. Note that this is just one form of matching strategy that we could follow. More generally, we could view many-to-many matches between the two (masked) sets as a bipartite matching graph problem, and we could solve for a globally optimal matching within the protocol.
Let’s continue where we left off with Company and Partner and add a few more keys to each dataset.
In the multi-key Private-ID formulation, the two parties want to generate a single set of keys for all the records without revealing the records themselves, like the following:
As in the Private-ID output, neither party knows whether its record matched with the other party’s, but if it did the records would have the same ID.
The Multi-key Private-ID protocol follows similar steps as the Private-ID protocol, except that the matching logic is different. More specifically:
Step 1. Encrypt and exchange records: The Company and the Partner generate secret keys, which they use to encrypt their records component-wise after they shuffle them. Each party then encrypts the received set with their key and both parties end up with doubly encrypted datasets.
Step 2. Calculate set difference: The records that are encrypted with two keys can be used to calculate a symmetric set difference. The Partner sends its doubly encrypted records to the Company after shuffling. The Company calculates a symmetric set difference, which would allow each party to get identifiers for records that it doesn’t have. If many-to-one (Company to Partner) connections arise, they are resolved using iterative matching, leveraging predefined identifier ranking defined by the Company. If one-to-many connections arise, they are resolved randomly.
Step 3. Generate mapping from ID to records: The parties can now generate the ID spine by exchanging the encrypted records, undoing their shuffling, and then appending them to the records generated from the symmetric set difference. In the end, both parties receive the ID spine and each party learns a mapping from its users to the ID spine.
We extend the ideas of Multi-key Private-ID to support matching between multiple distributed databases with a new protocol called Delegated Private Matching for Compute (DPMC). Our observation is that we can delegate the heavy computation of Partner to an intermediate — or delegate — Party D and run both the matching and the downstream MPC computation between parties C and D. Using an intermediate party will scale private matching to multiple input parties P, as each party only submits their data under an encrypted form and then goes offline.
Recall that Private-ID has two steps: 1) the private matching step, and 2) the downstream secure computation step. As both parties C and P participate in both these phases, they can feed secret-shared data for all the matched records after the matching has been established. Unfortunately, we cannot do that in DPMC, as we assume that the input parties P will disconnect from the protocol after they submit their data. In DPMC, we need to carry along the features from multiple parties P while we perform the matching: an important requirement is that the delegate Party D does not learn anything about the data of the input parties.
For that reason, our DPMC protocols focus on the left-join of C’s and all parties’ P data. Parties C and D learn secret shares of the parties’ P features for the matched records, or secret shares of NULL for the rows of C that did not match any of P’s data. Our motivation for performing a left-join (compared to a union or an intersection) is that The Company learns a mapping from all their users into the secret shared features, which allows them to input additional associated data without re-executing the matching protocol (akin to Private-ID). Notably, our protocols can be extended to compute union. These associated data can either be labels (in the clear) that could be used to filter the secret shared values (e.g., in a GROUP BY fashion), or they can be additional secret shares for the downstream MPC computation.
Next, we describe two protocols that allow multi-key matching between C and multiple parties P:
1) Delegated Private Matching for Compute (DPMC) utilizes one delegate server and computes a series of left joins with associated data; each left join is between The Company and each partner P.
2) Delegated Private Matching for Compute with Secure Shuffling (DSPMC) utilizes two delegate servers and achieves stronger security guarantees by computing a single left join with associated data between The Company and all partners P.
We break down the DPMC algorithm in six steps. While DPMC performs matching between multiple parties P, for simplicity, we show one Partner throughout each step below.
Step 1. Key setup: The Company has a public and a private key pair as well as a private key. Each Partner has two secret keys as well as the Company’s public key. The Delegate (Party D) has a secret key which they share with each Partner (different for each Partner).
Step 2. Encrypt Partner’s records and features: Each Partner shuffles and encrypts their identifiers component-wise. Each Partner also generates some random masks and encrypts both their features and the key used to encrypt the identifiers. Finally, each Partner sends these encrypted messages and disconnects from the protocol.
Step 3. Encrypt Company’s records: The Company shuffles and encrypts their identifiers component-wise and after it receives all encrypted messages from each Partner, it applies one more layer of encryption on the identifiers and forwards all the messages to the Delegate.
Step 4. Prepare for matching: The Delegate removes one layer of encryption from the encrypted features and recovers each Partner’s secret key as well as the encrypted features and masks. Finally, the Delegate uses the shared key of each Partner to remove one layer of encryption from the encrypted identifiers.
Step 5. Calculate set difference: The Delegate now has encrypted records that can be used to calculate a symmetric set difference as in Multi-key Private-ID. If many-to-one (Company-to-Partner) connections arise, they are resolved using iterative matching leveraging a predefined identifier ranking defined by the Company. If one-to-many (Company-to-Partner) connections arise, the Delegate will retain these in the output up to some maximum number of matches per Company row. Note that one-to-many connections may arise since one record of the Company may appear in multiple Partner’s datasets.
Step 6. Prepare secret shares: For each record that is matched, the Delegate sends the encrypted feature masks to the Company as their secret share and keeps the masks as its own share. For each record that did not match, Party D generates an encrypted mask of zero and sends the encrypted feature masks to the Company as their secret share and keeps the same masks as its own share.
Step 7. Company decryption: Finally, the Company decrypts the encrypted masks and receives the secret shares of the features. If a row is in the intersection of Company’s database and one of the Partners’ databases, the Company and the Delegate end up with secret shares of features for this identifier. Otherwise (if a row is not in the intersection), the Company and the Delegate end up with secret shares of zero.
The DPMC protocol supports multiple Partners and can either create a left-join with associated data for each Partner or a left-join with associated data up to a threshold number of partners, which minimizes the secret shares of zeros. DPMC can also create inner-joins with one-to-many (Company-to-Partner) outputs in a similar way. One thing to note about DPMC is that as in both Private-ID, and Multi-key Private-ID, the party that is performing the join (the Delegate in this case) learns the structure of the encrypted matching graph since it sees the encrypted identifiers of all parties. Next, we introduce a protocol that minimizes this leakage.
Finally, we extend DPMC to DSPMC, a protocol that uses two delegates (the Delegate and the Shuffler) to perform a secure shuffling protocol and achieve stronger security guarantees. Assuming no collusion between parties, DsPMC limits DPMC’s leakages and reveals only one intersection size between the Company’s database and the union of the databases of all Partners. Additionally, in the presence of an adversary that has corrupted multiple parties, DSPMC remains secure as long as the adversary corrupts at most one of the Company, Delegate, and Shuffler, and all but one of the Partners.
Most DSPMC steps are similar to those of DPMC, except for the following two:
After the secure shuffling step, the Company and the Delegate proceed as in DPMC and compute the left join and the secret shares of the Partners’ features.
Note that DSPMC the Delegate only learns an intersection size between datasets X and Y, where the latter is combined over all Partners. We argue that this leakage is minimal compared with that of previous protocols, especially as the number of Partners increases. We include the full security definitions of our protocols in this paper, which also discusses security in the presence of an adversary that corrupts multiple parties.
We implemented these protocols in Rust and open-sourced them at the Private-ID GitHub repository. We chose Rust for its superior safety features and ease of writing multithreaded code. We use Curve 22519 from the Dalek library as our elliptic curve.
We implemented multi-key Private-ID, DPMC, DSPMC, as well as PS3I, and compared them for an increasing number of records and identifiers per row. We use multi-key Private-ID as a baseline since it focuses only on the matching and does not consider any features. Recall that in Private-ID and in multi-key Private-ID, the two parties establish a secure join and then, as they are both online, they can provide associated data. Conversely, in our delegated protocols, the Partner goes offline and thus they must encrypt their associated data and send them with the encrypted identifiers. Next, all the parties must carry both the encrypted identifiers and the encrypted associated data throughout the whole computation, which imposes an overhead to the computation. In the image below, we show the total wall clock runtime for each protocol (i.e., the computation time for all parties). Each Partner goes offline in the delegated protocols after they send their encrypted data, incurring minimal computational overhead compared with multi-key Private-ID and PS3I.
Our experimental results show that all protocols scale linearly to the number of identifiers (i.e., input size) and that DPMC almost matches the performance of Private-ID, although it includes associated data. We observe that DSPMC is approximately one order of magnitude slower than DPMC because of the extra communication cost of the additional delegate party and the secure shuffling protocol, but it achieves stronger security guarantees. Both our delegated protocols are at least an order of magnitude faster than PS3I.