Searchable Encryption

Searchable Encryption

The goal is full-text search on encrypted documents where the database only sees ciphertext. All documents are encrypted within the context of a workspace key.

We seek to defend against the “stolen database” scenario, where the attacker has obtained a static copy of the database. We are not trying to prevent analysis by an attacker who can observe over a long period of time the queries that are made to the database.

The Basic Scheme

The basic scheme tokenizes each document at word boundaries, randomizes token order, and removes duplicates so each unique word appears at most once per document. Each unique token is then deterministically encrypted (this encryption does not need to be reversible, e.g. it could just be a one way digest).

When the application wants to search the documents, it performs the same operations on the search terms and then hands the database the deterministically encrypted search terms to match against the deterministically encrypted search index.

Deterministic encryption of unique tokens per document is essentially a bag-of-words model with deterministically encrypted terms. This means that all the statistical properties of normal inverted indexes (e.g. full text search) are preserved in the ciphertext such as:

Document frequency: The number of documents containing the ciphertext word is the exact same as the number of documents containing the cleartext word.
Zipfian distribution: The distribution of ciphertext words matches exactly the distribution of natural language.
Co-occurance patterns: Two ciphertexts that appear together in many documents preserve the correlation of their plaintext words.

In short, this scheme is highly vulnerable to frequency analysis but can be greatly strengthened with a few modifications.

Problem: Frequency Analysis

Although frequency analysis is not possible for a single document, if an attacker has access to all the documents for a workspace the attacker could compute the frequency that a token appears in a document or not.

Common words like “the” appear in nearly every document, while rare words like proper names appear in very few (in practice, “the” is eliminated from the search document because it is not useful term to search on).

For a large enough set of documents, the resulting document‑frequency histogram closely mirrors the Zipfian distribution of English, allowing the attacker to identify high‑frequency common words (low‑value targets) and, more critically, low‑frequency singletons that likely contain sensitive information such as names, medical terms, or locations.

The attacker needs to map frequency a token appears in all the documents to a rank within the known distribution of words in the given language.

C = normalization constant (approximately 0.1 for English)
D = total number of documents
k = number of documents the encrypted token appears in.

Then the rank r represents the unencrypted token’s rank in the language: r = C * D / k.

But this is only an expectation. The actual rank has a confidence internal due to sampling variance (e.g. Our set of documents does not in fact represent a perfect sample of the language).

The average probability that a word of rank r will appear in the average document is:

P(appears) = 1 - (1 - (C/r))^L

C = normalization constant
r = the word’s frequency rank
L = average number of unique tokens per document

Note: C/r is simplification of Zipf’s law for the probability of encountering a specific word. It slightly overestimates probabilities for rarer words. This formula treats L as constant across documents. In reality, document length varies. Using the average L is a simplification that loses information about the distribution’s shape.

The attacker wants to determine the real rank of a word they observe in the documents. Suppose the real rank is 500, unknown to the attacker. Suppose L = 300, typical 500-800 word documents after deduplication, and C = 0.2 (a generous overestimate). P(appears) is then approx 0.133.

This is a simplification of the worst case scenario. In practice, many documents will be only include a title, and many will contain tens of thousands of unique words. A successful frequency attack would have to account for the fact that titles for tasks or events (for example) will have a very different distribution of words than most written language. Large documents contribute little to the attack because they contain nearly every common word exactly once.

Number of documents `D`	Expected number of documents containing word (`k = D * P(appears)`)	95% confident interval of `k`	Range of implied rank	Interpretation
100	11.3	[5, 19]	[260, 1200]	no information
500	56.5	[46, 68]	[370, 680]	rank within ~ 40%
1,000	113	[96, 132]	[380, 650]	rank within ~ 35%
5,000	565	[535, 596]	[420, 560]	rank within ~ 15%
10,000	1,130	[1090, 1171]	[430, 540]	rank within ~ 10%
50,000	5,650	[5540, 5761]	[435, 520]	rank within ~ 8%
100,000	11,300	[11170, 11431]	[440, 510]	rank within ~ 7%

The confidence interval of k is calculated using the binomial proportions confidence interval since k follows a binomial distribution.

So, a rare word is hard to decipher with high confidence. However, lets look at a common word, with rank r = 20.

Number of documents `D`	Expected number of documents containing word (`k = D * P(appears)`)	95% confident interval of `k`	Range of implied rank	Interpretation
50	47.6	[42, 50]	[15, 28]	-
100	95.1	[89, 99]	[16, 26]	-
500	475.5	[465, 486]	[18, 23]	-
1,000	951	[942, 960]	[18, 22]	~95-99% word is identified correctly
5,000	4,755	[4,735, 4,775]	[19, 21]	~99% word is identified correctly
10,000	9,510	[9,490, 9,530]	[19, 21]	Effectively 100%

Common words (r=20) are trivially identifiable with even 100 documents. But this is low-value information since the attacker already knows that “have,” “with,” or “this” appear in most documents. The privacy risk remains on rare words (r=500+), which require orders of magnitude more documents to identify precisely. However, the attacker doesn’t need to identify rare words precisely to decipher some meaning in short documents. Identifying common words alone can provide some scaffolding to extract meaning.

Defense: Left and Right Tokens

To counter frequency analysis, Raft employs left and right tokens.

During the document encryption phase, each cleartext token encountered generates a “left” encrypted token and a “right” encrypted token. The digest of the cleartext of the token together with the encryption key determine a percentage chance that a token will be left or right encrypted.

There are then three outcomes for producing the encrypted search document:

The search document gets the left token
The search document gets the right token
The search document gets both tokens

The goal is to split the frequency of a token in a non-predictable way. Now the token for “water” appears in the encrypted corpus as two different tokens with different frequency ranks than the real word “water”. Perhaps the encryption key + digest results in 60% left tokens and 70% right tokens. Now “water” appears twice, onces as a word with a rank 60% of “water” and one with a rank 70% of “water”. We sometimes include both the left and right tokens so that the attacker cannot link them by detecting they never appear together.

In order to search for the word “water” the system must search for the left token OR the right token. This complicates the search query. Suppose the user wants to search for water AND boat. The expanded logic now becomes (left water OR right water) AND (left boat OR right boat).

Discussion:

t cleartext token
Lt left encryption of token t
Rt right encryption token t
pL probability that a given encrypted search document will contain Lt
pR probability that a given encrypted search document will contain Rt
pL + pR > 1 the encrypted tokens will often co-occur.
N total number of documents
f the true frequency of word t in the documents.

The attacker needs to know f. To do so, they need:

cL the count of documents were Lt appeared (N * f * pL)
cR the count of documents where Rt appeared (N * f * pR)
cLR the count of documents where both Lt and Rt appeared (N * f * pL * pR)
N * f is the total number of documents where t appears. (N * f = cL * cR / cLR)

Solving for f: f = (cL * cR) / (N * cLR)

However, this only works if the attacker can correctly identify which Lt is linked to which Rt, which they can only derive if they already know f.

Defense: Decoy Tokens

Another possible strategy to defend against frequency analysis is this: the system injects decoy tokens into each document’s unique token set. A decoy token appears as a real encrypted token but does not decrypt to an actual word. The decoy tokens are drawn from a global pool and reused across documents with carefully controlled frequencies to mimic the natural Zipf distribution of real English words.

This means some decoy tokens appear in many documents (mimicking “the”), others appear in moderate numbers, and many appear in only one document (mimicking rare but real words).

When the distribution of decoy token frequencies statistically matches the distribution of real token frequencies, an attacker cannot distinguish genuine sensitive terms from decoys.

The number of decoy tokens required depends on the desired security level and the number of documents in the corpus.

The decoy tokens themselves must be cryptographically random or otherwise indistinguishable from real encrypted tokens, and their reuse pattern must avoid detectable signatures.