Indexing Encrypted Data
Why bother encrypting?
As a SQL Server DBA I am primarily responsible for making sure my client’s data is protected and available amongst other things. In many cases I have gone to clients and projects to find that sensitive data is not protected as well as it could be. Most times development is still underway and the security is a minor issue at that phase, but it is critically important. By bringing the security of the data in the picture sooner than later steps can be taken to help make everything easier as it progresses.By planning ahead, you can even accomplish the not so easy task of safely indexing encrypted data.
There have been many cases over the years of sensitive data being compromised where just a few simple steps up front could have minimized, or even prevented the data being compromised.
In December of 2006 TJX companies (TJMax, Homegoods, etc) had a security breach of data that resulted in 94 million, yes million, credit card numbers and authorizations being stolen.
Why? The best I could find in my research indicated a lack of firewalls and weak encryption methods used.
94 million is one of those hard to comprehend numbers. To put it into perspective, image everyone (I mean everyone, kids, elderly, etc) in the USA was shopping at TJX on the day of the data breach and they all used their personal credit card. That is 1 in 3 people having had their credit card number stolen along with the authorization information. Better security and encryption probably could have saved this from being as large as it was. Things of course have changed a lot since then, but there are still issues happening out there.
Just last year (2012), in October, the South Carolina Department Of Revenue (SCDOR) had a security breach. It was reported that 4.7 million Social Security Numbers, of tax payers and their dependents were exposed, along with 5,000 credit card numbers and 3.3 million bank account numbers. The shocking part of this is that SCDOR was in full compliance with the IRS requirement for protecting data at rest.
Encryption of data is seen as affecting performance too much. Focus on data in transit, not data at rest. Per the IRS 1075 form detailing Federal Tax Information storage:
“While encryption of data at rest is an effective defense-in-depth technique, encryption is not currently required for FTI (Federal Tax Information) while it resides on a system (e.g., in files or in a database) that is dedicated to receiving, processing, storing or transmitting FTI, is configured in accordance with the IRS Safeguards Computer Security Evaluation Matrix (SCSEM) recommendations and is physically secure restricted area behind two locked barriers. This type of encryption is being evaluated by the IRS as a potential policy update in the next revision of the Publication 1075.”
Data in Transit vs. Data at Rest
Why does this type of regulation exist? In general the various regulations and guidelines are put in place for data that is in transit, nit at rest. The assumption that as long as the data is “staying put” and locked up behind barriers, it is safe. I view this a different way and I am going to paraphrase what was noted in the book “Expert SQL Server 2008 Encryption”
Think of your database or company. How much sensitive information is held within it. Think about things like: social security numbers, credit cards, bank accounts, company financials, customer information, medical information.
This data is like gold. There are a lot of people that would like to get a hold of it.
Picture a large fast flowing river with all sorts of prospectors in it panning for little nuggets of gold. These are the hackers in your network and on the internet trying to find and decipher the packets of data to get whatever they can.
That is your data in transit.
Now picture a large safe, filled with bricks of solid gold perfectly stacked up. That is your database. All this sensitive information condensed into one place. By not encrypting and protecting that data, you are leaving the safe door wide open.
If you were a hacker, where would you rather go? To the river with everyone else, or find the mother lode and just take some of the bricks, or even the the entire safe ?
This is not to say the only security measure to take is encrypting your data, but as a DBA, you control the last line of defense against the hackers if all the other measures fail or are breached. Protecting the data at rest is a key component of a complete security plan, and when omitted leaves a serious gap in the plan’s completeness.
Change is coming
As in the quote from the IRS above, and other articles I have read, change is coming. There is going to be a requirement to secure data at rest in and out of your databases.
Steps are going to have to be taken sooner than later and it is good to start to determine how you can protect your data.
What options are there to use to protect data in your SQL Server database.
The main three are Transparent Data Encryption, Hashing and Encryption.
Each have their purposes and uses, and can even be combined in some cases, but each also have their faults. The key is to determine which will work best for your situation.
I was approached by a client once that told me a SQL Server had to have a column in a table protected. I had the criteria that the data had to be able to be retrieved, but it also could not be visible to the users of the database in its plain text format. I started to look into each option in more detail to determine which would work best in my situation.
Transparent Data Encryption (TDE)
TDE performs real-time I/O encryption and decryption of the data and log files. This is a great option to provide a minimal level of protection at the database level. Basically a key is used to protect the data files and as the data is accessed it is decrypted and presented to the user or application. When you log into a database and look at the data directly, you will not see any difference. This is a popular solution to protect the data in older applications that can, or will, not be updated to use the other methods. There are some good things this method provides, as well as some drawbacks:
- enrypt the “data at rest”, the actual files themselves are encrypted.
- Encrypt the databse files (data and log) and transaction log records for log shipping and mirrored databases
- Affect other databases on the server not using TDE since the TEMPDB is encrypted. Since the TEMPDB is used by all databases on the instance, TDE has to encrypt it and you will probably start to see performance hits against other databases that are not using TDE.
TDE Does not:
- Encrypt at the object level – no single column or table, etc.
- Support instant file initialization for database files – growth of files may take longer due to TDE being turned on
- interact well with backup compression. I heard of a case where a 12GB DB was backing up with compression to < 3 GB. Once TDE was used, the backup size was 11.9GB with compression.
- Encrypt read only file groups
- Encrypt databases used in replication topologies
- Encrypt file stream data.
In my case, this would not help with what I had to accomplish for my client.
Hashing is a method of taking a value, salting it, and then passing it through an algorithm to get what looks like a bunch of gibberish to people.
Hashing works a lot like baking bread.
You start with some ingredients, add salt, put it in the oven (through the hashing algorithm) and you get bread. Hashing is also determinate, which means you will always get the same result. You will always get the same bread when you use the same ingredients and salt and bake it the same way. The problem is that you can’t ‘unbake’ the bread. The same goes for hashing a value. you cannot ‘unhash’ the value to get the original result.
There are some good points and bad points to hashing:
- Create a determinate value. Every time the value is processed through the algorithm the value is the same.
- Provide different algorithms, up to 512 bit encryption.
- Index. You can index the hashed value since it is determinate.
Hashing Does not:
- Un-Hash – Once hashed the value cannot be retrieved unless ‘hacked’, possibly with a rainbow table.
- Guarantee a unique value – Hashing can be prone to collisions. This is when 2 different values, when passed through the algorithm result in the same hashed value.
Because I would not be able to get the value back out with hashing, I can’t unbake the bread, this was not an option for me as well.
Encryption is very similar to hashing, but also very different. Instead of bread, encryption is more like a safe deposit box. So long as you have the correct keys you can get the original data back.
- Require keys, certificates and other components to be configured on the server. These also need to be securely backed up off the server.
- Work in 2 directions. I can encrypt and decrypt the values using the right keys. It is like a safe deposit box; As long as I have the right keys I can get back exactly what I placed in it.
- Use more resources. Not a lot, but certainly more power and memory are used to encrypt and decrypt the values due to the various ‘moving’ parts (keys, certificates, etc)
Encryption Does not:
- Have a determinate value. Every time you encrypt a value, using the same algorithm and keys, you will get a different value.
- Support indexing. You can index an encrypted column, but since encryption generates a non-determinate value every time you encrypt it, you can’t look up the previous value to compare. You must decrypt the value to then compare to the original value being handed in.
After going through the various options I decided that encrypting the column would be my best solution.
: For demo purposes I created a table with 5 million entries in the format of a social security number (9 integers in a row from 0 – 9 in value) and added a clustered index on the SSN column. from here on out the numbers I reference and the screenshots are based on that data, not the actual client data, though very similar in nature. I used Redgate’s SQL Data generator 2.0 to generate the data for the sample.
And here’s the problem I encountered
Before I encrypted the column, I figured I would do a quick check of just how long it took to execute the original query against the plain-text data that needed to be encrypted.
declare @ssn varchar(9)
set @ssn = '068814962'
SELECT cust_name ,
WHERE ssn = @ssn;
After executing the query I took a quick look at the time taken and the execution plan to see what I was up against.
Because this was a very simple query to return the table values based on the SSN value, a cluster index seek (as expected) was used and the results came back very quickly as you can see.
Just 1 ms for the results. Not going to be easy to add encryption and keep a speed like that, but let’s see what happened.
This time I had encrypted the value and placed it next to the old SSN plain text value. Because I would not be referencing the plain-text value anymore, the index really was no longer applicable. Since the encrypted column really can’t be indexed I expect a decent sized performance hit. I rewrote the query to open the appropriate keys and to decrypt the SSN value to return.
OPEN SYMMETRIC KEY HGssnkey
DECRYPTION BY CERTIFICATE SSN_Encrypt
DECLARE @ssn VARCHAR(9)
SET @ssn = '068814962'
SELECT cust_name ,
WHERE CONVERT(VARCHAR(9), DECRYPTBYKEY(ssn_enc)) = @ssn
I got the same result set, as was expected, but the time it took was a shock!
OUCH! Over 12 seconds. This is because I was forced to do an Index/Table scan and decrypt every value to then compare to the incoming parameter. All 5 million rows in the table needed to be decrypted since the query was looking for the records that matched.
This was not good. I couldn’t go forward with a process that went from 1 ms to run to over 12 seconds.
Something else had to be done.
The Partial Plain-Text Value (PPV) solution
This is usually asked for a couple of reasons, first it is all that the person on the phone can see, the rest of the number is hidden from them (as it should be), but the last 4 numbers are exposed in the encrypted database to make looking up the required information faster.
To solve the issue I was encountering I could add the last 4 digits of the SSN to a new column, and create an index on those 4 digits. Then I would look for the subset of rows that matched that last 4 numbers and then decrypt each of those until I found the match. It would certainly go a LOT faster since I’d be working on a much smaller set of data, but I have also given a hint to any potential hacker as to what the ciphered value is. In the case of the last 4 numbers of a SSN, I just guaranteed that you know 45% of the value that is stored.
With that much information, it becomes a LOT easier for the hacker to figure out my algorithms and crack the ciphered text.
That’s not a good thing.
An alternative approach to indexing encrypted data
I continued to think about the solution and came up with a hashed value grouping concept. This concept basically consists of hashing the value using a hashing algorithm, then converting that to an integer and then performing a modulo to get the remainder amount from a division. This method yields a value that has no relationship to the stored value. I created a function to perform this action.
CREATE FUNCTION [dbo].[ufn_GroupData]
/* based on value being hashed, generates a number within the modulo range to be stored and indexed rather than using an actual hashed value. */
DECLARE @Result INT;
SET @Result = HASHBYTES('SHA2_256', @String);
/*The SHA-2 algorithm is the current required hashing algorithm in most U.S. Gov't applications */
IF (@Divisor > 0)
SET @Result = @Result % @Divisor;
Using this function I was then able to create a grouping number that I could place in a column and index. In my case I set the @divisor value to 10,000. This gave me 20,000 groups with an average of 250 records in each grouping. SQL Server can handle going through and decrypting a couple hundred rows much faster than it could go through 5,000,000.
Since the old clustered index was removed when the original SSN column was removed I added a new one on the newly added SSN_HG column I created to hold the hash grouping number. Then all I had to do was add a single line to the where statement of the query so that the index could be used to narrow down the results.
OPEN SYMMETRIC KEY HGssnkey
DECRYPTION BY CERTIFICATE SSN_Encrypt
DECLARE @SSN VARCHAR(9)
SET @SSN = '068814962'
SELECT cust_name ,
CONVERT(VARCHAR(9), DECRYPTBYKEY(ssn_enc)) ,
WHERE CONVERT(VARCHAR(9), DECRYPTBYKEY(ssn_enc)) = @SSN
-- line added to use the new ssn_hg clustered index to look for the value
AND dbo.ufn_GroupData(@SSN, 10000) = SSN_HG
This time when I executed the query I got much better results.
The results were even better than I expected. The query was down to 3ms and using an index seek again!
Sure, it is slower than the original. Three times slower in fact, but my data is now protected and I am not suffering from having to perform full table/index scans. In my books that is a great success.The increase in the time to process the query is primarily due to the decryption and matching process to make sure the correct record is coming back. I feel that this small sacrifice in performance is acceptable. The reduction of the risk in having data leaked is well worth the small sacrifice in performance. Using these methods, no need ever see the sensitive data in the database. That is just another thing that makes me feel far more secure that my data at rest is protected.
Give me the code!
If you are interested in using any of the code I used, or the sample data you can download the various components directly using the following links.
You will need to create your own encryption keys and sample data. The Demo script does create keys, so you can reference it for syntax if needed.Get the Demo Script Get the Function Script