30 July 2008

On Proving your Identity

I purchased a new Dell Latitude D630 laptop today. I'm excited, but I was also quite cautious, since this is more than I'd ever spent on a laptop before. Nonetheless, my old laptop is effectively dead, and I need one that can last me through the next five years of grad school.

About an hour after I placed the order, I received a call from some 800 number. I had sort of a goofy conversation with them. Here's a summary of my conversation with him, and a later conversation I had with my bank:
Them: Hi this is Mr. Foo calling from Bar, Inc. We contract with your bank or credit union to provide fraud monitoring and prevention. We noticed some questionable transactions, and wanted to verify them with you.

Me: (unimpressed by the phrase "bank or credit union") Umm... Thanks.

Them: Could you please tell me your name and the last four of your social security number?

Me: I'd love to, but first, can you prove to me that you really are affiliated with my bank?

Them: Unfortunately, we cannot do that until you verify with us first. Besides, it's just the last four digits of your social security number---what harm is there in that?

Me: (trying not to laugh) Well, I use those last four digits to prove my identity to my bank-or-credit-union, my cell phone company, my health insurance company, and even this morning to prove my identity to my University. Those last four digits have become quite an important secret, and I don't give them out readily (any more).

Furthermore, because of the structure of social security numbers, the first five digits aren't very random. The first three digits, for instance, can be looked up as a function of the year and state of your birth, so I must assume they are public knowledge. If I tell you my last four, then you know everything except the middle two; you would be able to narrow my social down to 1 in 100.

No disrespect intended, Sir, but even if I assume that you are who you say you are, and that you have a legitimate relationship with my bank-or-credit-union, I have no assurance that your company has any decent standards for data protection, or that everyone who works at your company has good intentions.

Instead, I'll have to call my bank-or-credit-union directly. At least then, I'll know who I am speaking with.
I wasn't really trying to give this guy a hard time, but we have to protect ourselves. Given the time-coincidence between the purchase and the call, I was pretty sure that Mr. Foo was indeed who he claimed to be. Nonetheless, I opted to call my bank-or-credit-union to be sure, and had the whole situation cleared up.

My main point here is to call attention to the failings of our everyday methods of proving our identities. In nearly every context, an applicant claims and identity, and then provides a piece of knowledge that is known only to the applicant and the service provider. This could be your username/password on a website, your name/last-4 to your phone company, or your name/credit card/expiration date/billing address to a vendor. Variations of this scheme go on forever, and none of them is secure.

And the failing of all of these methods is: to prove your identity, you give away your identity. If the service provider on the other end (or even one bad employee at that service provider) decides to, he may impersonate you. If someone set up a website y0urbank.com that looked exactly like yourbank.com,
you may accidentally give them your identity. It's even possible that your service provider acts in good faith, but all of your identifying information is stolen from their databases by some malicious agent.

There are alternatives; I'll present two cryptographic solutions below. Neither of these is my invention. In this modern day and age---and considering that credit card companies are liable for fraud---I don't know why they aren't in common use.

While I'm on the topic, I should also announce that I am creating an open-source RF keyfob that performs the first authenication scheme below. That's why I developed SHA-1 for the PIC16 late last month). The device is still in the works, and has been delayed by the death of my laptop, but will hit the scene before I start grad school.

The first solution is relatively simple. Presume for a second that both the applicant (me) and the service provider (my bank-or-credit-union) share some secret S_me. You can think of this as a password, or more generally as a really long number. Instead of giving S_me away (and as a result, giving away my identity), I can prove that I know S_me. In order to authenticate:
  1. I tell my bank-or-credit-union my purported identity "Nick Johnson."
  2. My bank creates a nonce N_bank or in other words a long random number that is different each time.
  3. I compare N_bank to all of the nonces I have seen in the past. If I have seen it before, I reject it and start from the beginning.
  4. I create a second nonce N_me.
  5. I can now calculate a magic number by concatenating the digits of N_bank, S_me, and N_me. Please note that only the bank-or-credit-union and I could have produced this magic number, since only we know S_me.
  6. I compute the cryptographic hash H of the magic number.
  7. I send N_me and H to my bank-or-credit-union.
  8. My bank-or-credit-union compares N_me to all of the nonces I have sent in the past. If my bank-or-credit-union has ever seen this before, he rejects the whole transaction, and tells me to start from the beginning.
  9. My bank-or-credit-union, knowing N_bank, S_me, and N_me can now also compute H.
  10. My bank-or-credit-union compares Hs, and if they match, can be confident that I know the secret.
This way, even if someone overhears the entire conversation, they neither learn any information about my secret (because of properties of the hash function), nor do they have a dialog that they can replay (because the nonces must change each time). By performing this protocol again in reverse, my bank-or-credit-union can also prove it's identity to me.

The problem with the first scheme is the shared secret. The two parties must agree upon some secret, and communicate one to the other, hoping that no one overhears. Also, although best practice is to store passwords as hash values, this scheme requires both parties to store the passwords as cleartext, making data-theft a possible issue.

There is another algorithm to perform this sort of exchange---the classic zero-knowledge proof. It is more complicated conceptually, as well as in computation-time and -space, but doesn't require the service provider to store any secrets, and is very sexy. I would love to employ it in my RF keyfob project, but I don't think I can make it fit into a small, embedded microcontroller. Let me explain it using an analogy.
There's one slice of Pizza left, and we both want it. Neither of us trust the other to divide it evenly. So, instead, we agree that one of us will cut the slice in two, and the other has first pick from the halves. The pareto-optimal strategy for each player yields an equal half-slice; the first doesn't know what the second will choose, and so he must be prepared for either choice.

So, what if instead of knowing a secret, suppose instead that I knew the solution to a complicated problem---so complicated that it would take years to solve it. This knowledge can be used represent my identity, and I can prove that I know it without giving away the solution:
  1. Both my bank-or-credit-union and I agree upon a complicated problem to which only I know the solution.
  2. We repeat these steps until the bank is satisfied of my identity:
    1. I create a derived problem which is equally complicated to solve, but which I can solve easily since I know the solution to the original. It is important that knowing the solution to the derived problem gives no hint to the solution of the original.
    2. I tell my bank-or-credit-union about this derived problem. I have committed to this problem, and in doing so have claimed that I can demonstrate both that it is equivalent to the original problem and that I know an answer to it.
    3. I ask my bank-or-credit-union: Which would you like to see---the equivalence, or the solution? (Which half would you like to eat?)
    4. My bank-or-credit-union selects one, and I provide the requested information.
    5. My bank-or-credit-union verifies my response. If it's wrong, it rejects me. If it's right, it only knows that I wasn't wrong. In the worst case I predicted its choice in [4], and my bank-or-credit-union is now twice as confident that I know the solution to the original problem.
This aforementioned protocol, of course, requires a satisfactory problem to perform. The classic example is the problem of finding the Hamiltonian Cycle on an undirected graph. Since this problem is NP-Complete, we can make the problem arbitrarily hard by using a large enough graph. However, it is very easy to compute the Hamiltonian Cycle while you build such a graph (or perhaps, more accurately, it is easy to add a Hamiltonian Cycle to a totally random graph).

With the example above, I would generate a large graph and cycle when I open an new account with the bank, and I would tell the bank the graph, but I would keep the cycle as my secret. The secret doesn't need to travel, and so we can keep closer tabs on it.

Additionally, to perform the above protocol, I need some method by which I can create a derived problem. With the Hamiltonian Cycle problem, this is also easy: I choose a new random name for each vertex in the graph, thus producing a new problem, and I perform the same substitution on the cycle, thus producing the solution to the new problem. I have created a new problem, equally as large (and presumably, equally as difficult to solve), which I can prove equivalent to the original (by disclosing the random substitution), or which I can solve on demand (by applying the random substitution to the solution to the original). Even if I disclose the derived problem and its solution, my bank-or-credit-union gains no knowledge of the solution to the original. To do so, it would need to solve the graph isomorphism problem to find the relationship between the two problems. Graph isomorphism is in NP.

The major problem with both of these authentication protocols is that they require a lot of computation---more than any human would be able to do in his head. As a result, the human needs a piece of trusted hardware---analogous to a key---which can serve as his proxy in this exchange. Such hardware exists, often in the form of smart cards. Each implementation is a black box; no one is allowed to audit them to see if they really work the way they claim. Again, sit tight and you can play with my RF keyfob project soon.

However, like a key, anything you have can be stolen. To use this device effectively, it should be combined with something that can't be stolen, for instance something you know. This project is not the ultimate solution, just a step in the right direction.


Dennis Ferron said...

According to the Wikipedia page on NP-complete, the Boolean Satisfiability Problem (http://en.wikipedia.org/wiki/Boolean_satisfiability_problem) is also NP-complete. This could be computed efficiently by a microcontroller because you could use AND, OR, and NOT instructions to compute boolean functions on 8 or 16 bits in parallel, and the data representation would take up less RAM. Maybe you could use that instead of a directed graph to implement your second method?

Nick Johnson said...

Hi Dennis,

I like the way you're thinking, but you're tackling the wrong problem. The algorithms to generate a random graph with a known Hamiltonian cycle and to create a related problem via permutation are both simple, and each could fit within the microcontroller. The limitation of uCs is that they don't have very much ram/flash/eeprom to hold the problem and its secret solution.

Even if a problem (satisfiability, hamiltonian cycle, whatevs) is in NP, that doesn't necessarily mean that it will take a computer a long time to solve it; it simply means that the worst case time grows very quickly with the size of the problem. I did some preliminary investigation, and to establish a reasonable best case time, you need a really large graph (c. 1,000,000 vertices), and among graphs that large you can still find degenerates that are easy to solve.

For a completely random undirected graph, the space-complexity is O(N**2) for N vertices. The solution--a cycle--has space-complexity O(N) for N vertices, as does the permutation from the master problem to the derived problem. You might be able to cut the space complexity of the graph down to O(N) if you use a particularly sparse graph, but then it's even easier to find the hamiltonian cycle.

Believe it or not, to use a directed graph, you need more memory. It's still O(N**2), but with double the constant factor, because you cannot assume it to be symmetric.

A circuit in Satisfiability could be represented more efficiently. There are nice linearized representations of binary trees, that exploit implicit edges to cut the representation down to small-factor O(N) for N binary gates. Creating the derived problem would be a matter of renaming the inputs and possibly folding the tree. Perhaps I could shoe-horn that into a uC. Hmmm... perhaps...

Another alternative would be to store the problem on external memory. If you did that, you could make the problem as large as you want.

But then you run into different problems: an attacker could measure the signals between the uC and memory, and so you can't assume anything stored there will remain secret. At the other extreme, the EEPROM on a PIC16 features "code protection." I haven't seen any evaluation of the strength of code protection, but from what I understand, an attacker would need to open the chip and clear a fuse by manipulating the silicon wafer. That's good enough for my paranoia.

With the protocol as outlined, only the solution to the graph is considered secret. If one could fit the solution within the code protected EEPROM on the uC, and store the graph on external memory, that would work.

On a lighter note: how have you been? What's up with your blog?

Nick Johnson said...

Actually, upon reading your comment again, I think I need to give you a lot more credit than I did a second ago.

Using this linearized representation of a binary tree, you could represent each gate the SAT circuit as either an input or as a NAND gate, then you could get it down to 1 bit/gate, plus an ordered list of inputs. Though the ordered list would grow at O(D**2) for a circuit of average depth D. The correct solution is takes N-bits exactly.

Conversely, if you represented each gate as an 8-bit value, then you could say 0==AND, 1==OR, 2==NAND, 3==NOR and 4--255 represent any of up to 252 inputs, then you could represent a circuit of N gates as N bytes. I opt for NAND/NOR instead of NOT since the NOT has only one child, and is necessarily wasteful in space with the aforementioned b-tree representation. The solution still takes N-bits exactly.

Hmmm... Now I just need to examine how large a SAT problem needs to be before it's difficult in the best case.

Dennis Ferron said...
This comment has been removed by the author.
Dennis Ferron said...

So I'm thinking, Minesweeper Authentication FTW, right?!

Minesweeper (Clay Mathematics Institute)

@lighter note: I'm still around; haven't had much time to blog lately, but I made a post since you reminded me. :)