/ python

How To: Encrypt Large Files with Python and PyNacl

One limitation of pynacl's concise API is its lack of support for buffered reading. When it comes to large files we can't always load all data to memory in one chunk. This is how I dealt with the problem in a recent project.

The Mechanism

Pynacl can encrypt and authenticate short blocks (<=16kb is the recommended size). With larger files we'll want to read the file in chunks, encrypt and sign each using pynacl's secret box and then HMAC the entire encrypted data.

These are the steps taken:

  1. Create a secret encryption key and another secret sign key. Both keys must be kept secret. The example code below creates both keys in a subdirectory called "keys".

  2. It's ok to encrypt many files with the generated key, but since we're using a stream cipher we need a different IV for each file AND a different IV for each block.

  3. Therefore before encrypting a file we'll randomize a nonce base for that file. Then before encrypting each block we'll increment that nonce by the block index. This guranteed unique IV per block, and very low chance for IV reuse between files.

  4. After the entire file is encrypted we need to sign the encrypted data. Nacl's secret box already signs each block, but that's not enough. An adversary could change block order or remove blocks from the ciphered data, and we should protect against such attacks as well.

  5. Before decrypting the file we'll need to verify its signature and then decrypt each block.

Now for the code.

Generate Keys

The following python program generates secret keys to be used by nacl's secret box:

import nacl.secret
import nacl.utils
import nacl.encoding
import nacl.signing
import os

key = nacl.utils.random(nacl.secret.SecretBox.KEY_SIZE)
os.makedirs('keys', exist_ok=True)
os.chdir('keys')

with open('symkey.bin', 'wb') as f:
    f.write(key)

auth_key = nacl.utils.random(size=64)
with open('authkey.bin', 'wb') as f:
    f.write(auth_key)

Keys are saved as keys/symkey.bin and keys/authkey.bin.

Encrypting A File

The following python program reads a file in chunks and encrypts each chunk. Note the actual chunk size is 40 bytes less than 16KB. Those 40 bytes are used by nacl to write the nonce and the block signature.

import nacl.secret
import nacl.utils
import sys

if len(sys.argv) != 4:
    exit("Usage: {} <key> <input_file_name> <output_encrypted_file_name>".format(*sys.argv))

(_, keyfile, input_file, output_file) = sys.argv

def chunk_nonce(base, index):
    size = nacl.secret.SecretBox.NONCE_SIZE
    return int.to_bytes(int.from_bytes(base, byteorder='big') + index, length=size, byteorder='big')

def read_in_chunks(file_object, chunk_size=16 * 1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 16k."""
    index = 0
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield (data, index)
        index += 1

with open(keyfile, 'rb') as f:
    key = f.read()

box = nacl.secret.SecretBox(key)

nonce = nacl.utils.random(nacl.secret.SecretBox.NONCE_SIZE)
with open(output_file, 'wb') as fout:
    with open(input_file, 'rb') as fin:
        for chunk, index in read_in_chunks(fin, chunk_size=16*1024 - 40):
            enc = box.encrypt(chunk, chunk_nonce(nonce, index))
            fout.write(enc)

The program takes as input a key file, input file name and an output file name.

Signing The Encrypted Data

The following python code uses HMAC to securely sign a given ciphered file:

import nacl.encoding
import sys
import binascii
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives import hashes, hmac

def read_in_chunks(file_object, chunk_size=16 * 1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 16k."""
    index = 0
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield (data, index)
        index += 1

if len(sys.argv) != 3:
    sys.exit("Usage: {} <auth_key> <input_file>".format(sys.argv[0]))

(_, key_file, input_file) = sys.argv

with open(key_file, 'rb') as f:
    auth_key = f.read()

with open(input_file, 'rb') as f:
    h = hmac.HMAC(auth_key, hashes.SHA512(), backend=default_backend())
    for chunk, _ in read_in_chunks(f):
        h.update(chunk)

    print(binascii.hexlify(h.finalize()))

Since pynacl had no support for incremental HMAC I had to use another library for this operation (cryptography/pyca).

Verifying Signature Before Decrypting

Before decrypting the file we should verify its signature. The following program takes a file and its signature as command line arguments and prints "Valid" if the signature is valid (otherwise raises an exception):

import nacl.encoding
import sys
import binascii
from cryptography.hazmat.backends import default_backend
from cryptography.hazmat.primitives import hashes, hmac

def read_in_chunks(file_object, chunk_size=16 * 1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 16k."""
    index = 0
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield (data, index)
        index += 1

if len(sys.argv) != 4:
    sys.exit("Usage: {} <key_file> <input_file> <sig>".format(sys.argv[0]))

(_, key_file, input_file, sig) = sys.argv

with open(key_file, 'rb') as f:
    auth_key = f.read()

sig_bytes = binascii.unhexlify(sig)

with open(input_file, 'rb') as f:
    h = hmac.HMAC(auth_key, hashes.SHA512(), backend=default_backend())
    for chunk, _ in read_in_chunks(f):
        h.update(chunk)

    h.verify(sig_bytes)

print("Valid")

Decrypting The File

Finally we can safely decrypt the file and write the result:

import nacl.secret
import nacl.utils
import sys

if len(sys.argv) != 4:
    exit("Usage: {} <key> <input_file_name> <output_decrypted_file_name>".format(*sys.argv))

(_, keyfile, input_file, output_file) = sys.argv

def chunk_nonce(base, index):
    size = nacl.secret.SecretBox.NONCE_SIZE
    return int.to_bytes(int.from_bytes(base, byteorder='big') + index, length=size, byteorder='big')

def read_in_chunks(file_object, chunk_size=16 * 1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 16k."""
    index = 0
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield (data, index)
        index += 1

with open(keyfile, 'rb') as f:
    key = f.read()

box = nacl.secret.SecretBox(key)

print("Decrypting {} to {}".format(input_file, output_file))

with open(output_file, 'wb') as fout:
    with open(input_file, 'rb') as fin:
        for chunk, index in read_in_chunks(fin):
            enc = box.decrypt(chunk)
            fout.write(enc)

Note how a chunk size is now a full 16KB. The 40 bytes that were used to store nonce and signature should be passed to box.decrypt as part of the chunk.

Final Thoughts

Cryptography is hard. Most often when writing code it's easy to see if it's broken so we can fix. Crypto on the other hand requires skill and understanding just to see how broken a solution is.

As far as I can tell the code suggested above is secure, but I'd be more than happy to be proven wrong here (instead of on the real world). So if you can see anything broken do share your thoughts in the comments below. And as always if you can suggest other solutions feel free to share them too.