Link to GitHub, Documentation for the impatient.
Introduction
DynamoDB is a powerful, proprietary NoSQL database service provided by Amazon. DynamoDB allows you to pay for dedicated throughput, with predictable performance for 'any level of request traffic'. Scalability is handled for you, and table data is replicated across multiple availability zones. All that is left is for you to take your data, and its access patterns, and make it work in the denormalized world of NoSQL.
Data Structure
The single most important part of using DynamoDB begins before you ever put data into it: designing the table(s) and keys. Keys (Amazon calls them primary keys) can be composed of one attribute, called a hash key, or a compound key called the hash and range key. This key is used to uniquely identify an item in a table. The choice of the primary key is particularly important because of the way that Amazon accesses the data. Amazon shards (partitions) your data internally, based on this key. When you pay for provisioned throughput, that throughput is divided across those shards. If you choose a key with too little entropy, causing too many items to hash to the same shard, then you are limiting your own throughput. Suppose that you wanted to create a table to represent stories submitted to Hacker News. You could use the following table:
link (hash key) | title | points | days_old |
---|---|---|---|
http://goo.gl/LNp4u8 | 'The rise and rise of dogecoin' | 21 | 1 |
In this table we are storing the link, its title, the number of points it has, and how many days old it is. We're using the link as the hash key here, and it uniquely identifies one item. This allows you to retrieve the item, in order update its vote count. Actually, Amazon actually provides an atomic integer update via the UpdateItem
operation, so you don't even have to retrieve the item. You can even provide a mapping of attributes and their 'expected' values in order to apply a conditional update.
Indexes
So far we've chosen our hash key, the link
, and we can update the vote count atomically. You may realize that in order to display a list of links, we need to be able to retrieve them without relying solely on the link
attribute. In fact, we want to retrieve thembased on the points
attribute. This is what indexes are designed for, alternative access. What we can do is define a global index on the days_old
and points
attributes. This would allow us to issue a query using the index, where we could get the highest voted over a given set of days (yes, it's an overly simplistic model but it works for this example). This is known as a global secondary index by Amazon. It's global because it applies to the entire table, and secondary because the first real index is the primary hash key. In contrast, local secondary indexes are said to be local to a specific hash key. In that case you would have multiple items with the same hash key, but different range keys, and you could query those items using only the hash key.
Python
I actually had a use case in mind for DynamoDB when I set out to learn it. After studying the concepts I was ready to try it out. My particular use case needed secondary indexes in order to work. Python being my normal language of choice, I went Googling for a Python interface to DynamoDB. I found dynamodb-mapper, which looked very promising. It has many nice features, such as schema validation and attribute type mapping. I got my model defined quickly, and then got ready to create my indexes...only to discover that they aren't supported. Darn. OK, no problem, it's open source - I can fix it. I dug into the source, and found that it uses the widely used boto
library. I thought to myself, 'Great! I know boto, I'll have this working in no time!'. That's when I discovered that boto
had two versions its DynamoDB library. Apparently, DynamoDB has changed enough that the authors of boto
decided to start over. After looking into the dynamodb-mapper code, I realized it would be no small feat to port it from boto.dynamodb
to boto.dynamodb2
. I even found a pull request where someone attempted it, but the pull request wasn't merged.
That's when I decided to write my own. I started with the syntax I wanted, and then worked backward. I also started with Python 3, and supported Python 2 as an afterthought (and so should you!). Because I know boto
, I assumed it wouldn't be too hard. But...boto
doesn't support Python 3. Actually, boto
is the most popular Python package to not support Python 3 (which is 6 years old!).
I couldn't use boto
, but all was not lost. There is another library, botocore
, written by the same people. It's a much smaller library, providing a minimalist layer on top of Amazon's web API, but it was enough. I sacrificed an entire weekend, but now it's done.
PynamoDB
PynamoDB is attempt to be a Pythonic interface to DynamoDB that supports all of DynamoDB’s powerful features in both Python 3, and Python 2. This includes properly handling unicode and binary attributes, local secondary indexes, and global secondary indexes. Other features include:
- Sets for Binary, Number, and Unicode attributes
- Automatic pagination for bulk operations
- Iterators for Scan, Query, BatchGet operations
- Context managers for batch operations
- Automatic paging (in progress)
Example
Here is how you can create a table, with indexes, using PynamoDB.
from pynamodb.models import Model
from pynamodb.indexes import GlobalSecondaryIndex, AllProjection
from pynamodb.attributes import UnicodeAttribute, NumberAttribute
class DaysIndex(GlobalSecondaryIndex):
"""
This class represents a global secondary index
"""
read_capacity_units = 2
write_capacity_units = 1
projection = AllProjection()
days_old = NumberAttribute(hash_key=True)
class HackerNewsLinks(Model):
"""
A test model that uses a global secondary index
"""
table_name = 'HackerNews'
link = UnicodeAttribute(hash_key=True)
title = UnicodeAttribute()
days_index = DaysIndex()
days_old = NumberAttribute(default=0)
if not HackerNewsLinks.exists():
HackerNewsLinks.create_table(read_capacity_units=1, write_capacity_units=1)
# Indexes can be queried easily using the index's hash key
for item in HackerNewsLinks.day_index.query(1):
print("Item queried from index: {0}".format(item))
Here is the same thing (without the query), but using botocore
instead:
from botocore.session import get_session
kwargs = {
'read_capacity_units': 1,
'write_capacity_units': 1,
'attribute_definitions': [
{
'attribute_type': 'STRING',
'attribute_name': 'link'
},
{
'attribute_type': 'STRING',
'attribute_name': 'title'
},
{
'attribute_type': 'NUMBER',
'attribute_name': 'points'
}
],
'key_schema': [
{
'key_type': 'HASH',
'attribute_name': 'link'
},
],
'global_secondary_indexes': [
{
'index_name': 'days_index',
'key_schema': [
{
'KeyType': 'HASH',
'AttributeName': 'days'
}
],
'projection': {
'ProjectionType': 'KEYS_ONLY'
},
'provisioned_throughput': {
'ReadCapacityUnits': 1,
'WriteCapacityUnits': 1,
}
}
],
}
session = get_session()
service = session.get_service('dynamodb')
endpoint = service.get_endpoint('us-east-1')
operation = service.get_operation('CreateTable')
operation.call(endpoint, **kwargs)