581d-databases: Python oriented databases

543 days ago by wstein

Math 581d: Nov 29, 2010:  Persistent Python Databases for Sage -- beyond pickle

NOTE: Be aware of trac 10352, which has an spkg I installed before writing this lecture, which updates ZODB to the latest version.  

In this lecture we'll talk about:

  • Python's dbm modules
  • Python's shelve module
  • The ZODB Zope Object Database.

We assume the reader understands Python's pickle system, as explained in the previous lecture. 

Our running sample database problem is: make a database of factorizations of interesting integers.   This is a good example because:

  • Computation of the data can take a huge amount of time (but verification of correctness is fast)
  • There are many queries you could imagine making on the database (e.g., numbers with big factors only, which numbers have only 2 factors, special forms of factorizations).
  • The data structure isn't completely trivial: a factorization is a list of pairs (p,e), the primes can be too big to fit in a machine word, describing special structure of input e.g. "2^p-1", etc.
  • It is easy to make simple versions of the database quickly using Sage.  
  • For testing purposes, it is also easy to make a massive database of integer factorizations, just by factoring a huge number of boring integers (or just multiplying them), so this is good for testing scalability.
  • It's realistic: there are such databases out there that are the subject of active research right now. See, e.g., the Cunningham project

 

 

 
       
 
       

dbm -- "Simple Database Interface"

Official Python docs: http://docs.python.org/library/anydbm.html

Python comes with something called anydbm, which implements a very simple, but potentially very powerful database.   What you get is a persistent (to disk) object that works much like a Python dictionary, except the keys and values must both be strings.   The neat thing is that anydbm uses the best implementation of a dbm available when you build Sage from source: one of several possible fast C implementations, or possibly a really slow stupid Python one if none of those was available.  So it just works.  

Let's try it out. 

First, create a database.

file = DATA + 'anydbm.db' if os.path.exists(file): os.unlink(file) import anydbm D = anydbm.open(file, 'c'); D 
       
<dbm.dbm object at 0x10ca74070>
<dbm.dbm object at 0x10ca74070>

For us, it turned out this is using dbm, which is the standard UNIX dbm interface.    I think the OS X implementation of this is "ndbm".

print anydbm._defaultmod 
       
<module 'dbm' from
'/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/lib-dyn\
load/dbm.so'>
<module 'dbm' from '/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/lib-dynload/dbm.so'>
D['6'] = '[(2,1), (3,1)]' D['12'] = '[(2,2), (3,1)]' 
       

You can only store strings, and the keys must be strings.

D['1'] = [] 
       
Traceback (click to the left of this block for traceback)
...
TypeError: dbm mappings have string elements only
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "_sage_input_5.py", line 10, in <module>
    exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("RFsnMSddID0gW10="),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))
  File "", line 1, in <module>
    
  File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmpw07GFU/___code___.py", line 2, in <module>
    exec compile(u"D['1'] = []" + '\n', '', 'single')
  File "", line 1, in <module>
    
TypeError: dbm mappings have string elements only
D[1] = '[]' 
       
Traceback (click to the left of this block for traceback)
...
TypeError: dbm mappings have string indices only
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "_sage_input_6.py", line 10, in <module>
    exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("RFsxXSA9ICdbXSc="),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))
  File "", line 1, in <module>
    
  File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmp39_Pae/___code___.py", line 3, in <module>
    exec compile(u"D[_sage_const_1 ] = '[]'" + '\n', '', 'single')
  File "", line 1, in <module>
    
TypeError: dbm mappings have string indices only

Now we close the database, then reopen and see that it really stored our data.  If you don't explicitly close the database, then the data might not get stored to disk.

D.close() 
       
D.keys() 
       
Traceback (click to the left of this block for traceback)
...
dbm.error: DBM object has already been closed
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "_sage_input_8.py", line 10, in <module>
    exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("RC5rZXlzKCk="),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))
  File "", line 1, in <module>
    
  File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmpoVX_D9/___code___.py", line 2, in <module>
    exec compile(u'D.keys()
  File "", line 1, in <module>
    
dbm.error: DBM object has already been closed
import anydbm D = anydbm.open(DATA + 'anydbm', 'c') D.keys() 
       
['6', '12']
['6', '12']
D['6'] 
       
'[(2,1), (3,1)]'
'[(2,1), (3,1)]'
D['12'] 
       
'[(2,2), (3,1)]'
'[(2,2), (3,1)]'
eval(D['12']) 
       
[(2, 2), (3, 1)]
[(2, 2), (3, 1)]
del D['6'] D.keys() 
       
['12']
['12']
D.close() 
       
D = anydbm.open(DATA + 'anydbm', 'c') D.keys() 
       
['12']
['12']
 
       
 
       

Key Point: This is dramatically different than just storing a pickle and loading it, because if there are billions of key/values, only the ones you actually use get loaded into memory.   Databases are all about scalability!

According to Python and other docs, there are various limitations on anydbm, depending on which database is actually used (which depends on how/where Python/Sage was built!).  On my OS X laptop, NDBM is used, which has severe limits, though I had trouble finding out what they are -- one page says "there is a limit on the total size of the key/content pairs that can be stored (this ranges from 1018 bytes to 4096 bytes)".

Let's break stuff.

file = DATA + 'bigdbm.db' if os.path.exists(file): os.unlink(file) import anydbm bigdb = anydbm.open(DATA + 'bigdbm', 'c'); bigdb 
       
<dbm.dbm object at 0x10ca74070>
<dbm.dbm object at 0x10ca74070>
s = 'x'*(20 * 10^6) # a 20-megabyte string. bigdb['1'] = s 
       
bigdb.close() print os.system('ls -lh %s'%DATA) 
       
total 39336
-rw-r--r--  1 wstein  staff    16K Nov 28 15:38 anydbm.db
-rw-r--r--  1 wstein  staff    19M Nov 28 15:41 bigdbm.db
-rw-r--r--  1 wstein  staff   168B Nov 28 14:08 zodb.fs
-rw-r--r--  1 wstein  staff   279B Nov 28 14:12 zodb.fs.index
-rw-r--r--  1 wstein  staff     6B Nov 28 14:13 zodb.fs.lock
-rw-r--r--  1 wstein  staff     0B Nov 28 14:13 zodb.fs.tmp
0
total 39336
-rw-r--r--  1 wstein  staff    16K Nov 28 15:38 anydbm.db
-rw-r--r--  1 wstein  staff    19M Nov 28 15:41 bigdbm.db
-rw-r--r--  1 wstein  staff   168B Nov 28 14:08 zodb.fs
-rw-r--r--  1 wstein  staff   279B Nov 28 14:12 zodb.fs.index
-rw-r--r--  1 wstein  staff     6B Nov 28 14:13 zodb.fs.lock
-rw-r--r--  1 wstein  staff     0B Nov 28 14:13 zodb.fs.tmp
0

Now try to reopen it.

bigdb = anydbm.open(DATA + 'bigdbm', 'c') 
       
bigdb 
       
<dbm.dbm object at 0x10ca740f0>
<dbm.dbm object at 0x10ca740f0>

It seems to work fine. So, I guess we didn't hit a size limit. It's worrisome though.

len(bigdb['1']) 
       
20000000
20000000
del bigdb['1'] bigdb.close() 
       
bigdb = anydbm.open(DATA + 'bigdbm', 'c') len(bigdb['1']) 
       
Traceback (click to the left of this block for traceback)
...
KeyError: '1'
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "_sage_input_46.py", line 10, in <module>
    exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("YmlnZGIgPSBhbnlkYm0ub3BlbihEQVRBICsgJ2JpZ2RibScsICdjJykKbGVuKGJpZ2RiWycxJ10p"),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))
  File "", line 1, in <module>
    
  File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmpifnMje/___code___.py", line 3, in <module>
    exec compile(u"len(bigdb['1'])" + '\n', '', 'single')
  File "", line 1, in <module>
    
KeyError: '1'

But the database file is still big:

print os.system('ls -lh %s'%DATA) 
       
total 39336
-rw-r--r--  1 wstein  staff    16K Nov 28 15:38 anydbm.db
-rw-r--r--  1 wstein  staff    19M Nov 28 15:42 bigdbm.db
-rw-r--r--  1 wstein  staff   168B Nov 28 14:08 zodb.fs
-rw-r--r--  1 wstein  staff   279B Nov 28 14:12 zodb.fs.index
-rw-r--r--  1 wstein  staff     6B Nov 28 14:13 zodb.fs.lock
-rw-r--r--  1 wstein  staff     0B Nov 28 14:13 zodb.fs.tmp
0
total 39336
-rw-r--r--  1 wstein  staff    16K Nov 28 15:38 anydbm.db
-rw-r--r--  1 wstein  staff    19M Nov 28 15:42 bigdbm.db
-rw-r--r--  1 wstein  staff   168B Nov 28 14:08 zodb.fs
-rw-r--r--  1 wstein  staff   279B Nov 28 14:12 zodb.fs.index
-rw-r--r--  1 wstein  staff     6B Nov 28 14:13 zodb.fs.lock
-rw-r--r--  1 wstein  staff     0B Nov 28 14:13 zodb.fs.tmp
0
 
       
 
       

Pros:

  • Really simple to learn and understand.
  • You can store a huge number of records efficiently.

Cons:

  • You can only store strings.
  • The database file stays large even when you delete keys and close.
  • Very unclear  what the maximum record size is, or even what database implementation you'll be using...  The Python docs say the following scary thing: "The database is also (unfortunately) subject to the limitations of dbm, if it is used — this means that (the pickled representation of) the objects stored in the database should be fairly small, and in rare cases key collisions may cause the database to refuse updates."
 
       
 
       
 
       

shelve

Shelve is a standard Python module [Python docs] that provides a disk-persistent dictionary like object.     It is built on top of anydbm.  They main point is that the values of in the dictionary do not have to be strings.  This is done by simply pickling the objects you store, then storing the pickles.  You can read the complete source code to shelve in a few minutes -- it's just a light wrapper on anydbm. 

Let's take shelve for a spin.

 

file = DATA + 'shelve.db' if os.path.exists(file): os.unlink(file) import shelve D = shelve.open(file, 'c'); D 
       
{}
{}
D[6] = factor(6) 
       
Traceback (click to the left of this block for traceback)
...
TypeError: can't pickle SageObject objects
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "_sage_input_52.py", line 10, in <module>
    exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("RFs2XSA9IGZhY3Rvcig2KQ=="),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))
  File "", line 1, in <module>
    
  File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmp5OH57B/___code___.py", line 3, in <module>
    exec compile(u'D[_sage_const_6 ] = factor(_sage_const_6 )
  File "", line 1, in <module>
    
  File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python/shelve.py", line 132, in __setitem__
    p.dump(value)
  File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python/copy_reg.py", line 70, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle SageObject objects

But no worries, there is a protocol option to shelve, which specifies the pickle protocol to use. Sage objects are Cython classes, and require protocol 2.

file = DATA + 'shelve' if os.path.exists(file+'.db'): os.unlink(file+'.db') import shelve D = shelve.open(file, 'c', protocol=2); D 
       
{}
{}
D['6'] = factor(6) D['12'] = factor(12) 
       
D['12'] 
       
2^2 * 3
2^2 * 3

Write it to disk, then reload and see what we got.

D.close() os.system('ls -lh %s/shelve*'%DATA) 
       
-rw-r--r--  1 wstein  staff    16K Nov 28 16:07
/Users/wstein/.sage/psage_notebook.sagenb/home/admin/31/data//shelve.db
0
-rw-r--r--  1 wstein  staff    16K Nov 28 16:07 /Users/wstein/.sage/psage_notebook.sagenb/home/admin/31/data//shelve.db
0
D['6'] 
       
Traceback (click to the left of this block for traceback)
...
ValueError: invalid operation on closed shelf
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "_sage_input_97.py", line 10, in <module>
    exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("RFsnNidd"),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))
  File "", line 1, in <module>
    
  File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmpQGizJZ/___code___.py", line 2, in <module>
    exec compile(u"D['6']" + '\n', '', 'single')
  File "", line 1, in <module>
    
  File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python/shelve.py", line 121, in __getitem__
    f = StringIO(self.dict[key])
  File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python/shelve.py", line 79, in closed
    raise ValueError('invalid operation on closed shelf')
ValueError: invalid operation on closed shelf
D = shelve.open(file, 'c', protocol=2); D 
       
{'12': 2^2 * 3, '6': 2 * 3}
{'12': 2^2 * 3, '6': 2 * 3}
D['6'] 
       
2 * 3
2 * 3

Great!  Let's try putting a big object in there, say a 10-million digit integer.  Watch out -- sadly printing of shelve objects is totally massively braindead, and will try to load the whole database, so print out the whole damn 20-million digit number, so do not ever try to print D once you have some nontrivial data in it!

D.__repr__?? 
       

File: /Users/wstein/purple/install/psage-10.10.26/local/lib/python/UserDict.py

Source Code (starting at line 170):

def __repr__(self):
    return repr(dict(self.iteritems()))

File: /Users/wstein/purple/install/psage-10.10.26/local/lib/python/UserDict.py

Source Code (starting at line 170):

def __repr__(self):
    return repr(dict(self.iteritems()))
D['test'] = 11^(10^7) os.system('ls -lh %s/shelve*'%DATA) 
       
-rw-r--r--  1 wstein  staff   6.6M Nov 28 16:07
/Users/wstein/.sage/psage_notebook.sagenb/home/admin/31/data//shelve.db
0
-rw-r--r--  1 wstein  staff   6.6M Nov 28 16:07 /Users/wstein/.sage/psage_notebook.sagenb/home/admin/31/data//shelve.db
0

As you can see above, the file suddenly got way bigger, so the database got written to immediately this time, for some reason...

The following loads a new copy of the integer from disk into memory.

time n = D['test'] 
       
Time: CPU 0.05 s, Wall: 0.05 s
Time: CPU 0.05 s, Wall: 0.05 s
n.ndigits() 
       
10413927
10413927
D.close() 
       

Gotcha: the D.keys() method doesn't work as you might think until a query is made.  If you want to keep track of what you put into the database, you should store that as a special entry as well.

D = shelve.open(DATA+'shelve', 'c', protocol=2) D.keys() 
       
[]
[]
D.has_key('test') 
       
True
True
time n = D['test'] 
       
Time: CPU 0.05 s, Wall: 0.05 s
Time: CPU 0.05 s, Wall: 0.05 s
n.ndigits() 
       
10413927
10413927
D.keys() 
       
['test', '12', '6']
['test', '12', '6']
 
       

So we'll store which numbers we've factored in a set.

D['numbers'] = set([6, 12]) 
       
# the following has an intentional bug! for p in primes(150): n = 2^p-1 f = factor(n) D[str(n)] = f D['numbers'].add(n) 
       
D['numbers'] 
       
set([12, 6])
set([12, 6])

Hey, what happened?  Each time we do D['numbers'] a new copy of the set is loaded from disk, then appended to, then forgotten (since we don't save it).  There is a writeback=True option we could pass to open, but it has a bunch of its own problems.  So we instead do the following:

# the works fine, but has scalability issues. for p in primes(150): n = 2^p-1 f = factor(n) D[str(n)] = f nums = D['numbers'] nums.add(n) D['numbers'] = nums 
       
D['numbers'] 
       
set([3, 147573952589676412927, 7, 8191, 12, 2147483647, 131071,
576460752303423487, 524287, 31, 8388607, 137438953471, 6, 2047,
10141204801825835211973625643007, 140737488355327,
2722258935367507707706996859454145691647, 604462909807314587353087,
158456325028528675187087900671, 2361183241434822606847, 8796093022207,
2535301200456458802993406410751, 9444732965739290427391, 2199023255551,
9671406556917033397649407, 170141183460469231731687303715884105727,
162259276829213363391578010288127, 127, 618970019642690137449562111,
2305843009213693951, 696898287454081973172991196020261297061887,
10384593717069655257060992658440191, 9007199254740991,
649037107316853453566312041152511, 536870911,
174224571863520493293247799005065324265471,
713623846352979940529142984724747568191373311])
set([3, 147573952589676412927, 7, 8191, 12, 2147483647, 131071, 576460752303423487, 524287, 31, 8388607, 137438953471, 6, 2047, 10141204801825835211973625643007, 140737488355327, 2722258935367507707706996859454145691647, 604462909807314587353087, 158456325028528675187087900671, 2361183241434822606847, 8796093022207, 2535301200456458802993406410751, 9444732965739290427391, 2199023255551, 9671406556917033397649407, 170141183460469231731687303715884105727, 162259276829213363391578010288127, 127, 618970019642690137449562111, 2305843009213693951, 696898287454081973172991196020261297061887, 10384593717069655257060992658440191, 9007199254740991, 649037107316853453566312041152511, 536870911, 174224571863520493293247799005065324265471, 713623846352979940529142984724747568191373311])

We have our database, and it works!

D.close() D = shelve.open(DATA+'shelve', 'c', protocol=2) D[str(2^97-1)] 
       
11447 * 13842607235828485645766393
11447 * 13842607235828485645766393

Let's make a table:

for n in sorted(D['numbers']): if n > 10^30: break print '%-30s\t%s'%(n, D[str(n)]) 
       
3                             	3
6                             	2 * 3
7                             	7
12                            	2^2 * 3
31                            	31
127                           	127
2047                          	23 * 89
8191                          	8191
131071                        	131071
524287                        	524287
8388607                       	47 * 178481
536870911                     	233 * 1103 * 2089
2147483647                    	2147483647
137438953471                  	223 * 616318177
2199023255551                 	13367 * 164511353
8796093022207                 	431 * 9719 * 2099863
140737488355327               	2351 * 4513 * 13264529
9007199254740991              	6361 * 69431 * 20394401
576460752303423487            	179951 * 3203431780337
2305843009213693951           	2305843009213693951
147573952589676412927         	193707721 * 761838257287
2361183241434822606847        	228479 * 48544121 * 212885833
9444732965739290427391        	439 * 2298041 * 9361973132609
604462909807314587353087      	2687 * 202029703 * 1113491139767
9671406556917033397649407     	167 * 57912614113275649087721
618970019642690137449562111   	618970019642690137449562111
158456325028528675187087900671	11447 * 13842607235828485645766393
3                             	3
6                             	2 * 3
7                             	7
12                            	2^2 * 3
31                            	31
127                           	127
2047                          	23 * 89
8191                          	8191
131071                        	131071
524287                        	524287
8388607                       	47 * 178481
536870911                     	233 * 1103 * 2089
2147483647                    	2147483647
137438953471                  	223 * 616318177
2199023255551                 	13367 * 164511353
8796093022207                 	431 * 9719 * 2099863
140737488355327               	2351 * 4513 * 13264529
9007199254740991              	6361 * 69431 * 20394401
576460752303423487            	179951 * 3203431780337
2305843009213693951           	2305843009213693951
147573952589676412927         	193707721 * 761838257287
2361183241434822606847        	228479 * 48544121 * 212885833
9444732965739290427391        	439 * 2298041 * 9361973132609
604462909807314587353087      	2687 * 202029703 * 1113491139767
9671406556917033397649407     	167 * 57912614113275649087721
618970019642690137449562111   	618970019642690137449562111
158456325028528675187087900671	11447 * 13842607235828485645766393
 
       

Note: deleting that big entry doesn't make the database smaller, unfortunately.    This is the same problem we had with anydbm above.  Your mileage may vary.  For most research applications, this may be fine.

del D['test'] D.close() os.system('ls -lh %s/shelve*'%DATA) 
       
-rw-r--r--  1 wstein  staff   6.6M Nov 28 16:17
/Users/wstein/.sage/psage_notebook.sagenb/home/admin/31/data//shelve.db
0
-rw-r--r--  1 wstein  staff   6.6M Nov 28 16:17 /Users/wstein/.sage/psage_notebook.sagenb/home/admin/31/data//shelve.db
0
 
       

Let's store a hundred thousand factorizations in the database, close it, then reopen it and see how long it takes to load a factorization.

%time nums = D['numbers'] for n in range(1,10^5): f = factor(n) D[str(n)] = f nums.add(n) D['numbers'] = nums 
       
CPU time: 22.23 s,  Wall time: 22.35 s
CPU time: 22.23 s,  Wall time: 22.35 s
D.close() 
       
time D = shelve.open(DATA+'shelve', 'c', protocol=2) time k = D['2010'] 
       
Time: CPU 0.00 s, Wall: 0.00 s
Time: CPU 0.00 s, Wall: 0.00 s
Time: CPU 0.00 s, Wall: 0.00 s
Time: CPU 0.00 s, Wall: 0.00 s

Compare doing the same with one huge Python dictionary that is pickled, and you begin to see the scalability differences!

%time D2 = {} for n in range(1,10^5): f = factor(n) D2[str(n)] = f 
       
CPU time: 15.03 s,  Wall time: 15.04 s
CPU time: 15.03 s,  Wall time: 15.04 s
time save(D2, DATA+'pickle1.sobj') 
       
Time: CPU 9.18 s, Wall: 9.19 s
Time: CPU 9.18 s, Wall: 9.19 s
time D2 = load(DATA + 'pickle1.sobj') time k = D['2010'] 
       
Time: CPU 3.30 s, Wall: 3.30 s
Time: CPU 0.00 s, Wall: 0.00 s
Time: CPU 3.30 s, Wall: 3.30 s
Time: CPU 0.00 s, Wall: 0.00 s

I hope the above example gives you a sense of scalability of different approaches.  In particular, the database will easily scale out to millions or billions of numbers.  The dict will fail miserably, with load times taking hours, and memory usage skyrocketing.     Making a billion files on your filesystem (one for each integer), obviously won't work either. It works fine for a few numbers, and fails miserably for a lot of them (as I've learned the hard way...).

And so far we aren't using much, just some very simple Python code built on top of dbm, which is almost from the original UNIX days of the 1970s...  oh, and pickle, of course.

What we have done so far in this worksheet is the bare minimum you need to know about in order to use Sage to make a genuinely scalable research oriented code.  The cool thing is that you can mostly understand everything deeply -- it isn't too complicated.   (That said, I once took a semester course at Harvard by the co-author of berkeley db, which is a dbm implementation, and the implementations of programs that provide the dbm api get potentially pretty sophisticated.)

 
       

Scalability: Size isn't the only thing that matters -- concurrent writes.

Another key part of scalability is having support for more than one process simultaneously write to a single database.  You might think this doesn't matter for you, and maybe it doesn't.  But for research math applications, it often does.  Imagine you have a cluster of computers with 120 cores total, and you want to launch 120 processes that all compute some integer factorizations, and store their results in the database.  Their results might be a few hundred million integer factorizations (say).  If you try to do this using the above approach, you'll run into serious hell:

  1. The database file has to be stored somewhere that all the processes have read/write access to, and every read/write will involve updating that file.
  2. If multiple processes are all trying to read/write, they will just mess each other up, and the database file will soon become totally corrupted.  Game over.

For some applications, using shelve works. If you only need to scale up in size, or will mainly be using the database for read-only use, e.g., you compute a few gigabytes of data stored in files all over, and want to put it all in a database so you can work with it more easily (e.g., Cremona's tables, which are read only).  Then you just make one quick pass through the files, build up a database, and work with that. 

 
       

Shelve pros and cons.

Pros:

  • Much easier to use than anydbm, since it uses pickle.  (In Sage, make sure to use protocol=2!).
  • Scales up size wise.
  • Fast, assuming anydbm turns out to be fast on your system.

Cons:

  • Total failure to scale up for multiple simultaneous writes.
  • Performance depends on what anydbm you have (so even the underlying database file need not be portable!). 
  • No way to do interesting queries on your data. 
  • No undo (transaction support).
 
       
 
       

ZODB

ZODB is an object oriented Python database.  It's pretty much the canonical one for this, and has been around for a looooong time.    It's kind of like SQLite in that the entire database is usually stored in a single file (so if your database will be at large, make sure your filesystem supports files that are bigger than 2GB!). 

Some Pros:

  • ZODB is actively maintained and there are new bugfix releases (and minor feature improvements) regularly.  It's definitely not dead.
  • ZODB is mature and stable.
  • ZODB supports transactions: "Full ACID-compatible transaction support (including savepoints); History/undo ability. 
  • In some cases, a perfect intermediate between going with a full database and just writing files to disk.   Can be a good part of a math research workflow.
  • Included in every copy of Sage, and fairly easy to easy_install into any Python.
  • ZODB does have support for simultaneous writes but you have to use something called ZEO, which can have very serious performance and security implications. 

Cons: 

  • Your data is tied to Python and your Python code (via pickles).  It's not language agnostic like sqlite or mongodb.   
  • Indexing and querying of data is severely limited compared to what is possible with SQL or noSQL databases.
  • I can't get their test suite to work.
  • I can't get it to pickle SageObject, so it's usefulness is severely crippled in practice.
  • An optional install in Python, with several dependencies.
 
       

Here's a math-oriented tutorial to wet your appetite.  We're in the Sage notebook, so there is a variable DATA that is a scratch directory where we can create a file.  This is where we'll put our database.

from ZODB.FileStorage import FileStorage from ZODB.DB import DB import transaction 
       
file = DATA + 'zodb.fs' os.system('rm -f %s/zodb.fs*'%DATA) 
       
0
0
storage = FileStorage(file) db = DB(storage) connection = db.open() root = connection.root() 
       
 
       

To store data in the database:  just use root like a dictionary and store objects.  It's as easy as that.

To actually permanently save them, you use transaction.commit(), or to abort (and not store them), do transaction.abort(). 

Sample application: making a database of factorizations of integers.

for p in primes(100,150): n = 2^p - 1 print p, time F = factor(n) root[n] = F 
       
101 Time: CPU 0.03 s, Wall: 0.03 s
103 Time: CPU 0.03 s, Wall: 0.03 s
107 Time: CPU 0.00 s, Wall: 0.00 s
109 Time: CPU 0.04 s, Wall: 0.04 s
113 Time: CPU 0.00 s, Wall: 0.00 s
127 Time: CPU 0.00 s, Wall: 0.00 s
131 Time: CPU 0.01 s, Wall: 0.01 s
137 Time: CPU 0.22 s, Wall: 0.24 s
139 Time: CPU 0.25 s, Wall: 0.27 s
149 Time: CPU 0.50 s, Wall: 0.53 s
101 Time: CPU 0.03 s, Wall: 0.03 s
103 Time: CPU 0.03 s, Wall: 0.03 s
107 Time: CPU 0.00 s, Wall: 0.00 s
109 Time: CPU 0.04 s, Wall: 0.04 s
113 Time: CPU 0.00 s, Wall: 0.00 s
127 Time: CPU 0.00 s, Wall: 0.00 s
131 Time: CPU 0.01 s, Wall: 0.01 s
137 Time: CPU 0.22 s, Wall: 0.24 s
139 Time: CPU 0.25 s, Wall: 0.27 s
149 Time: CPU 0.50 s, Wall: 0.53 s
root 
       
{713623846352979940529142984724747568191373311: 86656268566282183151 *
8235109336690846723986161, 10141204801825835211973625643007: 2550183799
* 3976656429941438590393, 2722258935367507707706996859454145691647: 263
* 10350794431055162386718619237468234569,
162259276829213363391578010288127: 162259276829213363391578010288127,
2535301200456458802993406410751: 7432339208719 * 341117531003194129,
10384593717069655257060992658440191: 3391 * 23279 * 65993 * 1868569 *
1066818132868207, 696898287454081973172991196020261297061887:
5625767248687 * 123876132205208335762278423601,
170141183460469231731687303715884105727:
170141183460469231731687303715884105727,
174224571863520493293247799005065324265471: 32032215596496435569 *
5439042183600204290159, 649037107316853453566312041152511: 745988807 *
870035986098720987332873}
{713623846352979940529142984724747568191373311: 86656268566282183151 * 8235109336690846723986161, 10141204801825835211973625643007: 2550183799 * 3976656429941438590393, 2722258935367507707706996859454145691647: 263 * 10350794431055162386718619237468234569, 162259276829213363391578010288127: 162259276829213363391578010288127, 2535301200456458802993406410751: 7432339208719 * 341117531003194129, 10384593717069655257060992658440191: 3391 * 23279 * 65993 * 1868569 * 1066818132868207, 696898287454081973172991196020261297061887: 5625767248687 * 123876132205208335762278423601, 170141183460469231731687303715884105727: 170141183460469231731687303715884105727, 174224571863520493293247799005065324265471: 32032215596496435569 * 5439042183600204290159, 649037107316853453566312041152511: 745988807 * 870035986098720987332873}
transaction.commit() 
       
Traceback (click to the left of this block for traceback)
...
TypeError: can't pickle SageObject objects
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "_sage_input_6.py", line 10, in <module>
    exec compile(u'open("___code___.py","w").write("# -*- coding: utf-8 -*-\\n" + _support_.preparse_worksheet_cell(base64.b64decode("dHJhbnNhY3Rpb24uY29tbWl0KCk="),globals())+"\\n"); execfile(os.path.abspath("___code___.py"))
  File "", line 1, in <module>
    
  File "/private/var/folders/7y/7y-O1iZOGTmMUMnLq7otq++++TI/-Tmp-/tmpIb0kcP/___code___.py", line 2, in <module>
    exec compile(u'transaction.commit()
  File "", line 1, in <module>
    
  File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/site-packages/ZODB3-3.7.0-py2.6-macosx-10.6-i386.egg/transaction/_manager.py", line 96, in commit
    return self.get().commit(sub, deprecation_wng=False)
  File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/site-packages/ZODB3-3.7.0-py2.6-macosx-10.6-i386.egg/transaction/_transaction.py", line 395, in commit
    self._commitResources()
  File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/site-packages/ZODB3-3.7.0-py2.6-macosx-10.6-i386.egg/transaction/_transaction.py", line 495, in _commitResources
    rm.commit(self)
  File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/site-packages/ZODB3-3.7.0-py2.6-macosx-10.6-i386.egg/ZODB/Connection.py", line 498, in commit
    self._commit(transaction)
  File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/site-packages/ZODB3-3.7.0-py2.6-macosx-10.6-i386.egg/ZODB/Connection.py", line 543, in _commit
    self._store_objects(ObjectWriter(obj), transaction)
  File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/site-packages/ZODB3-3.7.0-py2.6-macosx-10.6-i386.egg/ZODB/Connection.py", line 570, in _store_objects
    p = writer.serialize(obj)  # This calls __getstate__ of obj
  File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/site-packages/ZODB3-3.7.0-py2.6-macosx-10.6-i386.egg/ZODB/serialize.py", line 407, in serialize
    return self._dump(meta, obj.__getstate__())
  File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python2.6/site-packages/ZODB3-3.7.0-py2.6-macosx-10.6-i386.egg/ZODB/serialize.py", line 416, in _dump
    self._p.dump(state)
  File "/Users/wstein/purple/install/psage-10.10.26/local/lib/python/copy_reg.py", line 70, in _reduce_ex
    raise TypeError, "can't pickle %s objects" % base.__name__
TypeError: can't pickle SageObject objects

Woops, what just happened?   The default pickle protocol is 1 in ZODB, and it is not user configurable (so far as I can tell).  According to this recent discussion on the ZODB devel list there is some worry about this...

I absolutely can't figure out how to deal with this nicely, beyond making a bunch of changes to ZODB's code itself.   So...

Next we'll talk about SQLalchemy and SQLite/PostgreSQL, which provide a very different approach to this whole problem....