Skip to main content

Pickling items that don't fit in memory

I really like using Python's generators. Watching/reading David Beazley's talk on the topic of generators has made me try use them where ever practical.

Recently I've had to deal with streaming and capturing massive amounts of data. Recording all traffic from a high baud rate CAN bus incidentally. Anyway I have Packets of unknown type and length coming in and want to store the sequence in such a way that I can recreate the full python objects on another computer.

A quick solution is to serialize the data and store it to a file. The built in module for this is pickle. Unfortunately even with 12GB of RAM my work PC couldn't store the entire sequence in memory waiting to do a single serialization dump with pickle. After a bit of research I found a Python 2 implementation of streaming-pickle by Philip Guo.

His solution didn't support bytearray objects and also suffered from a content boundary separation problem - multiple newlines within the pickled data could be picked up as record delimiters. I upgraded that solution to use Python 3 and support all the binary packing that I required, as well as adding in base64 encoding of the pickled data to get around the content boundary problem.

The use is slightly different from the standard library pickle; as you can either dump an iterable in one hit, or passing single elements to s_dump_elt (streaming dump element). This will pickle, encode then append the element to a file.

I imagine I'll refer back to this myself someday, but hopefully it is useful for someone else as well.

gist.github.com/hardbyte/5955010

Popular posts from this blog

Matplotlib in Django

The official django tutorial is very good, it stops short of displaying
data with matplotlib - which could be very handy for dsp or automated
testing. This is an extension to the tutorial. So first you must do the
official tutorial!
Complete the tutorial (as of writing this up to part 4).

Adding an image to a view

To start with we will take a static image from the hard drive and
display it on the polls index page.
Usually if it really is a static image this would be managed by the
webserver eg apache. For introduction purposes we will get django to
serve the static image. To do this we first need to change the
template.



Change the template
At the moment poll_list.html probably looks something like this:


<h1>Django test app - Polls</h1> {% if object_list %} <ul> {% for object in object_list %} <li><a href="/polls/{{object.id}}">{{ object.question }}</a></li> {% endfor %} </ul> {% else %} <p>No polls are available.</p> …

My setup for downloading & streaming movies and tv

I recently signed up for Netflix and am retiring my headless home media pc. This blog will have to serve as its obituary. The box spent about half of its life running FreeNAS, and half running Archlinux. I’ll briefly talk about my experience with FreeNAS, the migration, and then I’ll get to the robust setup I ended up with.

The machine itself cost around $1000 in 2014. Powered by an AMD A4-7300 3.8GHz cpu with 8GB of memory. A SilverStone DS380 case is both functional, quiet and looks great. The hard drives have been updated over the last two years until it had a full compliment of 6 WD Green 4TiB drives - all spinning bits of metal though.

Initially I had the BSD based FreeNAS operating system installed. I had a single hard drive in its own ZFS pool for TV and Movies, and a second ZFS pool comprised of 5 hard drives for documents and photos.

FreeNAS is straight forward to use and setup, provided you only want to do things supported out of the box or by plugins. Each plugin is install…

Python and Gmail with IMAP

Today I had to automatically access my Gmail inbox from Python. I needed the ability to get an unread email count, the subjects of those unread emails and then download them. I found a Gmail.py library on sourceforge, but it actually opened the normal gmail webpage and site scraped the info. I wanted something much faster, luckily gmail can now be accessed with both pop and imap.

After a tiny amount of research I decided imap was the better albiet slightly more difficult protocol. Enabling imap in gmail is straight forward, it was under labs.

The address for gmail's imap server is:

imap.gmail.com:993

Python has a library module called imaplib, we will make heavy use of that to access our emails. I'm going to assume that we have already defined two globals - username and password. To connect and login to the gmail server and select the inbox we can do:

importimaplibimap_server=imaplib.IMAP4_SSL("imap.gmail.com",993)imap_server.login(username,password)imap_server.select(…