Advanced TextAdapter

Gzip Support

IOPro can decompress gzip data on the fly, like so:

>>> adapter = iopro.text_adapter('data.gz', compression='gzip')
>>> array = adapter[:]

Aside from the obvious advantage of being able to store and work with your compressed data without having to decompress first, you also don’t need to sacrifice any performance in doing so. For example, with a 419 MB csv file of numerical data, and a 105 MB file of the same data compressed with gzip, the following are the “best of three” run times for loading the entire contents of each file into a NumPy array:

uncompressed: 13.38 sec gzip compressed: 14.54 sec

The compressed file takes slightly longer, but consider having to uncompress the file to disk before loading with IOPro:

uncompressed: 13.38 sec gzip compressed: 14.54 sec gzip compressed (decompress to disk, then load): 21.56 sec

Indexing CSV Data

One of the most useful features of IOPro is the ability to index data to allow for fast random lookup.

For example, to retrieve the last record of the compressed 109 MB dataset we used above:

>>> adapter = iopro.text_adapter('data.gz', parser='csv', compression='gzip')
>>> array = adapter[-1]

Retrieving the last record into a NumPy array takes 14.82 sec. This is about the same as the time to read the entire record, because the entire dataset has to be parsed to get to the last record.

To make seeking faster, we can build an index:

>>> adapter.create_index('index_file')

The above method creates an index in memory and saves it to disk, taking 9.48 sec. Now when seeking to and reading the last record again, it takes a mere 0.02 sec.

Reloading the index only takes 0.18 sec. Build an index once, and get near instant random access to your data forever:

>>> adapter = iopro.text_adapter('data.gz', parser='csv', compression='gzip', index_name='index_file')

Advanced Regular Expressions

IOPro supports using regular expressions to help parse messy data. Take for example the following snippet of actual NASDAQ stock data found on the Internet:

Apple,AAPL,NasdaqNM,363.32 - 705.07
Google,GOOG,NasdaqNM,523.20 - 774.38
Microsoft,MSFT,NasdaqNM,24.30 - 32.95

The first three fields are easy enough: name, symbol, and exchange. The fourth field presents a bit of a problem. Let’s try IOPro’s regular expression based parser:

>>> regex_string = '([A-Za-z]+),([A-Z]{1-4}),([A-Za-z]+),([0-9]+\.[0-9]{2})\s*-\s*([0-9]+\.[0-9]{2})'
>>> adapter = iopro.text_adapter('data.csv', parser='regex', regex_string=regex_string)
>>> array = adapter[:]

Regular expressions can admittedly get pretty ugly, but they can also be very powerful. By using the above regular expression with the grouping operators ‘(‘ and ‘)’, we can define exactly how each record should be parsed into fields. Let’s break it down into individual fields:

([A-Za-z]+) defines the first field (stock name) in our output array,

([A-Z]{1-4}) defines the second (stock symbol),

([A-Za-z]+) defines the third (company name),

([0-9]+.[0-9]{2}) defines the fourth field (low price), and

([0-9]+.[0-9]{2}) defines the fifth field (high price)

The output array contains five fields: three string fields and two float fields. Exactly what we want.

Numba Integration

IOPro comes with experimental integration with NumbaPro, the amazing NumPy aware Python compiler also available in Anaconda. Previously when parsing messy csv data, you had to use either a very slow custom Python converter function to convert the string data to the target data type, or use a complex regular expression to define the fields in each record string. Using the regular expression feature of IOPro will certainly still be a useful and valid option for certain types of data, but it would be nice if custom Python converter functions weren’t so slow as to be almost unusable. Numba solves this problem by compiling your converter functions on the fly without any action on your part. Simply set the converter function with a call to set_converter_function() as before, and IOPro + NumbaPro will handle the rest. To illustrate, I’ll show a trivial example using the sdss data set again. Take the following converter function which converts the input string to a floating point value and rounds to the nearest integer, returning the integer value:

>>> def convert_value(input_str):
...     float_value = float(input_str)
...     return int(round(float_value))

We’ll use it to convert field 1 from the sdss dataset to an integer. By calling the set_converter method with the use_numba parameter set to either True or False (the default is True), we can test the converter function being called as both interpreted Python and as Numba compiled llvm bytecode. In this case, compiling the converter function with NumbaPro gives us a 5x improvement in run time performance. To put that in perspective, the Numba compiled converter function takes about the same time as converting field 1 to a float value using IOPro’s built in C compiled float converter function. That isn’t quite an “apples to apples” comparison, but it does show that NumbaPro enables user defined python converter functions to achieve speeds in the same league as compiled C code.

S3 Support

Also in IOPro is the ability to parse csv data stored in Amazon’s S3 cloud storage service. The S3 text adapter constructor looks slightly different than the normal text adapter constructor:

>>> adapter = iopro.s3_text_adapter(aws_access_key, aws_secret_key, 'dev-wakari-public', 'FEC/FEC_ALL.csv')

The first two parameters are your AWS access key and secret key, followed by the S3 bucket name and key name. The S3 csv data is downloaded in 128K chunks and parsed directly from memory, bypassing the need to save the entire S3 data set to disk first. IOPro can also build an index for S3 data just as with disk based csv data, and use the index for fast random access lookup. If an index file is created with IOPro and stored with the S3 dataset in the cloud, IOPro can use this remote index to download and parse just the subset of records requested. This allows you to generate an index file once and share it on the cloud along with the data set, and does not require others to download the entire index file to use it.