TextAdapter First Steps

Basic Usage

Create TextAdapter object for data source:

>>> import iopro
>>> adapter = iopro.text_adapter('data.csv', parser='csv')

Define field dtypes (example: set field 0 to unsigned int and field 4 to float):

>>> adapter.set_field_types({0: 'u4', 4:'f4'})

Parse text and store records in NumPy array using slicing notation:

>>> # read all records
>>> array = adapter[:]

>>> # read first ten records
>>> array = adapter[0:10]

>>> # read last record
>>> array = adapter[-1]

>>> # read every other record
>>> array = adapter[::2]

JSON Support

Text data in JSON format can be parsed by specifying ‘json’ for the parser argument:

>>> adapter = iopro.text_adapter('data.json', parser='json')

Currently, each JSON object at the root level is interpreted as a single NumPy record. Each JSON object can be part of an array, or separated by a newline. Examples of valid JSON documents that can be parsed by IOPro, with the NumPy array result:

>>> # Single JSON object
>>> data = StringIO('{"id":123, "name":"xxx"}')
>>> iopro.text_adapter(data, parser='json')[:]
array([(123L, 'xxx')],
      dtype=[('f0', 'u8'), ('f1', 'O')])
>>> # Array of two JSON objects
>>> data = StringIO('[{"id":123, "name":"xxx"}, {"id":456, "name":"yyy"}]')
>>> iopro.text_adapter(data, parser='json')[:]
array([(123L, 'xxx'), (456L, 'yyy')],
      dtype=[('f0', 'u8'), ('f1', 'O')])
>>> # Two JSON objects separated by newline
>>> data = StringIO('{"id":123, "name":"xxx"}\n{"id":456, "name":"yyy"}')
>>> iopro.text_adapter(data, parser='json')[:]
array([(123L, 'xxx'), (456L, 'yyy')],
      dtype=[('f0', 'u8'), ('f1', 'O')])

Future versions of IOPro will have support for selecting specific JSON fields, using a query language similar to XPath for XML.

Advanced Usage

user defined converter function for field 0:

>>> import iopro
>>> import io

>>> data = '1, abc, 3.3\n2, xxx, 9.9'
>>> adapter = iopro.text_adapter(io.StringIO(data), parser='csv', field_names=False)

>>> # Override default converter for first field
>>> adapter.set_converter(0, lambda x: int(x)*2)
>>> adapter[:]
array([(2L, ' abc', 3.3), (4L, ' xxx', 9.9)],
          dtype=[('f0', '<u8'), ('f1', 'S4'), ('f2', '<f8')])

overriding default missing and fill values:

>>> import iopro
>>> import io

>>> data = '1,abc,inf\n2,NA,9.9'
>>> adapter = iopro.text_adapter(io.StringIO(data), parser='csv', field_names=False)
>>> adapter.set_field_types({1:'S3', 2:'f4'})

>>> # Define list of strings for each field that represent missing values
>>> adapter.set_missing_values({1:['NA'], 2:['inf']})

>>> # Set fill value for missing values in each field
>>> adapter.set_fill_values({1:'xxx', 2:999.999})
>>> adapter[:]
array([(' abc', 999.9990234375), ('xxx', 9.899999618530273)],
          dtype=[('f0', 'S4'), ('f1', '<f4')])

creating and saving tuple of index arrays for gzip file, and reloading indices:

>>> import iopro
>>> adapter = iopro.text_adapter('data.gz', parser='csv', compression='gzip')

>>> # build index of records and save index to NumPy array
>>> adapter.create_index('index_file')

>>> # reload index
>>> adapter = iopro.text_adapter('data.gz', parser='csv', compression='gzip', index_name='index_file')

>>> # Read last record
>>> adapter[-1]
array([(100, 101, 102)],dtype=[('f0', '<u4'), ('f1', '<u4'), ('f2', '<u4')])

Use regular expression for finer control of extracting data:

>>> import iopro
>>> import io

>>> # Define regular expression to extract dollar amount, percentage, and month.
>>> # Each set of parentheses defines a field.
>>> data = '$2.56, 50%, September 20 1978\n$1.23, 23%, April 5 1981'
>>> regex_string = '([0-9]\.[0-9][0-9]+)\,\s ([0-9]+)\%\,\s ([A-Za-z]+)'
>>> adapter = iopro.text_adapter(io.StringIO(data), parser='regex', regex_string=regex_string, field_names=False, infer_types=False)

>>> # set dtype of field to float
>>> adapter.set_field_types({0:'f4', 1:'u4', 2:'S10'})
>>> adapter[:]
array([(2.56, 50L, 'September'), (1.23, 23L, 'April')],
    dtype=[('f0', '<f8'), ('f1', '<u8'), ('f2', 'S9')])