Parquet Support

Over the summer, we added support for importing and exporting the Parquet file format. Parquet is an open source columnar data format that helps with data interoperability across different systems. Each Parquet file can represent at most one table. It does not contain multiple tables.

Importing

Using irdbImport, which the the standard way of importing data into InMemory.net, you can now specify specify a Parquet Datasource as follows;

Datasource P1 = PARQUET ‘path=c:\path_to_your_parquet_files’

When you want to import a table name myTable.parquet in this directory you can then specify,

Import MyTable = p1.[MyTable.parquet]

You can also issue a SELECT statement, against the Parquet datasource, to restrict the number of columns brought back , e.g.

Import MyTable2 = p1.{select column1,column2 from [MyTable.parquet]}

When importing, we automatically convert ByteArrays to strings. This works for a lot of Parquet file format, but not all. You may not want this conversion. To disable it, you can add ;treatByteArrayAsString=false to the datasource command. e.g.

Datasource P1 = PARQUET ‘path=c:\path_to_your_parquet_files;treatByteArrayAsString=false’

In the initial release, we also have 2 limitations. Currently we don’t support nested objects, and list and map types in the Parquet format. Also Parquet support is only available in the Dot Net 6,8,9 builds. It is not supported in the Dot Net 4.x Framework version.

Exporting

In IrdbImport you can save a table in the Parquet format by using the save table command. You just set Parquet as the keyword before the fileName

SAVE TABLE ‘MYTable’ to Parquet ‘c:\my_output_Directory\MyTable.Parquet’

Dot Net API

For importing data there is a class called ParquetIRDBConn, in the irdb.parquet package, which allows you to import data.

public InMemoryTable generateInMemoryTable(String tableName , bool compile,out string errors , string[] specificColumnsToRead)

tableName is the fileName with the directory, compile is whether you want the InMemoryTable compield, and specificColumnsToRead is the list of columns to read. Setting this to null will read all columns.

For saving files the following method in irdb.parquet.GenerateIRDBTable class can by called to save an InMemoryTable in Parquet format.

public static void saveInMemoryTableAsParquetFile(string fileName, InMemoryTable table, bool debug)

Discover more from InMemory.Net

Subscribe now to keep reading and get access to the full archive.

Continue reading