LucidDbPdiBulkLoad
From Eigenpedia
Contents |
Introduction
This page describes experimental support for bulk loading data into LucidDB from the Pentaho Data Integration ETL tool (PDI, formerly Kettle). This is an alternative to the LucidDB SQL-only approach described in LucidDbEtlTutorial, and can be used when a UI-based ETL tool is more appropriate, and/or when PDI provides some connectivity or transformation capability that LucidDB currently lacks.
Special thanks go to Matt Casters for putting together the first version of the PDI step implementation.
(Note: this page supersedes the older page LucidDbStreamingLoad.)
Setup
These instructions currently only work on Linux. Windows support via named pipes may be developed in the future.
The bulk loader is based on fifo support in LucidDB's flatfile reader. This is available since release 0.8.0, so set up a copy of that before continuing (see LucidDbGettingStarted if this is your first time setting up a LucidDB instance).
Create a directory luciddb/fifo; this is where the named pipe will be created, along with some other related files.
Using sqllineClient:
create schema pdi; create table pdi.territories(name varchar(128) primary key);
Download a copy of PDI and install it.
Next, set up PDI so that it will be able to connect to LucidDB.
- Copy luciddb/plugin/LucidDbClient.jar to the pdi/libext directory.
- Download luciddb-bulk-load-pdi-0.8.0.tar.bz2 and unpack it under pdi/plugins/steps (this should create a new subdirectory pdi/plugins/steps/LucidDbBulkLoadPlugin)
- Launch PDI, create a new database connection, and fill in the database connection info as follows:
- Connection Type: Generic database
- Access: Native (JDBC)
- Custom Connection URL: jdbc:luciddb:rmi://localhost
- Custom Driver Class Name: com.lucidera.jdbc.LucidDbRmiDriver
- User Name: sa
- Test the connection
Note: a LucidDB-specific connection type is coming in the 3.2 release of PDI.
Sample Data Source
For a sample data source, we'll use one of the flatfiles which ships with PDI as an example.
In PDI, create a new transformation, and add a new CSV file input step, filling in pdi/samples/transformations/files/Territories.txt for the filename. Click Get Fields and then Preview and make sure you see five rows in the result.
Define Bulk Load Step
Next, instantiate a step for the LucidDB Bulk Loader Plugin and create a hop into it from the CSV file input. Fill in the bulk loader properties:
- Connection: select the LucidDB connection you created earlier
- Target schema: PDI
- Target table: TERRITORIES
- Maximum errors: leave as default of 0 for now
- FIFO file path: /path/to/luciddb/fifo (the directory you created earlier)
- FIFO foreign server name: ffsrv
Click Get Fields to map in the Territory field from the CSV file, and then click OK.
Now if you did everything right, you should be able to launch the transformation to load the data, and then verify it by querying the table via sqllineClient (or PDI's database explorer).
Error Handling
By default, with the max errors setting 0, LucidDB will fail the load if anything goes wrong and rollback.
You can test this by re-running the transformation. Since the table was defined with a primary key, the second execution should fail due to uniqueness violations. The Step Metrics tab at the bottom should show red, and the details are in the Logging tab.
However, as described in LucidDbErrorHandling, it's possible to tell LucidDB to keep going when it hits an error and just reject the bad rows. This is enabled in the PDI bulk loader by setting max errors to some number greater than 0. If fewer than this many rows are rejected, then the load succeeds and the update is committed; otherwise, if there are still too many, it fails and rolls back.
Edit the step to bulk load set a number like 100 for max errors, and then re-run the transformation. This time, there should be no red, but the rejected row count in the Step Metrics tab should be 5, since ALL of the rows got rejected due to uniqueness violation. Again, the details will be in the Logging tab:
(WARNING) Statement execution encountered recoverable exceptions; errors: 5; warnings: 0; see /path/to/luciddb-0.8.0/trace/Summary.log for details
Foreign Server
Under the covers, one of the steps taken by the bulk load step is to create a flat file foreign server which allows LucidDB to read rows being transmitted to it via the fifo. The step uses CREATE OR REPLACE so that if the named server is already defined, it is overwritten.
This is fine for ad hoc usage; however, if run repeatedly, or if multiple loads are run in parallel, it would be preferable to be able to create the foreign server only once and then reference it from each step instance; this would avoid potential concurrency issues and redundancy. This is not currently possible until the step is enhanced with an option (maybe a choice between always recreate server, recreate only if undefined, or never create).


