LucidDbPDIStreamingLoader

From Eigenpedia

Jump to: navigation, search

Contents

PDI Streaming Loader

A PDI step that provides well performing DML operations from a remote source in LucidDb. This plugin supercedes and replaces the PDI step LucidDbPdiBulkLoad and the methods put forth in LucidDbStreamingLoad.

Specification

A new PDI 4.0 plugin that will stream rows over the wire to LucidDbAppLib_REMOTE_ROWS and execute a DML statement to read the rows from that stream.

The PDI plugin opens two connections to LucidDb. One to execute a very specific (and specialized) bit of DML, and the other to push serialized Java objects over the wire. This allows us to get a set based, bulk DML statement that will perform well given the nature of the I/O required in a column store database.

Overview Diagram

Image:luciddb streaming loader overview.png

 - PDI connects to the LucidDb database specified on Database.
 - PDI prepares the DML statement (according to the rules outline below) (1) in overview diagram
 - PDI fires a new thread to execute the DML (since it will block run for the entire time) (1) in overview diagram
 - LucidDb starts the DML, and [[LucidDbAppLib_REMOTE_ROWS]] begins listening on port 9099 (defined in DML) (2) in overview diagram
 - PDI serializes and sends rows over the wire to LucidDb on port speceified in DML and step configuration (3) in overview diagram
 - LucidDb [[LucidDbAppLib_REMOTE_ROWS]] deserializes and outputs records from PDI as a cursor UDX
 - LucidDb returns the JDBC call when REMOTE_ROWS has processed all rows on the wire.
 - PDI interrogates LucidDb (num warnings, errors, etc) (4) in overview diagram

PDI Screens and Behavior

The following diagram shows the main UI screen when configuring the LucidDbPDIStreamingLoader step in PDI. It consists of the main screen (an SWT built as specified in the PDI plugin packaging) that have several tabs. The three TABS (KEYS, FIELDS, CUSTOM) are covered in more detail, along with the behavior of the plugin.

Image:Luciddb streaming loader1.png

  • Step Name: The name of the step. This is consistent amoungst all PDI steps. It's the unique logical name for this configured step on the canvas.
  • Connection: The JDBC connection to LucidDb. Use the existing UI framework objects that allow for editing or making new connections.
  • Schema: The schema name that contains the table to operate upon.
  • Table: The DML target table.
  • Operation: Enumeration to pick one of the following (MERGE, INSERT, UPDATE, CUSTOM)

KEYS TAB

Pictured in Main UI picture above.

Specifies the keys that will be used to match for DML operations that require matching. The Table Field will be the names of the columns in the target table, and the stream field will be the names of the value to use on the incoming rows (from incoming hop). STRETCH GOAL: Make both text fields, but add a drop down based on the incoming fields (for stream field) or table columns (from database connection))

FIELDS TAB

Image:Luciddb streaming loader2.png

This tab defines which rows will be part of the DML operation. It specifies what the column names are in the table (Table Field), the names of the fields in the stream (Stream Field) and if in the case where the DML operation has the choice (ie, MERGE) whether or not to update the or only insert the column.

CUSTOM TAB

Image:Luciddb streaming loader3.png

In the case where the DML operations presented are not sufficient, this allows a custom DML statement to be written.

Operations

The following section defines the operations available. It gives a description of the behavior of the plugin, and any UI specifics associated with the operation selected.

OPERATION : MERGE

  • The -FORMAT- of rows sent over the wire are the union of both the KEYS and FIELDS tabs.
  • The fields on the FIELDS TAB create two separate field lists. One as the -UPDATE- fields, the other as -INSERT-. All fields are present in the -INSERT- list, but only fields with UPDATE=Y on the UI are placed into the -UPDATE- list.
  • The fields on the KEYS TAB are used to create the -MATCHING- field list
  • The DML statement takes the format of:

Main DML Statement:

MERGE INTO 
schema.table                                        -- derived from the main UI schema.table
as tgt
USING 
<<cursor>> 
src ON
src.key1 = tgt.key1indb AND src.key2 = tgt.key2indb -- derived from the -MATCHING- field list
WHEN MATCHED THEN
UPDATE SET col1 = src.stream1                       -- derived from the -UPDATE- field list
WHEN NOT MATCHED THEN INSERT
(col1, col2) values (src.stream1, src.stream2)      -- derived from the -INSERT- field list


Cursor Statement:

select table ( applib.remote_rows (
9099                                                -- from main UI screen port
,
cursor (
select 
  cast(null as INT) as "key1"                       -- datatype and aliases derived from -FORMAT-
  ,cast(null as VARCHAR(128)) as "key2"             -- datatype and aliases derived from -FORMAT-
  ,cast(null as VARCHAR) as "stream1"               -- datatype and aliases derived from -FORMAT-
  ,cast(null as NUMERIC) as "stream2"               -- datatype and aliases derived from -FORMAT-
from
(values(0)) 
) -- end cursor
)) -- end applib / table

UI Considerations:

  • Custom Tab is greyed out so it can't be edited

OPERATION : INSERT

  • The -FORMAT- of rows sent over the wire are fields on the FIELDS tabs.
  • The fields on the FIELDS TAB represent a single -INSERT- list.
  • The DML statement takes the format of:

Main DML Statement:

INSERT INTO 
schema.table                                        -- derived from the main UI schema.table
(col1, col2)                                        -- derived from the -INSERT- field list
<<Cursor>>

Cursor Statement:

select table ( applib.remote_rows (
9099                                                -- from main UI screen port
,
cursor (
select 
  cast(null as VARCHAR) as "stream1"                -- datatype and aliases derived from -FORMAT-
  ,cast(null as NUMERIC) as "stream2"               -- datatype and aliases derived from -FORMAT-
from
(values(0)) 
) -- end cursor
)) -- end applib / table

UI Considerations:

  • Keys and Custom tab is greyed out.
  • The "UPDATE Y/N" column in the Fields tab is also greyed out


OPERATION : UPDATE

Uses the same specification for MERGE except REMOVEs the following part of DML


WHEN NOT MATCHED THEN INSERT
(col1, col2) values (src.stream1, src.stream2)      -- derived from the -INSERT- field list

UI Considerations:

  • Fields tab the Update Y/N is greyed out (and assumed to be Y for all rows)

OPERATION : CUSTOM

--Ngoodman 15:39, 13 June 2010 (EDT) Not implemented in PDI 4.0, in progress.

  • The -FORMAT- of rows sent over the wire are the union of both the KEYS and FIELDS tabs.
  • The DML statement is the one entered in on the CUSTOM tab with the CURSOR being replaced for the ?

Main DML Statement:

from Custom tab with the ? replaced with <<cursor>>  For example:

insert into mytable 
?

Cursor Statement:

select table ( applib.remote_rows (
9099                                                -- from main UI screen port
,
cursor (
select 
  cast(null as INT) as "key1"                       -- datatype and aliases derived from -FORMAT-
  ,cast(null as VARCHAR(128)) as "key2"             -- datatype and aliases derived from -FORMAT-
  ,cast(null as VARCHAR) as "stream1"               -- datatype and aliases derived from -FORMAT-
  ,cast(null as NUMERIC) as "stream2"               -- datatype and aliases derived from -FORMAT-
from
(values(0)) 
) -- end cursor
)) -- end applib / table

UI Considerations:

* None.

Future Work

  • Enable compression (GZIP) on the stream? Is it I/O or CPU bound ?
  • Batching and async I/O on both sides?
  • Set Max Number of Errors on connection?
  • Multiple sockets?

See Also

Personal tools