SampleDatasetSubstitution
From Eigenpedia
Sample dataset substitution is a technique which is useful when developing queries against a large or otherwise difficult schema.
Some examples:
- Suppose that you are designing a query to detect an unusual condition, such as fraudulent behavior in banking transactions. This condition is unusual, so your dataset does not contain an example of one.
- You are writing a set of views, and queries to test them, against a data warehouse. The data warehouse has not been populated, because the ETL process has not been completed, but you want to write the queries now.
In both of these examples, it is helpful to be able to develop the queries against a sample dataset.
SQL syntax
We extend the TABLESAMPLE SQL syntax used for table sampling with the SUBSTITUTE keyword:
DDL:
CREATE TABLE tableName ( columnSpecification [, ...] ) SAMPLE ( tableName AS sampleName [, ...]); -- not yet implemented ALTER TABLE tableName DROP SAMPLE ( sampleName [ , ...] ); ALTER TABLE tableName ADD SAMPLE ( tableName AS sampleName [, ...] ); CREATE [ OR REPLACE ] VIEW viewName SAMPLE ( tableName AS sampleName [ , ...] ) AS ...;
Query/DML:
tableReference TABLESAMPLE SUBSTITUTE('sampleName')
Examples
Here we define a table EMP with two sample datasets: Small held in table EmpSmall, and Medium held in table EmpMedium.
CREATE TABLE EmpSmall (empno INTEGER, ename VARCHAR(10)); INSERT INTO EmpSmall VALUES (10, 'Fred'), (20, 'Bill'); CREATE TABLE EmpMedium (empno INTEGER, ename VARCHAR(10)); INSERT INTO EmpMedium VALUES (10, 'Fred'), (30, 'Wilma'), (50, 'Barney'); CREATE TABLE Emp (empno INTEGER, ename VARCHAR(10)) SAMPLE (EmpSmall AS SMALL, EmpMedium AS Medium);
To use a sample, reference it directly against the table reference, or in an enclosing query:
SELECT * FROM Emp TABLESAMPLE SUBSTITUTE('SMALL');
SELECT * FROM (
SELECT * FROM Emp
JOIN Dept USING deptno
) TABLESAMPLE SUBSTITUTE('MEDIUM')
If the Dept table has a sample called 'MEDIUM', the second query will substitute both Emp and Dept with sample datasets; otherwise, it substitute only Emp and use the real contents of the Dept table.
Functional specification
- Samples can be tables, views, or any other relation; and can live in other schemas.
- Sample names behave like other SQL identifiers. If you don't quote them, they can only consist of alphanumerics, and implicitly become uppercase.
- If the query contains nested sub-queries, the sample name is inherited from outer queries. Inner queries can override outer samples, as you'd expect. (REVIEW: Or would you expect the inner query to get a list of candidate sample names?)
- Samples are applied to all base tables, including those referenced via a view (even multiple levels of views). If a sample matches a sample on a view definition, the view definition is replaced by the sample, and sample substitution takes place on the sample.
- It is an error sample names within a table are not unique.
- Give error if a sample S for table T does not contain a column for each column in T.
- Give error if a column CS in sample S for table T does not have the same datatype as the corresponding column CT in T. Implicit coercions are allowed. Also nullability.
- If table T has a sample S, user cannot drop S. With 'DROP s CASCADE' keyword, the sample is removed from T but T is not dropped.
- Include samples in generated DDL. Also, add the sample relationship to the partial order, so that sample tables are exported before the tables which depend on them.
- Detect cycles in sample tables (E.g. x samples from y samples from x) and issue a specific error. Doesn't matter whether error is at DDL time or at runtime -- runtime is probably easier.
- (When we implement access control.) Access to sample tables is subject to access control. If not, there would be an easy exploit: to read table T which you don't have access to, create another table U with the same columns, make T a sample for it, then run 'select * from U tablesample substitute('T')'.
This feature is incomplete. The remaining tasks are listed in eigenjira:FRG-152.
REVIEW jvs 25-June-2006: Regarding identifiers, there's a discrepancy between how the name of a sample is defined and referenced. When it is defined, it is as an SQL identifier, but when it is referenced, it is as a literal string. Since we've taken the route of an explicit SQL clause instead of a UDX, we have the option of removing this discrepancy by making the reference an identifier instead of a literal string. This would be an improvement in user-friendliness, but maybe you want to keep it this way to indicate that the referenced sample doesn't have to actually exist at all levels? The same discrepancy exists in some other places, like in set schema vs. create schema; note that for set schema the SQL standard actually says that the referencing string expression is evaluated and then reparsed as an identifier, so set schema 's' and set schema 'S' are equivalent, whereas set schema '"s"' is different. If we want to keep the discrepancy, I would say follow the create/set schema rules for consistency. See DdlSetContextStmt for where the string evaluation and reparsing happens. If we follow this and allow an expression to be evaluated instead of just a literal, it's possible to give different users different sample results via reference expressions like SUBSTITUTE(CURRENT_USER||'_SMALL').
jhyde replies, 4-December-2006: Yes, something similar to indicating "that the referenced sample doesn't have to actually exist at all levels". Sample dataset matches should be fuzzy, although I'm not yet sure what the best matching algorithm should be. As you say, sample datasets are currently inherited. It might also be useful for the sample name to be interpreted as a regexp. I'd like to keep identifiers as string expressions for now, and nail down the precise semantics when we have more experience using these in the field.

