MondrianCacheOverview
From Eigenpedia
Contents |
Overview of Caches
For high-level background on caching in Mondrian, see the Cache Control documentation.
Additional references are the Mondrian components page and the Mondrian configuration page.
The table below describes the various kinds of caches maintained internally by Mondrian.
| Cached Object | Cache Scope | Lookup Key | Victimization Policy | Code Location |
|---|---|---|---|---|
| Mondrian Schema | Global | catalogUrl + connectionKey + jdbcUser + dataSource + connectInfo | SoftReference | mondrian.rolap.RolapSchema.Pool |
| JDBC Connection | Global | JDBC connect string | DBCP GenericObjectPool | mondrian.rolap.RolapConnectionPool |
| Aggregated measure value | Per-schema | cell coordinates (tuple of members) | SoftReference | mondrian.rolap.agg.AggregationManager |
| Member info | Per-schema | parent member + name of child member | SoftReference | mondrian.rolap.SmartMemberReader |
| Member children | Per-schema | parent member | SoftReference | mondrian.rolap.SmartMemberReader |
| RolapStar.Column Cardinality Cache | Per-schema | table alias + expression | SoftReference | mondrian.rolap.RolapStar.Column |
| Non-empty tuple set | Per-schema | SQL constraint(incl. referenced base cubes) + inputs | SoftReference | mondrian.rolap.RolapNativeSet |
| Expression value | Per-query | expression + evaluation context | none | mondrian.rolap.RolapEvaluator |
| Named set value | Per-query | name of named set | none | mondrian.rolap.RolapResult |
Additional info which needs to be added:
- synchronization granularity
- cache invalidation mechanisms and granularity (explicit API, implicit via data source change listeners)
Schema Cache
The schema cache holds instances of Mondrian schema definitions after they have been loaded, parsed, and validated. Schema instances also hold references to other caches (those marked as per-schema in the table above), so flushing a schema from the cache also flushes cached data associated with it.
TBD: schema lookup key, list of associated data structures in cache
Aggregation Cache
When people talk about Mondrian's cache, this is usually what they are thinking of, since it
- provides the most value (saving the work of summing up fact table rows over and over)
- eats up the most memory
As such, the Mondrian cache control documentation focuses on this topic the most; it alludes to and sketches the concept of "segment" used to fetch and contain aggregations at granularity higher than that of individual cells. It also covers how to selectively flush (or at least invalidate) portions of this cache.
Let's look at an example of incremental cache loading. If the following MDX is run against a cold cache:
SELECT
{[Customers].[All Customers].[USA].[CA],
[Customers].[All Customers].[USA].[WA]} ON ROWS,
{[Measures].[Unit Sales]} ON COLUMNS
FROM [Sales]
The SQL below will be executed to load up the measures for the two states requested:
Segment.load: executing sql [
select "time_by_day"."the_year" as "c0", "customer"."state_province" as "c1", sum("sales_fact_1997"."unit_sales") as "m0"
from "time_by_day" as "time_by_day", "sales_fact_1997" as "sales_fact_1997", "customer" as "customer"
where "sales_fact_1997"."time_id" = "time_by_day"."time_id"
and "time_by_day"."the_year" = 1997 and
"sales_fact_1997"."customer_id" = "customer"."customer_id"
and "customer"."state_province" in ('CA', 'WA')
group by "time_by_day"."the_year", "customer"."state_province"],
exec 2360 ms, exec+fetch 2362 ms, 2 rows
Now suppose the following MDX is executed against the warmed-up cache:
SELECT
[Customers].[All Customers].[USA].children ON ROWS,
{[Measures].[Unit Sales]} ON COLUMNS
FROM [Sales]
The SQL below will be executed to fetch the measure for the remaining child state of USA (Oregon); Mondrian will cache that result and return it together with the already-cached results for the other two states. This behavior can be a bit surprising at first, because Oregon is never mentioned in either of the MDX queries. Mondrian figures it out by reading the members for the dimension separately from the facts (as described in the next section on member caching).
Segment.load: executing sql [
select "time_by_day"."the_year" as "c0", "customer"."state_province" as "c1", sum("sales_fact_1997"."unit_sales") as "m0"
from "time_by_day" as "time_by_day", "sales_fact_1997" as "sales_fact_1997", "customer" as "customer"
where "sales_fact_1997"."time_id" = "time_by_day"."time_id"
and "time_by_day"."the_year" = 1997
and "sales_fact_1997"."customer_id" = "customer"."customer_id"
and "customer"."state_province" = 'OR'
group by "time_by_day"."the_year", "customer"."state_province"],
exec 680 ms, exec+fetch 680 ms, 1 rows
TBD:
- mapping to RolapStar
- details of segment representation, lookup, and manipulation
- interaction with aggregate tables; rollups
- predicate optimization; MondrianMaxConstraintsProperty
- call patterns from query axis evaluation
- batching
- explicit references to measure values, e.g. for Order
- implicit references to measure values, e.g. for NonEmptyCrossJoin
- call patterns from query result cell evaluation
- how segments are temporarily pinned for the scope of a query
- interaction with property mondrian.rolap.evaluate.MaxEvalDepth
- interaction with properties mondrian.rolap.SparseSegmentValueThreshold and SparseSegmentDensityThreshold
- interaction with property mondrian.rolap.star.disableCaching
See also MondrianDistinctCountAggregateImprovement.
Member Cache
This cache maps the structure and content of a dimension. As the table at the top of this page indicates, it actually consists of two caches: one which stores the properties of individual members, and one which maps out hierarchies by listing the children of each parent. Most member caching is subject to garbage collection via soft references, but certain special members are always "pinned" by the containing hierarchy (in the schema cache):
- all member
- default member
- root members
- null member
- schema calculated members
- members reference by formulas in the schema
Let's look at how a cold member cache gets loaded as part of executing the following MDX:
SELECT [Product].[All Products].[Drink].CHILDREN ON ROWS,
{[Measures].[Unit Sales]} ON COLUMNS
FROM [Sales];
Here's the SQL which gets executed to validate the existence of the Drink member and bring it into cache:
SqlMemberSource.getMemberChildren: executing sql [ select "product_class"."product_family" from "product" as "product", "product_class" as "product_class" where "product"."product_class_id" = "product_class"."product_class_id" and "product_class"."product_family" = 'Drink' group by "product_class"."product_family" order by "product_class"."product_family" ASC], exec 46 ms, exec+fetch 46 ms, 1 rows
And here's the SQL which retrieves its children into cache:
SqlMemberSource.getMemberChildren: executing sql [ select "product_class"."product_department" from "product" as "product", "product_class" as "product_class" where "product"."product_class_id" = "product_class"."product_class_id" and "product_class"."product_family" = 'Drink' group by "product_class"."product_department" order by "product_class"."product_department" ASC], exec 18 ms, exec+fetch 18 ms, 3 rows
Finally, the child members show up enumerated in the SQL used to retrieve the corresponding fact data to fill in the report cells:
Segment.load: executing sql [
select "time_by_day"."the_year" as "c0", "product_class"."product_family" as "c1", "product_class"."product_department" as "c2",
sum("sales_fact_1997"."unit_sales") as "m0"
from "time_by_day" as "time_by_day", "sales_fact_1997" as "sales_fact_1997", "product_class" as "product_class", "product" as "product"
where "sales_fact_1997"."time_id" = "time_by_day"."time_id"
and "time_by_day"."the_year" = 1997
and "sales_fact_1997"."product_id" = "product"."product_id"
and "product"."product_class_id" = "product_class"."product_class_id"
and "product_class"."product_family" = 'Drink'
and "product_class"."product_department" in ('Alcoholic Beverages', 'Beverages', 'Dairy')
group by "time_by_day"."the_year", "product_class"."product_family", "product_class"."product_department"],
exec 788 ms, exec+fetch 790 ms, 3 rows
TBD:
- member cache representation, lookup, and manipulation
- call patterns from schema validation
- call patterns from axis evaluation
Non-Empty Tuple Set Cache
When evaluating the "non empty" condition for various expressions, Mondrian is capable of executing SQL queries against fact+dimension table joins to determine which tuples (combinations of members) correspond to non-empty cells. When the same or equivalent MDX expression is executed multiple times as part of different queries, Mondrian is able to cache the results so that it only needs to execute SQL for the first request. The results are maintained in the non-empty tuple set cache.
Jvs 20:10, 13 May 2007 (PDT): Does this cache get flushed when the aggregation cache is flushed? It ought to. Currently the only way to flush this cache is to flush the containing schema.
Rchen 19:00, 30 May 2007 (PDT): This cache is not "connected" to the aggregation cache. It remains valid even when the aggregation cache is flushed. To verify this in cmdRunner, issue the command "cube Sales caching=false" between issuing twice the example MDX query below. The trace output does not contain a second occurance of the corresponding SQL that natively evaluates the non-empty cross join.
Here's an MDX query which uses native SQL (if enabled) to evaluate a non-empty crossjoin:
SELECT
NONEMPTYCROSSJOIN(
[Product].[All Products].[Drink].CHILDREN,
{[Gender].[M]}
) ON ROWS,
{[Measures].[Unit Sales]} ON COLUMNS
FROM [Sales];
And here's the SQL generated to evaluate the NonEmptyCrossjoin expression:
SqlTupleReader.readTuples [[Product].[Product Department], [Gender].[Gender]]: executing sql [ select "product_class"."product_family", "product_class"."product_department", "customer"."gender" from "product" as "product", "product_class" as "product_class", "sales_fact_1997" as "sales_fact_1997", "customer" as "customer" where "product"."product_class_id" = "product_class"."product_class_id" and "sales_fact_1997"."product_id" = "product"."product_id" and "sales_fact_1997"."customer_id" = "customer"."customer_id" and "product_class"."product_family" = 'Drink' and "customer"."gender" = 'M' group by "product_class"."product_family", "product_class"."product_department", "customer"."gender" order by "product_class"."product_family" ASC, "product_class"."product_department" ASC, "customer"."gender" ASC], exec 1091 ms, exec+fetch 1092 ms, 1 rows
Notice this MDX uses NonEmptyCrossJoin(NECJ) which does not specify which measure(s) should not be empty for the cells in the result set. The SQL statement simply checks if, in the fact table referenced by the query, a fact row exists for a each cell from the cross product of the two inputs. In effect, this NECJ is defined by both the inputs, and the fact table referenced by the query. This means that two NECJs with the same inputs but used in queries referencing different fact tables should not share the same cached result. It might not be very obvious to tell if two NECJs can share the same cached result from looking just at the MDX queries. For example, when the two queries reference the same virtual cube:
SELECT
NONEMPTYCROSSJOIN(
[Product].[Product Family].MEMBERS,
[Store].[Store Country].MEMBERS)
) ON ROWS,
{[Measures].[Store Sales]} ON COLUMNS
FROM [Warehouse and Sales];
SELECT
NONEMPTYCROSSJOIN(
[Product].[Product Family].MEMBERs,
[Store].[Store Country].MEMBERs)
) ON ROWS,
{[Measures].[Warehouse Sales]} ON COLUMNS
FROM [Warehouse and Sales];
However, they actually reference different fact tables because the measures [Store Sales] and [Warehouse Sales] come from different base cubes("Sales" cube and "Warehouse" cube respectively). As a result, these two NECJs should not share the same cached result even though they have the same definition.
This ambiguity can be avoided by explicitly naming the measures for which cells in the result set should have values, e.g.
SELECT
FILTER(
CROSSJOIN(
[Product].[Product Family].MEMBERS,
[Store].[Store Country].MEMBERS),
NOT ISEMPTY([Measures].[Store Sales])
) ON ROWS
{[Measures].[Store Sales]} ON COLUMNS
FROM [Warehouse and Sales];
SELECT
FILTER(
CROSSJOIN(
[Product].[Product Family].MEMBERS,
[Store].[Store Country].MEMBERS),
NOT ISEMPTY([Measures].[Warehouse Sales])
) ON ROWS
{[Measures].[Warehouse Sales]} ON COLUMNS
FROM [Warehouse and Sales];
There is a slight difference however between the generated SQL for this FILTER and the one for the NONEMPTYCROSSJOIN. Measures with null values are also excluded in the SQL for FILTER, but not for NONEMPTYCROSSJOIN.
Expression Cache
Within the execution of a query, the same MDX expression may be evaluated over and over as part of axis evaluation. To the extent to which the expression evaluation is context-independent, the result can be cached to accelerate evaluation. The expression cache keeps track of context-dependent values so that this can be done safely. This is only used on a case-by-case basis (currently only aggregate and RANK input sets).
Here's an MDX example where the cached expression is completely context-independent (the rank of each education level is computed against the set of all education levels):
WITH MEMBER [Gender].[RANK1] AS
'RANK([Education Level].currentmember,
ORDER (
{[Education Level].[Education Level].Members},
[Gender].[All Gender].[F], ASC))'
SELECT
{[Gender].[All Gender].[F], [Gender].[RANK1]} ON COLUMNS,
[Education Level].[Education Level].Members ON ROWS
FROM [Sales]
WHERE ([Measures].[Store Sales])
By crossjoining in another dimension, we get an example where the cached expression is context-dependent (the education level ranking is dependent on beverage):
WITH MEMBER [Gender].[RANK1] AS
'RANK([Education Level].currentmember,
ORDER (
{[Education Level].[Education Level].Members},
[Gender].[All Gender].[F], ASC))'
SELECT
{[Gender].[All Gender].[F], [Gender].[RANK1]} ON COLUMNS,
CROSSJOIN(
[Product].[All Products].[Drink].children,
[Education Level].[Education Level].Members)
ON ROWS
FROM [Sales]
WHERE ([Measures].[Store Sales])
The expression cache uses hard references, and these are not released until the end of query evaluation, so expressions which result in very large sets can cause Mondrian to run out of memory. (The exception is that expression cache entries are invalidated automatically after aggregation cache misses, since the expression value may have been dependent on a measure which hasn't been loaded yet.) Property mondrian.expCache.enable can be used to enable or disable expression caching.
Rchen 14:12, 5 June 2007 (PDT) Note that only the invalid result in the expression cache is cleared after aggregation cache misses. Expression values computed with missing aggregates are invalid. Caching such values avoids recomputing them. The valid results in the expression cache are not dependent on future aggregation cache loads. There's no need to invalidate this part of the expression cache after aggregation cache misses.
Jvs 11:43, 8 June 2007 (PDT): For more on this topic, see http://lists.pentaho.org/pipermail/mondrian/2007-June/000533.html
Named Set Cache
Similar to expression caching, named set caching is done per-query, and cached information is held pinned until end of query (except when it is discarded as invalid due to a cache miss on a measure reference needed to evaluate the named set's expression).
In the following MDX query, the named set cp is computed only once and reused for both axes:
WITH
SET [cp] AS
NONEMPTYCROSSJOIN(
[Product].[All Products].[Drink].children,
{[Gender].[M]})
SELECT
GENERATE([cp],{[Product].CURRENTMEMBER}) ON ROWS,
GENERATE([cp],{[Gender].CURRENTMEMBER}) ON COLUMNS
FROM [Sales]
Jvs 20:48, 13 May 2007 (PDT): Named sets could also be cached per-schema. For named sets defined within a schema, this should always be possible (with lookup on set name), since their evaluation is supposed to be query-independent (although this is currently not the case due to a bug in Mondrian). For named sets defined within a query, slicer context would need to be taken into account in cache lookup (which would be by expression instead of name); though I guess it would make more sense to do this as part of a more generic global expression caching mechanism.
JDBC Connection Pool
Mondrian uses the standard Apache DBCP connection pool for managing its JDBC connections to ROLAP data sources. The code excerpt below shows the parameters used to govern this pool:
connectionPool = new GenericObjectPool(
null, // PoolableObjectFactory, can be null
50, // max active
GenericObjectPool.WHEN_EXHAUSTED_BLOCK, // action when exhausted
3000, // max wait (milli seconds)
10, // max idle
false, // test on borrow
false, // test on return
60000, // time between eviction runs (millis)
5, // number to test on eviction run
30000, // min evictable idle time (millis)
true // test while idle
);
// create a PoolableConnectionFactory
AbandonedConfig abandonedConfig = new AbandonedConfig();
// flag to remove abandoned connections from pool
abandonedConfig.setRemoveAbandoned(true);
// timeout (seconds) before removing abandoned connections
abandonedConfig.setRemoveAbandonedTimeout(300);
// Flag to log stack traces for application code which abandoned a
// Statement or Connection
abandonedConfig.setLogAbandoned(true);
PoolableConnectionFactory poolableConnectionFactory
= new PoolableConnectionFactory(
// the connection factory
connectionFactory,
// the object pool
connectionPool,
// statement pool factory for pooling prepared statements,
// or null for no pooling
null,
// validation query (must return at least 1 row e.g. Oracle:
// select count(*) from dual) to test connection, can be
// null
null,
// default "read only" setting for borrowed connections
false,
// default "auto commit" setting for returned connections
true,
// AbandonedConfig object configures how to handle abandoned
// connections
abandonedConfig
);

