Data Virtualization Cache Notes

February 20, 2014 2-minute read

Software Things

When to use caching:

Query performance
Load optimization – to minimize load on the underlying system.
Consistent reporting
Source availability
Complex transformations
Security – to allow users to see aggregate data and still keep them from the ability to access detailed data.

Where to keep the cache:

Memory – gives you the fastest access, purged when the server restarts(?), more limited that storage in most envs
Database – works well for large data sets, creates and manages tables with indexes
File(s) – slower than memory, persistent across restarts, works well for small data sets

Refreshing the cache:

Construction refresh – refresh when its built, only once
Manual refresh – admin has to initiate a refresh
Scheduled refresh
Query-driven refresh – refreshed when queried if the cache has expired
Event-driven refresh – an event can initiate a refresh. an event could be something over an ESB, a file drop, API call, etc.

How to refresh the cache:

Incremental refresh – refresh only adds new data to the cache
Live refresh – basically CDC/instant refreshing
Full refresh – invalidate the cache and cache the data again

Online vs. Offline refreshing:

Offline refreshing – the table cannot be accessed during the refresh period
Online refreshing – queries CAN be executed against a table during a refresh. this data is obviously not complete though depending on the virtualization server’s implementation.

Cache replication is useful when:

the number of queries/users accessing the table is high
geographically dispersed consumers can have data cached physically closer to them for faster access
cache replication makes the table more available. even if one cache is down, you still have another cache to query against.