Earlier this year, some participants in the financial industry agreed to implement a “data repository” for credit, interest-rate and equity derivatives to facilitate the market wide recording of trade data and therefore improve the transparency and disclosure of the OTC derivatives market. However, the definition of a “data repository” is widely misunderstood and its design and functionality unclear. Industry consultant Simon Hopper provides some much needed clarification on the main functions, how it should be designed and operated to best support the need for greater understanding.
Introduction
Following the market fallout of the last 15 months, there has been increased focus in the news about data repositories and data warehouses. Coverage has been based on the need for greater access to information about derivative markets and participants’ trading activities, to improve awareness and therefore regulatory oversight.
The Depository Trust & Clearing Corporation in the US has a “Trade Information Warehouse” covering Credit Default Swaps, ISDA has, in the last three months, run two bid processes in London for global reporting repositories covering interest rate derivatives’ and equity derivatives’ transactions, and the Committee of European Securities Regulators has issued a consultative paper seeking stakeholders’ views on trade repositories, including their functions, data and transparency requirements, their location and the legal framework.
But before embarking on a number of industry wide reporting initiatives it’s important that the drivers are clear and understood, and focus isn’t allowed to drift. There is a need in many markets for operational improvements but an industry wide information system should not be asked to address these without acceptance that this will significantly impair, and even eliminate, any reporting and analytical capability. To help explain the rationale behind this statement, some basic questions need to be considered:
– What are data repositories?
– Is a data repository the same thing as a data warehouse?
– Will this flurry of activity deliver any meaningful benefits given the likely costs?
This article isn’t intended as either a definitive technical research document, or a detailed response to these questions, but a business perspective from an industry expert who has spent the last 18 years explaining this very topic.
Background
Data warehousing was first developed as a concept by IBM in the late 1980’s. The original concept was to provide a decision support system by extracting operational transaction data, cleaning it up and restructuring it in a separate system to provide management information and analysis. This MI couldn’t be delivered by operational systems as reporting and analysis activity interfered with the processing capability of those systems.
Over the last 20 years the concept has been developed, most notably by Bill Inmon and Ralph Kimball, and is widely used in many industries (financial services, telecoms, retail, pharmaceuticals, public sector and many more) for many reasons (MI, data mining, client reporting, risk quantification and management, supply chain management, business intelligence and marketing, corporate scorecards etc).
There have been many public failures, such as Sainsbury’s write-off of £260m in 2004, and much continuing publicity, such as the issues in the public sector surrounding the NHS data warehouse, and, more recently, the National DNA Database, however the general consensus is that data warehouses can and do deliver invaluable insight. For example, much of both WalMart and Tesco success is widely attributed to their investments in this field – WalMart has the world’s largest data warehouse at over 4 petabytes (that’s over 4,096 terabytes).
Data, or information, repositories are not quite the same thing as data warehouses. A repository is a secondary form of data storage to address data proliferation issues, by providing an archive for data that is no longer required to be in the primary storage system, but, from where the data can be retrieved if necessary. Data repositories tend therefore in their structure to be a mirror image of the operational database on which data proliferation is an issue, to allow the data contained therein to be retrieved quickly and in context if required. Data warehouses are structured to facilitate reporting and data analysis and, as outlined below, this structure is likely to be considerably different, so data repositories, in their true definition, are not an appropriate tool for reporting and analysis.
Operational Processing and Information Reporting Systems
Processing and reporting systems are fundamentally different in terms of their drivers, design paradigms and other considerations:
|
Processing
|
Reporting
|
Requirements are |
market specific and driven by operational needs |
cross-market and driven by analytical needs |
Data design focus is |
transaction throughput |
information retrieval |
Data flows are |
rigid and streamlined |
loose and matrix |
Data model is |
usually partly normalised |
denormalised |
Online data retention |
transactions should only be retained until complete to minimise data proliferation |
transactions should be retained for 7 to 10 years to facilitate historical / cyclical analysis |
User access |
Very limited and tightly controlled |
Open and flexible |
Change management is |
tightly managed with periodic updates that are subjected to extensive testing |
Flexible and frequent with variable levels of testing (dependant on data change or reporting change) |
Security is applied at |
system, application and user levels |
System, data and user levels |
SLA’s / DR |
real-time / instantaneous |
daily / 48 hour |
Database Basics
Both operational processing and information reporting systems will contain a database to store information, however the design of these data stores are diametrically opposed. It is important to understand the makeup of a database to fully appreciate the fundamental differences between operational processing systems and information reporting systems:
• Tables – these are used to store data. Operational databases perform better when these tables are small (i.e. contain fewer columns and rows of data) so amendments / additions / deletions can be written to the table quickly. As a result of the need for smaller tables, operational databases tend to have lots of tables, potentially hundreds.
• Joins – these link tables together. Joins serve a number of purposes but the two main reasons for consideration here are data integrity and query impact.
– Data Integrity – by joining two tables together it is possible to restrict data entry in one based on the data specified in another. For example, if a table of client information includes a country field, data in the field could include “UK”, “uk”, “u.k.”, “U.K.”, “gb”, “g.b.”, “G.B.” or any other mixture of characters that the data source determined was appropriate to represent the United Kingdom. By creating a separate table of country codes and joining this to the country code field in the client table, data entered in the client table can be restricted to a single code. Without this join, a query run against the client table using UK would not provide the correct information from a business perspective as clients with GB in the field would be ignored.
– Query impact – by linking tables, data in different tables can be bought together and more complex queries can be run.
Whilst joins bring data integrity, which is needed for analysis and/or reporting, they dramatically slow the processing of data – if a table includes joins then whenever data is added, deleted or updated the data contained in the field against which the join exists has to be validated through the join. Ultimately too many tables and joins will also slow reporting as the underlying queries will have to trace data through the joins to respond with the correct data set.
• Indexes – these are constructed within or across tables on multiple key data facts. The primary purpose of an index is to speed up the searching and retrieval of data from tables – if an index consists of two columns of data then when searching for entries that meet two criteria, one in each column, the data only has to be “read” once; if there was no index it would have to be read twice. Operational systems invariably don’t have indexes as the indexes dramatically slow the processing of changes / additions / deletions – to add, update or remove data, any affected indexes have to be dropped, the data changes applied and then the index is re-built.
Database Design
When considering the appropriate design of a database:
1. The primary focus for an operational system is processing speed and accuracy, to capture and process transactions as quickly as possible. Therefore, databases supporting operational systems have large numbers of tables, very few joins linking the data across these tables, and no indexes. From a reporting perspective, the absence of joins between tables means the data is unlikely to be consistent so reports will likely be flawed and it will be impossible to run queries against the data as it is not possible to link the data together in a meaningful way.
2. To address these shortcomings, the next step would be to “normalise” the data to introduce data integrity / consistency and enable meaningful queries to be run. Without going into the complexities of the different forms of normalisation, and ignoring data redundancy issues, normalising a database effectively introduces joins between the tables to enforce data integrity. However, in an operational processing system, this needs to be done judiciously as lots of joins between the various tables would slow throughput considerably.
3. As databases get larger, normalisation will not only impair processing performance but also query performance, as queries will be run across larger numbers of tables with potentially many joins. To address this, larger databases, or data warehouses, will denormalise data or consolidate data from multiple tables into as few tables as possible to reduce the number of joins and improve query performance. This design approach is known as a star schema and invariably consists of a few, maybe even just one, central “fact” table containing pertinent transaction data joined to a number of “dimensional” tables. One downside of this is a possible threat to data integrity so invariably this is addressed through the introduction of a layered approach within a data warehouse – the staging layer, an ETL layer (extract, transform and load) and a published data layer. The staging layer holds data in its original source format whilst the ETL layer extracts data from this, imposes data integrity through coded rules and then populates the published layer where reports and analysis are conducted.
In addition to the fundamental system design differences there are also a number of other notable differences between operational and reporting systems:
• Data retention – operational systems are focussed on supporting the life cycle of a transaction and therefore to minimise the amount of data contained within tables, transactions are usually archived once complete. Information reporting systems are designed to store huge volumes of data with many years of history to facilitate wider forms of analyses including cyclical trends.
• Data timeliness – operational systems in financial services need to be real-time and need to have real-time failover. Information reporting systems can invariably be slightly more relaxed and are usually updated daily with failover usually based on a recovery from back-up (maybe 24/48 hours).
• Data processing – operational systems have rigid processing workflows to ensure high throughput, accuracy and resilience. Analytical systems have to be flexible to facilitate the myriad of possible data paths associated with the likely possible reporting requirements. It’s is also worth noting here that running analytical capability from an operational database can significantly slow the operational processes as queries will inevitably interfere with data processing.
Of course in practice the above design paradigms are just that, and invariably operational systems do have some level of data normalisation and do produce reports, albeit that these are invariably run when the central processing operations are offline or from a replicated system. It is important to note here that running reporting queries across a large number of tables in a live operational database can cause “contention” (where more than one process is trying to access the same data) which could in turn lock the database and prevent both the operational process and reporting process from completing successfully.
Designing a database is not as black and white as the above paradigms would suggest but as data volumes grow there is usually a tendency to increase levels of normalisation, which will slow the performance of operational system, and eventually the sheer number of tables and joins will be such that queries to support reporting will also slow as the system approaches the point where denormalisation is required. Whilst this lessens the importance of the design paradigms it actually increases the need to understand the differences between operational processing and analytical reporting systems as it is possible to not only compromise one driver for the other, but to end up with a platform that cannot support either.
An excellent example of the different focus of the two types of system can be found in discussion of the need to establish a sole “official legal record” (so called golden copy). This is an operational requirement to enable the two parties to the transaction to agree a single representation of that transaction that then allows it to be processed and completed. For reporting purposes it would be useful to understand the status of a trade, as that could imply certain things about a parties STP capability and operational risk profile, but it would be preferable for the reporting system to include all trades regardless of their status to ensure the most complete set of trade data possible. For example, in fulfilment of the recent ISDA RFPs for asset class Trade Reporting Repositories, there are likely to be some parties who do not submit their trades, however parties that execute trades with that non-submitting party should be encouraged to submit their entire book (as opposed to just trades against other submitting parties) and these “alleged” trades would provide the opportunity to not only analyse the fuller trade set of the submitting party but also the partial set of trades for the non-submitting party. Attempting to use these trade repositories to establish the golden copy would not be possible in this instance so, if understanding what’s happening in the market is the priority, then a fuller trade set should be of more value.
Coming Next – Part 2 “Data Warehouses for the Financial Markets’ Oversight”
Part 2 of this series will review the key characteristics of a data warehouse reporting solution which differ from an operational processing system as applied for use in the financial markets.
More on Simon Hopper