– Part 2 of “When is a Data Repository Not…"
In a previous article, Simon Hopper outlined the specific design, functionality and operation of a data repository. In part two of the series, he provides further analysis on the use of data warehouses for the purpose of financial markets’ oversight. This article clarifies the main differences between operational processing and information reporting systems and explains the consequences if these differences if they are not considered today as when market works towards implementing a data warehouse for OTC derivatives.
Data Warehouse Reporting vs. Operational Processing System
Apart from the underlying design, there are a number of other key characteristics of a data warehouse reporting solution which differ from an operational processing system:
1. Counterparty / Regulatory structure
This is key as it sets out the legal structure of the counterparties within the system, and should at the very least contain a two tier structure covering legal entity and parent group, although it may be necessary to include branches of legal entities. This needs to be established as a matrix against the relevant regulatory body as it may be that any given institution may determine that only the appropriate regulatory entity may view a given group, legal entity and/or branches trading activity. Without this it will not be possible to disclose the correct level of information to any given user, whether regulatory, the public or users within the said institution(s).
2. Data Security
In addition to typical system/application security such as firewalls to prevent unauthorised access, the data warehouse should also include encryption of the data at the atomic level to ensure that authorised users can only access data that has been authorised. For example, institutions submitting data would only be able to see their own data, and regulators would only be able to see data for institutions for which they have regulatory responsibility despite all having access to the same report.
3. Data Transformation Mappings
To ensure that a full audit trail is maintained and data can be validated back to its source, it is imperative that a Source to Target Mapping is maintained. This should show the source of the data, how it has been transformed in the ETL layer and how it is stored in the published layer. This is vital where data is likely to be sourced from multiple systems and the data is to be used for regulatory analysis – if this is not maintained then it may prove impossible to validate the data produced in a report back to its source.
4. Data Dictionary and Naming Convention
Without a dictionary explaining the data contained within the data warehouse and a consistent business-based data naming convention, the ability for end users to define and build reports will be non-existent and a large highly specialised support function with significant key–resource risk will evolve within the operator. Key-resource risk arises because the understanding of the technical solution, the data contained within this and how to deliver reports from these invariably tends to be consolidated in those few key individuals who participated in the design and build of the system – a full data dictionary and business-based naming convention captures this knowledge within the solution and can therefore empower users to conduct their own analysis.
5. Information Accessibility
Access to the data warehouse should be via a secure web portal that allows users to run reports, slice and dice, and /or drill the results for greater understanding, and, ultimately, develop their own reports. This avoids a significant resource build up at the operator and therefore minimises the commercial costs and dependency / resource prioritisation issues that would otherwise arise.
6. Data Sources
To achieve the widest and fullest data set possible, data should ultimately be sourced from all market participants including infrastructure suppliers. As no infrastructure suppliers have complete coverage of any given market, the data must initially be sourced from the institutions themselves, and a phased approach taken to ensure that only additional sources are introduced where the additional data provided is of sufficient benefit to justify the cost of integration.
7. Data Validation
Data validation is an issue at three levels:
• Within the data warehouse, data can be validated from report to raw data received through the data transformation mapping
• From the data warehouse to the supplier source, through traditional checks such as file footers
• Completeness and accuracy of source data files can best be addressed by generating reports back to the supplier summarising the data received, however responsibility for the accuracy and completeness will always rest with the supplier as, short of auditing the supply process, extending validation beyond this is not practical
8. Trade Matching / Linking
Within operational processing matching is key to achieve settlement and closure. From an analytics perspective the status of a trade offers valuable insight but focus is more on how to identify the two sides of the same trade submitted by the two parties to the trade and, potentially, duplicates being submitted by third parties if, for example, market infrastructure suppliers also provide data. “De-duping” is not an uncommon issue within data warehousing where data is sourced from multiple providers – the hardware / software available is more than capable of processing very large volumes of data so the crux of the process is in defining the criteria to effect the de-duping necessary through the ETL processing of the transaction data. Linking the two sides would normally be achieved through matching counterparty code identifiers, instrument and currency codes, values, dates and counterparty references but can be extended to all data fields provided for the trade, however this could render some inconsistencies so it may be necessary to agree a data priority list to ensure fields are matched in the correct order allowing the definition of matched to be adjusted as analysis deems necessary.
Data Reporting and Analysis
With all this transactional data available for querying, what information is likely to be required? Having worked on numerous data warehousing programmes, a different approach is needed than that used in normal IT deliveries. For reporting systems it is more important to deliver a capability than a set of hard reports – many projects have focussed on a pre-defined list of reports, extracted the data to meet these from various operational systems and built the required reports. The data provided in these reports invariably raises more questions that require more reports, so more data has to be fed from the source operational systems and the original data warehouse design expanded.
This likelihood should be recognised at the outset, and when considering what data should be sourced from operational systems, the focus should be on what should be excluded (for example, user login or other security related data) rather than what is required for the specified reports. If all data is bought into the data warehouses staging layer then data required for future reports is already available and there’s no requirement to go back to the operational systems IT team for another change (which would not rank very high in their list of priorities compared to operational changes).
With all the data in the staging layer, data can be processed through the ETL layer into the published layer to meet the specific reporting requirements. If a good business data dictionary and a consistent, meaningful business naming convention has been adopted, users can be given access to build their own reports – for super users, this access can be extended into the staging layer so they can see what further data is available for them to query if needed.
By taking this approach, delivery timescales for new reports / information can be reduced to weeks and even days. If a new report required further data to be sourced form an operational system, the new report may well take months !
For the reports themselves, whilst it would ultimately be for the various user groupings to define their requirements, some anticipation should be possible based upon a knowledge of the business, the data and the technical possibilities.
For the recent regulatory reporting RFPs from ISDA, and the CESR consultative paper on reporting repositories, it would be anticipated that:
• the general public would be able to view fixed reports containing aggregated data at market level without any analytical capability
• regulators would be able to see reports aggregated at market and counterparty level covering those parties for which they are responsible for regulating (unless a wider coverage is agreed by that party), and would be able to analyse these reports through slicing/dicing/drilling down to individual trade level
• parties submitting data would have full access to reports containing their own data
It would also be anticipated that these reports would initially be aggregated at the submitting party’s group level with the capability to drill to a sub-report based on both the legal entity and its activities by asset class, or by asset class across the party at the group level. From asset class, the reports would be capable of drilling to individual assets and then the individual underlying trades. These reports would likely be simple aggregations of directional positions, however the introduction of additional pricing feeds into the data warehouse could also allow for a variety of pricing analyses, such as the identification of trades that are struck significantly in or out of the money.
It would also be anticipated that reports based on asset classes, or individual assets / asset pairings, would be required to identify the market participants (at the legal entity level) with the largest “position”. Whilst of obvious use to the relevant regulator these may also be of use to market participants although the positioning of competitor activity would need to be desensitised. For example, a participant might see that its current position on a given equity, or to a particular currency pair ranked it 5th with a given value and percentage of the total and, depending on what level of disclosure market participants are prepared to allow, it might be able to see the values and percentages of other participants but without knowing the names of those participants.
Also, the inclusion of trades that have not been matched, confirmed or otherwise corroborated allows for some analyses of the operational performance of a given market and/or party. Apart from analysis of parties not submitting their activity to the TR through the inclusion of trades alleged against them by others parties that are submitting their activity, if trades are submitted from execution through to completion and the status of the trade is tracked as it changes, then timelines can be determined (for that market, the asset class, the individual asset, the counterparty group, the counterparty legal entity and/or the counterparty branch level). Counterparties, assets etc can then be compared against each other to understand the relative efficiency of these plus individual parties or even trades that diverge from the established timeline can then be analysed to understand cause and effect.
As stated above, it is not uncommon at the outset of a reporting data warehouse initiative for there to be a lack of clarity on the reports and information required as invariably access to information inevitably then raises queries about that information that require further forms of analyses. This is another reason why the design of a reporting platform is so significantly different to that of an operational processing solution – the latter requires clear defined processing paths to ensure throughput (speed and volume) whilst the former needs a more open matrix set of paths to avoid preventing future development.
In all of the above possible forms of reporting, as stated under characteristics before, controls must be established to ensure
• supervisors are only able to obtain data directly relating to the entities that they themselves supervise or are otherwise authorised to view
• market participants should have no access to data other than their own aggregates, trades and positions
• no information should be made available to the public from which any inference could be drawn about any participant’s market positions.
Conclusion
The main point of this article was to provide some clarity on data repositories and data warehouses, and raise awareness of the key differences between operational processing and information reporting systems. Hopefully the above has accomplished this – data warehouse projects are notoriously complex and challenging, with more failing than succeeding, so focussing on the provision of a useful information capability is key. Operational improvements are needed in some markets more than others, but that is not the driver here and if the two are confused the chances of success are further reduced. If operational enhancements are to be the driver then the impact on informative reporting needs to be fully understood as it will be severely compromised.
Read Part 1 –"When a Data Repository is Not"