Everything Connected, Nothing Changed
Traditional methods of connecting address-centric datasets inevitably limit the usable output. This is because the unifying keys, or ‘universal’ identifiers, adopted to date are themselves limited. Each separate feed is, therefore, partially matched, irrespective of the quality of the matching process. In the combination of multiple feeds, this reduction in data utilisation is compounded.
Data Fabric promotes full data consumption by connecting everything (pretty much) and changing nothing (pretty much). The connections are achieved through a proprietary grid of positions, called HPIDs, against which the data is placed. The HPIDs sit underneath all other keys, so to speak, providing the deepest level of connection without changing the data itself. So previously siloed data both within and across providers can be viewed in full and as part of the whole. At Harness we feel this is compelling for data owners and end-users alike.
This ‘already connected’, unlimited, view of the data provides an open pathway for Data Fabric subscribers to define and create data journeys. The journey is the application of limit (cleansing, business logic, discrimination, segmentation etc.) to satisfy a given use case. One such journey is the latest Harness price per square metre (PPSM) release which connects data from four Open Government datasets:
- HM Land Registry Price Paid (price and transaction date)
- MHCLG Non-Domestic EPC (floor area)
- MHCLG Domestic EPC (floor area)
- VOA Rating List (floor area)
This results in circa 16.7 million usable PPSM records for circa 25.9 million ‘Data Fabricd’ Land Registry sales, to end of February 2021, for residential and commercial real estate in England and Wales.
These connections form a small part of the wider Data Fabric platform. Looking across the keys in the PPSM data to other datasets which are not included, for example, shows how the output could be extended to incorporate other atomic data elements from any source in the system. Analysing across sources we see the following counts, equating to the number of unique keys from each connected feed. The total number of connecting keys of each source are included for completeness and to show the degree of intersection.
|Supplier||Dataset||Key name||Connecting keys in/to PPSM||Total system connecting keys|
|Ordnance Survey||AddressBase Premium (in PPSM)||UPRN||8,235,705||38,189,195|
|HMLR||Price Paid (in PPSM)||GUID||16,707,658||25,914,817|
|MHCLG||Domestic EPC (in PPSM)||LmkKey||9,448,930||20,125,562|
|MHCLG||Non-Domestic EPC (in PPSM)||LmkKey||90,017||923,599|
|VOA||VOA Rating List (in PPSM)||UARN||8,307||3,153,853|
|Companies House||Basic Company Data||CompanyNumber||1,524,190||6,639,210|
|Companies House||Persons With Significant Control||ETag||1,905,913||9,602,530|
|HMLR||Land Registry Titles+||TitleNumber||8,730,970||26,091,791|
|HMLR||Land Registry Title Owners+||CompanyId||704,059||813,329|
|Food Standards Agency||Food Hygiene Ratings Scheme||FHRSID||33,316||603,560|
The table presents the connection profile from the perspective of the PPSM data journey. As might be expected in this case, the HMLR price paid GUIDs are the most numerous – one for each record. The ratio of GUIDs to HPIDs gives an average of 1.9 transactions per property object. Title numbers are known for the majority because the platform compiles into a single view rather than in-part and across multiple feeds. At the system level, there are currently 66.8m address-positions to which data can be pinned enabled by HPIDs.
About Data Fabric
Data Fabric gives the most upstream view of the data possible, or maximum data utilisation. That is, almost all the data is surfaced as a connected whole because the grid of HPIDs is not limited by existing perspectives. Most of the connected data is sourced under the Open Government Licence v3.0 though any address centric feed can be added as the platform is completely modular. For example, the Experian Shop*Point data has recently been included. This is exciting because it represents the first inclusion of commercially available data beyond the essential AddressBase Premium.
This approach delivers a double effect on data utilisation rates. That is:
- there are more positions (enabled by HPIDs) to which data can be pinned, and
- the understanding of connections surfaces more existing identifiers to match against.
Data Fabric allows consumption of all these data sets without spending lots of time, resource and money ingesting and connecting the data. The sources are ready to go and connected at the most atomic level possible.