Address Similarity Guide

Lets get started!

The intention of this walkthrough is to provide guidance around methods that can be implemented inside Neo4j to help you discover similarities inside your data. These similarities can be in a wide variety of data; below are a few examples are:

People e.g. KYC, System Reconsiliation

Addresses

Merchants

Vehicles

Policies

Basically anything you have inside of your business that has dirty data, and there are duplicates!

What does the data look like?

Lets review the schema:

// Show database schema
CALL db.schema.visualization();

Environment Clean Up

// Clean Up
MATCH (n:LinkedAddress)
DETACH DELETE n;

MATCH (n:RoyalMailAddress)
DETACH DELETE n;

MATCH ()-[r:SIMILAR_ADDRESS]-()
DETACH DELETE r;

Query "Albyn Place" Addresses

The below query will provide an example of dirty data in this dataset.

// Query "Albyn Place" Addresses
MATCH (a:Address)
WHERE a.FullAddress CONTAINS "37 ALBYN PLACE"
RETURN a;

Similarity Scoring

In the following query, we are going to execute a similarity score that executes in parallel as it takes all addresses whose postcodes start with "A". You can see this by reviewing the following code snippet:

WHERE a.RegAddressPostCode STARTS WITH 'A'

We then further segment all the postcodes by bucketing the postcode whose first 3 charactors are the same:

WITH COLLECT(DISTINCT(left(a.RegAddressPostCode, 3))) as postcodes

This provides enough data to ensure there is enough overlap to ensure addresses with slightly different postcodes is still compared together.

// Parallel Similarity Scoring Version
MATCH (a:Address)
WHERE a.RegAddressPostCode STARTS WITH 'A'
WITH COLLECT(DISTINCT(left(a.RegAddressPostCode, 3))) AS postcodes
CALL apoc.cypher.mapParallel2("
	MATCH (a:Address), (b:Address)
	
		WHERE id(a) > id(b) AND a.RegAddressPostCode STARTS WITH _ AND b.RegAddressPostCode STARTS WITH _
	
		// Pass Variables
		WITH a, b,
	
		// Build similarity scores
		apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine1) AS line_1_sim,
		apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine2) AS line_2_sim,
		apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine2) AS a_b_line_1,
		apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine1) AS b_a_line_1,
		apoc.text.levenshteinSimilarity(a.RegAddressPostCode, b.RegAddressPostCode) AS post_sim,
		apoc.text.levenshteinSimilarity(a.FullAddress, b.FullAddress) AS full_address_sim
	
		WITH a, b, line_1_sim, line_2_sim, a_b_line_1, b_a_line_1, post_sim, full_address_sim, ((line_1_sim + line_2_sim) / 2) as add_1_2_calculation
	
		// Selection logic //

		// Limit the similarity of the full address
		WHERE full_address_sim > 0.6

			// Postcodes can not be too far apart
			AND post_sim > 0.7
			
			// Looks at address who have prefixes e.g. 37 ALBYN PLACE vs KPMG 37 ALBYN PLACE
			// This addition pushes the address into Line 2
			AND ((line_1_sim = 1 OR a_b_line_1 = 1 OR b_a_line_1 = 1) AND post_sim > 0.85)
			AND NOT (add_1_2_calculation > 0.6 AND full_address_sim > 0.91 AND post_sim > 0.9)

		RETURN id(a) as a_id, a.FullAddress as a_FullAddress,id(b) as b_id, b.FullAddress as b_FullAddress, full_address_sim;
	",
	{parallel:True, batchSize:1000, concurrency:6}, postcodes, 6) YIELD value

RETURN value.a_id AS a_id, value.a_FullAddress AS a_full_address, value.b_id AS b_id, value.b_FullAddress AS b_full_address, value.full_address_sim AS full_address_similarity;

Create Similarity Relationship

Here we are using the same logic as in the previous query, but this time we are creating a relationship between all nodes that we deem are "similar".

// Create Similarity Relationship
MATCH (a:Address), (b:Address)

// Using 'AB1' for demo/performance purposes
WHERE id(a) > id(b) AND a.RegAddressPostCode STARTS WITH 'AB1' AND b.RegAddressPostCode STARTS WITH 'AB1'

// Pass Variables
WITH a, b,

// Build similarity scores
apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine1) AS line_1_sim,
apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine2) AS line_2_sim,
apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine2) AS a_b_line_1,
apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine1) AS b_a_line_1,
apoc.text.levenshteinSimilarity(a.RegAddressPostCode, b.RegAddressPostCode) AS post_sim,
apoc.text.levenshteinSimilarity(a.FullAddress, b.FullAddress) AS full_address_sim

WITH a, b, line_1_sim, line_2_sim, a_b_line_1, b_a_line_1, post_sim, full_address_sim, ((line_1_sim + line_2_sim) / 2) as add_1_2_calculation

// Selection logic

	// Limit the similarity of the full address
WHERE full_address_sim > 0.6

	// Postcodes can not be too far apart
	AND post_sim > 0.7

	// Looks at addresses that have prefixes e.g. 37 ALBYN PLACE vs KPMG 37 ALBYN PLACE
	// This addition pushes the address into Line 2
	AND ((line_1_sim = 1 OR a_b_line_1 = 1 OR b_a_line_1 = 1) AND post_sim > 0.85)

	AND NOT (add_1_2_calculation > 0.6 AND full_address_sim > 0.91 AND post_sim > 0.9)

CREATE (a)-[:SIMILAR_ADDRESS {full_address_sim: full_address_sim, post_sim: post_sim, line_2_sim: line_2_sim, line_1_sim: line_1_sim}]->(b);

Query ALL Similar Addresses

Here we will look at all the address clusters we have deemed to be similar enough to link them together.

// Query Similar Addresses
MATCH path=(a)-[:SIMILAR_ADDRESS]->()
WHERE a.FullAddress CONTAINS "AB101JB"
RETURN path;

Write Linked Address Relationship

Now we are going to create a unique LinkedAddress node that will be connected to all addresses with the same address. The reason for single node to be connected to all other is for a few reasons:

It ensures that every node is only one hop away from every other node, no matter how the SIMILAR_ADDRESS connects them.

This LinkedAddress can hold core information that has been validated that all other nodes should use. For example, the LinkedAddress hold the validated address that should be used in all correspondence.

// Write Linked Address Relationship
:auto MATCH (a:Address)-[:SIMILAR_ADDRESS]-(b:Address)
// WHERE id(a) = nodes[0] AND id(b) = nodes[1]

CALL {
	WITH a, b
	OPTIONAL MATCH (a)-[r1:LINKED_TO]->(c1:LinkedAddress)
	OPTIONAL MATCH (b)-[r2:LINKED_TO]->(c2:LinkedAddress)


	FOREACH(ignoreMe IN CASE WHEN c1 IS NULL AND c2 IS NULL THEN [1] ELSE [] END | 
		CREATE (a)-[:LINKED_TO]->(:LinkedAddress)<-[:LINKED_TO]-(b)
	)

	FOREACH(ignoreMe IN CASE WHEN c1 IS NOT NULL AND c2 IS NULL THEN [1] ELSE [] END | 
		CREATE (b)-[:LINKED_TO]->(c1)
	)

	FOREACH(ignoreMe IN CASE WHEN c1 IS NULL AND c2 IS NOT NULL THEN [1] ELSE [] END | 
		CREATE (a)-[:LINKED_TO]->(c2)
	)

	FOREACH(ignoreMe IN CASE WHEN c1 IS NOT NULL AND c2 IS NOT NULL AND ID(c1) <> ID(c2) THEN [1] ELSE [] END | 
		DETACH DELETE c2
		CREATE (b)-[:LINKED_TO]->(c1)
	)
} IN TRANSACTIONS OF 1 ROWS

Query ALL Similar Addresses and LinkedAddress Nodes

Now we are going to bring back all "37 ALBYN PLACE" nodes along with any LinkedAddress nodes

// Query Similar Addresses and LinkedAddress Nodes
MATCH path=(a)-[:SIMILAR_ADDRESS|LINKED_TO]->()
WHERE a.FullAddress CONTAINS "AB101JB"
RETURN path;

What is missing from the following output?

**Hint: Think back to the first output when I first showed you all the "37 ALBYN PLACE" addresses.** Similarity Demo

Has anyone worked it out?

Lets take a look:

// Query Similar Addresses and LinkedAddress Nodes
MATCH path=(a)-[:SIMILAR_ADDRESS|LINKED_TO]->()
WHERE a.FullAddress CONTAINS "37 ALBYN PLACE"
RETURN path;

Enirch data with 3rd party information

It could also enrich this dataset with purchased data, providing validated data that could be used inside of applications and APIs.

// Create Royal Mail Address
MERGE (a:RoyalMailAddress {
	AddressId: "royal-mail-123",
	RegAddressAddressLine1: "37 ALBYN PLACE",
	RegAddressAddressLine2: "ALBYN PLACE",
	RegAddressPostTown: "ABERDEEN",
	RegAddressPostCode: "AB101JB",
	latitude: 57.142641,
	longitude: -2.124153,
	FullAddress: "37 ALBYN PLACE ALBYN PLACE ABERDEEN AB101JB"
});

MATCH (a:Address)-[r:LINKED_TO]->(l:LinkedAddress)
WHERE a.FullAddress CONTAINS "37 ALBYN PLACE"
WITH r, l
MATCH (p:RoyalMailAddress)
WITH r, l, p
MERGE (l)-[:ROYAL_MAIL_ADDRESS]->(p);

Query Royal Main Address Node

Lets bring back our newly created RoyalMailAddress linked to "37 ALBYN PLACE"

// Royal Mail Address Node
MATCH path=(a:Address)-[:LINKED_TO]->(:LinkedAddress)-[:ROYAL_MAIL_ADDRESS]->(x)
WHERE a.FullAddress CONTAINS "37 ALBYN PLACE"
RETURN path

A few little extras:

GraphQL API

Command Line Validation App