Similarity Demo

Address Similarity Guide

Lets get started!

The intention of this walkthrough is to provide guidance around methods that can be implemented inside Neo4j to help you discover similarities inside your data. These similarities can be in a wide variety of data; below are a few examples are:

  • People e.g. KYC, System Reconsiliation
  • Addresses
  • Merchants
  • Vehicles
  • Policies

  • Basically anything you have inside of your business that has dirty data, and there are duplicates!


    What does the data look like?

    Similarity Demo

    Lets review the schema:

    // Show database schema
    CALL db.schema.visualization();
    					

    Environment Clean Up

    // Clean Up
    MATCH (n:LinkedAddress)
    DETACH DELETE n;
    
    MATCH (n:RoyalMailAddress)
    DETACH DELETE n;
    
    MATCH ()-[r:SIMILAR_ADDRESS]-()
    DETACH DELETE r;
    				

    Query "Albyn Place" Addresses

    The below query will provide an example of dirty data in this dataset.

    // Query "Albyn Place" Addresses
    MATCH (a:Address)
    WHERE a.FullAddress CONTAINS "37 ALBYN PLACE"
    RETURN a;
    				

    Similarity Scoring

    In the following query, we are going to execute a similarity score that executes in parallel as it takes all addresses whose postcodes start with "A". You can see this by reviewing the following code snippet:

    WHERE a.RegAddressPostCode STARTS WITH 'A'

    We then further segment all the postcodes by bucketing the postcode whose first 3 charactors are the same:

    WITH COLLECT(DISTINCT(left(a.RegAddressPostCode, 3))) as postcodes

    This provides enough data to ensure there is enough overlap to ensure addresses with slightly different postcodes is still compared together.

    // Parallel Similarity Scoring Version
    MATCH (a:Address)
    WHERE a.RegAddressPostCode STARTS WITH 'A'
    WITH COLLECT(DISTINCT(left(a.RegAddressPostCode, 3))) AS postcodes
    CALL apoc.cypher.mapParallel2("
    	MATCH (a:Address), (b:Address)
    	
    		WHERE id(a) > id(b) AND a.RegAddressPostCode STARTS WITH _ AND b.RegAddressPostCode STARTS WITH _
    	
    		// Pass Variables
    		WITH a, b,
    	
    		// Build similarity scores
    		apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine1) AS line_1_sim,
    		apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine2) AS line_2_sim,
    		apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine2) AS a_b_line_1,
    		apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine1) AS b_a_line_1,
    		apoc.text.levenshteinSimilarity(a.RegAddressPostCode, b.RegAddressPostCode) AS post_sim,
    		apoc.text.levenshteinSimilarity(a.FullAddress, b.FullAddress) AS full_address_sim
    	
    		WITH a, b, line_1_sim, line_2_sim, a_b_line_1, b_a_line_1, post_sim, full_address_sim, ((line_1_sim + line_2_sim) / 2) as add_1_2_calculation
    	
    		// Selection logic //
    
    		// Limit the similarity of the full address
    		WHERE full_address_sim > 0.6
    
    			// Postcodes can not be too far apart
    			AND post_sim > 0.7
    			
    			// Looks at address who have prefixes e.g. 37 ALBYN PLACE vs KPMG 37 ALBYN PLACE
    			// This addition pushes the address into Line 2
    			AND ((line_1_sim = 1 OR a_b_line_1 = 1 OR b_a_line_1 = 1) AND post_sim > 0.85)
    			AND NOT (add_1_2_calculation > 0.6 AND full_address_sim > 0.91 AND post_sim > 0.9)
    
    		RETURN id(a) as a_id, a.FullAddress as a_FullAddress,id(b) as b_id, b.FullAddress as b_FullAddress, full_address_sim;
    	",
    	{parallel:True, batchSize:1000, concurrency:6}, postcodes, 6) YIELD value
    
    RETURN value.a_id AS a_id, value.a_FullAddress AS a_full_address, value.b_id AS b_id, value.b_FullAddress AS b_full_address, value.full_address_sim AS full_address_similarity;		
    				

    Create Similarity Relationship

    Here we are using the same logic as in the previous query, but this time we are creating a relationship between all nodes that we deem are "similar".

    // Create Similarity Relationship
    MATCH (a:Address), (b:Address)
    
    // Using 'AB1' for demo/performance purposes
    WHERE id(a) > id(b) AND a.RegAddressPostCode STARTS WITH 'AB1' AND b.RegAddressPostCode STARTS WITH 'AB1'
    
    // Pass Variables
    WITH a, b,
    
    // Build similarity scores
    apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine1) AS line_1_sim,
    apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine2) AS line_2_sim,
    apoc.text.levenshteinSimilarity(a.RegAddressAddressLine1, b.RegAddressAddressLine2) AS a_b_line_1,
    apoc.text.levenshteinSimilarity(a.RegAddressAddressLine2, b.RegAddressAddressLine1) AS b_a_line_1,
    apoc.text.levenshteinSimilarity(a.RegAddressPostCode, b.RegAddressPostCode) AS post_sim,
    apoc.text.levenshteinSimilarity(a.FullAddress, b.FullAddress) AS full_address_sim
    
    WITH a, b, line_1_sim, line_2_sim, a_b_line_1, b_a_line_1, post_sim, full_address_sim, ((line_1_sim + line_2_sim) / 2) as add_1_2_calculation
    
    // Selection logic
    
    	// Limit the similarity of the full address
    WHERE full_address_sim > 0.6
    
    	// Postcodes can not be too far apart
    	AND post_sim > 0.7
    
    	// Looks at addresses that have prefixes e.g. 37 ALBYN PLACE vs KPMG 37 ALBYN PLACE
    	// This addition pushes the address into Line 2
    	AND ((line_1_sim = 1 OR a_b_line_1 = 1 OR b_a_line_1 = 1) AND post_sim > 0.85)
    
    	AND NOT (add_1_2_calculation > 0.6 AND full_address_sim > 0.91 AND post_sim > 0.9)
    
    CREATE (a)-[:SIMILAR_ADDRESS {full_address_sim: full_address_sim, post_sim: post_sim, line_2_sim: line_2_sim, line_1_sim: line_1_sim}]->(b);
    				

    Query ALL Similar Addresses

    Here we will look at all the address clusters we have deemed to be similar enough to link them together.

    // Query Similar Addresses
    MATCH path=(a)-[:SIMILAR_ADDRESS]->()
    WHERE a.FullAddress CONTAINS "AB101JB"
    RETURN path;
    				

    Write Linked Address Relationship

    Now we are going to create a unique LinkedAddress node that will be connected to all addresses with the same address. The reason for single node to be connected to all other is for a few reasons:

  • It ensures that every node is only one hop away from every other node, no matter how the SIMILAR_ADDRESS connects them.
  • This LinkedAddress can hold core information that has been validated that all other nodes should use. For example, the LinkedAddress hold the validated address that should be used in all correspondence.

  • // Write Linked Address Relationship
    :auto MATCH (a:Address)-[:SIMILAR_ADDRESS]-(b:Address)
    // WHERE id(a) = nodes[0] AND id(b) = nodes[1]
    
    CALL {
    	WITH a, b
    	OPTIONAL MATCH (a)-[r1:LINKED_TO]->(c1:LinkedAddress)
    	OPTIONAL MATCH (b)-[r2:LINKED_TO]->(c2:LinkedAddress)
    
    
    	FOREACH(ignoreMe IN CASE WHEN c1 IS NULL AND c2 IS NULL THEN [1] ELSE [] END | 
    		CREATE (a)-[:LINKED_TO]->(:LinkedAddress)<-[:LINKED_TO]-(b)
    	)
    
    	FOREACH(ignoreMe IN CASE WHEN c1 IS NOT NULL AND c2 IS NULL THEN [1] ELSE [] END | 
    		CREATE (b)-[:LINKED_TO]->(c1)
    	)
    
    	FOREACH(ignoreMe IN CASE WHEN c1 IS NULL AND c2 IS NOT NULL THEN [1] ELSE [] END | 
    		CREATE (a)-[:LINKED_TO]->(c2)
    	)
    
    	FOREACH(ignoreMe IN CASE WHEN c1 IS NOT NULL AND c2 IS NOT NULL AND ID(c1) <> ID(c2) THEN [1] ELSE [] END | 
    		DETACH DELETE c2
    		CREATE (b)-[:LINKED_TO]->(c1)
    	)
    } IN TRANSACTIONS OF 1 ROWS
    				

    Query ALL Similar Addresses and LinkedAddress Nodes

    Now we are going to bring back all "37 ALBYN PLACE" nodes along with any LinkedAddress nodes

    // Query Similar Addresses and LinkedAddress Nodes
    MATCH path=(a)-[:SIMILAR_ADDRESS|LINKED_TO]->()
    WHERE a.FullAddress CONTAINS "AB101JB"
    RETURN path;
    				

    What is missing from the following output?

    **Hint: Think back to the first output when I first showed you all the "37 ALBYN PLACE" addresses.**
    Similarity Demo

    Has anyone worked it out?

    Lets take a look:

    // Query Similar Addresses and LinkedAddress Nodes
    MATCH path=(a)-[:SIMILAR_ADDRESS|LINKED_TO]->()
    WHERE a.FullAddress CONTAINS "37 ALBYN PLACE"
    RETURN path;
    				

    Enirch data with 3rd party information

    It could also enrich this dataset with purchased data, providing validated data that could be used inside of applications and APIs.

    // Create Royal Mail Address
    MERGE (a:RoyalMailAddress {
    	AddressId: "royal-mail-123",
    	RegAddressAddressLine1: "37 ALBYN PLACE",
    	RegAddressAddressLine2: "ALBYN PLACE",
    	RegAddressPostTown: "ABERDEEN",
    	RegAddressPostCode: "AB101JB",
    	latitude: 57.142641,
    	longitude: -2.124153,
    	FullAddress: "37 ALBYN PLACE ALBYN PLACE ABERDEEN AB101JB"
    });
    
    MATCH (a:Address)-[r:LINKED_TO]->(l:LinkedAddress)
    WHERE a.FullAddress CONTAINS "37 ALBYN PLACE"
    WITH r, l
    MATCH (p:RoyalMailAddress)
    WITH r, l, p
    MERGE (l)-[:ROYAL_MAIL_ADDRESS]->(p);
    				

    Query Royal Main Address Node

    Lets bring back our newly created RoyalMailAddress linked to "37 ALBYN PLACE"

    // Royal Mail Address Node
    MATCH path=(a:Address)-[:LINKED_TO]->(:LinkedAddress)-[:ROYAL_MAIL_ADDRESS]->(x)
    WHERE a.FullAddress CONTAINS "37 ALBYN PLACE"
    RETURN path