Similarity Demo

Quote Fraud Guide

Insurance quote fraud refers to the deceptive practice of providing false or misleading information during the process of obtaining an insurance quote. Individuals or organisations engaged in this fraudulent activity deliberately manipulate data such as their personal details, assets, or claims history to secure lower insurance premiums.

“Research Reveals Half of U.K. Consumers Think It’s Fine to Fib”


So how can Neo4j help?


Full example here: https://neo4j.com/developer/industry-use-cases/insurance/quote-fraud/

How Graph Databases Can Help?

  1. Real-Time Fraud Detection: Neo4j's real-time data handling helps insurance companies detect and prevent fraud by quickly identifying anomalies and suspicious patterns in quotes, policies, and claims.

  2. Graph Data Modeling: Neo4j helps insurance companies detect and prevent fraud more accurately by modelling data as a graph, which allows for identifying hidden relationships and patterns among entities like policyholders, claims, agents, and fraud indicators.

  3. Network Analysis: Neo4j's graph algorithms and traversal capabilities can help insurers identify fraudulent networks and patterns involving multiple policies, claimants, or agents.

Clean Up Environment.

// Delete EVERYTHING
MATCH (n) DETACH DELETE n;
					

Load Demo Data.

// Create quote nodes
CREATE (q1:Quote {firstname: "Micheal", surname: "Down", dob: date("1988-02-02"), postcode: "YO30 7DW", longitude: -1.0927426, latitude: 53.96372145, passport: 584699531, created_date: datetime()-duration({years: 1, months: 1, minutes: 9}), change_info: "first quote"})
CREATE (q2:Quote {firstname: "Michael", surname: "Down", dob: date("1988-02-02"), postcode: "YO30 7DW", longitude: -1.0927426, latitude: 53.96372145, passport: 584699531, created_date: datetime()-duration({years: 1, months: 1, minutes: 4}), change_info: "name change ea to ae"})
CREATE (q3:Quote {firstname: "Michael", surname: "Down", dob: date("1988-02-02"), postcode: "YO30 7DW", longitude: -1.0927426, latitude: 53.96372145, passport: 584699531, created_date: datetime()-duration({years: 1, months: 1, minutes: 3}), change_info: "postcode_change"})
CREATE (q4:Quote {firstname: "Michael", surname: "Down", dob: date("1988-02-02"), postcode: "PA62 6AA", longitude: -5.851487, latitude: 56.359258, passport: 584699530, created_date: datetime()-duration({years: 1, months: 1}), change_info: "passport number"})
CREATE (q5:Quote {firstname: "Michael", surname: "Down", dob: date("1988-02-02"), postcode: "PA62 6AA", longitude: -5.851487, latitude: 56.359258, passport: 584699530, created_date: datetime()-duration({months: 1}), change_info: "quote 1yr later"})
CREATE (q6:Quote {firstname: "Michael", surname: "Down", dob: date("1988-02-02"), postcode: "PA62 6AA", longitude: -5.851487, latitude: 56.359258, passport: 584699530, created_date: datetime(), change_info: "quote 1m later"})


// Create all relationships
CREATE (q1)-[:NEXT_QUOTE {diff_seconds: duration.inSeconds(q1.created_date, q2.created_date).seconds}]->(q2)
CREATE (q2)-[:NEXT_QUOTE {diff_seconds: duration.inSeconds(q2.created_date, q3.created_date).seconds}]->(q3)
CREATE (q3)-[:NEXT_QUOTE {diff_seconds: duration.inSeconds(q3.created_date, q4.created_date).seconds}]->(q4)
CREATE (q4)-[:NEXT_QUOTE {diff_seconds: duration.inSeconds(q4.created_date, q5.created_date).seconds}]->(q5)
CREATE (q5)-[:NEXT_QUOTE {diff_seconds: duration.inSeconds(q5.created_date, q6.created_date).seconds}]->(q6)
				

Data Model

Similarity Demo

Let's review the schema:

// Show database schema
CALL db.schema.visualization();
					

Quick look at the raw data.

First, let's have a quick look at the raw data that is stored inside Neo4j.

// View all quotes
MATCH path=()-[r:NEXT_QUOTE]->() 
RETURN path;
				

Split the quote chains based on time.

In the world of a quote, the time between quotes is a really important factor. Imagine the following scenario.

Buying Car Insurance:

Normally car insurance is something that we would have you buy roughly every year. A policy duration is usually 12 months, and because of this, comparing a quote from last year to the new quote this would have some obvious differences.

  • No claims bonus - (hopefully) be 1 year greater than the previous year.
  • Car Age - would be 1 year older
  • Mileage - we would expect this to be greater. More so, depending on factors like a person's age, job, address etc...

  • This means that it to look for discrepancies in the quote, we would need to group them into small buckets of time. I think of it as similar to a web session.

    In the below query, we split the chain whenever a relationship has a time difference between quotes of greater 3600 seconds (or 1 hour)

    // Split Quote Chain
    MATCH path=()-[rel:NEXT_QUOTE]->()
    WHERE rel.diff_seconds < 3600
    RETURN path;	
    				

    Return Single Record

    Same outcome but a more complicated query.

    MATCH path=(firstQ)-[r:NEXT_QUOTE*..1000]->(lastQ)
    WHERE
        // Path termination condition (first)
        (not exists{ (firstQ)<-[:NEXT_QUOTE]-() } or exists{ (firstQ)<-[x:NEXT_QUOTE]-() where x.diff_seconds >= 3600 } )
        AND
    
        // Path termination condition (last)
        (not exists{ (lastQ)-[:NEXT_QUOTE]->() } or exists{ (lastQ)-[x:NEXT_QUOTE]->() where x.diff_seconds >= 3600 } )
        AND
    
        // No gaps condition (if you remove this condition then gaps are allowed and you get spurious longer chains that verify the end of path but not the max diff condition)
        all(x in relationships(path) where x.diff_seconds < 3600 )
        AND
    
        // Filter based on quote in the last N days
        firstQ.created_date > datetime() - Duration({days: 1000})
        AND
    
        // Where there are more than one quote in the chain otherwise there is nothing to compare against
        length(path)> 1
    
    RETURN path
    				

    Create Similarity Relationship

    Here we are using the same logic as in the previous query, but this time we are creating a relationship between all nodes that we deem are "similar".

    // Create Similarity Relationship
    MATCH path=(firstQ)-[r:NEXT_QUOTE*..1000]->(lastQ)
    WHERE
    
        // Path termination condition (first)
        (NOT EXISTS{ (firstQ)<-[:NEXT_QUOTE]-() } OR EXISTS{ (firstQ)<-[x:NEXT_QUOTE]-() WHERE x.diff_seconds >= 3600 } )
        AND
    
        // Path termination condition (last)
        (NOT EXISTS{ (lastQ)-[:NEXT_QUOTE]->() } OR EXISTS{ (lastQ)-[x:NEXT_QUOTE]->() WHERE x.diff_seconds >= 3600 } )
        AND
    
        // No gaps condition (if you remove this condition then gaps are allowed and you get spurious longer chains that verify the end of path but not the max diff condition)
        ALL(x IN relationships(path) WHERE x.diff_seconds < 3600 )
        AND
    
        // Filter based on quote in the last N days
        firstQ.created_date > datetime() - duration({days: 1000})
        AND
    
        // Where there are more than one quote in the chain otherwise there is nothing to compare against
        length(path)> 1
    
    WITH nodes(path) as nodes
    
    // Iterate over the list in chain order we create an array [0,1,2,3... length - 2]
    UNWIND range(0,size(nodes)-2) as index
    
    // For each position (index) in the list take the node at that position (current) and the rest
    WITH nodes[index] as current, nodes[index+1..size(nodes)] as rest
    
    // Iterate over the rest keeping current to get all pairs of nodes without repetitions
    UNWIND rest as subsequent
    
    WITH current, subsequent,
    
    // Build up similarity scores for all properties
    // Strings
    apoc.text.levenshteinSimilarity(current.firstname, subsequent.firstname) AS firstname,
    apoc.text.levenshteinSimilarity(current.surname, subsequent.surname) AS surname,
    apoc.text.levenshteinSimilarity(current.postcode, subsequent.postcode) AS postcode,
    
    // Numbers
    (current.passport - subsequent.passport) AS passport_number,
    apoc.text.levenshteinSimilarity(toString(current.passport), toString(subsequent.passport)) AS passport_similarity,
    
    // Dates
    duration.inDays(current.dob, subsequent.dob).days AS dob,
    
    // Location
    toInteger(point.distance(point({longitude: current.longitude, latitude: current.latitude}), point({longitude: subsequent.longitude, latitude: subsequent.latitude}))) AS location
    
    // Create :SIMILARITY Relationship
    MERGE (current)-[:SIMILARITY {
        // Add change string for simplicity
        change: subsequent.change_info,
    
        // Strings
        firstname: firstname,
        surname: surname,
        postcode: postcode,
    
        // Numbers
        passport_number: passport_number,
        passport_similarity: passport_similarity,
    
        // Dates
        dob: dob,
    
        // Location
        location: location,
    
        // Calulcated Similarity Score
        similarity_score: (firstname + surname + postcode + passport_similarity ) / 4
    }]->(subsequent)
    				

    Query ALL Quotes with a Similarity relationship

    Lets bring back all the Quotes with their Similarity relationships.

    // Query all quotes with their similarity relationship
    MATCH path=()-[:SIMILARITY]->()
    RETURN path;
    				

    Calculate static Fraud Score

    Based on all the Quote nodes and SIMILARITY relationships we have in our graph. We can work out a similarity score that will indicate how similar quotes are to each other.

    The Fraud Level is based on the logic that:

  • LOW: Greater than 70% Similar
  • Medium: Between 70% and 50% Similar
  • HIGH: Less than 50% Similar

  • // Calculate static Fraud Score
    MATCH path=(a)-[r:SIMILARITY]->(b)
    WHERE a.created_date > datetime() - Duration({days: 1000})
    RETURN sum(r.similarity_score)/COUNT(relationships(path)) AS Similarity,
    CASE
        WHEN COUNT(relationships(path)) = 0 THEN 'Additional Quote Needs Adding'
        WHEN toInteger(sum(r.similarity_score)/COUNT(relationships(path)) * 100) > 70 THEN 'LOW'
        WHEN toInteger(sum(r.similarity_score)/COUNT(relationships(path)) * 100) < 70 AND toInteger(sum(r.similarity_score)/COUNT(relationships(path)) * 100) > 50 THEN 'MEDIUM'
        WHEN toInteger(sum(r.similarity_score)/COUNT(relationships(path)) * 100) < 50 THEN 'HIGH'
    END AS Fraud_Level
    				

    Lets put it all together!

    In this example, I will create a new Quote which will be used to score the whole quote journey and then provide back a fraud score. The steps in this process are:

  • Create new Quote node
  • Only create SIMILARITY relationships between the new node and all others in its community. We DO NOT re-score all nodes, just the new one.
  • Recalculate Fraud Score and pass it back to the client.

  • // // // Realtime Quote Score // // //
    
    // Get last `Quote` node in quote chain
    MATCH (last:Quote)
    WITH last
    ORDER BY last.created_date DESC
    LIMIT 1
    WITH last
    // Create new quote node
    MERGE (current:Quote {
        change_info: "changed dob",
        created_date: datetime(),
        dob: Date("1978-11-30"),
        firstname: "Michael",
        surname: "Down",
        latitude: 56.359258,
        longitude: -5.851487,
        passport: 584699530,
        postcode: "PA62 6AA"
    })
    WITH last, current, duration.inSeconds(DateTime(last.created_date), DateTime(current.created_date)) AS time
    // Create relationship
    CREATE (last)-[:NEXT_QUOTE {diff_seconds: time.seconds}]->(current)
    
    WITH current
    
    // Minimum comparison
    MATCH path=(firstQ)-[r:NEXT_QUOTE*0..100]->(current)
    WHERE
    
        // Path termination condition (first)
        (NOT EXISTS{ (firstQ)<-[:NEXT_QUOTE]-() } OR EXISTS{ (firstQ)<-[x:NEXT_QUOTE]-() WHERE x.diff_seconds >= 3600 } )
        AND
    
        // Path termination condition (last)
        (NOT EXISTS{ (lastQ)-[:NEXT_QUOTE]->() } OR EXISTS{ (lastQ)-[x:NEXT_QUOTE]->() WHERE x.diff_seconds >= 3600 } )
        AND
    
        // No gaps condition (if you remove this condition then gaps are allowed and you get spurious longer chains that verify the end of path but not the max diff condition)
        ALL(x IN relationships(path) WHERE x.diff_seconds < 3600 )
        AND
    
        // Filter based on quote in the last N days
        firstQ.created_date > datetime() - duration({days: 1000})
        AND
    
        // Where there are more than one quote in the chain otherwise there is nothing to compare against
        length(path)> 1
    
    //let's keep just the nodes in the chain
    UNWIND nodes(path)[0..-1] as subsequent
    
    WITH current, subsequent,
    
    // Build up similarity scores for all properties
    // Strings
    apoc.text.levenshteinSimilarity(current.firstname, subsequent.firstname) AS firstname,
    apoc.text.levenshteinSimilarity(current.surname, subsequent.surname) AS surname,
    apoc.text.levenshteinSimilarity(current.postcode, subsequent.postcode) AS postcode,
    
    // Numbers
    (current.passport - subsequent.passport) AS passport_number,
    apoc.text.levenshteinSimilarity(toString(current.passport), toString(subsequent.passport)) AS passport_similarity,
    
    // Dates
    duration.inDays(current.dob, subsequent.dob).days AS dob,
    
    // Location
    toInteger(point.distance(point({longitude: current.longitude, latitude: current.latitude}), point({longitude: subsequent.longitude, latitude: subsequent.latitude}))) AS location
    
    // Create :SIMILARITY Relationship
    CREATE (current)-[:SIMILARITY {
        // Add change string for simplicity
        change: subsequent.change_info,
    
        // Strings
        firstname: firstname,
        surname: surname,
        postcode: postcode,
    
        // Numbers
        passport_number: passport_number,
        passport_similarity: passport_similarity,
    
        // Dates
        dob: dob,
    
        // Location
        location: location,
    
        // Calulcated Similarity Score
        similarity_score: (firstname + surname + postcode + passport_similarity ) / 4
    }]->(subsequent)
    
    WITH *
    
    // Quote - 3 - Calculate Fraud Score
    MATCH p=(a)-[r:SIMILARITY]->(b)
    WHERE a.created_date > datetime() - Duration({days: 1000})
    RETURN avg(r.similarity_score) AS Similarity,
    CASE
        WHEN COUNT(relationships(p)) = 0 THEN 'Run Agiain'
        WHEN toInteger(sum(r.similarity_score)/COUNT(relationships(p)) * 100) > 70 THEN 'LOW'
        WHEN toInteger(sum(r.similarity_score)/COUNT(relationships(p)) * 100) < 70 AND toInteger(sum(r.similarity_score)/COUNT(relationships(p)) * 100) > 50 THEN 'MEDIUM'
        WHEN toInteger(sum(r.similarity_score)/COUNT(relationships(p)) * 100) < 50 THEN 'HIGH'
    END AS Fraud_Level;