Why is probabilistic matching better than deterministic matching? What are the characteristics that make Initiate’s probabilistic matching superior?
Choosing a Matching System
Though accuracy has long been viewed as the cornerstone of any successful master data management (MDM) or customer data integration (CDI) installation, deciding which method to use to ensure precise automated data matching can be difficult.
In MDM or CDI, inaccuracies are expressed as “false positives” and “false negatives.” False positives occur when the system mistakenly links records that should not be matched (mismatches); false negatives result when the system fails to link two records that should have been matched (missed matches). However, these inaccuracies can vary greatly depending on the type of matching method being used.
Understanding Matching Methods
In today’s MDM and CDI industry there are essentially two methods available for matching and retrieving data: probabilistic and deterministic.
Deterministic matching
Deterministic matching systems use a combination of algorithms and business rules to determine when two or more records match (the rule “determines” the result). In a deterministic matching system, for example, one rule might instruct the system to match two records with different names if the Social Security number and address fields coincide. Algorithms catch simple common errors such as typos, phonetic variations and transpositions. The result is an either/or outcome: Either records match the requirements of the business rule or they don’t.
Deterministic matching systems have a relatively lower degree of accuracy compared to probability matching. Such systems are best suited for applications where the number of records is relatively small (less than two million), there are few data attributes, and there is no great consequence of error. One such application could be mailing list processing. If the system matches a name to an incorrect address, the mailing would be sent to the wrong person, resulting in wasted postage costs for the company that sent the mailing.
Deterministic systems do allow organizations to leverage their in-house IT staff for system implementation and to develop matching rules. When the number of data attributes and rules required are small, this can make implementation times shorter and less expensive. However, the more attributes involved and the larger the data sets, the more complex the rules-based matching routines become. This means implementation can involve many man hours of development and testing time and longer deployment times than probabilistic systems. Deterministic approaches do not have speed advantages over probabilistic methods, which now have the capability to perform lookups in real time.
In addition, deterministic systems lack scalability. When databases grow beyond a few hundred thousand records, companies with deterministic matching systems typically require expensive customization and business-rule revision. If an attribute is added to a data set, this doubles the number of rules the system requires, which can be very labor intensive and impact system scalability and performance. Both of these examples push the maintenance costs and total cost of ownership of deterministic systems far higher than that of a probabilistic matching solution.
Probabilistic matching
Probabilistic matching uses likelihood ratio theory to assign comparison outcomes to the correct, or more likely decision. This method leverages statistical theory and data analysis and, thus, can establish more accurate links than deterministic systems between records that have more complex typographical errors and error patterns.
Typically, probabilistic systems assign a percentage (such as 75 percent) indicating the probability of a match. Because these systems pinpoint variation and nuances to a much finer degree than a deterministic approach, they are better suited for businesses that have complex data systems with multiple databases. Due to the size of these data systems, the potential for duplicates, human error and discrepancies is far greater, making a system designed to establish links between records with complex error patterns much more effective.
In a probabilistic matching system, algorithms weigh frequency and uniqueness of data. Regional differences are factored into the equation, enabling a match on the name Jose Rodriguez, for example, to score differently in Los Angeles than it would in St. Louis. Probabilistic systems check all possible name alternatives and consider variables such as nicknames, phonetics (Gerald versus Jerold), transposed last and first names, and use of initials (Chuck L. Jones versus Chuck Lawrence Jones or C. Lawrence Jones). If programmed, probabilistic algorithms match against international languages and dialects.
Probabilistic systems adapt to the data to which they are being applied and do not require much manual tuning to implement and maintain. Probabilistic algorithms can also easily accommodate a growing number of files and databases with no sacrifice in speed or accuracy.
For situations where data set sizes and numbers of attributes are large, and high levels of accuracy and low total cost of ownership are important, organizations should select a probabilistic systems. When data sets are smaller, have fewer attributes and accuracy doesn’t cost an organization much in terms of risks or consequences, then a deterministic approach may be preferable.
All Probabilistic Systems Are Not Created Equal
Not all probabilistic algorithms applied to the same set of circumstances yield results with the same degree of accuracy. Initiate Systems’ probabilistic algorithms have been continuously improved in a quality feedback loop, based on actual field data, for the past 20 years. The comparison routines have been continuously improved and expanded to deal with new data types and “noisy” and error-prone data. No other probabilistic algorithm has been field-tested more often, or for longer, than the Initiate Systems algorithm.
Businesses should understand the features and capabilities necessary for accurate and effective automatic probabilistic data matching. The following are some of the most critical:
- Dual thresholds - Most systems based on probabilistic algorithms can be tuned to achieve specific false positive and false negative rates. However, Initiate Systems provides the ability to set multiple thresholds for each search. This “dual threshold” capability is critical for organizations that need to ensure a very high degree of accuracy for certain matches and a lesser degree for others. An example would be a state voter registration system where you would want a small false positive rate in looking for duplicate registrations but also require a low false negative rate when matching against felon lists.
- Real-time response - Initiate Systems can produce results in real time. Organizations that require real-time capability should avoid solutions that offload batch processing with no emphasis on performance. Instead, they should look for a system that can scale to support millions, even billions of records for on-demand record lookups. To provide an up-to-the-moment view of customer data across the enterprise, as well as prevent duplicate records from entering the system on an ongoing basis, the ability to operate in real time is essential.
- Adaptability - Organizations concerned with high accuracy should also look for a highly adaptive system – one that adjusts according to the data contained in individual files. With Initiate, customers can adjust the matching algorithm as data quality changes in the underlying file.
- Extensibility - To ensure very high accuracy, organizations must be able to include search parameters specific to their business or industry. Initiate’s probabilistic engine allows easy addition of new data fields without extensive business-rule revision, thus enabling organizations to add new parameters in accordance with changing search requirements. This allows certain types of businesses and organizations (such as law enforcement agencies) that rely on a high degree of detail for identification to add more fields than the default number. For example, police departments could add fields to search by physical characteristic (i.e., tattoos or hair color).
Additional Considerations
Organizations in the process of evaluating probabilistic matching systems should assess whether the solutions meet all IT requirements. The Initiate solution:
- Uses proven probabilistic algorithms that score and match data across a variety of attributes using likelihood statistical theory for the highest levels of accuracy
- Does not require programming code to be added to source systems or modification or standardization of source data
- Supports core data from sources in original form and maintains complete historical versioning
- Supports multiple matching-score thresholds to help manage quality versus cost tradeoffs
- Includes a complete task model to prioritize and resolve data errors or ambiguous linkages
- Offers role-based security access down to the attribute level
Picking the Right Solution
Although it is less accurate, deterministic matching can be an acceptable lower-cost matching option for organizations that have smaller data sets, fewer attributes and require less complex rules. Others may favor probabilistic matching due to its flexibility and ability to check potential matches against a higher number of variables and very large and growing databases. Probabilistic solutions can provide enormous value for businesses in many industries, including healthcare, financial services, hospitality and public safety, and for many applications, including customer relationship management and business intelligence. Businesses that require the highest levels of accuracy in real time, regardless of data volume, can fulfill their needs with a probabilistic solution, enabling them to limit the number of inaccurate matches and eliminate the costs and consequences associated with making them.
Return to the questions you must consider when evaluating a master data management or CDI solution.