Documentation needed for normalization, component-wise deduping, fuzzy matching and normalization #659

biegehydra · 2024-03-27T18:42:26Z

The release v1.1 introduced component-wise deduping, near duplicate hashing, and fuzzy deduping for names. These features are absolutely incredible! Truly, hats off to all the amazing work! However, no real documentation with code examples was released for these features, nor were tests created that show how these new features work and integrate with the rest of the library.

Here are my findings and questions

Findings

Component-Deduping

Header Functions

LIBPOSTAL_EXPORT libpostal_duplicate_options_t libpostal_get_default_duplicate_options(void);
LIBPOSTAL_EXPORT libpostal_duplicate_options_t libpostal_get_duplicate_options_with_languages(size_t num_languages, char **languages);

LIBPOSTAL_EXPORT libpostal_duplicate_status_t libpostal_is_name_duplicate(char *value1, char *value2, libpostal_duplicate_options_t options);
LIBPOSTAL_EXPORT libpostal_duplicate_status_t libpostal_is_street_duplicate(char *value1, char *value2, libpostal_duplicate_options_t options);
LIBPOSTAL_EXPORT libpostal_duplicate_status_t libpostal_is_house_number_duplicate(char *value1, char *value2, libpostal_duplicate_options_t options);
LIBPOSTAL_EXPORT libpostal_duplicate_status_t libpostal_is_po_box_duplicate(char *value1, char *value2, libpostal_duplicate_options_t options);
LIBPOSTAL_EXPORT libpostal_duplicate_status_t libpostal_is_unit_duplicate(char *value1, char *value2, libpostal_duplicate_options_t options);
LIBPOSTAL_EXPORT libpostal_duplicate_status_t libpostal_is_floor_duplicate(char *value1, char *value2, libpostal_duplicate_options_t options);
LIBPOSTAL_EXPORT libpostal_duplicate_status_t libpostal_is_postal_code_duplicate(char *value1, char *value2, libpostal_duplicate_options_t options);
LIBPOSTAL_EXPORT libpostal_duplicate_status_t libpostal_is_toponym_duplicate(size_t num_components1, char **labels1, char **values1, size_t num_components2, char **labels2, char **values2, libpostal_duplicate_options_t options);

Presumed Usage

libpostal_duplicate_options_t defaultOptions = libpostal_get_default_duplicate_options(); // or the with_languages variant
libpostal_is_street_duplicate("S Sumter Street", "South Sumter St", defaultOptions); //  same for other ones. Just not sure how toponym one would work

Questions

It appears that these functions are meant to be used with the results of libpostal_parse_address. But it's not clear how. Some components have corresponding dedupe functions, such as unit, po box, and postal code. However, some do no have corresponding dedupe functions.

For example, some addresses parse into a HouseNumber and Road while others just parse into a House. Ex:

404 Maple Drive, Bldg A, Suite 100

HouseNumber: 404
Road: maple drive
Unit: bldg a suite 100

404 Maple Dr, Building A, Ste 100

House: 404 maple dr building a
Unit: ste 100

Ignoring the fact that "bldg a" gets put in the unit for one but not the other, its hard to do component wise matching when these have different components. My idea was to construct a street line and compare like so:

StringBuilder sb = null;
Dictionary<AddressLabel, string> parts1 = _libPostalBinding.ParseAddress(address1);
Dictionary<AddressLabel, string> parts2 = _libPostalBinding.ParseAddress(address2);
foreach (AddressLabel label in parts1.Keys)
{
    if (label is AddressLabel.HouseNumber or AddressLabel.House or AddressLabel.Road)
    {
        if (sb == null)
        {
            sb = new();
            sb.Append(parts1[label]);
        }
        else
        {
            sb.Append(' ');
            sb.Append(parts1[label]);
        }
    }
}
string? streetLine1 = stringBuilder?.ToString();
string? streetLine2 = // repeat process
bool streetsEqual = _libPostalBinding.IsStreetDuplicate(streetLine1, streetLine2);

This wouldn't work for the previous example because "building a" would be in the street line for one but not the other, but for most cases where you have House vs HouseNumber and Road, this would work, I think. Open to hearing how others do it.

Near-Dupe Hashing

Headers

LIBPOSTAL_EXPORT libpostal_near_dupe_hash_options_t libpostal_get_near_dupe_hash_default_options(void);
LIBPOSTAL_EXPORT char **libpostal_near_dupe_name_hashes(char *name, libpostal_normalize_options_t normalize_options, size_t *num_hashes);
LIBPOSTAL_EXPORT char **libpostal_near_dupe_hashes(size_t num_components, char **labels, char **values, libpostal_near_dupe_hash_options_t options, size_t *num_hashes);
LIBPOSTAL_EXPORT char **libpostal_near_dupe_hashes_languages(size_t num_components, char **labels, char **values, libpostal_near_dupe_hash_options_t options, size_t num_languages, char **languages, size_t *num_hashes);

I believe you aren't meant to put the result from parse address into the near dupe hash functions.

Fuzzy duplicates

Headers

LIBPOSTAL_EXPORT libpostal_fuzzy_duplicate_options_t libpostal_get_default_fuzzy_duplicate_options(void);
LIBPOSTAL_EXPORT libpostal_fuzzy_duplicate_options_t libpostal_get_default_fuzzy_duplicate_options_with_languages(size_t num_languages, char **languages);

LIBPOSTAL_EXPORT libpostal_fuzzy_duplicate_status_t libpostal_is_name_duplicate_fuzzy(size_t num_tokens1, char **tokens1, double *token_scores1, size_t num_tokens2, char **tokens2, double *token_scores2, libpostal_fuzzy_duplicate_options_t options);
LIBPOSTAL_EXPORT libpostal_fuzzy_duplicate_status_t libpostal_is_street_duplicate_fuzzy(size_t num_tokens1, char **tokens1, double *token_scores1, size_t num_tokens2, char **tokens2, double *token_scores2, libpostal_fuzzy_duplicate_options_t options);

I have no idea how to use any of these functions. Where do the tokens come from? Where do the token scores come from?

Normalization

Headers

LIBPOSTAL_EXPORT char *libpostal_normalize_string_languages(char *input, uint64_t options, size_t num_languages, char **languages);
LIBPOSTAL_EXPORT char *libpostal_normalize_string(char *input, uint64_t options);
LIBPOSTAL_EXPORT libpostal_normalized_token_t *libpostal_normalized_tokens(char *input, uint64_t string_options, uint64_t token_options, bool whitespace, size_t *n);
LIBPOSTAL_EXPORT libpostal_normalized_token_t *libpostal_normalized_tokens_languages(char *input, uint64_t string_options, uint64_t token_options, bool whitespace, size_t num_languages, char **languages, size_t *n);

These functions seem simple enough and I have tested them. My only question is on how to free the char* returned by this function. Expand and parse have destroy functions for their results but there is no libpostal_normalize_response_destroy. Am I just suppose to call free when I'm done with it? In that case, could a function be added that does that, so bindings in other programming languages can free it without having to create a dll just for free?

Here's what I think could be improved

Documentation should be added for each of these features. It should explain the following:

How the feature integrates with the rest of the library
What the inputs to each function should be
What the outputs of the function will be

Here's how we want to use libpostal

Our main use cases are clustering similar addresses, deduping exact/likely match addresses, and near-dupe hashing to make the dedupe process more efficient.

My country is US

The text was updated successfully, but these errors were encountered:

biegehydra changed the title ~~Documentation needed for dedupe and fuzzy matching~~ Documentation needed for normalization, component-wise deduping, fuzzy matching and normalization Mar 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Documentation needed for normalization, component-wise deduping, fuzzy matching and normalization #659

Documentation needed for normalization, component-wise deduping, fuzzy matching and normalization #659

biegehydra commented Mar 27, 2024 •

edited

Documentation needed for normalization, component-wise deduping, fuzzy matching and normalization #659

Documentation needed for normalization, component-wise deduping, fuzzy matching and normalization #659

Comments

biegehydra commented Mar 27, 2024 • edited

Findings

Component-Deduping

Header Functions

Presumed Usage

Questions

Near-Dupe Hashing

Headers

Fuzzy duplicates

Headers

Normalization

Headers

Here's what I think could be improved

Here's how we want to use libpostal

My country is US

biegehydra commented Mar 27, 2024 •

edited