Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Documentation needed for normalization, component-wise deduping, fuzzy matching and normalization #659

Open
biegehydra opened this issue Mar 27, 2024 · 0 comments

Comments

@biegehydra
Copy link

biegehydra commented Mar 27, 2024

The release v1.1 introduced component-wise deduping, near duplicate hashing, and fuzzy deduping for names. These features are absolutely incredible! Truly, hats off to all the amazing work! However, no real documentation with code examples was released for these features, nor were tests created that show how these new features work and integrate with the rest of the library.

Here are my findings and questions

Findings


Component-Deduping

Header Functions

LIBPOSTAL_EXPORT libpostal_duplicate_options_t libpostal_get_default_duplicate_options(void);
LIBPOSTAL_EXPORT libpostal_duplicate_options_t libpostal_get_duplicate_options_with_languages(size_t num_languages, char **languages);

LIBPOSTAL_EXPORT libpostal_duplicate_status_t libpostal_is_name_duplicate(char *value1, char *value2, libpostal_duplicate_options_t options);
LIBPOSTAL_EXPORT libpostal_duplicate_status_t libpostal_is_street_duplicate(char *value1, char *value2, libpostal_duplicate_options_t options);
LIBPOSTAL_EXPORT libpostal_duplicate_status_t libpostal_is_house_number_duplicate(char *value1, char *value2, libpostal_duplicate_options_t options);
LIBPOSTAL_EXPORT libpostal_duplicate_status_t libpostal_is_po_box_duplicate(char *value1, char *value2, libpostal_duplicate_options_t options);
LIBPOSTAL_EXPORT libpostal_duplicate_status_t libpostal_is_unit_duplicate(char *value1, char *value2, libpostal_duplicate_options_t options);
LIBPOSTAL_EXPORT libpostal_duplicate_status_t libpostal_is_floor_duplicate(char *value1, char *value2, libpostal_duplicate_options_t options);
LIBPOSTAL_EXPORT libpostal_duplicate_status_t libpostal_is_postal_code_duplicate(char *value1, char *value2, libpostal_duplicate_options_t options);
LIBPOSTAL_EXPORT libpostal_duplicate_status_t libpostal_is_toponym_duplicate(size_t num_components1, char **labels1, char **values1, size_t num_components2, char **labels2, char **values2, libpostal_duplicate_options_t options);

Presumed Usage

libpostal_duplicate_options_t defaultOptions = libpostal_get_default_duplicate_options(); // or the with_languages variant
libpostal_is_street_duplicate("S Sumter Street", "South Sumter St", defaultOptions); //  same for other ones. Just not sure how toponym one would work

Questions

It appears that these functions are meant to be used with the results of libpostal_parse_address. But it's not clear how. Some components have corresponding dedupe functions, such as unit, po box, and postal code. However, some do no have corresponding dedupe functions.

For example, some addresses parse into a HouseNumber and Road while others just parse into a House. Ex:

404 Maple Drive, Bldg A, Suite 100

HouseNumber: 404
Road: maple drive
Unit: bldg a suite 100

404 Maple Dr, Building A, Ste 100

House: 404 maple dr building a
Unit: ste 100

Ignoring the fact that "bldg a" gets put in the unit for one but not the other, its hard to do component wise matching when these have different components. My idea was to construct a street line and compare like so:

StringBuilder sb = null;
Dictionary<AddressLabel, string> parts1 = _libPostalBinding.ParseAddress(address1);
Dictionary<AddressLabel, string> parts2 = _libPostalBinding.ParseAddress(address2);
foreach (AddressLabel label in parts1.Keys)
{
    if (label is AddressLabel.HouseNumber or AddressLabel.House or AddressLabel.Road)
    {
        if (sb == null)
        {
            sb = new();
            sb.Append(parts1[label]);
        }
        else
        {
            sb.Append(' ');
            sb.Append(parts1[label]);
        }
    }
}
string? streetLine1 = stringBuilder?.ToString();
string? streetLine2 = // repeat process
bool streetsEqual = _libPostalBinding.IsStreetDuplicate(streetLine1, streetLine2);

This wouldn't work for the previous example because "building a" would be in the street line for one but not the other, but for most cases where you have House vs HouseNumber and Road, this would work, I think. Open to hearing how others do it.


Near-Dupe Hashing

Headers

LIBPOSTAL_EXPORT libpostal_near_dupe_hash_options_t libpostal_get_near_dupe_hash_default_options(void);
LIBPOSTAL_EXPORT char **libpostal_near_dupe_name_hashes(char *name, libpostal_normalize_options_t normalize_options, size_t *num_hashes);
LIBPOSTAL_EXPORT char **libpostal_near_dupe_hashes(size_t num_components, char **labels, char **values, libpostal_near_dupe_hash_options_t options, size_t *num_hashes);
LIBPOSTAL_EXPORT char **libpostal_near_dupe_hashes_languages(size_t num_components, char **labels, char **values, libpostal_near_dupe_hash_options_t options, size_t num_languages, char **languages, size_t *num_hashes);

I believe you aren't meant to put the result from parse address into the near dupe hash functions.


Fuzzy duplicates

Headers

LIBPOSTAL_EXPORT libpostal_fuzzy_duplicate_options_t libpostal_get_default_fuzzy_duplicate_options(void);
LIBPOSTAL_EXPORT libpostal_fuzzy_duplicate_options_t libpostal_get_default_fuzzy_duplicate_options_with_languages(size_t num_languages, char **languages);

LIBPOSTAL_EXPORT libpostal_fuzzy_duplicate_status_t libpostal_is_name_duplicate_fuzzy(size_t num_tokens1, char **tokens1, double *token_scores1, size_t num_tokens2, char **tokens2, double *token_scores2, libpostal_fuzzy_duplicate_options_t options);
LIBPOSTAL_EXPORT libpostal_fuzzy_duplicate_status_t libpostal_is_street_duplicate_fuzzy(size_t num_tokens1, char **tokens1, double *token_scores1, size_t num_tokens2, char **tokens2, double *token_scores2, libpostal_fuzzy_duplicate_options_t options);

I have no idea how to use any of these functions. Where do the tokens come from? Where do the token scores come from?


Normalization

Headers

LIBPOSTAL_EXPORT char *libpostal_normalize_string_languages(char *input, uint64_t options, size_t num_languages, char **languages);
LIBPOSTAL_EXPORT char *libpostal_normalize_string(char *input, uint64_t options);
LIBPOSTAL_EXPORT libpostal_normalized_token_t *libpostal_normalized_tokens(char *input, uint64_t string_options, uint64_t token_options, bool whitespace, size_t *n);
LIBPOSTAL_EXPORT libpostal_normalized_token_t *libpostal_normalized_tokens_languages(char *input, uint64_t string_options, uint64_t token_options, bool whitespace, size_t num_languages, char **languages, size_t *n);

These functions seem simple enough and I have tested them. My only question is on how to free the char* returned by this function. Expand and parse have destroy functions for their results but there is no libpostal_normalize_response_destroy. Am I just suppose to call free when I'm done with it? In that case, could a function be added that does that, so bindings in other programming languages can free it without having to create a dll just for free?


Here's what I think could be improved

Documentation should be added for each of these features. It should explain the following:

  • How the feature integrates with the rest of the library
  • What the inputs to each function should be
  • What the outputs of the function will be

Here's how we want to use libpostal

Our main use cases are clustering similar addresses, deduping exact/likely match addresses, and near-dupe hashing to make the dedupe process more efficient.


My country is US

@biegehydra biegehydra changed the title Documentation needed for dedupe and fuzzy matching Documentation needed for normalization, component-wise deduping, fuzzy matching and normalization Mar 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant