Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSM-POI: Include brand property #25

Open
mattigrthr opened this issue May 31, 2021 · 3 comments · May be fixed by #69
Open

OSM-POI: Include brand property #25

mattigrthr opened this issue May 31, 2021 · 3 comments · May be fixed by #69
Assignees
Labels
enhancement New feature or request pipeline/osm-poi Issues related to the osm-poi pipeline pipeline Issues related to pipelines

Comments

@mattigrthr
Copy link
Contributor

OSM objects may include a brand or operator tag, which you can use to derive the brand of a POI.

The issue that exists is that the values of those tags can be spelled differently across several entities (e.g., "McDonalds", "Mc Donald's", or "McDonald's").

There exists a repo that tries to unify the spelling across OSM: https://github.com/osmlab/name-suggestion-index/.

Otherwise, it is an option to find a clean list of worldwide brand names and use string distance measures to connect a POI to a brand.

@mattigrthr mattigrthr added enhancement New feature or request pipeline Issues related to pipelines pipeline/osm-poi Issues related to the osm-poi pipeline labels May 31, 2021
@mattigrthr mattigrthr added this to To do in Kuwala via automation May 31, 2021
@mattigrthr
Copy link
Contributor Author

mattigrthr commented Dec 22, 2021

@IritaSee, we already parse the operator and brand tags from the OSM objects and include them in the Parquet files.

This is the file where we process the Parquet files after running the osm-parquetizer pipeline which transforms the pbf files to Parquet files:
https://github.com/kuwala-io/kuwala/blob/master/kuwala/pipelines/osm-poi/src/Processor.py

The idea would be to find the best match of a brand or operator in the name-suggestion-index.

Those are the necessary steps:

  • Compile a list of all brand names and operator names from the name-suggestion-index as a CSV with the columns id, display_name, and wiki_data and store it in the tmp folder with the other OSM-files.
    Deadline: 04.01.2022
  • Create a PySpark UDF similar to the ones in the osm-poi/src/Processor.py that takes a string and returns the best match against the name-index-suggestions. I recommend using fuzzy-wuzzy
    Deadline: 06.01.2022
  • Apply that function for the brand and operator columns which creates two new columns brand_matched and operator_matched. The best place to apply the function is after the data frames for nodes, ways, and relations have been merged into one at line 353 in the Processor.py atm.
    Deadline: 07.01.2022

@IritaSee IritaSee linked a pull request Jan 3, 2022 that will close this issue
3 tasks
@IritaSee IritaSee linked a pull request Jan 3, 2022 that will close this issue
3 tasks
@mattigrthr
Copy link
Contributor Author

More details about how Spark UDFs are used are in the PR discussion: #69 (comment)

@mattigrthr mattigrthr moved this from To do to In progress in Kuwala Jan 11, 2022
@mattigrthr
Copy link
Contributor Author

After doing some initial tests with the brand and operator name matching, it turns out that including the matching in the OSM-POI pipeline directly would increase the runtime significantly. Therefore, we have decided to store the consolidated list of brand and operator names in a separate table in Postgres, which can then be used later in transformation blocks (e.g., on a filtered set of POIs and thus drastically reduce the runtime).

Since the canvas development currently has a higher priority for the core team, this issue is up for grabs again.

@mattigrthr mattigrthr removed this from In progress in Kuwala Feb 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request pipeline/osm-poi Issues related to the osm-poi pipeline pipeline Issues related to pipelines
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

2 participants