Matching PlugIn for OpenJUMP

Jun 29, 2012 - finding features of a source layer matching features of a target layer, based ... point and the orientation according to a deterministic algorithm.
383KB taille 29 téléchargements 328 vues
Matching PlugIn for OpenJUMP Version 0.5.6 (2012-06-29) by Michaël Michaud

1 History Version 0.5.6 •

Small improvements in the way the dialog box responds with user actions before final validation.

Version 0.5.5 •

Bug fix : throwed a NullPointerException with geometry+attribute+cardinality constraints due to a severe bug related to how attribute matching score was added to the previous geometry score. Defined order of the MatchMap structure, did not respect the contract anymore.



Improvement : now, when fixing a limit of "n" for levenshtein or damarau-levenshtein distance, a distance of n is accepted and a distance of n+1 is excluded

Version 0.5.4 •

Add X_MIN_SCORE attribute to reference dataset and fix link layer name

2 Presentation The Matching PlugIn is a powerful PlugIn making it possible to match geographic features. This operation is also know as spatial join, or a attribute join, but with matching plugin, you'll have also the capability to perform fuzzy matching. Its main capabilities are : •

finding features of a source layer matching features of a target layer, based on spatial and/or attribute criteria.



finding duplicates in a single layer based on spatial and/or attribute criteria.



spatial criteria include not only several kinds of equality, more or less strict, but also predicates (intersects, overlaps...) or distances (minimum distance, Hausdorff distance...).



attribute criteria have also many options with different kinds of equality (case insensitive, accent insensitive), but also distances (Levenshtein, Damarau-Levenshtein) or for the most difficult cases, capability to pre-process strings with regular expressions.



it is possible to add cardinality constraints if it is not desired, for example, to match several source features with a single target feature.



transferring attributes from source layer to target layer. In case of multiple matches for a single target, an aggregation function can be choosen.

Known limitation : •

Aggregation functions are chossen on an attribute type base, not for each attribute.



Attribute matching only works for character strings at the moment

Michaël Michaud on 2012-06-29 ([email protected])

1/21

3 Input Options Maching plugin process two layers : The source layer (or candidate) is the layer to be compared to our target layer (or reference). We can transfer attributes from this layer to the reference layer, not the other way. The target layer is our reference layer. We can get attributes from the source on this layer. It is possible to match a layer with itself. It can be helpful to find duplicates using a fuzzy comparator. In this case, a feature cannot match itself, even if comparing the feature with itself will generally fullfill all the matching criteria.

4 Output Options Output options can be choosen among the following : •

Create a layer with matching source features



Create a layer with non matching source features



Create a layer showing the links between source and target features

This last option creates a Layer containing lines linking source features with matched target features. The Layer contains also three attributes :





the source id (OpenJUMP volatile id)



the target id (OpenJUMP volatile id)



The score of the match (1 for perfect, 0 for null)

Duplication of target layer and attribute transfer

The target (or reference) layer is always duplicated. For example, if your target layer is named MyLayer, you'll have a new Layer called MyLayer (2) with 2 attributes added by default : •

x_count (number of matches from a source feature to this particular target)



x_max_score (score of the best match for this target feature)

Of course, transferring attributes of the source matched source features to the target features is also an output option, but it is detailed in chapter 8.

5 Geometry matchers 5.1 Match All Matcher Deactivate geometry matcher. This is used to do a simple attribute join. Example : transfer attributes from Layer A features to Layer B features when both features have the same zip code. There is no spatial criteria. We just use a semantic criteria which is the zip code.

5.2 Equals Exact (Geom3d) Matcher Compare each coordinate of both geometries. Coordinates must be the same, in the same order,

Michaël Michaud on 2012-06-29 ([email protected])

2/21

and must have same z values (two NaN z values are considered as equal). Examples :

Only the lines with the same 3D coordinate sequence match (in geen). Use case : Quality assurance. Check that 3D geometries are exactly the same before and after applying a process.

5.3 Equals Normalized (Geom3d) Matcher Compare each coordinate of both geometries after geometry normalization. Normalization describes in a unique way the sequence of coordinates making the geometry, by fixing the starting point and the orientation according to a deterministic algorithm. With this matcher, two polygons having only a different orientation, a different starting point, or an unordered list of holes will match. As for the previous case, matching points must have the same z, all z having a NaN value being considered as equal. Examples :

Michaël Michaud on 2012-06-29 ([email protected])

3/21

In the same case as the previous one, forth line will also match as the normalisation process will hide the distinct orientations. Use case : Quality assurance. Check that 3D geometries are exactly the same before and after applying a process where line or polygon orientation may have changed.

5.4 Equals Exact (Geom2d) Matcher Same as EqualsExactGeom3dMatcher, but ignoring z values. Examples :

In this case, direction matters, but z coordinate does not matter. Use case :

Michaël Michaud on 2012-06-29 ([email protected])

4/21

Quality assurance. Check that 2D geometries are exactly the same before and after applying a process.

5.5 Equals Normalized (Geom2d) Matcher Same as EqualsNormalizedGeom3dMatcher, but ignoring z values. Examples :

With EqualsNormalizedGeom2d, all the above lines will match but the last, because the source feature has three points while the taret has only two. They will not match despite the fact that in this very particular case, they describe mathematically the same infinite set of points. Use case : Quality assurance. Check that 2D geometries are exactly the same before and after applying a process where line or polygon orientation may have changed.

5.6 Equals Topological Matcher Topological equality is a bit more tolerant than NormalizedGeom2d. It can match geometries having a different number of points if additional points are strictly included in the other geometry (which is only possible in rare cases, with horizontal or vertical lines, or when there are identical consecutive points). Examples :

Michaël Michaud on 2012-06-29 ([email protected])

5/21

This comparator is more tolerant than previous ones as he recognizes that the two-segments red line topologically equals the one-segment green line. Warning : in this case, lines are topologically equal because the point inserted in the red line lies EXCACTLY on the green line. If the lines were not horizontal , the vertex which has been add would have had no chance to lies exactly on the green line because of limited precision of floating point coordinates. Use case : Quality assurance. Check that 2D geometries have not been "moved". This comparison will match equal geometries, but also geometries with 0-length segments and cleaned geometries where 0length segments have been removed).

5.7 Equals With Coordinate Tolerance Matcher This matcher is the same as EqualsNormalizedGeom2dMatcher, but introduces a distance tolerance on each coordinate. This is useful to compare features having the same source, but where one of them has been modified by a process or a software which has slightly modified its coordinates. This may happen while transforming a geometry from a coordinate system A to a coordinate system B then back to A, or while importing features from a floating-precision based shapefile to a fixed-precision based software...). Parameter : tolerance distance Example (tolerance = 0,01) POINT(10 10) matches POINT(9.995 10.005) Use case : This matcher can be useful to check a dataset integrity after a double coordinate transformation 1, or after a format change (example, changing a dataset from shapefile to GeoConcept or to mif-mid will probably change slightly the coordinate values, because last ones are based on fixed-point or integer geometry models instead of floating point for the shapefile).

1 In a floating point geometry model, a coordinate transformation from CRS A to CRS B and back will hardly produce exactly the same coordinates as the original ones.

Michaël Michaud on 2012-06-29 ([email protected])

6/21

5.8 IsWithin Matcher Geometry A matches geometry B iff A.isWithin(B).

Use case : Find every house within a parcell, every school within a district...You will be able to aggregate attributes of small features located inside the big one onto this last one (see also aggregate extension).

5.9 Overlaps Matcher Geometry A matches geometry B iff A.overlaps(B) AND area of A.intersection(B) / area of B is more than the user-defined threshold (defined in percents). This matcher is not symmetric (intersection of A and B may represent more than 50% of A area and less than 50% of B area). This matcher makes sense for areal geometries only.

Parameter : overlapping threshold (in percents) Use case : Less restrictive than IsWithin but more restrictive than Intersects, overlaps can be used to match polygons of two datasets where geometries are different and where we want to match them in the

Michaël Michaud on 2012-06-29 ([email protected])

7/21

case where a source feature is made of several target features or in the case where target feature exceeds source features. For example, match every buikding of layer A overlapping more than 50% of a building in layer B.

5.10 OverlappedBy Matcher Geometry A matches geometry B iff A.overlaps(B) AND intersection area between A and B / area of A is less than overlapping threshold (in percent). This matcher is not symmetric (intersection of A and B may represent more than 50% of A area and less than 50% of B area). This matcher makes sense for areal geometries only.

Parameter : overlapping threshold (in percents) Use case : Less restrictive than IsWithin but more restrictive than Intersects, overlaps can be used to match polygons of two datasets where geometries are different and where several source features may match the same target feature. For example, matches every buikding of layer A within a building of layer B or having more than 50% of its area within a building of B.

5.11 Intersects Matcher Geometry A matches geometry B iff A.intersects(B).

Michaël Michaud on 2012-06-29 ([email protected])

8/21

Use case : Intersects matcher may be useful to separate isolated features from aggregated features in a single layer. If you match a layer with itself, features which do not match are isolated, while matching features are connected.

5.12 Intersects (0D intersection) Matcher Geometry A matches geometry B iff A.intersects(B) and their intersection is ponctual (a finite number of points).

Use case : In some situation, it may be useful to differenciate polygons touching each other by one point (and consider them as distinct polygons) from polygons having a common edge (or a common area) and consider merging them.

5.13 Intersects (1D interrsection) Matcher Geometry A matches geometry B iff A.intersects(B) and their intersection is 1D (a line or a set of lines).

Michaël Michaud on 2012-06-29 ([email protected])

9/21

Use case : In a quality assurance process, you may want to find ovelapping lines, which is generally not wanted. You may also have a kind of coverage where some parts are covered by polygons (ex. woods) and some other parts are just crossed by edges. In this case, you may want to find edges that penetrate inside the wood instead of stopping on their boundary, which is generally not wanted.

5.14 Intersects (2D intersection) Matcher Geometry A matches geometry B iff A.intersects(B) and their intersection has a non null area.

Use case : In a coverage, polygons can touch each other by a point or a line, but not by a surface. This matcher can help you to find overlapping which should not overlap.

5.15 Centroid Distance Matcher Check that the distance between feature's centroid does not exceed the threshold. Centroid is a point close to the gravity center, and if it is not guaranteed to lie inside the reference polygon, it is quite stable when small changes occur or for different cale representations. This matcher suits matching features at different scales. Parameter : tolerance distance

Michaël Michaud on 2012-06-29 ([email protected])

10/21

- On first schema of first line, brown shape matches green shape despite important shape differences, because these differences have a small impact on centroid. - On first schema of third line, the left brown rectangles do not match the green one because each of their centroid is too far from the green one. Here, Cardinality set to NM does not help, as source features are processed one after the other. Instead, you can use overlap matcher, or try to match the other way (see the right case on third line). - On second schema of third line, the large brown rectangle matches green rectangles despite an important centroid distance, because cardinality has been set to NM, so that after trying to match each of the green rectangle, the matcher tries to match union of green rectangles considered as a whole.

Use cases : If features are represented as closed linestrings on a layer and as points on the other layer (ex. ruins), it may happens that points do not intersect linestrings, but that centroids are very closed to each other. In this case, centroid distance matcher may be a good choice. This matcher can also match geometries with quite different boundaries (one described with all small islands and fjords, the other described in a very schematic way). Both representations should finally have quite similar centroids.

5.16 Minimum Distance Matcher Matches features if their minimum distance is less than the threshold. Not that for point layers, this matcher is the same as the previous one. In this example, every brown shape matches, as the minimum distance matcher matches every source geometry which minimum distance to a target geometry is less than a threshold. Note that this matcher does not fit well polygon or linestring matching, but that it is a good choice for point matching.

Michaël Michaud on 2012-06-29 ([email protected])

11/21

Use cases : Minimum distance matcher is a natural choice to match small features (ex. points). The closest the features are, the more likely they represent the same geographic entity.

5.17 Hausdorff Distance Matcher HausdorffDistanceMatcher matches two geometries if the maximum distance between a point in a geometry and the nearest point of the other geometry is less than the user-defined threshold. This also means that geometry A is entirely included in geometry B and vice-versa. Note : HausdorffDistance is approximated by computing the distance between each pair of coordinates. To have something closer to the actual HausdorffDistance, geometries are densified so that the longest segment cannot be more than 1,5 times the tolerance distance.

The Hausdorff Distance measure the maximum distance between a point of one feature and the nearest point of the other feature. - It is not well adapted in the case of first figure, as the brown shape has some asperities which are quite far from the green shape. In this particular case, a matcher based on centroid or on overlapping will be more appropriate. - Hausdorff distance is often used for linear matching, but in the cases presented hereafter, one should use Semi-Hausdorff matcher. - On third line, we can notice that the first case (two brown rectangles for one green one) has not matched, because source features are examined one after the other, but that the second case has matched because brown features matched union of green ones considered as a whole.

Michaël Michaud on 2012-06-29 ([email protected])

12/21

Use case : Hausdorff distance can match geometries which have been slightly modified. For example, after a noding process, geometries are generally no more the same as the original ones. Inserting points using a floating point geometry model will warp original geometries in such a way that even EqualsTopological will not recognize it. Instead, you can use HausdorffDistanceMatcher with a very small tolerance.

5.18 Semi-Hausdorff Distance Matcher SemiHausdorffDistanceMatcher matches A and B if the maximum distance of geometry A to geometry B is less than the distance parameter. Using SemiHausdorffDistanceMatcher is equivalent with checking that A is entirely included in a buffer of max distance around B. Note : HausdorffDistance is approximated by computing the distance between each pair of coordinates. To have something closer to the actual HausdorffDistance, geometries are densified so that the longest segment cannot be more than 1,5 times the tolerance distance. In this example, we can see that Semi-Hausdorff distance is adapted to network matching, and succeed in both cases : one source correspond to four green targets, and 6 source features correspond to one green target.

Michaël Michaud on 2012-06-29 ([email protected])

13/21

Use case : SemiHausdorff distance is the best choice to match networks. It will take each single linestring in the source dataset and will try to match it with the union of all target features. As it calculates a semi-hausdorff distance, it will not mind if the target features exceed the source one, it will just check that the source feature has not point too far from the target union. In a second loop, it will attribute an individual score to each pair of feature by comparing how their buffers overlap. This second process will help to determine how to transfer attributes.

5.19 ShapeMatcher Experimental

6 Matching Cardinality Default matching carfinality is N:M (actually 0...N : 0...M as this is always a possibility that a feature has no match at all. This means that •

several source features can match one target feature and



one source feature can match several target feature

This cardinality can be restricted to 1:N, N:1 or 1:1 thanks to 2 two check boxes located beside the layer combo boxes : •

one single source feature for one target feature



one single target feature for one source feature

If both boxes are checked, matches are limited to one source matching at lost one target and the other way.

Michaël Michaud on 2012-06-29 ([email protected])

14/21

6.1 Simple N:M matching The main matching algorithm loops through each source feature, and tries to match it with each target feature (this operation is greatly optimized by the usage of indexes). This process can instantiate several matches per source feature and several matches per target feature. For example, we want to get street names on houses located at less than 10 meters from the street. Most streets will match many houses, most houses will match a single street, and some houses will match two or three streets. This is a N:M relation, ant it may be interesting to limit this relation to a 1:N (one single source) in order to have the report the closest street name only.

6.2 Advanced N:M matching There is also an (automatic) N:M matching mechanism which will try to match one source feature to several target features considered as a whole. The best example is found for network matching. To test if a source road matches a target road we generally use a Hausdorff Distance or a Semi Hausdorff Distance metric (see above for definition). But if we have one source road for several target roads, we'll rapidly note that no match happens, because the source road is always too long when compared which each of the target candidate. In this case, the source road is compared to all candidates considered as a whole. If the source road is closed enough from this target superroad (in this case, we'll use semi-hausdorff distance to decide), another comparison is done on each source/target to give a score to each individual match using another matcher (this second processing is always performed with a overlapping matcher applied on geometry buffers). Playing with cardinality options NM Matches

A maximum of one source feature (brown) for each target (green).

A maximum of One source one target feature one target. (green) for each source (green).

for

7 Attribute matchers An attribute matcher can be used

Michaël Michaud on 2012-06-29 ([email protected])

15/21

1) in complement of geometry matcher, 2) alone (in this last case, set geometry matcher to Match All Matcher). Pure attribute matching is optimized by indexing target features attributes before starting the matching process. Even in the case of fuzzy matching, a special index (BKTree) is used to accelerate the search of matching strings. Here are some settings you can use to perform String matching.

Michaël Michaud on 2012-06-29 ([email protected])

16/21

Here is the dataset with a reference name (Mer méditerranée) and many different way to write it in the source layer.

Strict string matching will only match "Mer Méditerranée".

Case insensitive matching also matches lower case, upper case or mixed case names, provided they have correct accentuation.

Case and accent insensitive matcher will match all names having the correct letters, case and accents making no difference.

Levenshtein matcher with max distance = 1 can

Michaël Michaud on 2012-06-29 ([email protected])

17/21

match "Mer Méditerrannée" despite the typo (double "n"), but it is case and accent sensitive.

To make the previous matcher case and accent insensitive, we have to pre-process attribute values. TO_ASCII38 transform strings, in a way that will replace every lower case letter and accented letter by a upper case ASCII character.

The difference between Levenshtein and Damarau-Levenshtein distance is that Damarau-Levenshtein also consider a character switch (here IT -> TI) as a single edit operation.

8 More about pre-processing string attributes If you want to pre-process names before comparison, you can use pre-defined rules, or you can write your own rules in the ext/Rules folder. Predefined Rules TO_LOWERCASE

Change string character to lower case

NORMALIZE_SPACES

Remove initial spaces, final spaces, and change double spaces to single space.

REMOVE_ARTICLE

Remove article from a name (french only !). Exemple : "la Seine" -> "Seine".

REMOVE_PARENTHESES Remove parentheses and their content; Exemple : "la Seine (fleuve)" -> "la Seine". TO_ASCII

Replace all non ASCII character by an equivalent ASCII character (ASCII characters include digits, lower case and upper case characters, main punctuation characters, whitespace, dollar, but

Michaël Michaud on 2012-06-29 ([email protected])

18/21

exclude accented characters, other monetary characters from non latin alphabet, symbols.... TO_ASCII38

characters,

Transform every character into a one of [A-Z], [0-9], whitespace or slash (/).

MOVE_ARTICLE_BEFORE TO_UPPERCASE

Change string character to lower case

MOVE_ARTICLE_BEHIND NEUTRAL

Do nothing

Add a Pattern Rule Set (.prs file) You can add a .prs file in the ext/Rules folder. A .prs is a small text file defining replacement rules, based on regular expression. •

Add comments with lines beginning with '#'



Write one rule per line.



Order of rules is important : they are applied on the previous result in the order they appear in the file.

A single rule is made of •

a regular expression between quotes



the sign equals



the replacement string

Example : # Begining of Pattern Rule Set # Remove all characters but the last word ".*(?=[-\p{L}]+)$" = "" # End of Pattern Rule Set Add a .nrs rule (nrs is for Named Rule Set) A named rule set is a file containing rules already defined. These rule can be one of TO_LOWERCASE, TO_UPPERCASE, TO_ASCII38... or one of the prs you have already defined. Example : # Begining of Named Rule Set # Rule applying succesively MotDirecteur.prs and TO_ASCII38 named rule. MotDirecteur.prs TO_ASCII38 # End of Pattern Rule Set

Michaël Michaud on 2012-06-29 ([email protected])

19/21

9 Attribute transfer If you decide to transfer attributes from source layer to target layer, attributes will be transfered onto a copy of the target layer. This copy has the same name as the target layer followed by a number between parenthesis. It has already been introduced in chapter 3, along with the 2 attributes x_count and x_max_score. If you choose to transfer source attributes, you'll have the following options : Transfer best match only "Transfer best match only" is somewhat like setting matching cardinality to [0...1] : [0...N] (one single source feature per match) except it does not change the matching process, but only the transfer process. If the option is checked, the programm will examine all matches and transfer the one with the best score (generally, the "nearest"). Aggregation options If "Transfer best match only" is not checked, you can choose how matching attribute values will aggregate onto target features : Attribute Type

Aggregator

Definition

String

Concatenate

Strings will be concatenated with the pipe character ('|') as separator.

Concatenate (unique)

Strings will be concatenated with the pipe character ('|') as separator. Similar values will not be repeated.

MostFrequent

Keep the value matched with the maximum number of times.

Sum

Sum of all values

Mean

Mean of all non null values

Maximum

Maximum of all values

Minimum

Minimum of all values

Mean

Mean of all non null values

Maximum

Maximum of all values

Minimum

Minimum of all values

Double or Integer

Date

10 Performance test Compare network matching with previous version Source Layer : 11 501 features Target Layer : 381 171 features Memory footprint before process : 275 Mb Previous plugin

New plugin

Michaël Michaud on 2012-06-29 ([email protected])

20/21

(distance max to reference layer < 20)

(Semi-Hausdorff distance < 20)

Time : 27 sec Result : 133 features Memory footprint : 292 (small memory leak ?)

Time : 16 sec Result : 133 features Memory footprint after processing : 277

11 More to do Performing N:M matching can be a very complex task. Here, we take a simple approach where each source feature is compared to each target feature. Cardinality is checked on this base. Moreover, if "one source / several targets" is accepted, the soft can try to match a single source with several targets considered as a whole (unioned). But in this case the set of unioned features tested is somewhat arbitrary (generally, features intersecting the envelope of source feature expanded by the tolerance parameter). This candidate will generally be too large, so that some matcher like overlap or semi-hausdorff distance will behave properly, while others, more strict (like equals) will rarelly return the good result.

Michaël Michaud on 2012-06-29 ([email protected])

21/21