Skip to content

mapper.matchers.fuzzy_matcher

Fuzzy matcher for matching Excel and geographical data.

This module provides a matcher that uses fuzzy string matching to find similar values between Excel and geographical data columns. It uses the RapidFuzz library to find the best match for each Excel value in the geographical data, subject to a minimum similarity threshold.

FuzzyMatcher

Bases: BaseMatcher, Ui_FuzzyMatcher

Matcher for fuzzy string matching between Excel and geographical data.

Behavior
  • Uses RapidFuzz to extract the best match for each Excel value from geo values.
  • Only accepts matches above a configurable similarity threshold.
  • Ensures one-to-one matching (no duplicate matches on geo side).
Signals

updated: Emitted when the matcher configuration changes. removed: Emitted when the matcher is removed.

Source code in src/mapper/matchers/fuzzy_matcher.py
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
class FuzzyMatcher(BaseMatcher, Ui_FuzzyMatcher):
    """
    Matcher for fuzzy string matching between Excel and geographical data.

    Behavior:
      - Uses RapidFuzz to extract the best match for each Excel value from geo values.
      - Only accepts matches above a configurable similarity threshold.
      - Ensures one-to-one matching (no duplicate matches on geo side).

    Signals:
      updated: Emitted when the matcher configuration changes.
      removed: Emitted when the matcher is removed.
    """

    # --------------------------------------------------------------------------
    #  Constructor
    # --------------------------------------------------------------------------
    def __init__(self, nr: int, excel_cols: List[str], geo_cols: List[str]) -> None:
        """
        Initialize the fuzzy matcher UI and internal state.

        Steps:
          1. Call BaseMatcher constructor to store identifier and column lists.
          2. Call setupUi(self) to create UI controls from the .ui file.
          3. Populate the Excel and geo combo boxes with available columns.
          4. Connect UI signals to notify when configuration changes or removal is requested.
        """
        super().__init__(nr, excel_cols, geo_cols)
        # Build UI elements from the designer file
        self.setupUi(self)

        # Populate dropdowns with available columns
        self.comboExcel.addItems(excel_cols)
        self.comboGeo.addItems(geo_cols)

        # Connect dropdown and spinbox changes to the updated signal
        self.comboExcel.currentIndexChanged.connect(self.updated)
        self.comboGeo.currentIndexChanged.connect(self.updated)
        self.spinThreshold.valueChanged.connect(self.updated)
        # Connect remove button to emit the removed signal when clicked
        self.buttonRemove.clicked.connect(self.removed.emit)

    # --------------------------------------------------------------------------
    #  Matching logic
    # --------------------------------------------------------------------------
    def match(self, excel_df: pd.DataFrame, geo_df: pd.DataFrame) -> Tuple[Optional[pd.DataFrame], Optional[List[int]], Optional[List[int]]]:
        """
        Match records between Excel and geographical data using fuzzy string matching.

        Steps:
          1. Retrieve selected columns and threshold from UI.
          2. If threshold ≤ 1, scale it to percentage (multiply by 100).
          3. If either combo is empty, reset stats display and return no matches.
          4. Convert both column values to strings for comparison.
          5. Use RapidFuzz `extractOne` to find the best geo match for each Excel value,
             scoring with WRatio and discarding choices below threshold.
          6. Skip matches where the geo value was already used to enforce 1:1 mapping.
          7. For each accepted match, build a result row with a label containing the score.
          8. Concatenate all matched parts into one DataFrame.
          9. Update the stats label with the number of matched rows.
        """
        # Retrieve column names and threshold
        ex_col = self.comboExcel.currentText()
        ge_col = self.comboGeo.currentText()
        threshold = self.spinThreshold.value()
        if threshold <= 1:
            threshold *= 100

        # Bail out if no column is selected
        if not ex_col or not ge_col:
            self.set_stats(0)
            return None, None, None

        # Convert selected column values to string lists
        ex_values = excel_df[ex_col].astype(str).tolist()
        ge_values = geo_df[ge_col].astype(str).tolist()

        # Track used indices to enforce 1:1 mapping
        ex_used_labels: List[int] = []
        ge_used_labels: set = set()
        mapped_parts: List[pd.DataFrame] = []

        # Iterate each Excel value and attempt to find a fuzzy match in geo values
        for ex_pos, ex_val in enumerate(ex_values):
            # Find best match above threshold, scoring with WRatio
            choice = process.extractOne(ex_val, ge_values, scorer=fuzz.WRatio, score_cutoff=threshold)  # type: ignore[attr-defined]
            if not choice:
                continue  # Skip if no valid match found

            match_str, score, ge_pos = choice
            ge_label = geo_df.index[ge_pos]
            # Skip if that geo value is already matched
            if ge_label in ge_used_labels:
                continue

            # Build descriptive label including match score
            label = f"{self.description()}@{score:.1f}"
            # Build result DataFrame for this match
            part = self.build_result(excel_df.iloc[[ex_pos]], geo_df.iloc[[ge_pos]], label)
            mapped_parts.append(part)
            # Record the actual row indices used
            ex_used_labels.append(excel_df.index[ex_pos])
            ge_used_labels.add(ge_label)

        # If no parts were matched, indicate zero and return
        if not mapped_parts:
            self.set_stats(0)
            return None, None, None

        # Concatenate all matched DataFrames into one
        mapped_df = pd.concat(mapped_parts, ignore_index=True)
        # Update the UI to show how many mappings occurred
        self.set_stats(len(mapped_df))
        return mapped_df, ex_used_labels, list(ge_used_labels)

    # --------------------------------------------------------------------------
    #  Column update helpers
    # --------------------------------------------------------------------------
    # def update_excel_columns(self, cols: List[str]) -> None:
    #     """
    #     Update available Excel columns when upstream data changes.
    #
    #     Steps:
    #       1. Remember the currently selected column (if any).
    #       2. Clear the combo box and repopulate with new `cols`.
    #       3. If the previously selected column still exists, reselect it.
    #     """
    #     cur = self.comboExcel.currentText()
    #     self.comboExcel.clear()
    #     self.comboExcel.addItems(cols)
    #     if cur in cols:
    #         self.comboExcel.setCurrentText(cur)

    # def update_geo_columns(self, cols: List[str]) -> None:
    #     """
    #     Update available geo columns when upstream data changes.
    #
    #     Steps:
    #       1. Remember the currently selected column (if any).
    #       2. Clear the combo box and repopulate with new `cols`.
    #       3. If the previously selected column still exists, reselect it.
    #     """
    #     cur = self.comboGeo.currentText()
    #     self.comboGeo.clear()
    #     self.comboGeo.addItems(cols)
    #     if cur in cols:
    #         self.comboGeo.setCurrentText(cur)

    # --------------------------------------------------------------------------
    #  Stats display helper
    # --------------------------------------------------------------------------
    def set_stats(self, n: int) -> None:
        """
        Update the statistics label to reflect the number of matched rows.

        Steps:
          1. Set the text of `labelStats` to show "Mappings: n".
        """
        self.labelStats.setText(f"Mappings: {n}")

    # --------------------------------------------------------------------------
    #  Description provider
    # --------------------------------------------------------------------------
    def description(self) -> str:
        """
        Return a description of this matcher’s configuration.

        Steps:
          1. Combine the matcher ID, selected Excel column, and selected geo column.
          2. Format as "FUZZ#<nr>:<excel>→<geo>".
        """
        return f"FUZZ#{self._nr}:{self.comboExcel.currentText()}{self.comboGeo.currentText()}"

__init__(nr, excel_cols, geo_cols)

Initialize the fuzzy matcher UI and internal state.

Steps
  1. Call BaseMatcher constructor to store identifier and column lists.
  2. Call setupUi(self) to create UI controls from the .ui file.
  3. Populate the Excel and geo combo boxes with available columns.
  4. Connect UI signals to notify when configuration changes or removal is requested.
Source code in src/mapper/matchers/fuzzy_matcher.py
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
def __init__(self, nr: int, excel_cols: List[str], geo_cols: List[str]) -> None:
    """
    Initialize the fuzzy matcher UI and internal state.

    Steps:
      1. Call BaseMatcher constructor to store identifier and column lists.
      2. Call setupUi(self) to create UI controls from the .ui file.
      3. Populate the Excel and geo combo boxes with available columns.
      4. Connect UI signals to notify when configuration changes or removal is requested.
    """
    super().__init__(nr, excel_cols, geo_cols)
    # Build UI elements from the designer file
    self.setupUi(self)

    # Populate dropdowns with available columns
    self.comboExcel.addItems(excel_cols)
    self.comboGeo.addItems(geo_cols)

    # Connect dropdown and spinbox changes to the updated signal
    self.comboExcel.currentIndexChanged.connect(self.updated)
    self.comboGeo.currentIndexChanged.connect(self.updated)
    self.spinThreshold.valueChanged.connect(self.updated)
    # Connect remove button to emit the removed signal when clicked
    self.buttonRemove.clicked.connect(self.removed.emit)

description()

Return a description of this matcher’s configuration.

Steps
  1. Combine the matcher ID, selected Excel column, and selected geo column.
  2. Format as "FUZZ#:".
Source code in src/mapper/matchers/fuzzy_matcher.py
187
188
189
190
191
192
193
194
195
def description(self) -> str:
    """
    Return a description of this matcher’s configuration.

    Steps:
      1. Combine the matcher ID, selected Excel column, and selected geo column.
      2. Format as "FUZZ#<nr>:<excel>→<geo>".
    """
    return f"FUZZ#{self._nr}:{self.comboExcel.currentText()}{self.comboGeo.currentText()}"

match(excel_df, geo_df)

Match records between Excel and geographical data using fuzzy string matching.

Steps
  1. Retrieve selected columns and threshold from UI.
  2. If threshold ≤ 1, scale it to percentage (multiply by 100).
  3. If either combo is empty, reset stats display and return no matches.
  4. Convert both column values to strings for comparison.
  5. Use RapidFuzz extractOne to find the best geo match for each Excel value, scoring with WRatio and discarding choices below threshold.
  6. Skip matches where the geo value was already used to enforce 1:1 mapping.
  7. For each accepted match, build a result row with a label containing the score.
  8. Concatenate all matched parts into one DataFrame.
  9. Update the stats label with the number of matched rows.
Source code in src/mapper/matchers/fuzzy_matcher.py
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
def match(self, excel_df: pd.DataFrame, geo_df: pd.DataFrame) -> Tuple[Optional[pd.DataFrame], Optional[List[int]], Optional[List[int]]]:
    """
    Match records between Excel and geographical data using fuzzy string matching.

    Steps:
      1. Retrieve selected columns and threshold from UI.
      2. If threshold ≤ 1, scale it to percentage (multiply by 100).
      3. If either combo is empty, reset stats display and return no matches.
      4. Convert both column values to strings for comparison.
      5. Use RapidFuzz `extractOne` to find the best geo match for each Excel value,
         scoring with WRatio and discarding choices below threshold.
      6. Skip matches where the geo value was already used to enforce 1:1 mapping.
      7. For each accepted match, build a result row with a label containing the score.
      8. Concatenate all matched parts into one DataFrame.
      9. Update the stats label with the number of matched rows.
    """
    # Retrieve column names and threshold
    ex_col = self.comboExcel.currentText()
    ge_col = self.comboGeo.currentText()
    threshold = self.spinThreshold.value()
    if threshold <= 1:
        threshold *= 100

    # Bail out if no column is selected
    if not ex_col or not ge_col:
        self.set_stats(0)
        return None, None, None

    # Convert selected column values to string lists
    ex_values = excel_df[ex_col].astype(str).tolist()
    ge_values = geo_df[ge_col].astype(str).tolist()

    # Track used indices to enforce 1:1 mapping
    ex_used_labels: List[int] = []
    ge_used_labels: set = set()
    mapped_parts: List[pd.DataFrame] = []

    # Iterate each Excel value and attempt to find a fuzzy match in geo values
    for ex_pos, ex_val in enumerate(ex_values):
        # Find best match above threshold, scoring with WRatio
        choice = process.extractOne(ex_val, ge_values, scorer=fuzz.WRatio, score_cutoff=threshold)  # type: ignore[attr-defined]
        if not choice:
            continue  # Skip if no valid match found

        match_str, score, ge_pos = choice
        ge_label = geo_df.index[ge_pos]
        # Skip if that geo value is already matched
        if ge_label in ge_used_labels:
            continue

        # Build descriptive label including match score
        label = f"{self.description()}@{score:.1f}"
        # Build result DataFrame for this match
        part = self.build_result(excel_df.iloc[[ex_pos]], geo_df.iloc[[ge_pos]], label)
        mapped_parts.append(part)
        # Record the actual row indices used
        ex_used_labels.append(excel_df.index[ex_pos])
        ge_used_labels.add(ge_label)

    # If no parts were matched, indicate zero and return
    if not mapped_parts:
        self.set_stats(0)
        return None, None, None

    # Concatenate all matched DataFrames into one
    mapped_df = pd.concat(mapped_parts, ignore_index=True)
    # Update the UI to show how many mappings occurred
    self.set_stats(len(mapped_df))
    return mapped_df, ex_used_labels, list(ge_used_labels)

set_stats(n)

Update the statistics label to reflect the number of matched rows.

Steps
  1. Set the text of labelStats to show "Mappings: n".
Source code in src/mapper/matchers/fuzzy_matcher.py
175
176
177
178
179
180
181
182
def set_stats(self, n: int) -> None:
    """
    Update the statistics label to reflect the number of matched rows.

    Steps:
      1. Set the text of `labelStats` to show "Mappings: n".
    """
    self.labelStats.setText(f"Mappings: {n}")