Module: Utilities::DarwinCore::Compact

Defined in:
lib/utilities/darwin_core/compact.rb

Overview

Methods for compacting (merging rows) in DarwinCore tables.

Author:

  • Claude (>50% of code)

Constant Summary collapse

MALE_STRINGS =
/\Amale/i
FEMALE_STRINGS =
/\Afemale/i
ADULT_STRINGS =
/\Aadult/i
EXUVIA_STRINGS =
/\Aexuvia/i
NYMPH_STRINGS =
/\Anymph/i
COMPACT_DELIMITER =
'|'
APPENDED_COLUMNS =
%w[lifeStage sex otherCatalogNumbers associatedMedia].freeze
SUMMED_COLUMNS =
%w[individualCount].freeze
DERIVED_COLUMNS =
%w[adultMale adultFemale immatureNymph exuvia].freeze
SKIP_VALIDATION_COLUMNS =

Columns excluded from the differing-values validation check. These are housekeeping or per-row identity fields that are expected to differ across rows sharing a catalogNumber.

%w[
  id
  occurrenceID
  dwc_occurrence_object_id
  dwc_occurrence_object_type
].freeze

Class Method Summary collapse

Class Method Details

.add_derived_columns(row) ⇒ void (private)

This method returns an undefined value.

Add derived columns to a single (non-grouped) row.

Parameters:

  • row (Hash)


182
183
184
185
186
187
188
189
190
191
# File 'lib/utilities/darwin_core/compact.rb', line 182

def self.add_derived_columns(row)
  count = row['individualCount'].to_i
  sex_value = row['sex'].to_s.strip
  life_stage_value = row['lifeStage'].to_s.strip

  row['adultMale'] = (sex_value.match?(MALE_STRINGS) ? count : 0).to_s
  row['adultFemale'] = (sex_value.match?(FEMALE_STRINGS) ? count : 0).to_s
  row['immatureNymph'] = (life_stage_value.match?(NYMPH_STRINGS) ? count : 0).to_s
  row['exuvia'] = (life_stage_value.match?(EXUVIA_STRINGS) ? count : 0).to_s
end

.add_derived_columns_from_group(merged, rows) ⇒ void (private)

This method returns an undefined value.

Add derived columns from a group of pre-merge rows.

Parameters:

  • merged (Hash)

    the target merged row

  • rows (Array<Hash>)

    original rows



155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
# File 'lib/utilities/darwin_core/compact.rb', line 155

def self.add_derived_columns_from_group(merged, rows)
  adult_male_count = 0
  adult_female_count = 0
  immature_nymph_count = 0
  exuvia_count = 0

  rows.each do |row|
    count = row['individualCount'].to_i
    sex_value = row['sex'].to_s.strip
    life_stage_value = row['lifeStage'].to_s.strip

    adult_male_count += count if sex_value.match?(MALE_STRINGS)
    adult_female_count += count if sex_value.match?(FEMALE_STRINGS)
    immature_nymph_count += count if life_stage_value.match?(NYMPH_STRINGS)
    exuvia_count += count if life_stage_value.match?(EXUVIA_STRINGS)
  end

  merged['adultMale'] = adult_male_count.to_s
  merged['adultFemale'] = adult_female_count.to_s
  merged['immatureNymph'] = immature_nymph_count.to_s
  merged['exuvia'] = exuvia_count.to_s
end

.by_catalog_number(table, preview: false) ⇒ void

This method returns an undefined value.

Merge rows with identical catalogNumber values. Rows without a catalogNumber are excluded from compaction but tracked in table.skipped_rows.

Parameters:



36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
# File 'lib/utilities/darwin_core/compact.rb', line 36

def self.by_catalog_number(table, preview: false)
  with_catalog_number, without_catalog_number = table.rows.partition { |row|
    row['catalogNumber'].to_s.strip.present?
  }

  table.skipped_rows = without_catalog_number

  grouped = with_catalog_number.group_by { |row| row['catalogNumber'] }

  merged_rows = []

  grouped.each do |catalog_number, rows_in_group|
    if rows_in_group.size == 1
      row = rows_in_group.first
      unless preview
        add_derived_columns(row)
        merged_rows << row
      end
      next
    end

    validate_group(table, catalog_number, rows_in_group)

    unless preview
      merged = merge_group(table, catalog_number, rows_in_group)
      merged_rows << merged
    end
  end

  unless preview
    ensure_derived_headers(table)
    table.instance_variable_set(:@rows, merged_rows)
  end
end

.ensure_derived_headers(table) ⇒ void (private)

This method returns an undefined value.

Ensure derived column headers are present in the table.

Parameters:



197
198
199
200
201
# File 'lib/utilities/darwin_core/compact.rb', line 197

def self.ensure_derived_headers(table)
  DERIVED_COLUMNS.each do |col|
    table.headers << col unless table.headers.include?(col)
  end
end

.merge_group(table, catalog_number, rows) ⇒ Hash (private)

Merge a group of rows into a single row.

Parameters:

Returns:

  • (Hash)

    the merged row



131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
# File 'lib/utilities/darwin_core/compact.rb', line 131

def self.merge_group(table, catalog_number, rows)
  merged = rows.first.dup

  # Sum individualCount
  SUMMED_COLUMNS.each do |col|
    merged[col] = rows.sum { |r| r[col].to_i }.to_s
  end

  # Append unique values with delimiter
  APPENDED_COLUMNS.each do |col|
    unique_values = rows.map { |r| r[col].to_s.strip }.reject(&:empty?).uniq
    merged[col] = unique_values.join(COMPACT_DELIMITER)
  end

  add_derived_columns_from_group(merged, rows)

  merged
end

.validate_group(table, catalog_number, rows) ⇒ void (private)

This method returns an undefined value.

Validate a group of rows sharing a catalogNumber. Logs errors for columns with differing values. Warns if sex/lifeStage are non-adult.

Parameters:



79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
# File 'lib/utilities/darwin_core/compact.rb', line 79

def self.validate_group(table, catalog_number, rows)
  operated_columns = APPENDED_COLUMNS + SUMMED_COLUMNS + SKIP_VALIDATION_COLUMNS

  (table.headers - operated_columns).each do |column|
    values = rows.map { |r| r[column] }.uniq
    if values.size > 1
      table.errors << {
        type: :error,
        catalog_number:,
        column:,
        message: "Differing values in '#{column}'",
        values:
      }
    end
  end

  rows.each do |row|
    sex_value = row['sex'].to_s.strip
    life_stage_value = row['lifeStage'].to_s.strip

    if sex_value.present? && !sex_value.match?(MALE_STRINGS) && !sex_value.match?(FEMALE_STRINGS)
      if !life_stage_value.match?(ADULT_STRINGS)
        table.errors << {
          type: :warning,
          catalog_number:,
          column: 'sex',
          message: "Non-adult/non-standard sex '#{sex_value}' with lifeStage '#{life_stage_value}'",
          values: [sex_value, life_stage_value]
        }
      end
    end

    if life_stage_value.present? && !life_stage_value.match?(ADULT_STRINGS)
      unless life_stage_value.match?(NYMPH_STRINGS) || life_stage_value.match?(EXUVIA_STRINGS)
        table.errors << {
          type: :warning,
          catalog_number:,
          column: 'lifeStage',
          message: "Non-adult lifeStage '#{life_stage_value}'",
          values: [life_stage_value]
        }
      end
    end
  end
end