Class: Export::Dwca::Checklist::OccurrenceNormalizer

Inherits:
Object
  • Object
show all
Defined in:
lib/export/dwca/checklist/occurrence_normalizer.rb

Overview

Service object for normalizing from occurrence-based CSV to deduplicated taxon-based CSV with OTU UUID taxonIDs and parent/child relationships.

Handles:

  • Extracting unique taxa from occurrence data
  • Building taxonomic hierarchy from rank columns
  • Assigning OTU UUID taxonIDs
  • Creating parentNameUsageID relationships
  • Handling synonyms in accepted_name_usage_id mode

Constant Summary collapse

ORDERED_RANKS =

Returns of rank strings in hierarchical order (highest to lowest).

Returns:

  • (Array)

    of rank strings in hierarchical order (highest to lowest).

Data::ORDERED_RANKS
PASSTHROUGH_FIELDS =

DwC Taxon term column names allowed in the final output row. Derived from the checklist taxon extension definition values (post-header conversion names, e.g. 'class' not 'dwcClass'); anything not listed here (internal bookkeeping, future DwcOccurrence columns, etc.) is automatically excluded at finalization.

Data::CHECKLIST_TAXON_EXTENSION_COLUMNS.values.map(&:to_s).freeze
HIGHER_RANK_COLUMNS =

DwcOccurrence column names that hold rank-named classification values (as opposed to epithet columns like specificEpithet).

%w[kingdom phylum class order superfamily family subfamily tribe subtribe genus subgenus].freeze
TAXON_NAME_METADATA_FIELDS =

Fields written by store_taxon_name_metadata. Cleared for extracted parent taxa (which have no occurrence data of their own) and stripped from all rows during finalization.

%w[
  taxon_name_cached
  taxon_name_cached_is_valid
  taxon_name_cached_valid_taxon_name_id
  taxon_name_gbif_taxonomic_status
].freeze

Instance Attribute Summary collapse

Class Method Summary collapse

Instance Method Summary collapse

Constructor Details

#initialize(raw_csv:, accepted_name_mode:, otu_to_taxon_name_data:, occurrence_to_otu:) ⇒ OccurrenceNormalizer

Returns a new instance of OccurrenceNormalizer.

Parameters:

  • raw_csv (String)

    CSV with one row per occurrence

  • accepted_name_mode (String)

    How to handle synonyms

  • otu_to_taxon_name_data (Hash)

    otu_id => { cached:, cached_is_valid:, ... }

  • occurrence_to_otu (Hash)

    "type:id" => otu_id



40
41
42
43
44
45
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 40

def initialize(raw_csv:, accepted_name_mode:, otu_to_taxon_name_data:, occurrence_to_otu:)
  @raw_csv = raw_csv
  @accepted_name_mode = accepted_name_mode
  @otu_to_taxon_name_data = otu_to_taxon_name_data
  @occurrence_to_otu = occurrence_to_otu
end

Instance Attribute Details

#accepted_name_modeObject (readonly, private)

Returns the value of attribute accepted_name_mode.



88
89
90
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 88

def accepted_name_mode
  @accepted_name_mode
end

#occurrence_to_otuObject (readonly, private)

Returns the value of attribute occurrence_to_otu.



88
89
90
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 88

def occurrence_to_otu
  @occurrence_to_otu
end

#otu_to_taxon_name_dataObject (readonly, private)

Returns the value of attribute otu_to_taxon_name_data.



88
89
90
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 88

def otu_to_taxon_name_data
  @otu_to_taxon_name_data
end

#raw_csvObject (readonly, private)

Returns the value of attribute raw_csv.



88
89
90
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 88

def raw_csv
  @raw_csv
end

Class Method Details

.combine_scientific_name(cached, cached_author_year) ⇒ Object



890
891
892
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 890

def self.combine_scientific_name(cached, cached_author_year)
  [cached, cached_author_year].compact_blank.join(' ').presence
end

.infraspecific_rank_namesObject

Get all infraspecific rank names



876
877
878
879
880
881
882
883
884
885
886
887
888
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 876

def self.infraspecific_rank_names
  @infraspecific_rank_names ||= begin
    [
      ::NomenclaturalRank::Iczn::SpeciesGroup,
      ::NomenclaturalRank::Icn::SpeciesAndInfraspeciesGroup,
      ::NomenclaturalRank::Icnp::SpeciesGroup
    ].flat_map { |group|
      ranks = group.ordered_ranks.map(&:rank_name)
      species_idx = ranks.index('species')
      species_idx ? ranks[(species_idx + 1)..] : []
    }.uniq
  end
end

Instance Method Details

#add_terminal_taxon(row, terminal_tn_id, terminal_rank, all_taxa, ancestor_lookup, taxon_name_info = {}) ⇒ Object (private)

Add terminal taxon to all_taxa if not already present.

Parameters:

  • row (CSV::Row)

    occurrence row

  • terminal_tn_id (Integer)

    terminal taxon_name_id

  • terminal_rank (String)

    rank of terminal taxon

  • all_taxa (Hash)

    hash of taxon_name_id => taxon data

  • ancestor_lookup (Hash)

    precomputed ancestor lookup



289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 289

def add_terminal_taxon(
  row, terminal_tn_id, terminal_rank, all_taxa, ancestor_lookup, taxon_name_info = {}
)
  return if all_taxa[terminal_tn_id]

  taxon = row.to_h
  taxon['taxon_name_id'] = terminal_tn_id

  # In accepted_name_usage_id mode, use the original full name if available.
  if row['taxon_name_cached'].present?
    taxon['scientificName'] = self.class.combine_scientific_name(
      row['taxon_name_cached'],
      taxon_name_info.dig(terminal_tn_id, :scientific_name_authorship)
    )
  end

  normalize_occurrence_taxon(
    taxon,
    terminal_rank,
    terminal_rank,
    taxon_name_info: taxon_name_info[terminal_tn_id]
  )

  all_taxa[terminal_tn_id] = taxon

  # Extract parent species for infraspecific taxa.
  if self.class.infraspecific_rank_names.include?(terminal_rank)
    extract_parent_species_for_taxon(
      row, terminal_rank, terminal_tn_id, ancestor_lookup, all_taxa, taxon_name_info
    )
  end
end

#assign_taxon_ids_and_build_hierarchy(all_taxa, taxon_name_info) ⇒ Array (private)

Assign OTU UUID taxonIDs and parentNameUsageIDs to all taxa.

Parameters:

  • all_taxa (Hash)

    hash of taxon_name_id => taxon data

  • taxon_name_info (Hash)

    taxon_name_id => { rank:, parent_id:, scientific_name_authorship: }

Returns:

  • (Array)

    [processed_taxa, taxon_name_id_to_taxon_id]



532
533
534
535
536
537
538
539
540
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 532

def assign_taxon_ids_and_build_hierarchy(all_taxa, taxon_name_info)
  taxa_with_ids, taxon_name_id_to_taxon_id =
    assign_taxon_uuids(all_taxa, taxon_name_info)
  processed_taxa = build_processed_taxa(
    taxa_with_ids, taxon_name_info, taxon_name_id_to_taxon_id
  )

  [processed_taxa, taxon_name_id_to_taxon_id]
end

#assign_taxon_uuids(all_taxa, taxon_name_info) ⇒ Array (private)

Assign OTU UUID taxonIDs to all taxa, grouped by rank. Taxa without an OTU UUID identifier are excluded from the export.

Parameters:

  • all_taxa (Hash)

    taxon_name_id => taxon data

  • taxon_name_info (Hash)

    taxon_name_id => { rank:, parent_id: }

Returns:

  • (Array)

    [taxa_with_ids, taxon_name_id_to_taxon_id mapping]



547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 547

def assign_taxon_uuids(all_taxa, taxon_name_info)
  uuid_map = taxon_name_id_to_otu_uuid(all_taxa.keys)
  taxon_name_id_to_taxon_id = {}
  taxa_with_ids = []

  # Orderings here determine the final CSV row ordering.
  ORDERED_RANKS.each do |rank|
    rank_taxa = all_taxa.select { |tn_id, taxon|
      taxon_name_info[tn_id]&.[](:rank) == rank
    }.sort_by { |tn_id, taxon| taxon['scientificName'] || '' }

    rank_taxa.each do |tn_id, taxon|
      next if taxon_name_id_to_taxon_id[tn_id]

      uuid = uuid_map[tn_id]
      next unless uuid

      taxon_name_id_to_taxon_id[tn_id] = uuid

      taxa_with_ids << {
        taxon: taxon, taxon_id: uuid, taxon_name_id: tn_id, rank: rank
      }
    end
  end

  [taxa_with_ids, taxon_name_id_to_taxon_id]
end

#build_ancestor_lookup(terminal_taxon_name_ids) ⇒ Hash (private)

Build a lookup hash for ancestor taxon_name_ids from terminal taxon_name_ids.

Parameters:

  • terminal_taxon_name_ids (Array<Integer>)

    IDs of terminal TaxonNames

Returns:

  • (Hash)

    "terminal_tn_id:rank" => ancestor_tn_id, i.e. gives the taxon_name id of the ancestor of terminal_tn_id at rank rank



163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 163

def build_ancestor_lookup(terminal_taxon_name_ids)
  return {} if terminal_taxon_name_ids.empty?

  lookup = {}

  # Query taxon_name_hierarchies WITH join to get rank_class in single query
  hierarchy_relationships = TaxonNameHierarchy
    .joins('JOIN taxon_names ON taxon_names.id = taxon_name_hierarchies.ancestor_id')
    .where(descendant_id: terminal_taxon_name_ids)
    .where.not('ancestor_id = descendant_id') # exclude self-references
    .pluck('taxon_name_hierarchies.descendant_id',
           'taxon_name_hierarchies.ancestor_id',
           'taxon_names.rank_class')

  # Build lookup hash: "terminal_id:rank" => ancestor_id
  hierarchy_relationships.each do |descendant_id, ancestor_id, rank_class|
    next unless rank_class

    rank = rank_class_to_name[rank_class]
    next unless rank

    key = "#{descendant_id}:#{rank}"
    lookup[key] = ancestor_id
  end

  lookup
end

#build_final_taxon(taxon, taxon_id, taxon_name_id, taxon_name_info, taxon_name_id_to_taxon_id) ⇒ Hash (private)

Build a single processed taxon with all relationships.

Parameters:

  • taxon (Hash)

    source taxon data

  • taxon_id (Integer)

    assigned taxonID

  • taxon_name_id (Integer)

    source taxon_name_id

  • taxon_name_info (Hash)

    taxon_name_id => { rank:, parent_id: }

  • taxon_name_id_to_taxon_id (Hash)

    taxon_name_id => taxonID

Returns:

  • (Hash)

    processed taxon



620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 620

def build_final_taxon(
  taxon, taxon_id, taxon_name_id, taxon_name_info, taxon_name_id_to_taxon_id
)
  if accepted_name_mode == 'accepted_name_usage_id'
    accepted_name_usage_id, taxonomic_status, accepted_name_usage = determine_accepted_name_usage(
      taxon,
      taxon_id,
      taxon_name_id,
      taxon_name_info,
      taxon_name_id_to_taxon_id
    )
  end

  # GBIF checklist guidance requires acceptedNameUsageID on synonym rows to
  # point at an existing record in the dataset. If the accepted name could
  # not be included in this export, omit the synonym row instead of
  # emitting an invalid reference.
  if accepted_name_mode == 'accepted_name_usage_id' &&
     taxonomic_status.present? &&
     taxonomic_status != 'accepted' &&
     accepted_name_usage_id.nil?
    return nil
  end

  parent_id = nil
  # Synonyms don't participate in parent hierarchy.
  if accepted_name_mode == 'replace_with_accepted_name' ||
     taxonomic_status == 'accepted'
    # Find parent via TaxonName parent_id, walking up hierarchy if needed.
    current_parent_id = taxon_name_info[taxon_name_id]&.[](:parent_id)

    while current_parent_id
      # Check if this parent is in the export
      if taxon_name_id_to_taxon_id[current_parent_id]
        parent_id = taxon_name_id_to_taxon_id[current_parent_id]
        break
      end

      # Parent not in export, walk up to its parent
      current_parent_id = taxon_name_info[current_parent_id]&.[](:parent_id)
    end
  end

  processed_taxon = {
    'id' => taxon_id,
    'taxonID' => taxon_id,
    'parentNameUsageID' => parent_id
  }

  if accepted_name_mode == 'accepted_name_usage_id'
    processed_taxon['acceptedNameUsageID'] = accepted_name_usage_id
    processed_taxon['acceptedNameUsage'] = accepted_name_usage
    processed_taxon['taxonomicStatus'] = taxonomic_status
  end

  # keep processed_taxon value during the merge when both processed_taxon
  # and taxon have a value for a key.
  processed_taxon.merge(taxon.slice(*PASSTHROUGH_FIELDS)) { |_key, processed_taxon_value, _taxon_value| processed_taxon_value }
end

#build_processed_taxa(taxa_with_ids, taxon_name_info, taxon_name_id_to_taxon_id) ⇒ Array<Hash> (private)

Build final processed taxa with parent/accepted relationships.

Parameters:

  • taxa_with_ids (Array<Hash>)

    taxa with assigned IDs

  • taxon_name_info (Hash)

    taxon_name_id => { rank:, parent_id: }

  • taxon_name_id_to_taxon_id (Hash)

    taxon_name_id => taxonID

Returns:

  • (Array<Hash>)

    processed taxa ready for export



599
600
601
602
603
604
605
606
607
608
609
610
611
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 599

def build_processed_taxa(
  taxa_with_ids, taxon_name_info, taxon_name_id_to_taxon_id
)
  taxa_with_ids.filter_map do |item|
    build_final_taxon(
      item[:taxon],
      item[:taxon_id],
      item[:taxon_name_id],
      taxon_name_info,
      taxon_name_id_to_taxon_id
    )
  end
end

#collect_terminal_ids_for_batch(batch) ⇒ Array<Integer> (private)

Collect unique terminal taxon_name_ids from a batch of occurrence rows.

Parameters:

  • batch (Array<CSV::Row>)

    batch of occurrence rows

Returns:

  • (Array<Integer>)

    unique terminal taxon_name_ids



143
144
145
146
147
148
149
150
151
152
153
154
155
156
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 143

def collect_terminal_ids_for_batch(batch)
  batch.map { |row|
    occurrence_key = "#{row['dwc_occurrence_object_type']}:#{row['dwc_occurrence_object_id']}"

    tn_data = otu_to_taxon_name_data[occurrence_to_otu[occurrence_key]]
    next unless tn_data

    if accepted_name_mode == 'replace_with_accepted_name'
      tn_data[:cached_valid_taxon_name_id] || tn_data[:id]
    else
      tn_data[:id]
    end
  }.compact.uniq
end

#determine_accepted_name_usage(taxon, taxon_id, taxon_name_id, taxon_name_info, taxon_name_id_to_taxon_id) ⇒ Array (private)

Determine acceptedNameUsageID and taxonomicStatus for a taxon.

Parameters:

  • taxon (Hash)

    taxon data

  • taxon_id (Integer)

    assigned taxonID

  • taxon_name_id (Integer)

    taxon name id for this taxon

  • taxon_name_info (Hash)

    taxon metadata keyed by taxon_name_id

  • taxon_name_id_to_taxon_id (Hash)

    taxon_name_id => taxonID

Returns:

  • (Array)

    [acceptedNameUsageID (a taxonID), taxonomicStatus, acceptedNameUsage]



705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 705

def determine_accepted_name_usage(
  taxon, taxon_id, taxon_name_id, taxon_name_info, taxon_name_id_to_taxon_id
)
  return [nil, nil, nil] unless accepted_name_mode == 'accepted_name_usage_id'

  is_valid = taxon['taxon_name_cached_is_valid']

  if !is_valid.nil?
    return [taxon_id, 'accepted', taxon['scientificName']] if is_valid == true

    # This taxon is marked as invalid (synonym).
    valid_taxon_name_id = taxon['taxon_name_cached_valid_taxon_name_id']
    if valid_taxon_name_id.present?
      if valid_taxon_name_id == taxon_name_id &&
         taxon['taxon_name_gbif_taxonomic_status'].blank?
        return [taxon_id, 'accepted', taxon['scientificName']]
      end

      accepted_id = taxon_name_id_to_taxon_id[valid_taxon_name_id]
      accepted_name = taxon_name_info[valid_taxon_name_id]&.[](:scientific_name)
      # NOTE: accepted_id may be nil when the valid name has no OTU UUID
      # in this export - technically this is bad DwC checklist behavior:
      # https://ipt.gbif.org/manual/en/ipt/latest/best-practices-checklists#publishing-synonymy
      # "An dwc:acceptedNameUsageID must point to an existing record in
      # the dataset"
      status = taxon['taxon_name_gbif_taxonomic_status'] || 'synonym'
      [accepted_id, status, accepted_name]
    else
      [nil, nil, nil]
    end
  else
    # No validity data - this is an extracted higher taxon from rank columns.
    [taxon_id, 'accepted', taxon['scientificName']]
  end
end

#ensure_valid_names_for_synonyms(all_taxa, taxon_name_info = {}) ⇒ Hash (private)

Ensure valid names exist for all synonyms.

Parameters:

  • all_taxa (Hash)

    hash of taxon_name_id => taxon data

Returns:

  • (Hash)

    updated all_taxa with valid names added



463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 463

def ensure_valid_names_for_synonyms(all_taxa, taxon_name_info = {})
  # Build lookup of missing valid names => a synonym template.
  valid_id_to_synonym = {}
  all_taxa.each_value do |taxon|
    next unless taxon['taxon_name_cached_is_valid'] == false
    valid_id = taxon['taxon_name_cached_valid_taxon_name_id']
    next unless valid_id.present? && !all_taxa[valid_id]
    # Yes, we're just picking out any one synonym here (cf. below):
    valid_id_to_synonym[valid_id] ||= taxon
  end

  return all_taxa if valid_id_to_synonym.empty?

  valid_ids = valid_id_to_synonym.keys
  valid_ancestor_lookup = build_ancestor_lookup(valid_ids)
  merge_taxon_name_info!(taxon_name_info, valid_ids, valid_ancestor_lookup)

  ::TaxonName.where(id: valid_ids).each do |valid_tn|
    rank = valid_tn.rank&.downcase
    next unless rank

    template_taxon = valid_id_to_synonym[valid_tn.id]
    next unless template_taxon

    # The template's rank columns (genus, family, higherClassification, etc.)
    # already reflect the valid name's classification, not the synonym's -
    # this is why it didn't matter *which* synonym of the valid name we
    # selected above.
    valid_taxon = template_taxon.dup
    valid_taxon['taxon_name_id'] = valid_tn.id
    valid_taxon['scientificName'] =
      taxon_name_info.dig(valid_tn.id, :scientific_name) || valid_tn.cached
    valid_taxon['taxonRank'] = rank
    valid_taxon['taxon_name_cached'] = valid_tn.cached
    valid_taxon['taxon_name_cached_is_valid'] = true
    valid_taxon['taxon_name_cached_valid_taxon_name_id'] = valid_tn.id

    taxon_name_info[valid_tn.id] = {
      rank: rank,
      parent_id: valid_tn.parent_id,
      scientific_name: self.class.combine_scientific_name(valid_tn.cached, valid_tn.cached_author_year),
      scientific_name_authorship: valid_tn.cached_author_year
    }

    normalize_accepted_name_usage_taxon(
      valid_taxon,
      rank,
      template_taxon['taxonRank']&.downcase,
      taxon_name_info: taxon_name_info[valid_tn.id]
    )

    all_taxa[valid_tn.id] = valid_taxon
    extract_accepted_name_usage_ancestor_taxa(
      valid_taxon,
      valid_tn.id,
      rank,
      valid_ancestor_lookup,
      all_taxa,
      taxon_name_info
    )
  end

  all_taxa
end

#extract_accepted_name_usage_ancestor_taxa(row, terminal_tn_id, terminal_rank, ancestor_lookup, all_taxa, taxon_name_info = {}) ⇒ Object (private)



840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 840

def extract_accepted_name_usage_ancestor_taxa(
  row, terminal_tn_id, terminal_rank, ancestor_lookup, all_taxa, taxon_name_info = {}
)
  terminal_rank_index = ORDERED_RANKS.index(terminal_rank)
  return unless terminal_rank_index && terminal_rank_index > 0

  (0...terminal_rank_index).reverse_each do |i|
    higher_rank = ORDERED_RANKS[i]
    rank_taxon_name = row[higher_rank]
    next if rank_taxon_name.blank?

    ancestor_tn_id = ancestor_lookup["#{terminal_tn_id}:#{higher_rank}"]
    next unless ancestor_tn_id

    # All higher ancestors are in all_taxa if this one is.
    break if all_taxa[ancestor_tn_id]

    ancestor_taxon = row.to_h.dup
    ancestor_taxon['taxon_name_id'] = ancestor_tn_id
    ancestor_taxon['scientificName'] =
      taxon_name_info.dig(ancestor_tn_id, :scientific_name) || rank_taxon_name
    ancestor_taxon['taxonRank'] = higher_rank
    normalize_accepted_name_usage_taxon(
      ancestor_taxon,
      higher_rank,
      terminal_rank,
      taxon_name_info: taxon_name_info[ancestor_tn_id]
    )

    TAXON_NAME_METADATA_FIELDS.each { |f| ancestor_taxon[f] = nil }

    all_taxa[ancestor_tn_id] = ancestor_taxon
  end
end

#extract_ancestor_taxa(row, terminal_tn_id, terminal_rank, ancestor_lookup, all_taxa, taxon_name_info = {}) ⇒ Object (private)

Extract and add ancestor taxa from terminal taxon up to root.

Parameters:

  • row (CSV::Row)

    occurrence row

  • terminal_tn_id (Integer)

    terminal taxon_name_id

  • terminal_rank (String)

    rank of terminal taxon

  • ancestor_lookup (Hash)

    precomputed ancestor lookup

  • all_taxa (Hash)

    hash of taxon_name_id => taxon data



365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 365

def extract_ancestor_taxa(
  row, terminal_tn_id, terminal_rank, ancestor_lookup, all_taxa, taxon_name_info = {}
)
  # Synonyms have no parentNameUsageID and their hierarchy columns are
  # corrected from their own ancestry in fix_synonym_rank_columns. Skipping
  # here avoids creating ancestor rows with values from the valid name's row.
  return if row['taxon_name_cached_is_valid'] == false

  terminal_rank_index = ORDERED_RANKS.index(terminal_rank)
  return unless terminal_rank_index && terminal_rank_index > 0

  (0...terminal_rank_index).reverse_each do |i|
    higher_rank = ORDERED_RANKS[i]
    rank_taxon_name = row[higher_rank]
    next if rank_taxon_name.blank?

    ancestor_tn_id = ancestor_lookup["#{terminal_tn_id}:#{higher_rank}"]
    next unless ancestor_tn_id

    # Early termination: if this ancestor already exists, all higher ones do too.
    break if all_taxa[ancestor_tn_id]

    ancestor_taxon = row.to_h.dup
    ancestor_taxon['taxon_name_id'] = ancestor_tn_id
    ancestor_taxon['scientificName'] =
      taxon_name_info.dig(ancestor_tn_id, :scientific_name) || rank_taxon_name
    ancestor_taxon['taxonRank'] = higher_rank
    normalize_occurrence_taxon(
      ancestor_taxon,
      higher_rank,
      terminal_rank,
      taxon_name_info: taxon_name_info[ancestor_tn_id]
    )

    # Clear taxon_name_ metadata for extracted ancestors.
    TAXON_NAME_METADATA_FIELDS.each { |f| ancestor_taxon[f] = nil }

    all_taxa[ancestor_tn_id] = ancestor_taxon
  end
end

#extract_parent_species_for_taxon(row, rank, terminal_tn_id, ancestor_lookup, all_taxa, taxon_name_info = {}) ⇒ Object (private)

Extract parent species for infraspecific taxa.

Parameters:

  • row (CSV::Row)

    the occurrence row

  • rank (String)

    the rank of the infraspecific taxon

  • terminal_tn_id (Integer)

    the taxon_name_id of the infraspecific taxon

  • ancestor_lookup (Hash)

    the ancestor lookup hash

  • all_taxa (Hash)

    hash of taxon_name_id => taxon data (modified in place)



328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 328

def extract_parent_species_for_taxon(
  row, rank, terminal_tn_id, ancestor_lookup, all_taxa, taxon_name_info = {}
)
  genus = row['genus']
  specific_epithet = row['specificEpithet']

  if genus.present? && specific_epithet.present?
    species_tn_id = ancestor_lookup["#{terminal_tn_id}:species"]
    return unless species_tn_id
    return if all_taxa[species_tn_id] # Already extracted

    # Create species taxon
    species_taxon = row.to_h.dup
    species_taxon['taxon_name_id'] = species_tn_id
    species_taxon['scientificName'] =
      taxon_name_info.dig(species_tn_id, :scientific_name) || "#{genus} #{specific_epithet}"
    species_taxon['taxonRank'] = 'species'
    normalize_occurrence_taxon(
      species_taxon,
      'species',
      rank,
      taxon_name_info: taxon_name_info[species_tn_id]
    )

    # Clear taxon_name_ metadata since this is an extracted parent.
    TAXON_NAME_METADATA_FIELDS.each { |f| species_taxon[f] = nil }

    all_taxa[species_tn_id] = species_taxon
  end
end

#extract_unique_taxa(parsed) ⇒ Array<Hash, Hash> (private)

Extract all unique taxa from occurrence data. Uses taxon_name_id as the key to handle homonyms correctly.

Parameters:

  • parsed (CSV::Table)

    parsed occurrence data

Returns:

  • (Array<Hash, Hash>)

    all_taxa and taxon_name_info hashes



122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 122

def extract_unique_taxa(parsed)
  all_taxa = {}
  taxon_name_info = {}
  batch_size = 25_000

  parsed.each_slice(batch_size) do |batch|
    terminal_ids = collect_terminal_ids_for_batch(batch)
    ancestor_lookup = build_ancestor_lookup(terminal_ids)
    merge_taxon_name_info!(taxon_name_info, terminal_ids, ancestor_lookup)

    batch.each do |row|
      process_occurrence_row(row, all_taxa, ancestor_lookup, taxon_name_info)
    end
  end

  [all_taxa, taxon_name_info]
end

#fix_synonym_rank_columns(all_taxa, taxon_name_info = {}) ⇒ Hash (private)

Overwrite rank and classification columns for synonym rows using the synonym's own taxon_name_hierarchies ancestry, not the valid name's. DwcOccurrence stores the valid name's classification (via current_valid_taxon_name), so synonym rows otherwise inherit the wrong genus, family, taxonRank, etc.

All hierarchy columns (genus through kingdom, higherClassification) are corrected from the synonym's own ancestry — useful for understanding the historical classification the synonym was published under. parentNameUsageID is left empty for synonyms (handled in build_final_taxon), since synonyms do not participate in tree navigation.

Parameters:

  • all_taxa (Hash)

    hash of taxon_name_id => taxon data

Returns:

  • (Hash)

    updated all_taxa



419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 419

def fix_synonym_rank_columns(all_taxa, taxon_name_info = {})
  synonym_tn_ids = all_taxa.each_value.filter_map { |taxon|
    taxon['taxon_name_id'] if taxon['taxon_name_cached_is_valid'] == false
  }
  return all_taxa if synonym_tn_ids.empty?

  # Query ancestors including self (self gives epithet and own rank).
  synonym_ancestors = Hash.new { |h, k| h[k] = {} }
  TaxonNameHierarchy
    .joins('JOIN taxon_names ON taxon_names.id = taxon_name_hierarchies.ancestor_id')
    .where(descendant_id: synonym_tn_ids)
    .pluck('taxon_name_hierarchies.descendant_id', 'taxon_names.rank_class', 'taxon_names.name')
    .each do |descendant_id, rank_class, name|
      rank = rank_class_to_name[rank_class]
      next unless rank
      synonym_ancestors[descendant_id][rank] = name
    end

  infraspecific_ranks = self.class.infraspecific_rank_names.to_set

  all_taxa.each_value do |taxon|
    next unless taxon['taxon_name_cached_is_valid'] == false
    ancestors = synonym_ancestors[taxon['taxon_name_id']]
    next if ancestors.empty?

    # Overwrite all rank columns from synonym's own hierarchy.
    HIGHER_RANK_COLUMNS.each { |col| taxon[col] = ancestors[col] }
    taxon['specificEpithet'] = ancestors['species']

    synonym_rank = (ORDERED_RANKS & ancestors.keys).last
    taxon['taxonRank'] = synonym_rank if synonym_rank
    taxon['infraspecificEpithet'] = infraspecific_ranks.include?(synonym_rank) ? ancestors[synonym_rank] : nil
    normalize_accepted_name_usage_taxon(
      taxon,
      taxon_name_info: taxon_name_info[taxon['taxon_name_id']]
    )
  end

  all_taxa
end

#merge_taxon_name_info!(taxon_name_info, terminal_taxon_name_ids, ancestor_lookup) ⇒ Object (private)

Preload TaxonName metadata needed during normalization and final assembly.

Parameters:

  • taxon_name_info (Hash)

    hash to merge metadata into

  • terminal_taxon_name_ids (Array<Integer>)
  • ancestor_lookup (Hash)


195
196
197
198
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 195

def merge_taxon_name_info!(taxon_name_info, terminal_taxon_name_ids, ancestor_lookup)
  ids = (terminal_taxon_name_ids + ancestor_lookup.values).uniq - taxon_name_info.keys
  merge_taxon_name_info_for_ids!(taxon_name_info, ids)
end

#merge_taxon_name_info_for_ids!(taxon_name_info, ids) ⇒ Object (private)



200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 200

def merge_taxon_name_info_for_ids!(taxon_name_info, ids)
  ids = ids.uniq - taxon_name_info.keys
  return if ids.empty?

  ids.each_slice(25_000) do |batch|
    ::TaxonName.where(id: batch)
      .pluck(:id, :rank_class, :parent_id, :cached, :cached_author_year)
      .each do |id, rank_class, parent_id, cached, cached_author_year|
        rank = rank_class_to_name[rank_class]&.downcase
        taxon_name_info[id] = {
          rank: rank,
          parent_id: parent_id,
          scientific_name: self.class.combine_scientific_name(cached, cached_author_year),
          scientific_name_authorship: cached_author_year
        }
      end
  end
end

#normalizeString, Hash

Main entry point - normalizes taxonomy CSV

Returns:

  • (String, Hash)

    Normalized CSV and taxon_name_id_to_taxon_id mapping



49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 49

def normalize
  parsed = CSV.parse(@raw_csv, headers: true, col_sep: "\t")
  return ["\n", {}] if parsed.empty?

  all_taxa, taxon_name_info = extract_unique_taxa(parsed)

  if @accepted_name_mode == 'accepted_name_usage_id'
    all_taxa = ensure_valid_names_for_synonyms(all_taxa, taxon_name_info)
    all_taxa = fix_synonym_rank_columns(all_taxa, taxon_name_info)
  end

  # Build hierarchy and assign taxonIDs
  processed_taxa, taxon_name_id_to_taxon_id =
    assign_taxon_ids_and_build_hierarchy(all_taxa, taxon_name_info)

  all_taxa = nil # release memory

  processed_taxa = remove_empty_columns(processed_taxa)

  # Collect headers from all rows so taxa with extra columns (e.g. a
  # synonym row that gained infraspecificEpithet from its own hierarchy
  # while the first row never had that key) are not misaligned when written.
  output_headers = processed_taxa.each_with_object([]) do |taxon, headers|
    taxon.each_key { |k| headers << k unless headers.include?(k) }
  end

  csv_output = CSV.generate(col_sep: "\t") do |csv|
    csv << output_headers

    processed_taxa.each do |taxon|
      csv << output_headers.map { |h| taxon[h] }
    end
  end

  [csv_output, taxon_name_id_to_taxon_id]
end

#normalize_accepted_name_usage_taxon(taxon, current_rank = nil, original_rank = nil, taxon_name_info: nil) ⇒ Object (private)

Normalizes a taxon row introduced or rewritten during accepted-name-usage handling. This covers both corrected synonym rows and accepted rows synthesized from those synonyms.

Like occurrence-stage normalization, a rank transition clears lower-rank columns and taxon-specific fields that do not apply to the resulting rank, then recomputes the normalized fields from the row's own taxon metadata. Callers may omit the ranks to use the row's current taxonRank unchanged.

Parameters:

  • taxon (Hash)

    the taxon data hash to modify

  • current_rank (String, nil) (defaults to: nil)

    the rank represented by the row after accepted-name-usage normalization; defaults from taxon

  • original_rank (String, nil) (defaults to: nil)

    the source rank before any accepted-name rewrite; defaults to current_rank when omitted

  • taxon_name_info (Hash, nil) (defaults to: nil)

    the row's single TaxonName metadata entry (for example { scientific_name_authorship: ... }) used to repopulate normalized fields such as authorship



788
789
790
791
792
793
794
795
796
797
798
799
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 788

def normalize_accepted_name_usage_taxon(
  taxon, current_rank = nil, original_rank = nil, taxon_name_info: nil
)
  current_rank ||= taxon['taxonRank']&.downcase
  original_rank ||= current_rank
  normalize_taxon_for_rank_transition(
    taxon,
    current_rank,
    original_rank,
    taxon_name_info: taxon_name_info
  )
end

#normalize_occurrence_taxon(taxon, current_rank, original_rank = nil, taxon_name_info: nil) ⇒ Object (private)

Normalizes a taxon row produced during the occurrence-driven stage. This covers both terminal occurrence-backed rows and taxa extracted from those rows (for example parent species or higher ancestors).

When current_rank differs from original_rank, it:

  • clears rank columns below current_rank
  • clears taxon-specific fields not applicable to the extracted rank

It always recomputes normalized fields from the taxon's own metadata and preserves all other fields as-is. In particular, it does not rewrite row identity fields such as taxonRank or scientificName; callers are expected to set those before calling this method.

Parameters:

  • taxon (Hash)

    the taxon data hash to modify

  • current_rank (String)

    the rank represented by the row after occurrence-stage extraction/normalization

  • original_rank (String) (defaults to: nil)

    the source rank before any extraction; use the same value as current_rank for terminal rows

  • taxon_name_info (Hash, nil) (defaults to: nil)

    the row's single TaxonName metadata entry (for example { scientific_name_authorship: ... }) used to repopulate normalized fields such as authorship



761
762
763
764
765
766
767
768
769
770
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 761

def normalize_occurrence_taxon(
  taxon, current_rank, original_rank = nil, taxon_name_info: nil
)
  normalize_taxon_for_rank_transition(
    taxon,
    current_rank,
    original_rank,
    taxon_name_info: taxon_name_info
  )
end

#normalize_taxon_for_rank_transition(taxon, current_rank, original_rank = nil, taxon_name_info: nil) ⇒ Object (private)



801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 801

def normalize_taxon_for_rank_transition(
  taxon, current_rank, original_rank = nil, taxon_name_info: nil
)
  current_id = ORDERED_RANKS.index(current_rank)
  return unless current_id

  if current_rank == original_rank
    populate_normalized_taxon_fields(taxon, taxon_name_info)
    return
  end

  # Clear lower rank columns
  ORDERED_RANKS[(current_id + 1)..-1].each do |lower_rank|
    taxon[lower_rank] = nil
  end

  # Fields to keep for extracted taxa
  rank_columns = ORDERED_RANKS.map(&:to_s)
  fields_to_keep =
    rank_columns + ['scientificName', 'taxonRank', 'nomenclaturalCode']

  # Add epithet fields based on rank
  if current_rank == 'species'
    fields_to_keep << 'specificEpithet'
  elsif self.class.infraspecific_rank_names.include?(current_rank)
    fields_to_keep << 'specificEpithet'
    fields_to_keep << 'infraspecificEpithet'
  end

  # Clear all other taxon-specific fields
  Data::CHECKLIST_TAXON_EXTENSION_COLUMNS.keys.each do |field|
    field_str = field.to_s
    field_str = 'class' if field == :dwcClass
    taxon[field_str] = nil unless fields_to_keep.include?(field_str)
  end

  populate_normalized_taxon_fields(taxon, taxon_name_info)
end

#populate_normalized_taxon_fields(taxon, taxon_name_info = nil) ⇒ Object (private)



680
681
682
683
684
685
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 680

def populate_normalized_taxon_fields(taxon, taxon_name_info = nil)
  taxon['scientificNameAuthorship'] = taxon_name_info&.[](:scientific_name_authorship)
  # The original higherClassification may include more ranks than the
  # checklist does, so just always recompute it.
  taxon['higherClassification'] = recompute_higher_classification(taxon)
end

#process_occurrence_row(row, all_taxa, ancestor_lookup, taxon_name_info = {}) ⇒ Object (private)

Process a single occurrence row and extract its taxa.

Parameters:

  • row (CSV::Row)

    occurrence row

  • all_taxa (Hash)

    hash of taxon_name_id => taxon data

  • ancestor_lookup (Hash)

    precomputed ancestor lookup



245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 245

def process_occurrence_row(row, all_taxa, ancestor_lookup, taxon_name_info = {})
  occurrence_key = "#{row['dwc_occurrence_object_type']}:#{row['dwc_occurrence_object_id']}"

  tn_data = otu_to_taxon_name_data[occurrence_to_otu[occurrence_key]]
  return unless tn_data

  terminal_tn_id = if accepted_name_mode == 'replace_with_accepted_name'
    tn_data[:cached_valid_taxon_name_id] || tn_data[:id]
  else
    tn_data[:id]
  end
  return unless terminal_tn_id

  (row, tn_data) if accepted_name_mode == 'accepted_name_usage_id'

  terminal_rank = row['taxonRank']&.downcase
  return unless terminal_rank.present? && row['scientificName'].present?

  add_terminal_taxon(
    row, terminal_tn_id, terminal_rank, all_taxa, ancestor_lookup, taxon_name_info
  )
  extract_ancestor_taxa(
    row, terminal_tn_id, terminal_rank, ancestor_lookup, all_taxa, taxon_name_info
  )
end

#rank_class_to_nameHash (private)

Cached mapping of rank_class to rank_name.

Returns:

  • (Hash)

    rank_class string => rank_name string



221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 221

def rank_class_to_name
  @rank_class_to_name ||= begin
    mapping = {}

    # Get all NomenclaturalRank classes from all codes
    [
      NomenclaturalRank::Iczn,
      NomenclaturalRank::Icn,
      NomenclaturalRank::Icnp,
      NomenclaturalRank::Icvcn
    ].each do |code_module|
      code_module.ordered_ranks.each do |rank_class|
        mapping[rank_class.name] = rank_class.rank_name
      end
    end

    mapping
  end
end

#recompute_higher_classification(taxon) ⇒ Object (private)



687
688
689
690
691
692
693
694
695
696
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 687

def recompute_higher_classification(taxon)
  rank = taxon['taxonRank']&.downcase
  rank_index = ORDERED_RANKS.index(rank)
  return taxon['higherClassification'] unless rank_index

  classification_parts = ORDERED_RANKS[0...rank_index]
    .filter_map { |r| HIGHER_RANK_COLUMNS.include?(r) ? taxon[r].presence : nil }

  classification_parts.empty? ? nil : classification_parts.join(Export::Dwca::DELIMITER)
end

#remove_empty_columns(taxa) ⇒ Array<Hash> (private)

Remove columns that are completely empty across all taxa

Parameters:

  • taxa (Array<Hash>)

    array of taxon hashes

Returns:

  • (Array<Hash>)

    taxa with empty columns removed



93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 93

def remove_empty_columns(taxa)
  return taxa if taxa.empty?

  # Required columns that should never be removed, even if empty
  required_columns = %w[id taxonID scientificName taxonRank].to_set

  # Find which columns have at least one non-empty value
  columns_with_data = Set.new

  taxa.each do |taxon|
    taxon.each do |key, value|
      next if columns_with_data.include?(key)

      if required_columns.include?(key) || value.present?
        columns_with_data << key
      end
    end
  end

  # Filter each taxon to only include columns with data
  taxa.map do |taxon|
    taxon.select { |key, _| columns_with_data.include?(key) }
  end
end

#store_taxon_name_metadata(row, tn_data) ⇒ Object (private)

Store TaxonName metadata in row for accepted_name_usage_id mode

Parameters:

  • row (CSV::Row)

    occurrence row to modify

  • tn_data (Hash)

    taxon name data



274
275
276
277
278
279
280
281
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 274

def (row, tn_data)
  return unless tn_data[:cached].present?

  row['taxon_name_cached'] = tn_data[:cached]
  row['taxon_name_cached_is_valid'] = tn_data[:cached_is_valid]
  row['taxon_name_cached_valid_taxon_name_id'] = tn_data[:cached_valid_taxon_name_id]
  row['taxon_name_gbif_taxonomic_status'] = tn_data[:gbif_taxonomic_status]
end

#taxon_name_id_to_otu_uuid(taxon_name_ids) ⇒ Hash (private)

Build a mapping of taxon_name_id => OTU UUID for the given taxon_name_ids. Only includes taxa that have an OTU with a Uuid identifier.

Parameters:

  • taxon_name_ids (Array<Integer>)

Returns:

  • (Hash)

    taxon_name_id => uuid string



579
580
581
582
583
584
585
586
587
588
589
590
591
592
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 579

def taxon_name_id_to_otu_uuid(taxon_name_ids)
  return {} if taxon_name_ids.empty?

  taxon_name_ids.each_slice(25_000).each_with_object({}) do |batch, result|
    ::Otu
      .joins("JOIN identifiers ON identifiers.identifier_object_id = otus.id
                AND identifiers.identifier_object_type = 'Otu'
                AND identifiers.type LIKE 'Identifier::Global::Uuid%'
                AND identifiers.position = 1")
      .where(taxon_name_id: batch)
      .pluck('otus.taxon_name_id', 'identifiers.cached')
      .each { |tn_id, uuid| result[tn_id] = uuid }
  end
end