Class: Export::Dwca::Checklist::OccurrenceNormalizer
- Inherits:
-
Object
- Object
- Export::Dwca::Checklist::OccurrenceNormalizer
- Defined in:
- lib/export/dwca/checklist/occurrence_normalizer.rb
Overview
Service object for normalizing from occurrence-based CSV to deduplicated taxon-based CSV with OTU UUID taxonIDs and parent/child relationships.
Handles:
- Extracting unique taxa from occurrence data
- Building taxonomic hierarchy from rank columns
- Assigning OTU UUID taxonIDs
- Creating parentNameUsageID relationships
- Handling synonyms in accepted_name_usage_id mode
Constant Summary collapse
- ORDERED_RANKS =
Returns of rank strings in hierarchical order (highest to lowest).
Data::ORDERED_RANKS
- PASSTHROUGH_FIELDS =
DwC Taxon term column names allowed in the final output row. Derived from the checklist taxon extension definition values (post-header conversion names, e.g. 'class' not 'dwcClass'); anything not listed here (internal bookkeeping, future DwcOccurrence columns, etc.) is automatically excluded at finalization.
Data::CHECKLIST_TAXON_EXTENSION_COLUMNS.values.map(&:to_s).freeze
- HIGHER_RANK_COLUMNS =
DwcOccurrence column names that hold rank-named classification values (as opposed to epithet columns like specificEpithet).
%w[kingdom phylum class order superfamily family subfamily tribe subtribe genus subgenus].freeze
- TAXON_NAME_METADATA_FIELDS =
Fields written by store_taxon_name_metadata. Cleared for extracted parent taxa (which have no occurrence data of their own) and stripped from all rows during finalization.
%w[ taxon_name_cached taxon_name_cached_is_valid taxon_name_cached_valid_taxon_name_id taxon_name_gbif_taxonomic_status ].freeze
Instance Attribute Summary collapse
-
#accepted_name_mode ⇒ Object
readonly
private
Returns the value of attribute accepted_name_mode.
-
#occurrence_to_otu ⇒ Object
readonly
private
Returns the value of attribute occurrence_to_otu.
-
#otu_to_taxon_name_data ⇒ Object
readonly
private
Returns the value of attribute otu_to_taxon_name_data.
-
#raw_csv ⇒ Object
readonly
private
Returns the value of attribute raw_csv.
Class Method Summary collapse
- .combine_scientific_name(cached, cached_author_year) ⇒ Object
-
.infraspecific_rank_names ⇒ Object
Get all infraspecific rank names.
Instance Method Summary collapse
-
#add_terminal_taxon(row, terminal_tn_id, terminal_rank, all_taxa, ancestor_lookup, taxon_name_info = {}) ⇒ Object
private
Add terminal taxon to all_taxa if not already present.
-
#assign_taxon_ids_and_build_hierarchy(all_taxa, taxon_name_info) ⇒ Array
private
Assign OTU UUID taxonIDs and parentNameUsageIDs to all taxa.
-
#assign_taxon_uuids(all_taxa, taxon_name_info) ⇒ Array
private
Assign OTU UUID taxonIDs to all taxa, grouped by rank.
-
#build_ancestor_lookup(terminal_taxon_name_ids) ⇒ Hash
private
Build a lookup hash for ancestor taxon_name_ids from terminal taxon_name_ids.
-
#build_final_taxon(taxon, taxon_id, taxon_name_id, taxon_name_info, taxon_name_id_to_taxon_id) ⇒ Hash
private
Build a single processed taxon with all relationships.
-
#build_processed_taxa(taxa_with_ids, taxon_name_info, taxon_name_id_to_taxon_id) ⇒ Array<Hash>
private
Build final processed taxa with parent/accepted relationships.
-
#collect_terminal_ids_for_batch(batch) ⇒ Array<Integer>
private
Collect unique terminal taxon_name_ids from a batch of occurrence rows.
-
#determine_accepted_name_usage(taxon, taxon_id, taxon_name_id, taxon_name_info, taxon_name_id_to_taxon_id) ⇒ Array
private
Determine acceptedNameUsageID and taxonomicStatus for a taxon.
-
#ensure_valid_names_for_synonyms(all_taxa, taxon_name_info = {}) ⇒ Hash
private
Ensure valid names exist for all synonyms.
- #extract_accepted_name_usage_ancestor_taxa(row, terminal_tn_id, terminal_rank, ancestor_lookup, all_taxa, taxon_name_info = {}) ⇒ Object private
-
#extract_ancestor_taxa(row, terminal_tn_id, terminal_rank, ancestor_lookup, all_taxa, taxon_name_info = {}) ⇒ Object
private
Extract and add ancestor taxa from terminal taxon up to root.
-
#extract_parent_species_for_taxon(row, rank, terminal_tn_id, ancestor_lookup, all_taxa, taxon_name_info = {}) ⇒ Object
private
Extract parent species for infraspecific taxa.
-
#extract_unique_taxa(parsed) ⇒ Array<Hash, Hash>
private
Extract all unique taxa from occurrence data.
-
#fix_synonym_rank_columns(all_taxa, taxon_name_info = {}) ⇒ Hash
private
Overwrite rank and classification columns for synonym rows using the synonym's own taxon_name_hierarchies ancestry, not the valid name's.
-
#initialize(raw_csv:, accepted_name_mode:, otu_to_taxon_name_data:, occurrence_to_otu:) ⇒ OccurrenceNormalizer
constructor
A new instance of OccurrenceNormalizer.
-
#merge_taxon_name_info!(taxon_name_info, terminal_taxon_name_ids, ancestor_lookup) ⇒ Object
private
Preload TaxonName metadata needed during normalization and final assembly.
- #merge_taxon_name_info_for_ids!(taxon_name_info, ids) ⇒ Object private
-
#normalize ⇒ String, Hash
Main entry point - normalizes taxonomy CSV.
-
#normalize_accepted_name_usage_taxon(taxon, current_rank = nil, original_rank = nil, taxon_name_info: nil) ⇒ Object
private
Normalizes a taxon row introduced or rewritten during accepted-name-usage handling.
-
#normalize_occurrence_taxon(taxon, current_rank, original_rank = nil, taxon_name_info: nil) ⇒ Object
private
Normalizes a taxon row produced during the occurrence-driven stage.
- #normalize_taxon_for_rank_transition(taxon, current_rank, original_rank = nil, taxon_name_info: nil) ⇒ Object private
- #populate_normalized_taxon_fields(taxon, taxon_name_info = nil) ⇒ Object private
-
#process_occurrence_row(row, all_taxa, ancestor_lookup, taxon_name_info = {}) ⇒ Object
private
Process a single occurrence row and extract its taxa.
-
#rank_class_to_name ⇒ Hash
private
Cached mapping of rank_class to rank_name.
- #recompute_higher_classification(taxon) ⇒ Object private
-
#remove_empty_columns(taxa) ⇒ Array<Hash>
private
Remove columns that are completely empty across all taxa.
-
#store_taxon_name_metadata(row, tn_data) ⇒ Object
private
Store TaxonName metadata in row for accepted_name_usage_id mode.
-
#taxon_name_id_to_otu_uuid(taxon_name_ids) ⇒ Hash
private
Build a mapping of taxon_name_id => OTU UUID for the given taxon_name_ids.
Constructor Details
#initialize(raw_csv:, accepted_name_mode:, otu_to_taxon_name_data:, occurrence_to_otu:) ⇒ OccurrenceNormalizer
Returns a new instance of OccurrenceNormalizer.
40 41 42 43 44 45 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 40 def initialize(raw_csv:, accepted_name_mode:, otu_to_taxon_name_data:, occurrence_to_otu:) @raw_csv = raw_csv @accepted_name_mode = accepted_name_mode @otu_to_taxon_name_data = otu_to_taxon_name_data @occurrence_to_otu = occurrence_to_otu end |
Instance Attribute Details
#accepted_name_mode ⇒ Object (readonly, private)
Returns the value of attribute accepted_name_mode.
88 89 90 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 88 def accepted_name_mode @accepted_name_mode end |
#occurrence_to_otu ⇒ Object (readonly, private)
Returns the value of attribute occurrence_to_otu.
88 89 90 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 88 def occurrence_to_otu @occurrence_to_otu end |
#otu_to_taxon_name_data ⇒ Object (readonly, private)
Returns the value of attribute otu_to_taxon_name_data.
88 89 90 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 88 def otu_to_taxon_name_data @otu_to_taxon_name_data end |
#raw_csv ⇒ Object (readonly, private)
Returns the value of attribute raw_csv.
88 89 90 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 88 def raw_csv @raw_csv end |
Class Method Details
.combine_scientific_name(cached, cached_author_year) ⇒ Object
890 891 892 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 890 def self.combine_scientific_name(cached, ) [cached, ].compact_blank.join(' ').presence end |
.infraspecific_rank_names ⇒ Object
Get all infraspecific rank names
876 877 878 879 880 881 882 883 884 885 886 887 888 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 876 def self.infraspecific_rank_names @infraspecific_rank_names ||= begin [ ::NomenclaturalRank::Iczn::SpeciesGroup, ::NomenclaturalRank::Icn::SpeciesAndInfraspeciesGroup, ::NomenclaturalRank::Icnp::SpeciesGroup ].flat_map { |group| ranks = group.ordered_ranks.map(&:rank_name) species_idx = ranks.index('species') species_idx ? ranks[(species_idx + 1)..] : [] }.uniq end end |
Instance Method Details
#add_terminal_taxon(row, terminal_tn_id, terminal_rank, all_taxa, ancestor_lookup, taxon_name_info = {}) ⇒ Object (private)
Add terminal taxon to all_taxa if not already present.
289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 289 def add_terminal_taxon( row, terminal_tn_id, terminal_rank, all_taxa, ancestor_lookup, taxon_name_info = {} ) return if all_taxa[terminal_tn_id] taxon = row.to_h taxon['taxon_name_id'] = terminal_tn_id # In accepted_name_usage_id mode, use the original full name if available. if row['taxon_name_cached'].present? taxon['scientificName'] = self.class.combine_scientific_name( row['taxon_name_cached'], taxon_name_info.dig(terminal_tn_id, :scientific_name_authorship) ) end normalize_occurrence_taxon( taxon, terminal_rank, terminal_rank, taxon_name_info: taxon_name_info[terminal_tn_id] ) all_taxa[terminal_tn_id] = taxon # Extract parent species for infraspecific taxa. if self.class.infraspecific_rank_names.include?(terminal_rank) extract_parent_species_for_taxon( row, terminal_rank, terminal_tn_id, ancestor_lookup, all_taxa, taxon_name_info ) end end |
#assign_taxon_ids_and_build_hierarchy(all_taxa, taxon_name_info) ⇒ Array (private)
Assign OTU UUID taxonIDs and parentNameUsageIDs to all taxa.
532 533 534 535 536 537 538 539 540 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 532 def assign_taxon_ids_and_build_hierarchy(all_taxa, taxon_name_info) taxa_with_ids, taxon_name_id_to_taxon_id = assign_taxon_uuids(all_taxa, taxon_name_info) processed_taxa = build_processed_taxa( taxa_with_ids, taxon_name_info, taxon_name_id_to_taxon_id ) [processed_taxa, taxon_name_id_to_taxon_id] end |
#assign_taxon_uuids(all_taxa, taxon_name_info) ⇒ Array (private)
Assign OTU UUID taxonIDs to all taxa, grouped by rank. Taxa without an OTU UUID identifier are excluded from the export.
547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 547 def assign_taxon_uuids(all_taxa, taxon_name_info) uuid_map = taxon_name_id_to_otu_uuid(all_taxa.keys) taxon_name_id_to_taxon_id = {} taxa_with_ids = [] # Orderings here determine the final CSV row ordering. ORDERED_RANKS.each do |rank| rank_taxa = all_taxa.select { |tn_id, taxon| taxon_name_info[tn_id]&.[](:rank) == rank }.sort_by { |tn_id, taxon| taxon['scientificName'] || '' } rank_taxa.each do |tn_id, taxon| next if taxon_name_id_to_taxon_id[tn_id] uuid = uuid_map[tn_id] next unless uuid taxon_name_id_to_taxon_id[tn_id] = uuid taxa_with_ids << { taxon: taxon, taxon_id: uuid, taxon_name_id: tn_id, rank: rank } end end [taxa_with_ids, taxon_name_id_to_taxon_id] end |
#build_ancestor_lookup(terminal_taxon_name_ids) ⇒ Hash (private)
Build a lookup hash for ancestor taxon_name_ids from terminal taxon_name_ids.
163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 163 def build_ancestor_lookup(terminal_taxon_name_ids) return {} if terminal_taxon_name_ids.empty? lookup = {} # Query taxon_name_hierarchies WITH join to get rank_class in single query hierarchy_relationships = TaxonNameHierarchy .joins('JOIN taxon_names ON taxon_names.id = taxon_name_hierarchies.ancestor_id') .where(descendant_id: terminal_taxon_name_ids) .where.not('ancestor_id = descendant_id') # exclude self-references .pluck('taxon_name_hierarchies.descendant_id', 'taxon_name_hierarchies.ancestor_id', 'taxon_names.rank_class') # Build lookup hash: "terminal_id:rank" => ancestor_id hierarchy_relationships.each do |descendant_id, ancestor_id, rank_class| next unless rank_class rank = rank_class_to_name[rank_class] next unless rank key = "#{descendant_id}:#{rank}" lookup[key] = ancestor_id end lookup end |
#build_final_taxon(taxon, taxon_id, taxon_name_id, taxon_name_info, taxon_name_id_to_taxon_id) ⇒ Hash (private)
Build a single processed taxon with all relationships.
620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672 673 674 675 676 677 678 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 620 def build_final_taxon( taxon, taxon_id, taxon_name_id, taxon_name_info, taxon_name_id_to_taxon_id ) if accepted_name_mode == 'accepted_name_usage_id' accepted_name_usage_id, taxonomic_status, accepted_name_usage = determine_accepted_name_usage( taxon, taxon_id, taxon_name_id, taxon_name_info, taxon_name_id_to_taxon_id ) end # GBIF checklist guidance requires acceptedNameUsageID on synonym rows to # point at an existing record in the dataset. If the accepted name could # not be included in this export, omit the synonym row instead of # emitting an invalid reference. if accepted_name_mode == 'accepted_name_usage_id' && taxonomic_status.present? && taxonomic_status != 'accepted' && accepted_name_usage_id.nil? return nil end parent_id = nil # Synonyms don't participate in parent hierarchy. if accepted_name_mode == 'replace_with_accepted_name' || taxonomic_status == 'accepted' # Find parent via TaxonName parent_id, walking up hierarchy if needed. current_parent_id = taxon_name_info[taxon_name_id]&.[](:parent_id) while current_parent_id # Check if this parent is in the export if taxon_name_id_to_taxon_id[current_parent_id] parent_id = taxon_name_id_to_taxon_id[current_parent_id] break end # Parent not in export, walk up to its parent current_parent_id = taxon_name_info[current_parent_id]&.[](:parent_id) end end processed_taxon = { 'id' => taxon_id, 'taxonID' => taxon_id, 'parentNameUsageID' => parent_id } if accepted_name_mode == 'accepted_name_usage_id' processed_taxon['acceptedNameUsageID'] = accepted_name_usage_id processed_taxon['acceptedNameUsage'] = accepted_name_usage processed_taxon['taxonomicStatus'] = taxonomic_status end # keep processed_taxon value during the merge when both processed_taxon # and taxon have a value for a key. processed_taxon.merge(taxon.slice(*PASSTHROUGH_FIELDS)) { |_key, processed_taxon_value, _taxon_value| processed_taxon_value } end |
#build_processed_taxa(taxa_with_ids, taxon_name_info, taxon_name_id_to_taxon_id) ⇒ Array<Hash> (private)
Build final processed taxa with parent/accepted relationships.
599 600 601 602 603 604 605 606 607 608 609 610 611 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 599 def build_processed_taxa( taxa_with_ids, taxon_name_info, taxon_name_id_to_taxon_id ) taxa_with_ids.filter_map do |item| build_final_taxon( item[:taxon], item[:taxon_id], item[:taxon_name_id], taxon_name_info, taxon_name_id_to_taxon_id ) end end |
#collect_terminal_ids_for_batch(batch) ⇒ Array<Integer> (private)
Collect unique terminal taxon_name_ids from a batch of occurrence rows.
143 144 145 146 147 148 149 150 151 152 153 154 155 156 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 143 def collect_terminal_ids_for_batch(batch) batch.map { |row| occurrence_key = "#{row['dwc_occurrence_object_type']}:#{row['dwc_occurrence_object_id']}" tn_data = otu_to_taxon_name_data[occurrence_to_otu[occurrence_key]] next unless tn_data if accepted_name_mode == 'replace_with_accepted_name' tn_data[:cached_valid_taxon_name_id] || tn_data[:id] else tn_data[:id] end }.compact.uniq end |
#determine_accepted_name_usage(taxon, taxon_id, taxon_name_id, taxon_name_info, taxon_name_id_to_taxon_id) ⇒ Array (private)
Determine acceptedNameUsageID and taxonomicStatus for a taxon.
705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 705 def determine_accepted_name_usage( taxon, taxon_id, taxon_name_id, taxon_name_info, taxon_name_id_to_taxon_id ) return [nil, nil, nil] unless accepted_name_mode == 'accepted_name_usage_id' is_valid = taxon['taxon_name_cached_is_valid'] if !is_valid.nil? return [taxon_id, 'accepted', taxon['scientificName']] if is_valid == true # This taxon is marked as invalid (synonym). valid_taxon_name_id = taxon['taxon_name_cached_valid_taxon_name_id'] if valid_taxon_name_id.present? if valid_taxon_name_id == taxon_name_id && taxon['taxon_name_gbif_taxonomic_status'].blank? return [taxon_id, 'accepted', taxon['scientificName']] end accepted_id = taxon_name_id_to_taxon_id[valid_taxon_name_id] accepted_name = taxon_name_info[valid_taxon_name_id]&.[](:scientific_name) # NOTE: accepted_id may be nil when the valid name has no OTU UUID # in this export - technically this is bad DwC checklist behavior: # https://ipt.gbif.org/manual/en/ipt/latest/best-practices-checklists#publishing-synonymy # "An dwc:acceptedNameUsageID must point to an existing record in # the dataset" status = taxon['taxon_name_gbif_taxonomic_status'] || 'synonym' [accepted_id, status, accepted_name] else [nil, nil, nil] end else # No validity data - this is an extracted higher taxon from rank columns. [taxon_id, 'accepted', taxon['scientificName']] end end |
#ensure_valid_names_for_synonyms(all_taxa, taxon_name_info = {}) ⇒ Hash (private)
Ensure valid names exist for all synonyms.
463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 463 def ensure_valid_names_for_synonyms(all_taxa, taxon_name_info = {}) # Build lookup of missing valid names => a synonym template. valid_id_to_synonym = {} all_taxa.each_value do |taxon| next unless taxon['taxon_name_cached_is_valid'] == false valid_id = taxon['taxon_name_cached_valid_taxon_name_id'] next unless valid_id.present? && !all_taxa[valid_id] # Yes, we're just picking out any one synonym here (cf. below): valid_id_to_synonym[valid_id] ||= taxon end return all_taxa if valid_id_to_synonym.empty? valid_ids = valid_id_to_synonym.keys valid_ancestor_lookup = build_ancestor_lookup(valid_ids) merge_taxon_name_info!(taxon_name_info, valid_ids, valid_ancestor_lookup) ::TaxonName.where(id: valid_ids).each do |valid_tn| rank = valid_tn.rank&.downcase next unless rank template_taxon = valid_id_to_synonym[valid_tn.id] next unless template_taxon # The template's rank columns (genus, family, higherClassification, etc.) # already reflect the valid name's classification, not the synonym's - # this is why it didn't matter *which* synonym of the valid name we # selected above. valid_taxon = template_taxon.dup valid_taxon['taxon_name_id'] = valid_tn.id valid_taxon['scientificName'] = taxon_name_info.dig(valid_tn.id, :scientific_name) || valid_tn.cached valid_taxon['taxonRank'] = rank valid_taxon['taxon_name_cached'] = valid_tn.cached valid_taxon['taxon_name_cached_is_valid'] = true valid_taxon['taxon_name_cached_valid_taxon_name_id'] = valid_tn.id taxon_name_info[valid_tn.id] = { rank: rank, parent_id: valid_tn.parent_id, scientific_name: self.class.combine_scientific_name(valid_tn.cached, valid_tn.), scientific_name_authorship: valid_tn. } normalize_accepted_name_usage_taxon( valid_taxon, rank, template_taxon['taxonRank']&.downcase, taxon_name_info: taxon_name_info[valid_tn.id] ) all_taxa[valid_tn.id] = valid_taxon extract_accepted_name_usage_ancestor_taxa( valid_taxon, valid_tn.id, rank, valid_ancestor_lookup, all_taxa, taxon_name_info ) end all_taxa end |
#extract_accepted_name_usage_ancestor_taxa(row, terminal_tn_id, terminal_rank, ancestor_lookup, all_taxa, taxon_name_info = {}) ⇒ Object (private)
840 841 842 843 844 845 846 847 848 849 850 851 852 853 854 855 856 857 858 859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 840 def extract_accepted_name_usage_ancestor_taxa( row, terminal_tn_id, terminal_rank, ancestor_lookup, all_taxa, taxon_name_info = {} ) terminal_rank_index = ORDERED_RANKS.index(terminal_rank) return unless terminal_rank_index && terminal_rank_index > 0 (0...terminal_rank_index).reverse_each do |i| higher_rank = ORDERED_RANKS[i] rank_taxon_name = row[higher_rank] next if rank_taxon_name.blank? ancestor_tn_id = ancestor_lookup["#{terminal_tn_id}:#{higher_rank}"] next unless ancestor_tn_id # All higher ancestors are in all_taxa if this one is. break if all_taxa[ancestor_tn_id] ancestor_taxon = row.to_h.dup ancestor_taxon['taxon_name_id'] = ancestor_tn_id ancestor_taxon['scientificName'] = taxon_name_info.dig(ancestor_tn_id, :scientific_name) || rank_taxon_name ancestor_taxon['taxonRank'] = higher_rank normalize_accepted_name_usage_taxon( ancestor_taxon, higher_rank, terminal_rank, taxon_name_info: taxon_name_info[ancestor_tn_id] ) TAXON_NAME_METADATA_FIELDS.each { |f| ancestor_taxon[f] = nil } all_taxa[ancestor_tn_id] = ancestor_taxon end end |
#extract_ancestor_taxa(row, terminal_tn_id, terminal_rank, ancestor_lookup, all_taxa, taxon_name_info = {}) ⇒ Object (private)
Extract and add ancestor taxa from terminal taxon up to root.
365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 365 def extract_ancestor_taxa( row, terminal_tn_id, terminal_rank, ancestor_lookup, all_taxa, taxon_name_info = {} ) # Synonyms have no parentNameUsageID and their hierarchy columns are # corrected from their own ancestry in fix_synonym_rank_columns. Skipping # here avoids creating ancestor rows with values from the valid name's row. return if row['taxon_name_cached_is_valid'] == false terminal_rank_index = ORDERED_RANKS.index(terminal_rank) return unless terminal_rank_index && terminal_rank_index > 0 (0...terminal_rank_index).reverse_each do |i| higher_rank = ORDERED_RANKS[i] rank_taxon_name = row[higher_rank] next if rank_taxon_name.blank? ancestor_tn_id = ancestor_lookup["#{terminal_tn_id}:#{higher_rank}"] next unless ancestor_tn_id # Early termination: if this ancestor already exists, all higher ones do too. break if all_taxa[ancestor_tn_id] ancestor_taxon = row.to_h.dup ancestor_taxon['taxon_name_id'] = ancestor_tn_id ancestor_taxon['scientificName'] = taxon_name_info.dig(ancestor_tn_id, :scientific_name) || rank_taxon_name ancestor_taxon['taxonRank'] = higher_rank normalize_occurrence_taxon( ancestor_taxon, higher_rank, terminal_rank, taxon_name_info: taxon_name_info[ancestor_tn_id] ) # Clear taxon_name_ metadata for extracted ancestors. TAXON_NAME_METADATA_FIELDS.each { |f| ancestor_taxon[f] = nil } all_taxa[ancestor_tn_id] = ancestor_taxon end end |
#extract_parent_species_for_taxon(row, rank, terminal_tn_id, ancestor_lookup, all_taxa, taxon_name_info = {}) ⇒ Object (private)
Extract parent species for infraspecific taxa.
328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 328 def extract_parent_species_for_taxon( row, rank, terminal_tn_id, ancestor_lookup, all_taxa, taxon_name_info = {} ) genus = row['genus'] specific_epithet = row['specificEpithet'] if genus.present? && specific_epithet.present? species_tn_id = ancestor_lookup["#{terminal_tn_id}:species"] return unless species_tn_id return if all_taxa[species_tn_id] # Already extracted # Create species taxon species_taxon = row.to_h.dup species_taxon['taxon_name_id'] = species_tn_id species_taxon['scientificName'] = taxon_name_info.dig(species_tn_id, :scientific_name) || "#{genus} #{specific_epithet}" species_taxon['taxonRank'] = 'species' normalize_occurrence_taxon( species_taxon, 'species', rank, taxon_name_info: taxon_name_info[species_tn_id] ) # Clear taxon_name_ metadata since this is an extracted parent. TAXON_NAME_METADATA_FIELDS.each { |f| species_taxon[f] = nil } all_taxa[species_tn_id] = species_taxon end end |
#extract_unique_taxa(parsed) ⇒ Array<Hash, Hash> (private)
Extract all unique taxa from occurrence data. Uses taxon_name_id as the key to handle homonyms correctly.
122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 122 def extract_unique_taxa(parsed) all_taxa = {} taxon_name_info = {} batch_size = 25_000 parsed.each_slice(batch_size) do |batch| terminal_ids = collect_terminal_ids_for_batch(batch) ancestor_lookup = build_ancestor_lookup(terminal_ids) merge_taxon_name_info!(taxon_name_info, terminal_ids, ancestor_lookup) batch.each do |row| process_occurrence_row(row, all_taxa, ancestor_lookup, taxon_name_info) end end [all_taxa, taxon_name_info] end |
#fix_synonym_rank_columns(all_taxa, taxon_name_info = {}) ⇒ Hash (private)
Overwrite rank and classification columns for synonym rows using the synonym's own taxon_name_hierarchies ancestry, not the valid name's. DwcOccurrence stores the valid name's classification (via current_valid_taxon_name), so synonym rows otherwise inherit the wrong genus, family, taxonRank, etc.
All hierarchy columns (genus through kingdom, higherClassification) are corrected from the synonym's own ancestry — useful for understanding the historical classification the synonym was published under. parentNameUsageID is left empty for synonyms (handled in build_final_taxon), since synonyms do not participate in tree navigation.
419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 419 def fix_synonym_rank_columns(all_taxa, taxon_name_info = {}) synonym_tn_ids = all_taxa.each_value.filter_map { |taxon| taxon['taxon_name_id'] if taxon['taxon_name_cached_is_valid'] == false } return all_taxa if synonym_tn_ids.empty? # Query ancestors including self (self gives epithet and own rank). synonym_ancestors = Hash.new { |h, k| h[k] = {} } TaxonNameHierarchy .joins('JOIN taxon_names ON taxon_names.id = taxon_name_hierarchies.ancestor_id') .where(descendant_id: synonym_tn_ids) .pluck('taxon_name_hierarchies.descendant_id', 'taxon_names.rank_class', 'taxon_names.name') .each do |descendant_id, rank_class, name| rank = rank_class_to_name[rank_class] next unless rank synonym_ancestors[descendant_id][rank] = name end infraspecific_ranks = self.class.infraspecific_rank_names.to_set all_taxa.each_value do |taxon| next unless taxon['taxon_name_cached_is_valid'] == false ancestors = synonym_ancestors[taxon['taxon_name_id']] next if ancestors.empty? # Overwrite all rank columns from synonym's own hierarchy. HIGHER_RANK_COLUMNS.each { |col| taxon[col] = ancestors[col] } taxon['specificEpithet'] = ancestors['species'] synonym_rank = (ORDERED_RANKS & ancestors.keys).last taxon['taxonRank'] = synonym_rank if synonym_rank taxon['infraspecificEpithet'] = infraspecific_ranks.include?(synonym_rank) ? ancestors[synonym_rank] : nil normalize_accepted_name_usage_taxon( taxon, taxon_name_info: taxon_name_info[taxon['taxon_name_id']] ) end all_taxa end |
#merge_taxon_name_info!(taxon_name_info, terminal_taxon_name_ids, ancestor_lookup) ⇒ Object (private)
Preload TaxonName metadata needed during normalization and final assembly.
195 196 197 198 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 195 def merge_taxon_name_info!(taxon_name_info, terminal_taxon_name_ids, ancestor_lookup) ids = (terminal_taxon_name_ids + ancestor_lookup.values).uniq - taxon_name_info.keys merge_taxon_name_info_for_ids!(taxon_name_info, ids) end |
#merge_taxon_name_info_for_ids!(taxon_name_info, ids) ⇒ Object (private)
200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 200 def merge_taxon_name_info_for_ids!(taxon_name_info, ids) ids = ids.uniq - taxon_name_info.keys return if ids.empty? ids.each_slice(25_000) do |batch| ::TaxonName.where(id: batch) .pluck(:id, :rank_class, :parent_id, :cached, :cached_author_year) .each do |id, rank_class, parent_id, cached, | rank = rank_class_to_name[rank_class]&.downcase taxon_name_info[id] = { rank: rank, parent_id: parent_id, scientific_name: self.class.combine_scientific_name(cached, ), scientific_name_authorship: } end end end |
#normalize ⇒ String, Hash
Main entry point - normalizes taxonomy CSV
49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 49 def normalize parsed = CSV.parse(@raw_csv, headers: true, col_sep: "\t") return ["\n", {}] if parsed.empty? all_taxa, taxon_name_info = extract_unique_taxa(parsed) if @accepted_name_mode == 'accepted_name_usage_id' all_taxa = ensure_valid_names_for_synonyms(all_taxa, taxon_name_info) all_taxa = fix_synonym_rank_columns(all_taxa, taxon_name_info) end # Build hierarchy and assign taxonIDs processed_taxa, taxon_name_id_to_taxon_id = assign_taxon_ids_and_build_hierarchy(all_taxa, taxon_name_info) all_taxa = nil # release memory processed_taxa = remove_empty_columns(processed_taxa) # Collect headers from all rows so taxa with extra columns (e.g. a # synonym row that gained infraspecificEpithet from its own hierarchy # while the first row never had that key) are not misaligned when written. output_headers = processed_taxa.each_with_object([]) do |taxon, headers| taxon.each_key { |k| headers << k unless headers.include?(k) } end csv_output = CSV.generate(col_sep: "\t") do |csv| csv << output_headers processed_taxa.each do |taxon| csv << output_headers.map { |h| taxon[h] } end end [csv_output, taxon_name_id_to_taxon_id] end |
#normalize_accepted_name_usage_taxon(taxon, current_rank = nil, original_rank = nil, taxon_name_info: nil) ⇒ Object (private)
Normalizes a taxon row introduced or rewritten during accepted-name-usage handling. This covers both corrected synonym rows and accepted rows synthesized from those synonyms.
Like occurrence-stage normalization, a rank transition clears lower-rank columns and taxon-specific fields that do not apply to the resulting rank, then recomputes the normalized fields from the row's own taxon metadata. Callers may omit the ranks to use the row's current taxonRank unchanged.
788 789 790 791 792 793 794 795 796 797 798 799 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 788 def normalize_accepted_name_usage_taxon( taxon, current_rank = nil, original_rank = nil, taxon_name_info: nil ) current_rank ||= taxon['taxonRank']&.downcase original_rank ||= current_rank normalize_taxon_for_rank_transition( taxon, current_rank, original_rank, taxon_name_info: taxon_name_info ) end |
#normalize_occurrence_taxon(taxon, current_rank, original_rank = nil, taxon_name_info: nil) ⇒ Object (private)
Normalizes a taxon row produced during the occurrence-driven stage. This covers both terminal occurrence-backed rows and taxa extracted from those rows (for example parent species or higher ancestors).
When current_rank differs from original_rank, it:
- clears rank columns below current_rank
- clears taxon-specific fields not applicable to the extracted rank
It always recomputes normalized fields from the taxon's own metadata and preserves all other fields as-is. In particular, it does not rewrite row identity fields such as taxonRank or scientificName; callers are expected to set those before calling this method.
761 762 763 764 765 766 767 768 769 770 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 761 def normalize_occurrence_taxon( taxon, current_rank, original_rank = nil, taxon_name_info: nil ) normalize_taxon_for_rank_transition( taxon, current_rank, original_rank, taxon_name_info: taxon_name_info ) end |
#normalize_taxon_for_rank_transition(taxon, current_rank, original_rank = nil, taxon_name_info: nil) ⇒ Object (private)
801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 838 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 801 def normalize_taxon_for_rank_transition( taxon, current_rank, original_rank = nil, taxon_name_info: nil ) current_id = ORDERED_RANKS.index(current_rank) return unless current_id if current_rank == original_rank populate_normalized_taxon_fields(taxon, taxon_name_info) return end # Clear lower rank columns ORDERED_RANKS[(current_id + 1)..-1].each do |lower_rank| taxon[lower_rank] = nil end # Fields to keep for extracted taxa rank_columns = ORDERED_RANKS.map(&:to_s) fields_to_keep = rank_columns + ['scientificName', 'taxonRank', 'nomenclaturalCode'] # Add epithet fields based on rank if current_rank == 'species' fields_to_keep << 'specificEpithet' elsif self.class.infraspecific_rank_names.include?(current_rank) fields_to_keep << 'specificEpithet' fields_to_keep << 'infraspecificEpithet' end # Clear all other taxon-specific fields Data::CHECKLIST_TAXON_EXTENSION_COLUMNS.keys.each do |field| field_str = field.to_s field_str = 'class' if field == :dwcClass taxon[field_str] = nil unless fields_to_keep.include?(field_str) end populate_normalized_taxon_fields(taxon, taxon_name_info) end |
#populate_normalized_taxon_fields(taxon, taxon_name_info = nil) ⇒ Object (private)
680 681 682 683 684 685 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 680 def populate_normalized_taxon_fields(taxon, taxon_name_info = nil) taxon['scientificNameAuthorship'] = taxon_name_info&.[](:scientific_name_authorship) # The original higherClassification may include more ranks than the # checklist does, so just always recompute it. taxon['higherClassification'] = recompute_higher_classification(taxon) end |
#process_occurrence_row(row, all_taxa, ancestor_lookup, taxon_name_info = {}) ⇒ Object (private)
Process a single occurrence row and extract its taxa.
245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 245 def process_occurrence_row(row, all_taxa, ancestor_lookup, taxon_name_info = {}) occurrence_key = "#{row['dwc_occurrence_object_type']}:#{row['dwc_occurrence_object_id']}" tn_data = otu_to_taxon_name_data[occurrence_to_otu[occurrence_key]] return unless tn_data terminal_tn_id = if accepted_name_mode == 'replace_with_accepted_name' tn_data[:cached_valid_taxon_name_id] || tn_data[:id] else tn_data[:id] end return unless terminal_tn_id (row, tn_data) if accepted_name_mode == 'accepted_name_usage_id' terminal_rank = row['taxonRank']&.downcase return unless terminal_rank.present? && row['scientificName'].present? add_terminal_taxon( row, terminal_tn_id, terminal_rank, all_taxa, ancestor_lookup, taxon_name_info ) extract_ancestor_taxa( row, terminal_tn_id, terminal_rank, ancestor_lookup, all_taxa, taxon_name_info ) end |
#rank_class_to_name ⇒ Hash (private)
Cached mapping of rank_class to rank_name.
221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 221 def rank_class_to_name @rank_class_to_name ||= begin mapping = {} # Get all NomenclaturalRank classes from all codes [ NomenclaturalRank::Iczn, NomenclaturalRank::Icn, NomenclaturalRank::Icnp, NomenclaturalRank::Icvcn ].each do |code_module| code_module.ordered_ranks.each do |rank_class| mapping[rank_class.name] = rank_class.rank_name end end mapping end end |
#recompute_higher_classification(taxon) ⇒ Object (private)
687 688 689 690 691 692 693 694 695 696 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 687 def recompute_higher_classification(taxon) rank = taxon['taxonRank']&.downcase rank_index = ORDERED_RANKS.index(rank) return taxon['higherClassification'] unless rank_index classification_parts = ORDERED_RANKS[0...rank_index] .filter_map { |r| HIGHER_RANK_COLUMNS.include?(r) ? taxon[r].presence : nil } classification_parts.empty? ? nil : classification_parts.join(Export::Dwca::DELIMITER) end |
#remove_empty_columns(taxa) ⇒ Array<Hash> (private)
Remove columns that are completely empty across all taxa
93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 93 def remove_empty_columns(taxa) return taxa if taxa.empty? # Required columns that should never be removed, even if empty required_columns = %w[id taxonID scientificName taxonRank].to_set # Find which columns have at least one non-empty value columns_with_data = Set.new taxa.each do |taxon| taxon.each do |key, value| next if columns_with_data.include?(key) if required_columns.include?(key) || value.present? columns_with_data << key end end end # Filter each taxon to only include columns with data taxa.map do |taxon| taxon.select { |key, _| columns_with_data.include?(key) } end end |
#store_taxon_name_metadata(row, tn_data) ⇒ Object (private)
Store TaxonName metadata in row for accepted_name_usage_id mode
274 275 276 277 278 279 280 281 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 274 def (row, tn_data) return unless tn_data[:cached].present? row['taxon_name_cached'] = tn_data[:cached] row['taxon_name_cached_is_valid'] = tn_data[:cached_is_valid] row['taxon_name_cached_valid_taxon_name_id'] = tn_data[:cached_valid_taxon_name_id] row['taxon_name_gbif_taxonomic_status'] = tn_data[:gbif_taxonomic_status] end |
#taxon_name_id_to_otu_uuid(taxon_name_ids) ⇒ Hash (private)
Build a mapping of taxon_name_id => OTU UUID for the given taxon_name_ids. Only includes taxa that have an OTU with a Uuid identifier.
579 580 581 582 583 584 585 586 587 588 589 590 591 592 |
# File 'lib/export/dwca/checklist/occurrence_normalizer.rb', line 579 def taxon_name_id_to_otu_uuid(taxon_name_ids) return {} if taxon_name_ids.empty? taxon_name_ids.each_slice(25_000).each_with_object({}) do |batch, result| ::Otu .joins("JOIN identifiers ON identifiers.identifier_object_id = otus.id AND identifiers.identifier_object_type = 'Otu' AND identifiers.type LIKE 'Identifier::Global::Uuid%' AND identifiers.position = 1") .where(taxon_name_id: batch) .pluck('otus.taxon_name_id', 'identifiers.cached') .each { |tn_id, uuid| result[tn_id] = uuid } end end |