Module: Vendor::Colrapi

Defined in:
lib/vendor/colrapi.rb

Overview

A middle-layer wrapper between Colrapi and TaxonWorks

Constant Summary collapse

DATASETS =
{
  col: '3LR', # The Human edited compilation
  col_extended: '3LXR'  # Human plus algorithmic extensions
}.freeze

Class Method Summary collapse

Class Method Details

.align_classification(taxonworks_object, colrapi_result) ⇒ Array

2 row alignment facilitator

Returns:

  • (Array)

    with hashes { { rank: 'species' col: 'name', taxonworks: 'name' rank_origin: :col, :taxonworks, :both }



27
28
29
# File 'lib/vendor/colrapi.rb', line 27

def self.align_classification(taxonworks_object, colrapi_result)
  r = []
end

.ancestors(taxon_id) ⇒ Array<Hash>

Returns the ancestor classification chain for a CoL taxon.

Uses Colrapi.taxon with subresource: 'classification'. Response is an Array of hashes with keys: 'id', 'name' (String, not hash), 'authorship', 'rank', 'label', 'labelHtml'.

Only valid for backbone datasets (DATASETS, DATASETS). For external datasets use ancestors_via_parent_id instead.

Parameters:

  • taxon_id (String)

    CoL taxon ID (e.g. '6MB3T')

Returns:

  • (Array<Hash>)


150
151
152
153
154
155
# File 'lib/vendor/colrapi.rb', line 150

def self.ancestors(taxon_id)
  ::Colrapi.taxon(DATASETS[:col], taxon_id: taxon_id, subresource: 'classification')
rescue => e
  Rails.logger.warn "Vendor::Colrapi.ancestors error: #{e.message}"
  []
end

.ancestors_via_parent_id(dataset_id, taxon_id, max_depth: 20) ⇒ Array<Hash>

Builds an ancestor chain for external/denormed datasets by following the parentId field of successive taxon records.

External datasets (like the Mammal Diversity Database) are ingested into ChecklistBank without a pre-built classification subresource. Instead, each nameusage record carries a parentId pointing to the immediate parent within the same dataset.

Returns entries in the same format as the classification subresource used by ancestors():

{ 'id', 'name' (String uninomial), 'rank', 'authorship', 'label', 'labelHtml' }

Order is proximal-first (immediate parent first) matching ancestors() behavior. The starting taxon itself is NOT included; only its ancestors are.

Parameters:

  • dataset_id (String)

    the external dataset to query

  • taxon_id (String)

    ID of the starting taxon

  • max_depth (Integer) (defaults to: 20)

    circuit-breaker against malformed/cyclic data

Returns:

  • (Array<Hash>)


184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
# File 'lib/vendor/colrapi.rb', line 184

def self.ancestors_via_parent_id(dataset_id, taxon_id, max_depth: 20)
  chain   = []
  visited = Set.new

  initial = ::Colrapi.taxon(dataset_id, taxon_id: taxon_id)
  return chain if initial.blank?

  current_id = initial['parentId']

  max_depth.times do
    break if current_id.blank? || visited.include?(current_id)
    visited << current_id

    taxon = ::Colrapi.taxon(dataset_id, taxon_id: current_id)

    break if taxon.blank?

    chain << {
      'id'         => current_id,
      'name'       => uninomial_name(taxon['name']).to_s,
      'rank'       => taxon.dig('name', 'rank'),
      'authorship' => taxon.dig('name', 'authorship'),
      'label'      => taxon.fetch('label', '').to_s,
      'labelHtml'  => taxon.fetch('labelHtml', '').to_s
    }

    current_id = taxon['parentId']
  end

  # Return distal-first (kingdom before genus) to match the classification subresource
  # order returned by ancestors(), so build_extension can treat both paths uniformly.
  chain.reverse
rescue => e
  Rails.logger.warn "Vendor::Colrapi.ancestors_via_parent_id error: #{e.message}"
  []
end

.build_extension(col_result, project_id, dataset_id: nil) ⇒ Hash

Builds an alignment hash comparing a CoL nameusage result against TaxonNames in the project.

col_result is a flat nameusage hash as returned by search (no 'usage' wrapper):

{ 'id' => '6MB3T', 'status' => 'accepted',
'name' => { 'scientificName' => 'Homo sapiens', 'rank' => 'species',
            'authorship' => 'Linnaeus, 1758',
            'combinationAuthorship' => { 'authors' => [...], 'year' => '1758' } },
'label' => 'Homo sapiens Linnaeus, 1758', … }

Classification entries from ancestors() have:

{ 'id' => '636X2', 'name' => 'Homo', 'rank' => 'genus', 'label' => 'Homo', … }

Note: in classification entries 'name' is a plain String, not a hash.

Parameters:

  • col_result (Hash)

    a single entry from search

  • project_id (Integer, nil)
  • dataset_id (String, nil) (defaults to: nil)

    the dataset that was searched; falls back to DATASETS

Returns:

  • (Hash)

    extension hash with :col_key, :col_name, :col_status, :col_authorship, :col_year, :col_rank, :col_dataset_id, and :alignment (Array of ancestor hashes each including :col_id, :dataset_id)



239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
# File 'lib/vendor/colrapi.rb', line 239

def self.build_extension(col_result, project_id, dataset_id: nil)
  col_key        = col_result['id']
  col_name       = uninomial_name(col_result['name'])
  col_rank       = col_result.dig('name', 'rank')&.downcase
  col_name       = extract_subgenus_name(col_name) if col_rank == 'subgenus'
  col_status     = col_result['status']
  col_authorship = col_result.dig('name', 'authorship')
  col_year       = col_result.dig('name', 'combinationAuthorship', 'year') ||
                   col_result.dig('name', 'basionymOrCombinationAuthorship', 'year')

  # CoL nomenclatural code: 'zoological', 'botanical', 'bacterial', 'viral'
  col_code       = col_result.dig('name', 'code')

  # Dataset used for the search (target row).
  col_dataset_id = dataset_id.presence || DATASETS[:col]

  # Backbone datasets (main CoL, extended CoL) expose a classification subresource.
  # External/denormed datasets must be traversed via iterative parentId lookups.
  # Ancestor records carry the dataset_id of whichever source they came from.
  #
  # For synonyms with a single accepted target, CoL attaches the synonym under its accepted
  # name in the tree, so fetching classification via the synonym's own ID returns the accepted
  # name in the chain.  Use the accepted name's ID for the lookup instead, then strip it out
  # by ID (CoL's classification endpoint includes the queried taxon itself as the most
  # proximal entry).
  # Ambiguous synonyms have no single accepted target, so we look up via their own ID.
  non_accepted    = col_status.present? && col_status != 'accepted'
  accepted_id     = col_status == 'synonym' ? col_result.dig('accepted', 'id').presence : nil
  ancestor_lookup_key = accepted_id || col_key

  ancestor_chain, ancestor_dataset_id =
    if ancestor_lookup_key.present?
      if col_backbone_dataset?(col_dataset_id)
        [ancestors(ancestor_lookup_key), DATASETS[:col]]
      else
        [ancestors_via_parent_id(col_dataset_id, ancestor_lookup_key), col_dataset_id]
      end
    else
      [[], col_dataset_id]
    end

  ancestor_chain = ancestor_chain.reject { |a| a['id'] == accepted_id } if accepted_id

  # For any non-accepted name, strip ancestors at or below the name's own rank.
  # CoL places non-accepted names under their accepted name, so the accepted name's
  # classification chain can include same- or lower-ranked entries that are not valid
  # parents of the queried name (e.g. a genus synonym whose accepted name is a subgenus).
  if non_accepted && col_rank.present?
    target_sort = col_rank_sort(col_rank, col_code)
    if target_sort
      ancestor_chain = ancestor_chain.reject { |a|
        anc_sort = col_rank_sort(a['rank']&.downcase, col_code)
        anc_sort && anc_sort >= target_sort
      }
    end
  end

  # Drop suprakingdom ranks (e.g. 'domain') that have no equivalent in TaxonWorks
  # nomenclatural codes.  Kingdom is the highest rank we include.
  # CoL classification returns proximal→distal (immediate parent first); reverse to kingdom-first.
  ancestor_chain = ancestor_chain.reject { |a| a['rank']&.downcase == 'domain' }.reverse

  alignment = ancestor_chain.map do |ancestor|
    rank     = ancestor['rank']&.downcase
    # In classification entries 'name' is a plain String (the uninomial name)
    anc_name = ancestor['name'].is_a?(String) ? ancestor['name'] : ancestor.dig('name', 'scientificName')
    anc_name = extract_subgenus_name(anc_name) if rank == 'subgenus'
    col_id   = ancestor['id']

    scope = ::TaxonName.where(cached: anc_name) # !!!
    scope = scope.where(project_id:) if project_id.present?
    tw_record = scope.first

    {
      rank:,
      col_name:        anc_name,
      col_id:,
      dataset_id:      ancestor_dataset_id,
      col_authorship:  ancestor['authorship'].presence,
      taxonworks_id:   tw_record&.id,
      taxonworks_name: tw_record&.cached,
      match:           tw_record ? 'exact' : 'none'
    }
  end

  { col_key:, col_name:, col_status:, col_authorship:, col_year:, col_rank:, col_code:, col_dataset_id:, alignment: }
end

.col_backbone_dataset?(dataset_id) ⇒ Boolean

Returns true when dataset_id refers to one of the CoL backbone datasets that support the classification subresource for ancestor retrieval. External/denormed datasets (e.g. Mammal Diversity Database, dataset 9802) do not have this subresource and require iterative parentId traversal instead.

Parameters:

  • dataset_id (String)

Returns:

  • (Boolean)


164
165
166
# File 'lib/vendor/colrapi.rb', line 164

def self.col_backbone_dataset?(dataset_id)
  DATASETS.values.include?(dataset_id.to_s)
end

.col_rank_sort(rank_name, col_code) ⇒ Object

Maps a CoL rank name ('genus', 'family', …) and CoL nomenclatural code ('zoological', 'botanical', 'bacterial', 'viral') to the TaxonWorks RANK_SORT index. Higher index = more specific rank. Returns nil when unresolvable.



352
353
354
355
356
357
358
359
360
361
362
363
# File 'lib/vendor/colrapi.rb', line 352

def self.col_rank_sort(rank_name, col_code)
  return nil if rank_name.blank?
  lookup = case col_code
           when 'zoological' then ::ICZN_LOOKUP
           when 'botanical'  then ::ICN_LOOKUP
           when 'bacterial'  then ::ICNP_LOOKUP
           when 'viral'      then ::ICVCN_LOOKUP
           else ::ICZN_LOOKUP
           end
  rank_class = lookup[rank_name]
  rank_class ? ::RANK_SORT[rank_class] : nil
end

.collection_object_scientific_name(collection_object) ⇒ Object

Extend to buffered with GNA in middle layer? Text only, taxon name cached or OTU name for the most recent determination



368
369
370
371
372
373
374
375
376
377
378
379
# File 'lib/vendor/colrapi.rb', line 368

def self.collection_object_scientific_name(collection_object)
  return nil if collection_object.nil?
  if a = collection_object.taxon_determinations.order(:position)&.first
    if a.otu.taxon_name
      a.otu.taxon_name.cached
    else
      a.otu.name
    end
  else
    nil
  end
end

.datasets(q:, limit: 20) ⇒ Array<Hash>

Searches CoL datasets by name string.

Returns an array of dataset summaries, each containing at least 'id', 'title', and 'alias'. Used by the preferences UI to let users pick a target dataset.

Parameters:

  • q (String)

    dataset name search string

  • limit (Integer) (defaults to: 20)

Returns:

  • (Array<Hash>)


112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
# File 'lib/vendor/colrapi.rb', line 112

def self.datasets(q:, limit: 20)
  # Text search by title/name.
  text_results = begin
    result = ::Colrapi.dataset(q: q, limit: limit)
    (result['result'] || []).map { |d| { 'id' => d['key'].to_s, 'title' => d['title'], 'alias' => d['alias'] } }
  rescue => e
    Rails.logger.warn "Vendor::Colrapi.datasets text search error: #{e.message}"
    []
  end

  # Direct lookup by dataset ID — `q` may itself be a key like '3LXR'.
  # Colrapi.dataset(dataset_id:) returns a single hash, not a paged result.
  direct_hit = begin
    d = ::Colrapi.dataset(dataset_id: q)
    d.is_a?(Hash) && d['key'].present? ? { 'id' => d['key'].to_s, 'title' => d['title'], 'alias' => d['alias'] } : nil
  rescue
    nil
  end

  seen = {}
  [direct_hit, *text_results].compact.each_with_object([]) do |d, arr|
    next if seen[d['id']]
    seen[d['id']] = true
    arr << d
  end
end

.extract_subgenus_name(name) ⇒ Object

Subgenus names in CoL classification arrive as "Genus (Subgenus)" combinations. Extract just the subgenus epithet from inside the parentheses when present.



344
345
346
347
# File 'lib/vendor/colrapi.rb', line 344

def self.extract_subgenus_name(name)
  return name if name.nil?
  name[/\(([^)]+)\)/, 1] || name
end

.name_status(taxonworks_object, colrapi_result) ⇒ Object

}, accepted: {} } ] }

Returns:

  • hash { taxonworks_name: name } col_results: [ { usage: { name: status:



46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
# File 'lib/vendor/colrapi.rb', line 46

def self.name_status(taxonworks_object, colrapi_result)
  o = taxonworks_object

  r = {
      taxonworks_name: collection_object_scientific_name(o),
      col_usages: [],
      provisional_status: :accepted,
  }

  if colrapi_result.dig('total') == 0
    r[:provisional_status] = :undeterminable
    return r
  end

  colrapi_result['result'].each do |u|
    i = u['usage']

    d = {
      usage: {},
      accepted: {}
    }

    d[:usage][:name] = i.dig *%w{name scientificName}
    d[:usage][:status] = i['status']

    if i['accepted']
      d[:accepted][:name] = i.dig *%w{accepted name scientificName}
      d[:accepted][:status] = i.dig *%w{accepted status}
    end

    if d[:usage][:status] == 'synonym' && (d[:usage][:name] == r[:taxonworks_name])
      r[:provisional_status] = :synonym
    end

    r[:col_usages].push d
  end
  r
end

.search(name_string, dataset_id: nil) ⇒ Hash

Searches the Catalogue of Life by name string.

The Colrapi gem takes dataset_id as a positional first argument. Response structure: { 'total' => Integer, 'result' => Array } Each result entry is a flat nameusage hash with keys:

'id', 'status', 'name' (hash with 'scientificName', 'rank', 'authorship', …),
'label', 'labelHtml', 'parentId', etc.

Parameters:

  • name_string (String)
  • dataset_id (String, nil) (defaults to: nil)

    CoL dataset ID; falls back to the default hardwired ID when nil

Returns:

  • (Hash)

    raw Colrapi nameusage response (keys 'total', 'result')



96
97
98
99
100
101
102
# File 'lib/vendor/colrapi.rb', line 96

def self.search(name_string, dataset_id: nil)
  target = dataset_id.presence || DATASETS[:col]
  ::Colrapi.nameusage(target, q: name_string, limit: 20)
rescue => e
  Rails.logger.warn "Vendor::Colrapi.search error: #{e.message}"
  { 'total' => 0, 'result' => [] }
end

.uninomial_name(name_hash) ⇒ String?

Returns the single-word name component suitable for storing as a TaxonWorks Protonym name. CoL's scientificName is the full combination (e.g. "Homo sapiens"), but TaxonWorks Protonym requires just the uninomial or epithet. Priority: specificEpithet (species) > infraspecificEpithet (infra) > uninomial (higher) > scientificName fallback.

Parameters:

  • name_hash (Hash, nil)

    the 'name' sub-hash from a CoL nameusage result

Returns:

  • (String, nil)


334
335
336
337
338
339
340
# File 'lib/vendor/colrapi.rb', line 334

def self.uninomial_name(name_hash)
  return nil if name_hash.nil?
  name_hash['infraspecificEpithet'].presence ||
    name_hash['specificEpithet'].presence ||
    name_hash['uninomial'].presence ||
    name_hash['scientificName']
end