Module: Export::Dwca

Defined in:
lib/export/dwca.rb

Defined Under Namespace

Modules: Eml, GbifProfile, Occurrence

Constant Summary collapse

INDEX_VERSION =

Version is a way to track dates where the indexing changed significantly such that all or most of the index should be regenerated. To add a version use ‘Time.now` via IRB

[
  '2021-10-12 17:00:00.000000 -0500',    # First major refactor
  '2021-10-15 17:00:00.000000 -0500',    # Minor  Excludes footprintWKT, and references to GeographicArea in gazetteer; new form of media links
  '2021-11-04 17:00:00.000000 -0500',    # Minor  Removes '|', fixes some mappings
  '2021-11-08 13:00:00.000000 -0500',    # PENDING: Minor  Adds depth mappings
  '2021-11-30 13:00:00.000000 -0500',    # Fix inverted long,lat
  '2022-01-21 16:30:00.000000 -0500',    # basisOfRecord can now be FossilSpecimen; occurrenceId exporting; adds redundant time fields
  '2022-03-31 16:30:00.000000 -0500',    # collectionCode, occurrenceRemarks and various small fixes
  '2022-04-28 16:30:00.000000 -0500',    # add dwcOccurrenceStatus
  '2022-09-28 16:30:00.000000 -0500',    # add phylum, class, order, higherClassification
  '2023-04-03 16:30:00.000000 -0500',    # add associatedTaxa; updating InternalAttributes is now reflected in index
  '2023-12-14 16:30:00.000000 -0500',    # add verbatimLabel
  '2023-12-21 11:00:00.000000 -0500',    # add caste (via biocuration), identificationRemarks

  '2024-09-13 11:00:00.000000 -0500'     # enable collectionCode, object and collecting event related IDs
].freeze

Class Method Summary collapse

Class Method Details

.build_index_async(klass, record_scope, predicate_extensions: {}) ⇒ Object

When we re-index a large set of data then we run it in the background. To determine when it is done we poll by the last record to be indexed.

Parameters:

  • klass (Class)
    ActiveRecord class

    e.g. CollectionObject

  • record_scope (ActiveRecord::Relation)
    An ActiveRecord scope

Returns:

  • Hash total: total records to expect start_time: the time indexing started sample: Array of object global ids spread across 10 (or fewer) intervals of the recordset



93
94
95
96
97
# File 'lib/export/dwca.rb', line 93

def self.build_index_async(klass, record_scope, predicate_extensions: {} )
  s = record_scope.order(:id)
  ::DwcaCreateIndexJob.perform_later(klass.to_s, sql_scope: s.to_sql)
  (klass, s)
end

.download_async(record_scope, request = nil, extension_scopes: {}, predicate_extensions: {}, taxonworks_extensions: {}, project_id: nil) ⇒ Download

Creates a DwC-A download asynchronously by enqueuing a job.

Parameters:

  • record_scope (ActiveRecord::Relation)

    A relation that returns DwcOccurrence records.

  • request (String, nil) (defaults to: nil)

    Optional. The request URI path this download was generated from.

  • extension_scopes (Hash) (defaults to: {})

    Optional extensions to include:

    • :biological_associations [Hash] with keys :core_params and :collection_objects_query

    • :media [Hash] with keys :collection_objects (query string) and :field_occurrences (query string)

  • predicate_extensions (Hash) (defaults to: {})

    Predicate IDs to include:

    • :collection_object_predicate_id [Array<Integer>]

    • :collecting_event_predicate_id [Array<Integer>]

  • taxonworks_extensions (Array<Symbol>) (defaults to: {})

    TaxonWorks-specific fields to export (e.g., [:otu_name, :elevation_precision]).

  • project_id (Integer) (defaults to: nil)

    Required. The project ID for scoping queries.

Returns:

  • (Download)

    The Download object that will contain the archive when ready.

Raises:

  • (TaxonWorks::Error)


54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
# File 'lib/export/dwca.rb', line 54

def self.download_async(record_scope, request = nil, extension_scopes: {}, predicate_extensions: {}, taxonworks_extensions: {}, project_id: nil)
  raise TaxonWorks::Error, 'project_id is required in Export::Dwca::download_async!' if project_id.nil?

  name = "dwc-a_#{DateTime.now}.zip"

  # TODO: move fixed attributes to model
  download = ::Download::DwcArchive.create!(
    name: "DwC Archive generated at #{Time.now.utc}.",
    description: 'A Darwin Core archive.',
    filename: name,
    request:,
    expires: 2.days.from_now,
    total_records: record_scope.size # Was haveing problems with count() TODO: increment after when extensions are allowed.
  )

  # Note we pass a string with the record scope
  ::DwcaCreateDownloadJob.perform_later(
    download.id,
    core_scope: record_scope.to_sql,
    extension_scopes:,
    predicate_extensions:,
    taxonworks_extensions:,
    project_id:
  )

  download
end

.index_metadata(klass, record_scope) ⇒ Hash{Symbol=>Integer, Time, Array}

Returns:

  • (Hash{Symbol=>Integer, Time, Array})


100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
# File 'lib/export/dwca.rb', line 100

def self.(klass, record_scope)
  a = record_scope.first&.to_global_id&.to_s  # TODO: this should be UUID?
  b = record_scope.last&.to_global_id&.to_s # TODO: this should be UUID?

  t = record_scope.size # was having problems with count

   = {
    total: t,
    start_time: Time.zone.now,
    sample: [a, b].compact
  }

  if b && (t > 2)
    max = 9
    max = t if t < 9

    ids = klass
      .select('*')
      .from("(select id, type, ROW_NUMBER() OVER (ORDER BY id ASC) rn from (#{record_scope.to_sql}) b ) a")
      .where("a.rn % ((SELECT COUNT(*) FROM (#{record_scope.to_sql}) c) / #{max}) = 0")
      .limit(max)
      .collect{|o| o.to_global_id.to_s}

    [:sample].insert(1, *ids)
  end

  [:sample].uniq!
  
end