Class: Export::Dwca::Occurrence::Data

Inherits:
Object
  • Object
show all
Includes:
PostgresqlFunctions, SqlFragments
Defined in:
lib/export/dwca/occurrence/data.rb

Overview

!! !! This export does not support AssertedDistribution data at the moment. While those data are indexed, !! if they are in the ‘core_scope` they will almost certainly cause problems or be ignored. !!

Wrapper to build DWCA zipfiles for a specific project. See tasks/accesssions/report/dwc_controller.rb for use.

With help from thinkingeek.com/2013/11/15/create-temporary-zip-file-send-response-rails/

Usage:

begin
 data = Dwca::Occurrence::Data.new(DwcOccurrence.where(project_id: sessions_current_project_id)
ensure
 data.cleanup
end

Always use the ensure/data.cleanup pattern!

Instance Attribute Summary collapse

Instance Method Summary collapse

Methods included from PostgresqlFunctions

#create_api_link_for_model_id_function, #create_authorship_sentence_function, #create_csv_sanitize_function, #create_image_file_url_function, #create_image_metadata_url_function, #create_image_url_functions, #create_sled_image_file_url_function

Methods included from SqlFragments

#copyright_label_from_temp_sql, #image_occurrence_resolution_joins_sql, #media_identifier_joins_sql, #media_identifier_sql, #sound_occurrence_resolution_joins_sql

Constructor Details

#initialize(core_scope: nil, extension_scopes: {}, predicate_extensions: {}, eml_data: {}, taxonworks_extensions: []) ⇒ Data

Initializes a new DwC-A export data builder.

Parameters:

  • core_scope (String, ActiveRecord::Relation) (defaults to: nil)

    Required. DwcOccurrence scope (SQL string or ActiveRecord::Relation).

  • extension_scopes (Hash) (defaults to: {})

    Optional extensions to include:

    • :biological_associations [Hash] with keys :core_params and :collection_objects_query

    • :media [Hash] with keys :collection_objects (query string) and :field_occurrences (query string)

  • predicate_extensions (Hash) (defaults to: {})

    Predicate IDs to include:

    • :collection_object_predicate_id [Array<Integer>]

    • :collecting_event_predicate_id [Array<Integer>]

  • eml_data (Hash) (defaults to: {})

    EML metadata configuration:

    • :dataset [String] XML string for dataset metadata

    • :additional_metadata [String] XML string for additional metadata

  • taxonworks_extensions (Array<Symbol>) (defaults to: [])

    TaxonWorks-specific fields to export (e.g., [:otu_name, :elevation_precision]).

Raises:

  • (ArgumentError)


115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
# File 'lib/export/dwca/occurrence/data.rb', line 115

def initialize(
  core_scope: nil, extension_scopes: {}, predicate_extensions: {},
  eml_data: {}, taxonworks_extensions: []
)
  raise ArgumentError, 'must pass a core_scope' if core_scope.nil?

  @core_scope = core_scope

  @biological_associations_extension = extension_scopes[:biological_associations]
  @media_extension = extension_scopes[:media]

  @data_predicate_ids = { collection_object_predicate_id: [], collecting_event_predicate_id: [] }.merge(predicate_extensions)

  @eml_data = eml_data

  # Normalize and sort extensions into a fixed, canonical order.
  extensions = Array(taxonworks_extensions).map(&:to_sym)
  canonical  = ::CollectionObject::DwcExtensions::TaxonworksExtensions::EXTENSION_FIELDS

  @taxonworks_extension_methods = canonical & extensions
end

Instance Attribute Details

#all_dataTempfile

Generates and caches the combined data file by joining core, predicate, and extension data horizontally.

Returns:

  • (Tempfile)

    Combined TSV file with all data joined side-by-side.



92
93
94
# File 'lib/export/dwca/occurrence/data.rb', line 92

def all_data
  @all_data
end

#core_scopeActiveRecord::Relation

Normalizes and returns the core scope as an ordered ActiveRecord::Relation.

Returns:

  • (ActiveRecord::Relation)

    DwcOccurrence scope ordered by id



60
61
62
# File 'lib/export/dwca/occurrence/data.rb', line 60

def core_scope
  @core_scope
end

#dataTempfile

Generates and caches the core occurrence data as TSV.

Returns:

  • (Tempfile)

    The core occurrence CSV data as a tempfile.



38
39
40
# File 'lib/export/dwca/occurrence/data.rb', line 38

def data
  @data
end

#data_predicate_idsHash

Returns Predicate IDs to include: { collection_object_predicate_id: [], collecting_event_predicate_id: [] }.

Returns:

  • (Hash)

    Predicate IDs to include: { collection_object_predicate_id: [], collecting_event_predicate_id: [] }



77
78
79
# File 'lib/export/dwca/occurrence/data.rb', line 77

def data_predicate_ids
  @data_predicate_ids
end

#emlTempfile

Note:

This is a stub implementation, users may prefer to use IPT.

Generates and caches the eml.xml file. TODO: reference biological_resource_extension.csv

Returns:

  • (Tempfile)

    The EML metadata file (uses stub if no eml_data provided)

See Also:



47
48
49
# File 'lib/export/dwca/occurrence/data.rb', line 47

def eml
  @eml
end

#eml_dataHash

Returns Input configuration containing :dataset and :additional_metadata as xml strings, for use in construction of the eml file.

Returns:

  • (Hash)

    Input configuration containing :dataset and :additional_metadata as xml strings, for use in construction of the eml file.



43
44
45
# File 'lib/export/dwca/occurrence/data.rb', line 43

def eml_data
  @eml_data
end

#filenameString (readonly)

Generates and caches the filename for the zip archive.

Returns:

  • (String)

    The filename with timestamp.



68
69
70
# File 'lib/export/dwca/occurrence/data.rb', line 68

def filename
  @filename
end

#metaTempfile

Generates and caches the meta.xml file describing the DwC-A structure.

Returns:

  • (Tempfile)

    The meta.xml file with core and extension definitions.



51
52
53
# File 'lib/export/dwca/occurrence/data.rb', line 51

def meta
  @meta
end

#predicate_dataTempfile

Generates and caches predicate data as TSV.

Returns:

  • (Tempfile)

    TSV file with predicate data columns.



72
73
74
# File 'lib/export/dwca/occurrence/data.rb', line 72

def predicate_data
  @predicate_data
end

#taxonworks_extension_dataTempfile

Generates and caches TaxonWorks extension data as TSV.

Returns:

  • (Tempfile)

    TSV file with TaxonWorks-specific extension fields.



81
82
83
# File 'lib/export/dwca/occurrence/data.rb', line 81

def taxonworks_extension_data
  @taxonworks_extension_data
end

#taxonworks_extension_methodsArray<Symbol>

Returns List of TaxonWorks-specific field names (e.g., :otu_name, :elevation_precision) to export as additional columns (subset of EXTENSION_FIELDS).

Returns:

  • (Array<Symbol>)

    List of TaxonWorks-specific field names (e.g., :otu_name, :elevation_precision) to export as additional columns (subset of EXTENSION_FIELDS).



87
88
89
# File 'lib/export/dwca/occurrence/data.rb', line 87

def taxonworks_extension_methods
  @taxonworks_extension_methods
end

#totalInteger

Returns Total number of records in the core scope.

Returns:

  • (Integer)

    Total number of records in the core scope.



64
65
66
# File 'lib/export/dwca/occurrence/data.rb', line 64

def total
  @total
end

#zipfileTempfile

Generates and caches the final DwC-A zip archive.

Returns:

  • (Tempfile)

    The complete DwC-A zip file.



55
56
57
# File 'lib/export/dwca/occurrence/data.rb', line 55

def zipfile
  @zipfile
end

Instance Method Details

#biological_association_relations_to_coreObject



428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
# File 'lib/export/dwca/occurrence/data.rb', line 428

def biological_association_relations_to_core
  core_params = {
    dwc_occurrence_query: @biological_associations_extension[:core_params]
  }

  subject_biological_associations =
    ::Queries::BiologicalAssociation::Filter.new(
      collection_object_query: core_params,
      collection_object_as_subject_or_as_object: :subject
    ).all

  object_biological_associations =
    ::Queries::BiologicalAssociation::Filter.new(
      collection_object_query: core_params,
      collection_object_as_subject_or_as_object: :object
    ).all

  {
    subject: Set.new(subject_biological_associations.pluck(:id)),
    object: Set.new(object_biological_associations.pluck(:id))
  }
end

#biological_associations_resource_relationship_tmpObject



451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
# File 'lib/export/dwca/occurrence/data.rb', line 451

def biological_associations_resource_relationship_tmp
  return nil if @biological_associations_extension.nil?

  @biological_associations_resource_relationship_tmp = Tempfile.new('biological_resource_relationship.tsv')

  if no_records?
    @biological_associations_resource_relationship_tmp.write("\n")
  else
    Export::CSV::Dwc::Extension::BiologicalAssociations.csv(
      biological_associations_scope,
      biological_association_relations_to_core,
      output_file: @biological_associations_resource_relationship_tmp
    )
  end

  @biological_associations_resource_relationship_tmp.flush
  @biological_associations_resource_relationship_tmp.rewind
  @biological_associations_resource_relationship_tmp
end

#biological_associations_scopeActiveRecord::Relation?

Transforms the biological associations config into an AR scope.

Returns:

  • (ActiveRecord::Relation, nil)

    BiologicalAssociation scope with biological_association_index, or nil if not configured.



162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
# File 'lib/export/dwca/occurrence/data.rb', line 162

def biological_associations_scope
  return nil unless @biological_associations_extension.present?

  q = @biological_associations_extension[:collection_objects_query]
  scope = if q.kind_of?(String)
    ::BiologicalAssociation.from('(' + q + ') as biological_associations')
  elsif q.kind_of?(ActiveRecord::Relation)
    q
  else
    raise ArgumentError, 'Biological associations scope is not an SQL string or ActiveRecord::Relation'
  end

  scope
    .joins(:biological_association_index)
    .select('biological_associations.id')
    .includes(:biological_association_index)
end

#build_zipObject



570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
# File 'lib/export/dwca/occurrence/data.rb', line 570

def build_zip
  t = Tempfile.new(filename)

  Zip::OutputStream.open(t) { |zos| }

  Zip::File.open(t.path, create: true) do |zip|
    zip.add('data.tsv', all_data.path)

    zip.add('media.tsv', media_tmp.path) if @media_extension
    zip.add('resource_relationships.tsv', biological_associations_resource_relationship_tmp.path) if @biological_associations_extension

    zip.add('meta.xml', meta.path)
    zip.add('eml.xml', eml.path)
  end
  t
end

#cleanupTrue

Returns close and delete all temporary files.

Returns:

  • (True)

    close and delete all temporary files.



605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
# File 'lib/export/dwca/occurrence/data.rb', line 605

def cleanup

  Rails.logger.debug 'dwca_export: cleanup start'

  # Only cleanup files that were actually created (materialized).
  # This prevents lazy-loading during cleanup.
  if defined?(@zipfile) && @zipfile
    @zipfile.close
    @zipfile.unlink
  end

  if defined?(@meta) && @meta
    @meta.close
    @meta.unlink
  end

  if defined?(@eml) && @eml
    @eml.close
    @eml.unlink
  end

  if defined?(@data) && @data
    @data.close
    @data.unlink
  end

  if @biological_associations_extension && defined?(@biological_associations_resource_relationship_tmp) && @biological_associations_resource_relationship_tmp
    @biological_associations_resource_relationship_tmp.close
    @biological_associations_resource_relationship_tmp.unlink
  end

  if @media_extension && defined?(@media_tmp) && @media_tmp
    @media_tmp.close
    @media_tmp.unlink
  end

  if predicate_options_present? && defined?(@predicate_data) && @predicate_data
    @predicate_data.close
    @predicate_data.unlink
  end

  if taxonworks_options_present? && defined?(@taxonworks_extension_data) && @taxonworks_extension_data
    @taxonworks_extension_data.close
    @taxonworks_extension_data.unlink
  end

  if defined?(@all_data) && @all_data
    @all_data.close
    @all_data.unlink
  end

  Rails.logger.debug 'dwca_export: cleanup end'

  true
end

#collecting_event_predicate_idsObject



155
156
157
# File 'lib/export/dwca/occurrence/data.rb', line 155

def collecting_event_predicate_ids
  @data_predicate_ids[:collecting_event_predicate_id]
end

#collection_object_predicate_idsObject



151
152
153
# File 'lib/export/dwca/occurrence/data.rb', line 151

def collection_object_predicate_ids
  @data_predicate_ids[:collection_object_predicate_id]
end

#columns_with_data(columns) ⇒ Object

Find which columns in the dwc_occurrence table have non-NULL, non-empty values. This implements the trim_columns: true behavior. Note: We check for non-empty AFTER sanitization.



264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
# File 'lib/export/dwca/occurrence/data.rb', line 264

def columns_with_data(columns)
  return [] if columns.empty?

  conn = ActiveRecord::Base.connection

  column_types = ::DwcOccurrence.columns_hash

  checks = columns.map.with_index do |col, idx|
    quoted = conn.quote_column_name(col)
    col_str = col.to_s
    column_info = column_types[col_str]

    is_string_column = column_info && [:string, :text].include?(column_info.type)
    if is_string_column
      # String columns - check if any non-empty values after sanitization.
      sanitized = "pg_temp.sanitize_csv(#{quoted})"
      "CASE WHEN COUNT(CASE WHEN #{sanitized} IS NOT NULL AND #{sanitized} != '' THEN 1 END) > 0 THEN #{conn.quote(col)} ELSE NULL END AS check_#{idx}"
    else
      # Non-string columns - just check if not NULL.
      "CASE WHEN COUNT(#{quoted}) > 0 THEN #{conn.quote(col_str)} ELSE NULL END AS check_#{idx}"
    end
  end

  sql = "SELECT #{checks.join(', ')} FROM (#{core_scope.to_sql}) AS data"
  result = conn.execute(sql).first
  result.values.compact
end

#csv(output_file:) ⇒ Object

Streams CSV data from PostgreSQL directly to output_file.

Parameters:

  • output_file (File, Tempfile)

    File to write to directly



194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
# File 'lib/export/dwca/occurrence/data.rb', line 194

def csv(output_file:)
  conn = ActiveRecord::Base.connection

  create_csv_sanitize_function

  target_cols = ::DwcOccurrence.target_columns
  excluded = ::DwcOccurrence.excluded_columns

  cols_to_export = target_cols - excluded

  cols_with_data = columns_with_data(cols_to_export)

  column_order = (::CollectionObject::DWC_OCCURRENCE_MAP.keys +
    ::CollectionObject::EXTENSION_FIELDS).map(&:to_s)
  ordered_cols = order_columns(cols_with_data, column_order)

  column_types = ::DwcOccurrence.columns_hash

  # Build SELECT list with proper column names and aliases.
  # Sanitize string columns by replacing newlines and tabs with spaces
  # (matching Utilities::Strings.sanitize_for_csv behavior).
  select_list = ordered_cols.map do |col|
    if col == 'id'
      # DwCA requires the <id> column specified in meta.xml to be named "id"
      # (not "occurrenceID") for extension records to join correctly (see
      # commit 444262503d).
      # We copy the occurrenceID column to the id so that id's are proper
      # UUIDs (not db ids), then also include occurrenceID as a proper DwC
      # term field. This means both columns contain the same values - it
      # seems to be required in this case.
      '"occurrenceID" AS "id"'
    elsif col == 'dwcClass'
      # Header converter: dwcClass -> class, with sanitization
      '"dwcClass" AS "class"'
    else
      column_info = column_types[col]
      is_string_column = column_info && [:string, :text].include?(column_info.type)

      if is_string_column
        # String columns - sanitize by replacing newlines and tabs.
        "pg_temp.sanitize_csv(#{conn.quote_column_name(col)}) AS #{conn.quote_column_name(col)}"
      else
        # Non-string columns (integer, decimal, date, etc.) - no
        # sanitization needed.
        conn.quote_column_name(col)
      end
    end
  end.join(', ')

  copy_sql = <<-SQL
    COPY (
      SELECT #{select_list}
      FROM (#{core_scope.to_sql}) AS dwc_data
    ) TO STDOUT WITH (FORMAT CSV, DELIMITER E'\\t', HEADER, NULL '')
  SQL

  # Stream directly from PostgreSQL to file
  conn.raw_connection.copy_data(copy_sql) do
    while row = conn.raw_connection.get_copy_data
      output_file.write(row.force_encoding(Encoding::UTF_8))
    end
  end

  Rails.logger.debug 'dwca_export: csv data generated'
end

#media_tmpObject



471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
# File 'lib/export/dwca/occurrence/data.rb', line 471

def media_tmp
  return nil if @media_extension.nil?

  @media_tmp = Tempfile.new('media.tsv')

  if no_records?
    @media_tmp.write("\n")
  else
    exporter = Export::Dwca::Occurrence::MediaExporter.new(
      media_extension: @media_extension,
    )
    exporter.export_to(@media_tmp)
  end

  @media_tmp.flush
  @media_tmp.rewind
  @media_tmp
end

#meta_fieldsArray

Non-standard DwC colums are handled elsewhere.

Returns:

  • (Array)

    use the temporarily written, and refined, CSV file to read off the existing headers so we can use them in writing meta.yml.



493
494
495
496
497
498
499
500
501
# File 'lib/export/dwca/occurrence/data.rb', line 493

def meta_fields
  return [] if no_records?
  h = File.open(all_data, &:gets)&.strip&.split("\t")
  # Remove "id" column from field list since it's declared separately as
  # <id> in meta.xml.
  # The remaining fields become <field> elements.
  h&.shift
  h || []
end

#no_records?Boolean

Returns true if provided core_scope returns no records.

Returns:

  • (Boolean)

    true if provided core_scope returns no records.



311
312
313
# File 'lib/export/dwca/occurrence/data.rb', line 311

def no_records?
  total == 0
end

#order_columns(columns, column_order) ⇒ Object

Order columns according to column_order, with unordered columns first. This matches the behavior of Export::CSV.sort_column_headers.



294
295
296
297
298
299
300
301
302
303
304
305
306
307
# File 'lib/export/dwca/occurrence/data.rb', line 294

def order_columns(columns, column_order)
  sorted = []
  unsorted = []

  columns.each do |col|
    if pos = column_order.index(col)
      sorted[pos] = col
    else
      unsorted.push col
    end
  end

  unsorted + sorted.compact
end

#package_download(download) ⇒ Object

Parameters:



662
663
664
665
666
667
# File 'lib/export/dwca/occurrence/data.rb', line 662

def package_download(download)
  p = zipfile.path

  # This doesn't touch the db (source_file_path is an instance var).
  download.update!(source_file_path: p)
end

#predicate_options_present?Boolean

Returns:

  • (Boolean)


180
181
182
# File 'lib/export/dwca/occurrence/data.rb', line 180

def predicate_options_present?
  data_predicate_ids[:collection_object_predicate_id].present? || data_predicate_ids[:collecting_event_predicate_id].present?
end

#taxonworks_options_present?Boolean

Returns:

  • (Boolean)


184
185
186
# File 'lib/export/dwca/occurrence/data.rb', line 184

def taxonworks_options_present?
  taxonworks_extension_methods.present?
end