Class: Export::Dwca::Occurrence::Data
- Inherits:
-
Object
- Object
- Export::Dwca::Occurrence::Data
- Includes:
- PostgresqlFunctions, SqlFragments
- Defined in:
- lib/export/dwca/occurrence/data.rb
Overview
!! !! This export does not support AssertedDistribution data at the moment. While those data are indexed, !! if they are in the ‘core_scope` they will almost certainly cause problems or be ignored. !!
Wrapper to build DWCA zipfiles for a specific project. See tasks/accesssions/report/dwc_controller.rb for use.
With help from thinkingeek.com/2013/11/15/create-temporary-zip-file-send-response-rails/
Usage:
begin
data = Dwca::Occurrence::Data.new(DwcOccurrence.where(project_id: sessions_current_project_id)
ensure
data.cleanup
end
Always use the ensure/data.cleanup pattern!
Instance Attribute Summary collapse
-
#all_data ⇒ Tempfile
Generates and caches the combined data file by joining core, predicate, and extension data horizontally.
-
#core_scope ⇒ ActiveRecord::Relation
Normalizes and returns the core scope as an ordered ActiveRecord::Relation.
-
#data ⇒ Tempfile
Generates and caches the core occurrence data as TSV.
-
#data_predicate_ids ⇒ Hash
Predicate IDs to include: { collection_object_predicate_id: [], collecting_event_predicate_id: [] }.
-
#eml ⇒ Tempfile
Generates and caches the eml.xml file.
-
#eml_data ⇒ Hash
Input configuration containing :dataset and :additional_metadata as xml strings, for use in construction of the eml file.
-
#filename ⇒ String
readonly
Generates and caches the filename for the zip archive.
-
#meta ⇒ Tempfile
Generates and caches the meta.xml file describing the DwC-A structure.
-
#predicate_data ⇒ Tempfile
Generates and caches predicate data as TSV.
-
#taxonworks_extension_data ⇒ Tempfile
Generates and caches TaxonWorks extension data as TSV.
-
#taxonworks_extension_methods ⇒ Array<Symbol>
List of TaxonWorks-specific field names (e.g., :otu_name, :elevation_precision) to export as additional columns (subset of EXTENSION_FIELDS).
-
#total ⇒ Integer
Total number of records in the core scope.
-
#zipfile ⇒ Tempfile
Generates and caches the final DwC-A zip archive.
Instance Method Summary collapse
- #biological_association_relations_to_core ⇒ Object
- #biological_associations_resource_relationship_tmp ⇒ Object
-
#biological_associations_scope ⇒ ActiveRecord::Relation?
Transforms the biological associations config into an AR scope.
- #build_zip ⇒ Object
-
#cleanup ⇒ True
Close and delete all temporary files.
- #collecting_event_predicate_ids ⇒ Object
- #collection_object_predicate_ids ⇒ Object
-
#columns_with_data(columns) ⇒ Object
Find which columns in the dwc_occurrence table have non-NULL, non-empty values.
-
#csv(output_file:) ⇒ Object
Streams CSV data from PostgreSQL directly to output_file.
-
#initialize(core_scope: nil, extension_scopes: {}, predicate_extensions: {}, eml_data: {}, taxonworks_extensions: []) ⇒ Data
constructor
Initializes a new DwC-A export data builder.
- #media_tmp ⇒ Object
-
#meta_fields ⇒ Array
Non-standard DwC colums are handled elsewhere.
-
#no_records? ⇒ Boolean
True if provided core_scope returns no records.
-
#order_columns(columns, column_order) ⇒ Object
Order columns according to column_order, with unordered columns first.
- #package_download(download) ⇒ Object
- #predicate_options_present? ⇒ Boolean
- #taxonworks_options_present? ⇒ Boolean
Methods included from PostgresqlFunctions
#create_api_link_for_model_id_function, #create_authorship_sentence_function, #create_csv_sanitize_function, #create_image_file_url_function, #create_image_metadata_url_function, #create_image_url_functions, #create_sled_image_file_url_function
Methods included from SqlFragments
#copyright_label_from_temp_sql, #image_occurrence_resolution_joins_sql, #media_identifier_joins_sql, #media_identifier_sql, #sound_occurrence_resolution_joins_sql
Constructor Details
#initialize(core_scope: nil, extension_scopes: {}, predicate_extensions: {}, eml_data: {}, taxonworks_extensions: []) ⇒ Data
Initializes a new DwC-A export data builder.
115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 |
# File 'lib/export/dwca/occurrence/data.rb', line 115 def initialize( core_scope: nil, extension_scopes: {}, predicate_extensions: {}, eml_data: {}, taxonworks_extensions: [] ) raise ArgumentError, 'must pass a core_scope' if core_scope.nil? @core_scope = core_scope @biological_associations_extension = extension_scopes[:biological_associations] @media_extension = extension_scopes[:media] @data_predicate_ids = { collection_object_predicate_id: [], collecting_event_predicate_id: [] }.merge(predicate_extensions) @eml_data = eml_data # Normalize and sort extensions into a fixed, canonical order. extensions = Array(taxonworks_extensions).map(&:to_sym) canonical = ::CollectionObject::DwcExtensions::TaxonworksExtensions::EXTENSION_FIELDS @taxonworks_extension_methods = canonical & extensions end |
Instance Attribute Details
#all_data ⇒ Tempfile
Generates and caches the combined data file by joining core, predicate, and extension data horizontally.
92 93 94 |
# File 'lib/export/dwca/occurrence/data.rb', line 92 def all_data @all_data end |
#core_scope ⇒ ActiveRecord::Relation
Normalizes and returns the core scope as an ordered ActiveRecord::Relation.
60 61 62 |
# File 'lib/export/dwca/occurrence/data.rb', line 60 def core_scope @core_scope end |
#data ⇒ Tempfile
Generates and caches the core occurrence data as TSV.
38 39 40 |
# File 'lib/export/dwca/occurrence/data.rb', line 38 def data @data end |
#data_predicate_ids ⇒ Hash
Returns Predicate IDs to include: { collection_object_predicate_id: [], collecting_event_predicate_id: [] }.
77 78 79 |
# File 'lib/export/dwca/occurrence/data.rb', line 77 def data_predicate_ids @data_predicate_ids end |
#eml ⇒ Tempfile
This is a stub implementation, users may prefer to use IPT.
Generates and caches the eml.xml file. TODO: reference biological_resource_extension.csv
47 48 49 |
# File 'lib/export/dwca/occurrence/data.rb', line 47 def eml @eml end |
#eml_data ⇒ Hash
Returns Input configuration containing :dataset and :additional_metadata as xml strings, for use in construction of the eml file.
43 44 45 |
# File 'lib/export/dwca/occurrence/data.rb', line 43 def eml_data @eml_data end |
#filename ⇒ String (readonly)
Generates and caches the filename for the zip archive.
68 69 70 |
# File 'lib/export/dwca/occurrence/data.rb', line 68 def filename @filename end |
#meta ⇒ Tempfile
Generates and caches the meta.xml file describing the DwC-A structure.
51 52 53 |
# File 'lib/export/dwca/occurrence/data.rb', line 51 def @meta end |
#predicate_data ⇒ Tempfile
Generates and caches predicate data as TSV.
72 73 74 |
# File 'lib/export/dwca/occurrence/data.rb', line 72 def predicate_data @predicate_data end |
#taxonworks_extension_data ⇒ Tempfile
Generates and caches TaxonWorks extension data as TSV.
81 82 83 |
# File 'lib/export/dwca/occurrence/data.rb', line 81 def taxonworks_extension_data @taxonworks_extension_data end |
#taxonworks_extension_methods ⇒ Array<Symbol>
Returns List of TaxonWorks-specific field names (e.g., :otu_name, :elevation_precision) to export as additional columns (subset of EXTENSION_FIELDS).
87 88 89 |
# File 'lib/export/dwca/occurrence/data.rb', line 87 def taxonworks_extension_methods @taxonworks_extension_methods end |
#total ⇒ Integer
Returns Total number of records in the core scope.
64 65 66 |
# File 'lib/export/dwca/occurrence/data.rb', line 64 def total @total end |
#zipfile ⇒ Tempfile
Generates and caches the final DwC-A zip archive.
55 56 57 |
# File 'lib/export/dwca/occurrence/data.rb', line 55 def zipfile @zipfile end |
Instance Method Details
#biological_association_relations_to_core ⇒ Object
428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 |
# File 'lib/export/dwca/occurrence/data.rb', line 428 def biological_association_relations_to_core core_params = { dwc_occurrence_query: @biological_associations_extension[:core_params] } subject_biological_associations = ::Queries::BiologicalAssociation::Filter.new( collection_object_query: core_params, collection_object_as_subject_or_as_object: :subject ).all object_biological_associations = ::Queries::BiologicalAssociation::Filter.new( collection_object_query: core_params, collection_object_as_subject_or_as_object: :object ).all { subject: Set.new(subject_biological_associations.pluck(:id)), object: Set.new(object_biological_associations.pluck(:id)) } end |
#biological_associations_resource_relationship_tmp ⇒ Object
451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 |
# File 'lib/export/dwca/occurrence/data.rb', line 451 def biological_associations_resource_relationship_tmp return nil if @biological_associations_extension.nil? @biological_associations_resource_relationship_tmp = Tempfile.new('biological_resource_relationship.tsv') if no_records? @biological_associations_resource_relationship_tmp.write("\n") else Export::CSV::Dwc::Extension::BiologicalAssociations.csv( biological_associations_scope, biological_association_relations_to_core, output_file: @biological_associations_resource_relationship_tmp ) end @biological_associations_resource_relationship_tmp.flush @biological_associations_resource_relationship_tmp.rewind @biological_associations_resource_relationship_tmp end |
#biological_associations_scope ⇒ ActiveRecord::Relation?
Transforms the biological associations config into an AR scope.
162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 |
# File 'lib/export/dwca/occurrence/data.rb', line 162 def biological_associations_scope return nil unless @biological_associations_extension.present? q = @biological_associations_extension[:collection_objects_query] scope = if q.kind_of?(String) ::BiologicalAssociation.from('(' + q + ') as biological_associations') elsif q.kind_of?(ActiveRecord::Relation) q else raise ArgumentError, 'Biological associations scope is not an SQL string or ActiveRecord::Relation' end scope .joins(:biological_association_index) .select('biological_associations.id') .includes(:biological_association_index) end |
#build_zip ⇒ Object
570 571 572 573 574 575 576 577 578 579 580 581 582 583 584 585 |
# File 'lib/export/dwca/occurrence/data.rb', line 570 def build_zip t = Tempfile.new(filename) Zip::OutputStream.open(t) { |zos| } Zip::File.open(t.path, create: true) do |zip| zip.add('data.tsv', all_data.path) zip.add('media.tsv', media_tmp.path) if @media_extension zip.add('resource_relationships.tsv', biological_associations_resource_relationship_tmp.path) if @biological_associations_extension zip.add('meta.xml', .path) zip.add('eml.xml', eml.path) end t end |
#cleanup ⇒ True
Returns close and delete all temporary files.
605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622 623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 |
# File 'lib/export/dwca/occurrence/data.rb', line 605 def cleanup Rails.logger.debug 'dwca_export: cleanup start' # Only cleanup files that were actually created (materialized). # This prevents lazy-loading during cleanup. if defined?(@zipfile) && @zipfile @zipfile.close @zipfile.unlink end if defined?(@meta) && @meta @meta.close @meta.unlink end if defined?(@eml) && @eml @eml.close @eml.unlink end if defined?(@data) && @data @data.close @data.unlink end if @biological_associations_extension && defined?(@biological_associations_resource_relationship_tmp) && @biological_associations_resource_relationship_tmp @biological_associations_resource_relationship_tmp.close @biological_associations_resource_relationship_tmp.unlink end if @media_extension && defined?(@media_tmp) && @media_tmp @media_tmp.close @media_tmp.unlink end if && defined?(@predicate_data) && @predicate_data @predicate_data.close @predicate_data.unlink end if && defined?(@taxonworks_extension_data) && @taxonworks_extension_data @taxonworks_extension_data.close @taxonworks_extension_data.unlink end if defined?(@all_data) && @all_data @all_data.close @all_data.unlink end Rails.logger.debug 'dwca_export: cleanup end' true end |
#collecting_event_predicate_ids ⇒ Object
155 156 157 |
# File 'lib/export/dwca/occurrence/data.rb', line 155 def collecting_event_predicate_ids @data_predicate_ids[:collecting_event_predicate_id] end |
#collection_object_predicate_ids ⇒ Object
151 152 153 |
# File 'lib/export/dwca/occurrence/data.rb', line 151 def collection_object_predicate_ids @data_predicate_ids[:collection_object_predicate_id] end |
#columns_with_data(columns) ⇒ Object
Find which columns in the dwc_occurrence table have non-NULL, non-empty values. This implements the trim_columns: true behavior. Note: We check for non-empty AFTER sanitization.
264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 |
# File 'lib/export/dwca/occurrence/data.rb', line 264 def columns_with_data(columns) return [] if columns.empty? conn = ActiveRecord::Base.connection column_types = ::DwcOccurrence.columns_hash checks = columns.map.with_index do |col, idx| quoted = conn.quote_column_name(col) col_str = col.to_s column_info = column_types[col_str] is_string_column = column_info && [:string, :text].include?(column_info.type) if is_string_column # String columns - check if any non-empty values after sanitization. sanitized = "pg_temp.sanitize_csv(#{quoted})" "CASE WHEN COUNT(CASE WHEN #{sanitized} IS NOT NULL AND #{sanitized} != '' THEN 1 END) > 0 THEN #{conn.quote(col)} ELSE NULL END AS check_#{idx}" else # Non-string columns - just check if not NULL. "CASE WHEN COUNT(#{quoted}) > 0 THEN #{conn.quote(col_str)} ELSE NULL END AS check_#{idx}" end end sql = "SELECT #{checks.join(', ')} FROM (#{core_scope.to_sql}) AS data" result = conn.execute(sql).first result.values.compact end |
#csv(output_file:) ⇒ Object
Streams CSV data from PostgreSQL directly to output_file.
194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 |
# File 'lib/export/dwca/occurrence/data.rb', line 194 def csv(output_file:) conn = ActiveRecord::Base.connection create_csv_sanitize_function target_cols = ::DwcOccurrence.target_columns excluded = ::DwcOccurrence.excluded_columns cols_to_export = target_cols - excluded cols_with_data = columns_with_data(cols_to_export) column_order = (::CollectionObject::DWC_OCCURRENCE_MAP.keys + ::CollectionObject::EXTENSION_FIELDS).map(&:to_s) ordered_cols = order_columns(cols_with_data, column_order) column_types = ::DwcOccurrence.columns_hash # Build SELECT list with proper column names and aliases. # Sanitize string columns by replacing newlines and tabs with spaces # (matching Utilities::Strings.sanitize_for_csv behavior). select_list = ordered_cols.map do |col| if col == 'id' # DwCA requires the <id> column specified in meta.xml to be named "id" # (not "occurrenceID") for extension records to join correctly (see # commit 444262503d). # We copy the occurrenceID column to the id so that id's are proper # UUIDs (not db ids), then also include occurrenceID as a proper DwC # term field. This means both columns contain the same values - it # seems to be required in this case. '"occurrenceID" AS "id"' elsif col == 'dwcClass' # Header converter: dwcClass -> class, with sanitization '"dwcClass" AS "class"' else column_info = column_types[col] is_string_column = column_info && [:string, :text].include?(column_info.type) if is_string_column # String columns - sanitize by replacing newlines and tabs. "pg_temp.sanitize_csv(#{conn.quote_column_name(col)}) AS #{conn.quote_column_name(col)}" else # Non-string columns (integer, decimal, date, etc.) - no # sanitization needed. conn.quote_column_name(col) end end end.join(', ') copy_sql = <<-SQL COPY ( SELECT #{select_list} FROM (#{core_scope.to_sql}) AS dwc_data ) TO STDOUT WITH (FORMAT CSV, DELIMITER E'\\t', HEADER, NULL '') SQL # Stream directly from PostgreSQL to file conn.raw_connection.copy_data(copy_sql) do while row = conn.raw_connection.get_copy_data output_file.write(row.force_encoding(Encoding::UTF_8)) end end Rails.logger.debug 'dwca_export: csv data generated' end |
#media_tmp ⇒ Object
471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 |
# File 'lib/export/dwca/occurrence/data.rb', line 471 def media_tmp return nil if @media_extension.nil? @media_tmp = Tempfile.new('media.tsv') if no_records? @media_tmp.write("\n") else exporter = Export::Dwca::Occurrence::MediaExporter.new( media_extension: @media_extension, ) exporter.export_to(@media_tmp) end @media_tmp.flush @media_tmp.rewind @media_tmp end |
#meta_fields ⇒ Array
Non-standard DwC colums are handled elsewhere.
493 494 495 496 497 498 499 500 501 |
# File 'lib/export/dwca/occurrence/data.rb', line 493 def return [] if no_records? h = File.open(all_data, &:gets)&.strip&.split("\t") # Remove "id" column from field list since it's declared separately as # <id> in meta.xml. # The remaining fields become <field> elements. h&.shift h || [] end |
#no_records? ⇒ Boolean
Returns true if provided core_scope returns no records.
311 312 313 |
# File 'lib/export/dwca/occurrence/data.rb', line 311 def no_records? total == 0 end |
#order_columns(columns, column_order) ⇒ Object
Order columns according to column_order, with unordered columns first. This matches the behavior of Export::CSV.sort_column_headers.
294 295 296 297 298 299 300 301 302 303 304 305 306 307 |
# File 'lib/export/dwca/occurrence/data.rb', line 294 def order_columns(columns, column_order) sorted = [] unsorted = [] columns.each do |col| if pos = column_order.index(col) sorted[pos] = col else unsorted.push col end end unsorted + sorted.compact end |
#package_download(download) ⇒ Object
662 663 664 665 666 667 |
# File 'lib/export/dwca/occurrence/data.rb', line 662 def package_download(download) p = zipfile.path # This doesn't touch the db (source_file_path is an instance var). download.update!(source_file_path: p) end |
#predicate_options_present? ⇒ Boolean
180 181 182 |
# File 'lib/export/dwca/occurrence/data.rb', line 180 def data_predicate_ids[:collection_object_predicate_id].present? || data_predicate_ids[:collecting_event_predicate_id].present? end |
#taxonworks_options_present? ⇒ Boolean
184 185 186 |
# File 'lib/export/dwca/occurrence/data.rb', line 184 def taxonworks_extension_methods.present? end |