Workflow

Operational Product Development Timeline

Over the course of the year, the survey team is developing a variety of different data products. Planning and preparation for surveys happens in the late winter and spring, surveys occur in the summer, data validation takes place over the course of the survey and after the survey, and data products are produced through fall and late winter.

Operational product development timeline.
	January	February	March	April	May	June	July	August	September	October	November	December
Surveys					1	1	1	1
Planning	1	1	1						1	1	1	1
Development	1	1	1	1	1					1	1	1
Deployment (survey deliverables)									1	1	1	1
Deployment (survey operations)					1	1	1	1
Triage (fixing bugs and errors)	1	1	1	1	1	1	1	1	1	1	1	1
User feedback and brainstorming	1	1	1	1	1	1	1	1	1	1	1	1

Data workflow from boat to production

Organisms first need to be collected aboard the vessel before data can be entered into tablets.

flowchart 
  A1[Net and catch\ntotal weight/volume] --> A2
  A1 --> |Catch tablet|BB
  A2[Fish are collected and\ndumped on the table] --> A3[Initial on deck\nsampling decisions\n and sort by\nspecies and mix]
  P1[Specimen bag and tag] --> P5[Observer collections]
  P1 --> P6[Museum collections]
  P1 --> P7[Research samples]
  P2[SAT/PAT fish and crab tagging]
  P3[Species condition]
  P4[Genetics]
  A3 --> C1[Crab:\nSex, weight]
  A3 --> C2[Fish, size varying:\nWeight, sex, length]
  A3 --> |Catch tablet|C3[Fish, small:\nWeight, sex]
  A3 --> C4[Invertebrates, colonial:\nWeight]
  A3 --> C5[Invertebrates, other:\nWeight, count]
  C1 --> |Crab tablet|S1[Length, width, weight, \nshell condition]
  C2 --> |Specimen tablet|S2[Otolith, length, width, \nweight, sex]
  C2 --> |Length tablet|S3[Length, sex]
  C2 --> |Stomach tablet|S4[Diet/stomach samples]
  V1[Fish collections in formalin 10%]
  V2[Invertebrate collections in ethanol]
  subgraph AA[Net on deck]
  A1
  A2
  A3
  subgraph BB[Benthic bag]
  B[Identify,\npresence/absence]
  end
  end
  subgraph CC[Bag composition data]
  C1
  C2
  C3
  C4
  C5
  subgraph VV[Voucher collections]
  V1
  V2
  end
  end
  subgraph SS[Individual specimen data]
  S1
  S2
  S3
  S4
  subgraph PP[Non-core science project requests]
  P1
  P2
  P3
  P4
  P5
  P6
  P7
  end
  end
  style AA fill:white,::
  style SS fill:beige,::
  style BB fill:beige,::
  style CC fill:beige,::
  style PP fill:white,::
  style VV fill:white,::

Figure 2.1: Simplified boat deck processing workflow.

The objective of this process is to take raw data, QA/QC and clean these data, curate standard data products for these survey. Please note, through this process we are not providing “data” (what we consider lower level data material; see the data levels section below) but “data products”, which is intended to facilitate the most fool-proof standard interpretation of the data. These data products only use data from standard and validated hauls, and has undergone careful review.

Once survey data collected on the vessel has been checked and validated, the gap_products/code/run.R script is used to orchestrate a sequence of programs that calculate the standard data products resulting from the NOAA AFSC GAP bottom trawl surveys. Standard data products are the CPUE, BIOMASS, SIZECOMP, and AGECOMP tables in the GAP_PRODUCTS Oracle schema. The tables are slated to be updated twice a year: once after the survey season following finalization of that summer’s bottom trawl survey data to incorporate the new catch, size, and effort data and once prior to an upcoming survey to incorporate new age data that were processed after the prior summer’s survey season ended. This second pre-survey production run will also incorporate changes in the data due to the specimen voucher process as well as other post-hoc changes in the survey data.

The data from these surveys constitute a living data set so we can continue to provide the best available data to all partners, stakeholders, and fellow scientists.

flowchart LR
  A([Catch data\ndeck tablet]) --> B1[METIS Biological\ndata editing\nsoftware]
  A1([Length data\ndeck tablets]) --> B1
  A2([Specimen data\ndeck tablet]) --> B1
  A3([Haul performance w/ Marport sensors]) --> B2[Wheelhouse &\nCalyposo \nhaul data] 
  A4([CTD]) --> B2
  A5([sea state\nobersvations]) --> B2
  A6([HOBO bottom\ncontact sensor]) --> B2
  A7{{navmaps\nR package\nfor navigation}} --> B2
  B1 --> C[GIDES &\nRACE_EDIT\nOracle schema\n& whole-survey\nreview &\ndata checking]
  B2 --> C
  C --> D[RACEBASE\nand RACE_DATA\nOracle schemata]
  D1{{gapindex R package}} --> F[Public Data Product:\ntables in\nGAP_PRODUCTS\nOracle schema]
  D2{{gap_products\nR scripts}} --> F
  D --> D1
  D1 --> D2
  D --> F
  subgraph AA[Data Level 0: Raw]
  A
  A1
  A2
  A3
  A4
  A5
  A6
  A7
  end
  subgraph BB[Data Level 1:\nQA/QC'ed data]
  B1
  B2
  end
  subgraph CC[Data Level 2:\nAnalysis ready product for internal use]
  C
  D
  end
  subgraph FF[Data Level 3:\nAnalysis ready product\nfor external/public use]
  F
  end
  style AA fill:beige,::
  style BB fill:white,::
  style CC fill:beige,::
  style FF fill:white,::

Figure 2.2: Simplified data workflow from boat to production.

During each data product run cycle:

Versions of the tables in GAP_PRODUCTS are locally imported within the gap_products repository to compare with the updated production tables. Any changes to a production table will be compared and checked to make sure those changes are intentional and documented.
Use the gapindex R package to calculate the four major standard data products: CPUE, BIOMASS, SIZECOMP, AGECOMP. These tables are compared and checked to their respective locally saved copies and any changes to the tables are vetted and documented. These tables are then uploaded to the GAP_PRODUCTS Oracle schema.
Calculate the various materialized views for AKFIN and FOSS purposes. Since these are derivative of the tables in GAP_PRODUCTS as well as other base tables in RACEBASE and RACE_DATA, it is not necessary to check these views in addition to the data checks done in the previous steps.

flowchart
  subgraph AA[Data Level 4]
  A[GAP_PRODUCTS\nOracle data\nproduct tables] --> C[Data process\nreports &\npresentations]
  A --> G3{{esrindex R package}}
  G3 --> G1[ESP: Ecosystem\nStatus Reports]
  G3 --> G2[ESR: Ecosystem\nand Socioeconomic\nProfiles]
  G3 --> C
  G3 --> G4[Stock assessment\nrisk assessment\ntables]
  A --> D[Outreach]
  D --> E[Plan Team\nPresentations]
  D --> F[Community\nHighlight\nDocuments]
  D --> H[Survey progress\n& temperature maps]
  A --> I[Data Platforms]
  I --> J[AKFIN: Alaska\nFisheries\nInformation\nNetwork]
  I --> K[FOSS: Fisheries One\nStop Shop]
  I --> L{{afscgap\nPython package}}
  A --> M[Model-Based\nIndices]
  M --> N{{VAST/tinyVAST}}
  M --> O{{sdmTMB}}
  M --> P[EFH: Essential\nFish Habitat]
  F
  end
  style AA fill:beige,::

Figure 2.3: Major end-users of the GAP data product tables.

Data levels

GAP produces numerous data products that are subjected to different levels of processing, ranging from raw to highly-derived. The suitability of these data products for analysis varies and there is ambiguity about which data products can be used for which purpose. This ambiguity can create challenges in communicating about data products and potentially lead to misunderstanding and misuse of data. One approach to communicating about the level of processing applied to data products and their suitability for analysis is to describe data products using a Data Processing Level system. Data Processing Level systems are widely used in earth system sciences to characterize the extent of processing that has been applied to data products. For example, the NOAA National Centers for Environmental Information (NCEI) Satellite Program uses a Data Processing Level system to describe data on a scale of 0-4, where Level 0 is raw data and Level 4 is model output or results from analysis. Example of how NASA remote sensing data products are shared through a public data portal with levels of data processing and documentation.

For more information, see Sean Rohan’s October 2022 SCRUGS presentation on the topic.

Level 0: Raw and unprocessed data. Ex: Data on the G drive, some tables in RACE_DATA
Level 1: Data products with QA/QC applied that may or may not be expanded to analysis units, but either not georeferenced or does not include full metadata. Ex: Some tables in RACE_DATA and RACEBASE
Level 2: Analysis-ready data products that are derived for a standardized extent and account for zeros and missing/bad data. Ex: CPUE tables, some data products in public-facing archives and repositories
Level 3: Data products that are synthesized across a standardized extent, often inputs in a higher-level analytical product. Ex: Abundance indices, some data products in public-facing archives and repositories
Level 4: Analytically generated data products that are derived from lower-level data, often to inform management. Ex: Biological reference points from stock assessments, Essential Fish Habitat layers, indicators in Ecosystem Status Reports and Ecosystem and Socioeconomic Profiles