{ "cells": [ { "cell_type": "markdown", "id": "ec9ded46-b872-4a33-a98f-5afa2e9d1499", "metadata": {}, "source": [ "# Handling Corrupt SEG-Y Files\n", "\n", "```{article-info}\n", ":author: Altay Sansal\n", ":date: \"{sub-ref}`today`\"\n", ":read-time: \"{sub-ref}`wordcount-minutes` min read\"\n", ":class-container: sd-p-0 sd-outline-muted sd-rounded-3 sd-font-weight-light\n", "```\n", "\n", "In this tutorial, we will demonstrate how to handle some of the most common SEG-Y file issues that can\n", "occur during ingestion. To illustrate these problems and their solutions, we'll start by creating some\n", "intentionally malformed files using the [`TGSAI/segy`][tgsai-segy] library. Let's begin by importing the\n", "modules we'll be using throughout this tutorial.\n", "\n", "[tgsai-segy]: https://github.com/TGSAI/segy" ] }, { "cell_type": "code", "execution_count": 2, "id": "7ca46249-f607-4549-8466-20257ae7cfb5", "metadata": { "ExecuteTime": { "end_time": "2025-10-04T19:51:30.075646Z", "start_time": "2025-10-04T19:51:28.848987Z" } }, "outputs": [], "source": [ "from pathlib import Path\n", "\n", "import numpy as np\n", "from segy import SegyFactory\n", "from segy.config import SegyHeaderOverrides\n", "from segy.standards import get_segy_standard\n", "\n", "from mdio import open_mdio\n", "from mdio import segy_to_mdio\n", "from mdio.builder.template_registry import get_template" ] }, { "cell_type": "markdown", "id": "7735a63c74432eb9", "metadata": {}, "source": [ "## Fixing Coordinate Scalar Issues\n", "\n", "One of the most common issues in SEG-Y files is an invalid or missing coordinate scalar value. Let's start by\n", "creating a SEG-Y file with an intentionally incorrect coordinate scalar. We'll create a simple toy 2D stack dataset\n", "that contains CDP (Common Depth Point) numbers and dummy CDP-X/Y coordinates in the trace headers.\n", "\n", "To generate this example file, we will follow these steps:\n", "1. Create an empty SEG-Y factory with the appropriate specification.\n", "2. Populate the file headers (textual and binary headers).\n", "3. Generate 10 traces with headers and fill them with dummy sample values.\n", "\n", "[tgsai-segy]: https://github.com/TGSAI/segy" ] }, { "cell_type": "code", "execution_count": 3, "id": "18d7694ee93c0d98", "metadata": { "jupyter": { "is_executing": true } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Wrote temporary SEG-Y file successfully.\n" ] } ], "source": [ "n_traces = 10\n", "spec = get_segy_standard(1.0)\n", "factory = SegyFactory(spec=spec, sample_interval=4000, samples_per_trace=1201)\n", "\n", "txt_header = factory.create_textual_header() # default text header\n", "bin_header = factory.create_binary_header() # default binary header\n", "\n", "headers = factory.create_trace_header_template(n_traces) # default all zero except n_samp and interval\n", "samples = factory.create_trace_sample_template(n_traces) # default all zero\n", "\n", "rng = np.random.default_rng(seed=42)\n", "headers[\"cdp\"] = np.arange(n_traces) # cdp\n", "headers[\"coordinate_scalar\"] = 0\n", "headers[\"cdp_x\"] = np.arange(n_traces) * 1000\n", "headers[\"cdp_y\"] = np.arange(n_traces) * 10000\n", "samples[:] = rng.normal(size=samples.shape).astype(\"float16\")\n", "\n", "# encode traces to SEG-Y buffer and write\n", "with Path(\"tmp.sgy\").open(mode=\"wb\") as fp:\n", " fp.write(txt_header)\n", " fp.write(bin_header)\n", " fp.write(factory.create_traces(headers, samples))\n", "\n", "print(\"Wrote temporary SEG-Y file successfully.\")" ] }, { "cell_type": "markdown", "id": "efdf0c533c6b5589", "metadata": {}, "source": [ "As mentioned earlier, this file has a zero value in the coordinate scalar field. According to the SEG-Y standard\n", "(both Revision 0 and Revision 1), a coordinate scalar of zero is invalid and should not be used.\n", "\n", "Starting with MDIO v1, we extract X/Y coordinates (such as CDP-X/Y, Shot-X/Y, etc.) as dedicated MDIO variables\n", "for easier access and manipulation. For these coordinates to be extracted correctly, the coordinate scalar must be\n", "valid. If we attempt to ingest the file with an invalid coordinate scalar, MDIO will raise an error. Let's try to\n", "ingest the file and catch the resulting error to demonstrate this issue." ] }, { "cell_type": "code", "execution_count": 4, "id": "5537acb5a0ef370d", "metadata": { "ExecuteTime": { "end_time": "2025-10-04T12:41:50.216247Z", "start_time": "2025-10-04T12:41:48.958802Z" } }, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Scanning SEG-Y for geometry attributes: 100%|██████████| 1/1 [00:01<00:00, 1.12s/block]\n", "Unexpected value in coordinate unit (measurement_system_code) header: 0. Can't extract coordinate unit and will ingest without coordinate units.\n" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Ingestion failed with error: Invalid coordinate scalar: 0 for file revision SegyStandard.REV1.\n" ] } ], "source": [ "mdio_template = get_template(\"PostStack2DTime\")\n", "\n", "ingestion_kwargs = {\n", " \"segy_spec\": spec,\n", " \"mdio_template\": mdio_template,\n", " \"input_path\": \"tmp.sgy\",\n", " \"output_path\": \"tmp.mdio\",\n", " \"overwrite\": True,\n", "}\n", "try:\n", " segy_to_mdio(**ingestion_kwargs)\n", " print(\"Ingestion successful.\")\n", "except ValueError as e:\n", " print(f\"Ingestion failed with error: {e}\")" ] }, { "cell_type": "markdown", "id": "de52bcf4-9eb9-4f19-8aca-2ade664b1649", "metadata": {}, "source": [ "### Fixing the Coordinate Scalar\n", "\n", "To be able to read this file without issues, we can utilize the `SegyHeaderOverride` option to override the\n", "existing value at runtime and also have the correct value in the final MDIO file. With the value `-100` we\n", "expect the coordinates to be divided by 100." ] }, { "cell_type": "code", "execution_count": 5, "id": "72ba0a22-6e14-400b-8227-3ec6e93fbc52", "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Scanning SEG-Y for geometry attributes: 100%|██████████| 1/1 [00:01<00:00, 1.21s/block]\n", "Unexpected value in coordinate unit (measurement_system_code) header: 0. Can't extract coordinate unit and will ingest without coordinate units.\n", "Ingesting traces: 100%|██████████| 1/1 [00:01<00:00, 1.51s/block]" ] }, { "name": "stdout", "output_type": "stream", "text": [ "Ingestion successful.\n" ] }, { "name": "stderr", "output_type": "stream", "text": [ "\n" ] } ], "source": [ "overrides = SegyHeaderOverrides(trace_header={\"coordinate_scalar\": -100})\n", "\n", "segy_to_mdio(**ingestion_kwargs, segy_header_overrides=overrides)\n", "print(\"Ingestion successful.\")" ] }, { "cell_type": "markdown", "id": "ebccf97c-390e-4e12-8192-2db5a1a3612d", "metadata": {}, "source": [ "Now that the ingestion has completed successfully, we can open the MDIO file and inspect its contents to verify\n", "that everything was processed correctly." ] }, { "cell_type": "code", "execution_count": 6, "id": "7894292d-ac08-4c19-bdaa-3518bd112c78", "metadata": {}, "outputs": [ { "data": { "text/html": [ "
<xarray.Dataset> Size: 55kB\n",
"Dimensions: (cdp: 10, time: 1201)\n",
"Coordinates:\n",
" cdp_x (cdp) float64 80B ...\n",
" * time (time) int32 5kB 0 4 8 12 16 20 ... 4784 4788 4792 4796 4800\n",
" cdp_y (cdp) float64 80B ...\n",
" * cdp (cdp) int32 40B 0 1 2 3 4 5 6 7 8 9\n",
"Data variables:\n",
" headers (cdp) [('trace_seq_num_line', '<i4'), ('trace_seq_num_reel', '<i4'), ('orig_field_record_num', '<i4'), ('trace_num_orig_record', '<i4'), ('energy_source_point_num', '<i4'), ('ensemble_num', '<i4'), ('trace_num_ensemble', '<i4'), ('trace_id_code', '<i2'), ('vertically_summed_traces', '<i2'), ('horizontally_stacked_traces', '<i2'), ('data_use', '<i2'), ('source_to_receiver_distance', '<i4'), ('receiver_group_elevation', '<i4'), ('source_surface_elevation', '<i4'), ('source_depth_below_surface', '<i4'), ('receiver_datum_elevation', '<i4'), ('source_datum_elevation', '<i4'), ('source_water_depth', '<i4'), ('receiver_water_depth', '<i4'), ('elevation_depth_scalar', '<i2'), ('coordinate_scalar', '<i2'), ('source_coord_x', '<i4'), ('source_coord_y', '<i4'), ('group_coord_x', '<i4'), ('group_coord_y', '<i4'), ('coordinate_unit', '<i2'), ('weathering_velocity', '<i2'), ('subweathering_velocity', '<i2'), ('source_uphole_time', '<i2'), ('group_uphole_time', '<i2'), ('source_static_correction', '<i2'), ('receiver_static_correction', '<i2'), ('total_static_applied', '<i2'), ('lag_time_a', '<i2'), ('lag_time_b', '<i2'), ('delay_recording_time', '<i2'), ('mute_time_start', '<i2'), ('mute_time_end', '<i2'), ('samples_per_trace', '<i2'), ('sample_interval', '<i2'), ('instrument_gain_type', '<i2'), ('instrument_gain_const', '<i2'), ('instrument_gain_initial', '<i2'), ('correlated_data', '<i2'), ('sweep_freq_start', '<i2'), ('sweep_freq_end', '<i2'), ('sweep_length', '<i2'), ('sweep_type', '<i2'), ('sweep_taper_start', '<i2'), ('sweep_taper_end', '<i2'), ('taper_type', '<i2'), ('alias_filter_freq', '<i2'), ('alias_filter_slope', '<i2'), ('notch_filter_freq', '<i2'), ('notch_filter_slope', '<i2'), ('low_cut_freq', '<i2'), ('high_cut_freq', '<i2'), ('low_cut_slope', '<i2'), ('high_cut_slope', '<i2'), ('year_recorded', '<i2'), ('day_of_year', '<i2'), ('hour_of_day', '<i2'), ('minute_of_hour', '<i2'), ('second_of_minute', '<i2'), ('time_basis_code', '<i2'), ('trace_weighting_factor', '<i2'), ('group_num_roll_switch', '<i2'), ('group_num_first_trace', '<i2'), ('group_num_last_trace', '<i2'), ('gap_size', '<i2'), ('taper_overtravel', '<i2'), ('cdp_x', '<i4'), ('cdp_y', '<i4'), ('inline', '<i4'), ('crossline', '<i4'), ('shot_point', '<i4'), ('shot_point_scalar', '<i2'), ('trace_value_unit', '<i2'), ('transduction_const_mantissa', '<i4'), ('transduction_const_exponent', '<i2'), ('transduction_unit', '<i2'), ('device_trace_id', '<i2'), ('times_scalar', '<i2'), ('source_type_orientation', '<i2'), ('source_energy_dir_mantissa', '<i4'), ('source_energy_dir_exponent', '<i2'), ('source_measurement_mantissa', '<i4'), ('source_measurement_exponent', '<i2'), ('source_measurement_unit', '<i2')] 2kB ...\n",
" trace_mask (cdp) bool 10B ...\n",
" amplitude (cdp, time) float32 48kB ...\n",
"Attributes:\n",
" apiVersion: 1.0.4\n",
" createdOn: 2025-10-04 19:51:42.144206+00:00\n",
" name: PostStack2DTime\n",
" attributes: {'surveyType': '2D', 'gatherType': 'stacked', 'defaultVariab...<xarray.Dataset> Size: 200B\n",
"Dimensions: (cdp: 10)\n",
"Coordinates:\n",
" cdp_x (cdp) float64 80B 0.0 10.0 20.0 30.0 40.0 50.0 60.0 70.0 80.0 90.0\n",
" cdp_y (cdp) float64 80B 0.0 100.0 200.0 300.0 ... 600.0 700.0 800.0 900.0\n",
" * cdp (cdp) int32 40B 0 1 2 3 4 5 6 7 8 9\n",
"Data variables:\n",
" *empty*\n",
"Attributes:\n",
" apiVersion: 1.0.4\n",
" createdOn: 2025-10-04 19:51:42.144206+00:00\n",
" name: PostStack2DTime\n",
" attributes: {'surveyType': '2D', 'gatherType': 'stacked', 'defaultVariab..."
],
"text/plain": [
"