# Supported Data Formats The deterministic test generator reads output files produced by pipeline steps and generates integrity, qualitative, and quantitative tests for each file. Format detection is automatic based on file extension. ## Format Table | Format | Extensions | Library Required | Domain | |--------|-----------|-----------------|--------| | NumPy array | `.npy` | numpy | General | | NumPy archive | `.npz` | numpy | General | | JSON | `.json` | (stdlib) | General | | JSON Lines | `.jsonl`, `.ndjson` | (stdlib) | General | | CSV | `.csv` | (stdlib) | General | | HDF5 | `.h5`, `.hdf5` | h5py | General | | Whitespace-delimited | `.dat`, `.txt` | (stdlib) | General | | Key-value text | (via `sFormat` override) | (stdlib) | General | | Fixed-width text | (via `sFormat` override) | (stdlib) | General | | Multi-table text | (via `sFormat` override) | (stdlib) | General | | Excel | `.xlsx`, `.xls` | openpyxl | General | | Parquet | `.parquet` | pyarrow | Data Science | | Image | `.png`, `.jpg`, `.jpeg`, `.tiff`, `.tif` | Pillow | General | | FITS | `.fits`, `.fit` | astropy | Astronomy | | VOTable | `.vot` | astropy | Astronomy | | IPAC table | `.ipac` | astropy | Astronomy | | MATLAB | `.mat` | scipy | Engineering | | FORTRAN binary | `.unf` | scipy | Engineering | | VTK mesh | `.vtk`, `.vtu` | pyvista | Engineering | | CGNS | `.cgns` | h5py | Engineering | | FASTA | `.fasta`, `.fa` | (stdlib) | Biology | | FASTQ | `.fastq`, `.fq` | (stdlib) | Biology | | VCF | `.vcf` | (stdlib) | Biology | | BED | `.bed` | (stdlib) | Biology | | GFF/GTF | `.gff`, `.gtf`, `.gff3` | (stdlib) | Biology | | SAM | `.sam` | (stdlib) | Biology | | BAM | `.bam` | pysam | Biology | | SPSS | `.sav` | pyreadstat | Social Science | | Stata | `.dta` | pyreadstat | Social Science | | SAS | `.sas7bdat` | pyreadstat | Social Science | | R data | `.rds`, `.RData`, `.rda` | pyreadr | Social Science | | Safetensors | `.safetensors` | safetensors | AI/ML | | TFRecord | `.tfrecord` | tfrecord | AI/ML | | Syslog | `.log` | (stdlib) | Security | | CEF | `.cef` | (stdlib) | Security | | PCAP | `.pcap`, `.pcapng` | scapy | Security | **Total: 36 format names, 50 file extensions.** ## How Format Detection Works 1. The file extension is matched against the format map above. 2. For `.txt` and `.dat` files, the system checks whether a majority of non-blank, non-comment lines contain `=` signs. If so, the file is treated as key-value format rather than whitespace-delimited. 3. For unknown extensions, the first 4 bytes are read. If any byte exceeds ASCII range (value > 127), the file is reported as an unsupported binary format and skipped. Otherwise it is treated as whitespace-delimited text. ## Optional Libraries Formats marked `(stdlib)` require no additional packages. All other libraries are imported with `try/except ImportError`, so a missing library does not break the test generator -- the file is reported with an error message indicating which package to install, and the remaining files are processed normally. ## Overriding Format Detection When a file extension is ambiguous (for example, a `.txt` file that uses fixed-width columns rather than whitespace delimiters), set the `sFormat` field in the quantitative standards JSON to override extension-based detection: ```json { "sName": "fTemperature", "sDataFile": "results.txt", "sAccessPath": "column:TGlobal,index:-1", "sFormat": "fixedwidth", "fValue": 288.15 } ``` Valid `sFormat` values are the format names in the table above. ## Access Path Syntax Each quantitative benchmark specifies an access path that tells the test runner how to locate a value within a file. The syntax depends on the format: | Format Family | Access Path Example | Meaning | |--------------|-------------------|---------| | CSV, whitespace, Excel, SPSS, Stata, SAS, VOTable, IPAC | `column:Temperature,index:-1` | Last row of Temperature column | | CSV, whitespace | `column:Temperature,index:mean` | Mean of Temperature column | | NumPy, MATLAB, safetensors | `key:arrayName,index:0` | First element of named array | | NumPy, MATLAB, safetensors | `key:arrayName,index:mean` | Mean of named array | | NumPy (.npy) | `index:0` | First element (flat) | | JSON | `key:path.to.field` | Nested key traversal | | JSON | `key:daMedians,index:0` | First element of JSON array | | JSON | `key:daMedians,index:mean` | Mean of JSON array | | HDF5, CGNS | `dataset:/group/name,index:0` | First element of dataset | | FITS | `hdu:1,column:flux,index:0` | First row of flux column in HDU 1 | | FITS | `hdu:0,index:mean` | Mean of image data in HDU 0 | | FASTA, FASTQ | `index:mean` | Mean sequence length | | VCF, BED, GFF, SAM | `column:POS,index:0` | First value in POS column | | Key-value | `key:parameterName` | Value associated with key | | PCAP | `index:mean` | Mean packet length | | Syslog, CEF | `index:0` | Line count | | Multi-table | `section:0,column:X,index:0` | First value in column X of first table | ## Security - All `np.load()` calls use `allow_pickle=False` to prevent arbitrary code execution via malicious NumPy files. - PyTorch checkpoint files (`.pt`, `.pth`) are intentionally unsupported because they use pickle deserialization internally. Use safetensors instead. - File paths are validated with `os.path.realpath()` to prevent path traversal attacks. - Files larger than 500 MB are skipped to prevent memory exhaustion. - JSON traversal is limited to 10 levels of nesting depth. - Each file generates at most 250 benchmark entries to prevent test explosion on wide datasets. ## Unsupported Files Files with unrecognized extensions are handled as follows: 1. The first 4 bytes are inspected. If any byte is non-ASCII (> 127), the file is classified as an unsupported binary format. 2. For binary files, the introspection reports `bLoadable: false` with `sError: "unsupported binary format"`. No benchmarks are generated, but the integrity test still verifies the file exists and is non-empty. 3. For text files with unknown extensions, the system falls back to whitespace-delimited parsing with automatic header detection. To add support for a new format, add an entry to `_DICT_FORMAT_MAP`, a loader function to the template, a benchmarker to the introspection script, and integrity/no-NaN test generators. All new library imports must use `try/except ImportError` for graceful degradation.