Supported Data Formats
The deterministic test generator reads output files produced by pipeline steps and generates integrity, qualitative, and quantitative tests for each file. Format detection is automatic based on file extension.
Format Table
Format |
Extensions |
Library Required |
Domain |
|---|---|---|---|
NumPy array |
|
numpy |
General |
NumPy archive |
|
numpy |
General |
JSON |
|
(stdlib) |
General |
JSON Lines |
|
(stdlib) |
General |
CSV |
|
(stdlib) |
General |
HDF5 |
|
h5py |
General |
Whitespace-delimited |
|
(stdlib) |
General |
Key-value text |
(via |
(stdlib) |
General |
Fixed-width text |
(via |
(stdlib) |
General |
Multi-table text |
(via |
(stdlib) |
General |
Excel |
|
openpyxl |
General |
Parquet |
|
pyarrow |
Data Science |
Image |
|
Pillow |
General |
FITS |
|
astropy |
Astronomy |
VOTable |
|
astropy |
Astronomy |
IPAC table |
|
astropy |
Astronomy |
MATLAB |
|
scipy |
Engineering |
FORTRAN binary |
|
scipy |
Engineering |
VTK mesh |
|
pyvista |
Engineering |
CGNS |
|
h5py |
Engineering |
FASTA |
|
(stdlib) |
Biology |
FASTQ |
|
(stdlib) |
Biology |
VCF |
|
(stdlib) |
Biology |
BED |
|
(stdlib) |
Biology |
GFF/GTF |
|
(stdlib) |
Biology |
SAM |
|
(stdlib) |
Biology |
BAM |
|
pysam |
Biology |
SPSS |
|
pyreadstat |
Social Science |
Stata |
|
pyreadstat |
Social Science |
SAS |
|
pyreadstat |
Social Science |
R data |
|
pyreadr |
Social Science |
Safetensors |
|
safetensors |
AI/ML |
TFRecord |
|
tfrecord |
AI/ML |
Syslog |
|
(stdlib) |
Security |
CEF |
|
(stdlib) |
Security |
PCAP |
|
scapy |
Security |
Total: 36 format names, 50 file extensions.
How Format Detection Works
The file extension is matched against the format map above.
For
.txtand.datfiles, the system checks whether a majority of non-blank, non-comment lines contain=signs. If so, the file is treated as key-value format rather than whitespace-delimited.For unknown extensions, the first 4 bytes are read. If any byte exceeds ASCII range (value > 127), the file is reported as an unsupported binary format and skipped. Otherwise it is treated as whitespace-delimited text.
Optional Libraries
Formats marked (stdlib) require no additional packages. All other
libraries are imported with try/except ImportError, so a missing library
does not break the test generator – the file is reported with an error
message indicating which package to install, and the remaining files are
processed normally.
Overriding Format Detection
When a file extension is ambiguous (for example, a .txt file that uses
fixed-width columns rather than whitespace delimiters), set the sFormat
field in the quantitative standards JSON to override extension-based
detection:
{
"sName": "fTemperature",
"sDataFile": "results.txt",
"sAccessPath": "column:TGlobal,index:-1",
"sFormat": "fixedwidth",
"fValue": 288.15
}
Valid sFormat values are the format names in the table above.
Access Path Syntax
Each quantitative benchmark specifies an access path that tells the test runner how to locate a value within a file. The syntax depends on the format:
Format Family |
Access Path Example |
Meaning |
|---|---|---|
CSV, whitespace, Excel, SPSS, Stata, SAS, VOTable, IPAC |
|
Last row of Temperature column |
CSV, whitespace |
|
Mean of Temperature column |
NumPy, MATLAB, safetensors |
|
First element of named array |
NumPy, MATLAB, safetensors |
|
Mean of named array |
NumPy (.npy) |
|
First element (flat) |
JSON |
|
Nested key traversal |
JSON |
|
First element of JSON array |
JSON |
|
Mean of JSON array |
HDF5, CGNS |
|
First element of dataset |
FITS |
|
First row of flux column in HDU 1 |
FITS |
|
Mean of image data in HDU 0 |
FASTA, FASTQ |
|
Mean sequence length |
VCF, BED, GFF, SAM |
|
First value in POS column |
Key-value |
|
Value associated with key |
PCAP |
|
Mean packet length |
Syslog, CEF |
|
Line count |
Multi-table |
|
First value in column X of first table |
Security
All
np.load()calls useallow_pickle=Falseto prevent arbitrary code execution via malicious NumPy files.PyTorch checkpoint files (
.pt,.pth) are intentionally unsupported because they use pickle deserialization internally. Use safetensors instead.File paths are validated with
os.path.realpath()to prevent path traversal attacks.Files larger than 500 MB are skipped to prevent memory exhaustion.
JSON traversal is limited to 10 levels of nesting depth.
Each file generates at most 250 benchmark entries to prevent test explosion on wide datasets.
Unsupported Files
Files with unrecognized extensions are handled as follows:
The first 4 bytes are inspected. If any byte is non-ASCII (> 127), the file is classified as an unsupported binary format.
For binary files, the introspection reports
bLoadable: falsewithsError: "unsupported binary format". No benchmarks are generated, but the integrity test still verifies the file exists and is non-empty.For text files with unknown extensions, the system falls back to whitespace-delimited parsing with automatic header detection.
To add support for a new format, add an entry to _DICT_FORMAT_MAP, a
loader function to the template, a benchmarker to the introspection script,
and integrity/no-NaN test generators. All new library imports must use
try/except ImportError for graceful degradation.