Expression matrix format

*The interpretation of the Advanced part is only available in EN.

Gene Expression File (GEF)

Gene expression file (GEF), in HDF5 format, is a data management and storage format designed to support multidimensional datasets and high computational efficiency. Stereo-seq analysis workflow generates bin GEF and cellbin GEF files. Bin GEF file format is a hierarchically structured data model that stores one or bin combined gene expression matrices in various bin sizes. Cellbin GEF file format stores expression information within each cell. Each GEF container organizes a collection of spatial gene expression matrices. It includes two primary data objects: Group and Dataset. A dataset is a multidimensional array of data elements. Group object is analogous to a file system directory that organizes datasets and other groups in hierarchies.

Bin GEF

The first level of GEF includes four group objects: "geneExp" (required), "wholeExp" (optional), "wholeExpExon" (optional), and "stat" (optional). Group "geneExp" contains groups of gene spatial expression data in one or multiple bin sizes. Group "wholeExp" contains datasets that record expression level and gene type count of each coordinate in one or multiple bin sizes. Group "wholeExpExon" contains datasets that record the exon level of each coordinate in one or multiple bin sizes. Group "stat" saves gene names, total MID count and spatial pattern enrichment score of each gene. "Attributes" of the file record the version of GEF format, software version, and omics information. "Attribute" in each dataset records the key metrics of that dataset. Check the table to get details.

Attributes
File Attributes DataType Example Description
version uint32 2 Gene expression file format version.
geftool_ver

uint32[3]

1,1,17 Geftool version. It can be used as an individual tool to manipulate GEF files.
omics S32 b'Transcriptomics' Omics name.
gef_area float32 4.4410855E10 Tissue or labeled tissue area in square nanometers.
bin_type S32 b'bin' Bin type of the GEF file.
sn S32 b'SS200000135TL_D1' Stereo-seq chip SN
/geneExp/binN/expression:Dataset "expression" is a 1D array which stores coordinates and MID counts of each gene in the bin size of N, aggregated by gene name.
Dataset Attributes DataType Example (bin1) Description
minX int32 59820 Minimum x coordinate in bin N.
minY int32 102086 Minimum y coordinate in bin N.
maxX int32 73040 Maximum x coordinate in bin N.
maxY int32 120539 Maximum y coordinate in bin N.
maxExp uint32 28 Maximum MID count in a spot when the bin size is N. Data type for "maxExp" is dynamically changed for each sample.
resolution uint32 500 Physical pitch (nm) between neighbor spots.
Dataset DataType:compound DataType Example (bin1) Description
x int32 71032 x coordinate in bin N.
y int32 103180 y coordinate in bin N.
count uint8/uint16/uint32 1 MID count at (x, y) when bin size is N. Data type for "count" is consistent with "maxExp" in the "Attributes."
[optional] /geneExp/binN/exon:Dataset "exon" is a 1D array which stores exon expression of each gene in the bin size of N, aggregated by gene name.
Dataset Attributes DataType Example (bin1) Description
maxExon int32 21 Max exon expression in binN.
Dataset DataType:1D array DataType Example (bin1) Description
count uint8/uint16/uint32 0 Exon expression in binN at coordinate (x,y), the index is same to the index in the "expression" dataset. Data type for "count" is dynamically changed for each sample.
/geneExp/binN/gene:Dataset "gene" is a 1D array which stores the gene names, the starting row indexes in dataset "expression", and row counts.
Dataset DataType:compound DataType Example (bin1) Description
geneID S64 b'ENSMUSG00000000001' Gene ID.
geneName S64 b'Gm16045' Gene name.
offset uint32 21 The starting row index in dataset "expression" for the gene.In this example, the gene expression data for gene "Gm16045" starts from row 21 in the dataset "expression."
count uint32 2 Row count.In this example, expression data for gene "Gm16045" is recorded in row 21 and 22 (2 rows) in the dataset "expression."
[optional] /wholeExp/binN:Dataset "binN" is a 2D array (matrix) which stores the MID count and gene type count at each spot.
Dataset Attributes DataType Example (bin1) Description
number uint64 22879557 Number of non-zero spots in the dense matrix.
minX int32 59820 Minimum x coordinate in bin N.
lenX int32 13221 Length of x.
minY int32 102086 Minimum y coordinate in bin N.
lenY int32 18454 Length of y.
maxMID uint32 2155 Maximum MID count in a spot.
maxGene uint32 846 Maximum gene type count in a spot.
resolution uint32 500 Pitch (nm) between neighbor spots.
Dataset DataType: 2D array (XⅹY), compound DataType Example (bin1) Description
MIDcount uint8/uint16/uint32 1 MID count in the spot. The spot coordinate can be identified from the row and column index of the 2D matrix plus the "minX" and "minY" specified in the attributes. Data type for "MIDcount" is dynamically changed for each sample.
genecount uint16 1 Gene count in the spot. The spot coordinate can be identified from "Attributes" and the indexes of the 2D array.
[optional] /wholeExpExon/binN:Dataset "binN" in "/wholeExpExon/" Group is a 2D array (matrix) which stores the exon expression count at each spot.
Dataset Attributes DataType Example (bin1) Description
maxExon uint32 21 Maximum exon expression count in a spot when the bin size is N.
Dataset DataType: 2D array DataType Example (bin1) Description
MIDcount uint8/uint16/uint32 0 MID count in the spot. The spot coordinate can be identified from the row and column index of the 2D matrix plus the "minX" and "minY" specified in the attributes. Data type for "MIDcount" is dynamically changed for each sample.
[optional] /stat/gene:Dataset "gene" is a 1D array which stores the MID count and spatial pattern enrichment score (E10) of each gene. The array is order by the MID count in descending order.
Dataset Attributes DataType Example Description
maxE10 float32 65.53 Maximum E10 score.
minE10 float32 0. Minimum E10 score.
cutoff float32 0.1 Threshold for filtering spots that will be used for computing E10.In this example, 0.1 means that the spots whose MID count is in the top 10% are used for calculating the spatial enrichment score.
Dataset DataType:compound DataType Example Description
geneID S64 b'ENSMUSG00000000001' Gene ID.
geneName S64 b'Ptgds' Gene name.
MIDcount uint32 229502 MID count for the gene.
E10 float32 65.53 The spatial pattern enrichment score (E10) for the gene.

[optional]

/proteinList:Dataset "proteinList" is a 1D array which stores the protein panel information of the sample.

Dataset DataType:compound DataType Example Description
PIDName H5T_STRING CD169 Protein name in the protein panel
GeneName H5T_STRING Siglec1 Protein's marker gene
GeneID H5T_STRING Ensembl gene IDs Ensembl gene IDs

Cell Bin GEF

The first layer of Cell Bin GEF contains one required group "cellBin" and multiple optional datasets. The second layer "codedCellBlock" is optional, which stores precomputed data used in the rendering of StereoMap. "Attributes" of the file record the version of GEF format, software version, and omics information. "Attribute" in each dataset records the key metrics of that dataset. Check the table to get more details.

Attributes
File Attributes DataType Example Description
geftool_ver uint32[3] 0,7,11 geftool version. It can be used as an individual tool to manipulate GEF files.
offsetX int32 0 Minimum x coordinate in bin 1.
offsetY int32 0 Minimum y coordinate in bin 1.
omics S32 b‘Transcriptomcis’ Omics name.
resolution uint32 500 Pitch (nm) between neighbor spots.
version uint32 2 Gene expression file format version.
bin_type S32 CellBin Bin type of the GEF file.
sn S32 b'SS200000135TL_D1' Stereo-seq chip SN
/cellBin/cell:Dataset "cell" is a 1D array which stores basic information and indices information of cells and expression.
Dataset Attributes DataType Example Description
averageArea float32 494.666 Average area for cells in pixel.
averageDnbCount float32 194.299 Average number of mRNA-captured DNBs in a cell.
averageExpCount float32 541.715 Average MID count in cell.
averageGeneCount float32 310.157 Average gene count in cell.
maxArea uint16 1925 Maximum area for cells in pixel.
maxDnbCount uint16 883 Maximum number of mRNA-captured DNBs in a cell.
maxExpCount uint16 3018 Maximum MID count in cell.
maxGeneCount uint16 1415 Maximum gene count in cell.
maxX int32 17658 Maximum x coordinate of the cell’s center of mass.
maxY int32 19422 Maximum y coordinate of the cell’s center of mass.
medianArea float32 474. Median area for cells in pixel.
medianDnbCount float32 183. Median number of mRNA-captured DNBs in a cell.
medianExpCount float32 491. Median MID count in cell.
medianGeneCount float32 289. Median gene count in cell.
minArea uint16 2 Minimum area for cells in pixel.
minDnbCount uint16 0 Minimum number of mRNA-captured DNBs in a cell.
minExpCount uint16 0 Minimum MID count in cell.
minGeneCount uint16 0 Minimum gene count in cell.
minX int32 2933 Minimum x coordinate of the cell’s center of mass.
minY int32 5568 Minimum y coordinate of the cell’s center of mass.
Dataset DataType:compound DataType Example Description
id uint32 10 Cell ID index, the start ID is 0.In the Example, 10 represents the 10th cell in the dataset.
x int32 541 The x coordinate of the cell’s center of mass.In the Example, the x coordinate of the 10th cell’s center of mass is 541.

y

int32 190 The y coordinate of the cell’s center of mass.In the Example, the x coordinate of the 10th cell’s center of mass is 190.
offset uint32 494 The start row index of the cell in the "/cellBin/cellExp" dataset.The example represents that the gene ID index and total MID count information of the 10th cell in the "/cellBin/cellExp" dataset start from the 494th row.
geneCount uint16 100 Gene count in the cell.In the example, 100 represents that the 100 rows in the "/cellBin/cellExp", start from the 494th to the 593th row, contains the gene ID indices and total MID count of the gene for the 10th cell in "/cellBin/cell" dataset.
expCount uint16 500 Cell MID count.
dnbCount uint16 200 mRNA-captured DNBs of the cell.
area uint16 474 Cell area in pixel.
cellTypeID uint32 0 Cell type ID.
clusterID uint32 20 Cell cluster ID.
/cellBin/cellBorder:Dataset "cellBorder" is a 3D array which stores the lists of points for the bounding polygons of the cell.
Dataset Attributes DataType Example Description
maxX int32 16127 Maximum x coordinate of the bounding box of the cell.
maxY int32 16663 Maximum y coordinate of the bounding box of the cell.
minX int32 11129 Minimum x coordinate of the bounding box of the cell.
minY int32 12784 Minimum y coordinate of the bounding box of the cell.
Dataset DataType:3D array DataType Example Description


32*(int16,int16) [[-17,-11],[-15,-5]…[32767,32767]] A list of 32 coordinates recording the differences between cell bounding points and the cell’s center of mass (0,0). The real coordinate of cell’s center of mass (x, y) can be obtained from "cell" dataset using cellID.
/cellBin/cellExp:Dataset "cellExp" is a 1D array which stores the expression information of each cell.
Dataset Attributes DataType Example Description
maxCount uint16 336 Maximum MID count of a gene in a cell.
Dataset DataType:compound DataType Example Description
geneID uint32 1610 Gene IDs of the genes detected in the cell. ID is the index of "gene" dataset.In the example, 1610 represents the 1610th item in the "gene" dataset, and the name of the gene can be acquired in "gene" dataset.
count

uint16

3 MID count for the gene.In the example, (assume this is the 0th item in the "cellExp" dataset, from the "offset" and "geneCount" record in the "cell" dataset we can know that the 0th item in the "cellExp" belongs to the cell whose cellID=0) the MID count for the gene (geneID=1610) in the cell (cellID=0) is 3.
[optional] /cellBin/cellExon:Dataset "cellExon" is a 1D array which stores the exon information for each cell.
Dataset Attributes DataType Example Description
maxExon uint16 5793 Maximum exon count of a gene in all cells.
minExon uint16 0 Minimum exon count of a gene in all cells.
Dataset DataType:1D array DataType Example Description


uint16 16 Exon count in a cell, the index of the array is same to the cellID in the "cell" dataset.
[optional] /cellBin/cellExpExon:Dataset "cellExpExon" is a 1D array which stores exon expression information for each cell.
Dataset Attributes DataType Example Description
maxExon uint16 336 Maximum exon count of a gene in a cell.
Dataset DataType:1D array DataType Example Description


uint16 3 Exon count (MID) for the gene. The index is same to the "cellExp" dataset.In the example, (assume this is the 0th item in the "cellExpExon" dataset, since the index is same to "cellExp" dataset, from the "offset" and "geneCount" record in the "cell" dataset we can know that the 0th item in the "cellExpExon" belongs to the cell whose cellID=0) the exon count (MID) for the gene (geneID=1610) in the cell (cellID=0) is 3.
/cellBin/cellTypeList:Dataset "cellTypeList" is a 1D array which stores cell types of each cell.
Dataset DataType:1D array DataType Example Description


S32 b'default' Cell type, "default" stands for undefined cell type.
/cellBin/gene:Dataset "gene" is a 1D array which stores the indices of cell and expression information of each gene.
Dataset Attributes DataType Example Description
maxCellCount uint32 5718 Maximum number of cells a gene can be detected.
maxExpCount uint32 55361 Maximum MID count of a gene.
minCellCount uint32 1 Minimum number of cells a gene can be detected.
minExpCount uint32 1 Minimum MID count of a gene.
Dataset DataType:compound DataType Example Description
geneID S32 b'ENSMUSG00000000001' Gene ID.
geneName S32 b'AC149090.1' Gene name.
offset uint32 0 The start row index of the gene in "/cellBin/geneExp" dataset.In the example, 0 means that start from the 0th item in "/cellBin/geneExp" dataset records the cellIDs and total MID count information of "AC149090.1".
cellcount uint32 60 Number of cells a gene can be detected.In the example, 60 represents that start from the 0th item to the 59th item records the information of gene "AC149090.1".
expCount uint32 100 Sum of MID count for the gene.In the example, the total MID count of "AC149090.1" is 100.
maxMIDcount uint16 4 Maximum MID count of a gene in a cell.In this case, the maximum MID count of gene "AC149090.1" in a cell is 4.
/cellBin/geneExp:Dataset "geneExp" is a 1D array which stores cell and expression information of each gene.
Dataset Attributes DataType Example Description
maxCount uint16 10 Maximum MID count of a gene.
Dataset DataType:compound DataType Example Description
cellID uint32 1247 cellID that contains the gene whose index is same to the index in "gene" dataset.In the example, (assume we use the 0th item in "geneExp" dataset) 1247 shows that the gene "AC149090.1" appears in the cell whose cellID is 1247.
count uint16 3 The MID count of the gene, whose index is same to the index in "gene" dataset, in the cellID.In the example, the MID count of gene "AC149090.1" in the cell (cellID=1247) is 3.
[optional] /cellBin/geneExon:Dataset "geneExon" is a 1D array which stores the exon expression information of each gene.
Dataset Attributes DataType Example Description
maxExon uint32 55361 Maximum exon count of a gene.
minExon uint32 0 Minimum exon count of a gene.
Dataset DataType:1D array DataType Example Description


uint32 97 Total exon count of a gene, the index of "geneExon" dataset is same to the "gene" dataset.In the example, (assume this is the 0th item in the "geneExon" dataset, and gene "AC149090.1" is the 0th item in the "gene" dataset) the exon count of gene "AC149090.1" is 97.
[optional] /cellBin/geneExpExon:Dataset "geneExpExon" is a 1D array which stores the exon expression information in cells of each gene.
Dataset Attributes DataType Example Description
maxExon uint16 336 Maximum exon expression of a gene in a cell.
Dataset DataType:1D array DataType Example Description


uint16 3 Exon count of a gene in a cell. The index of "geneExpExon" dataset is same to the "geneExp" dataset.In the example, (assume this is the 0th item in the "geneExpExon" dataset, since the index is same to "geneExp" dataset, from the "offset" and "cellCount" record in the "gene" dataset we can know that the 0th item in the "geneExpExon" dataset belongs to the gene "AC149090.1") 3 stands for the exon count of gene "AC149090.1" in cell 1247 is 3.
/cellBin/bockIndex:Dataset "bockIndex" is a 1D array which stores the matrix block partition information.
Dataset DataType:1D array DataType Example Description


uint32 0 Cell count in each partition block.cnt=blockIndex[i+1]-blockIndex[i]
/cellBin/bockSize:Dataset "bockSize" is a 1D array which stores the block size of partition.
Dataset DataType:1D array

DataType

Example Description


uint32 256, 256, 104, 104 4-element array. The 4 items represent the block length in x-axis, block length in y-axis, block count in x-axis, and block count in y-axis, respectively.
[optional] /codedCellBlock:Group "codedCellBlock" stores pre-computed data for rendering in StereoMap.
Group Attributes DataType Example Description

info

string {"@type": "neuroglancer_annotations_v1", ...}

Metadata of encoded precomputed data in JSON.

[optional] /codedCellBlock/L0/0_1:Dataset "0_1" is an example chunk encoded pre-computed data, including id, geometry, and so on.
Dataset DataType:Bytes

DataType

Example

Description


H5T_OPAQUE 1F 8B 08 00 ... Bytecode of the chunk.

[optional]

/proteinList:Dataset "proteinList" is a 1D array which stores the protein panel information of the sample

Dataset DataType:compound DataType Example Description
PIDName H5T_STRING CD169 Protein name in the protein panel
GeneName H5T_STRING Siglec1 Protein's marker gene
GeneID H5T_STRING ENSMUSG00000027322 Ensembl gene IDs

Gene Expression Matrix (GEM)

Gene expression matrix (GEM), a text file, stores gene spatial expression data. SAW generates multiple gene expression matrix files in the workflow, the basic format requires six columns with a header row that shows the column names. The six columns are gene ID, gene name, x coordinate, y coordinate, MID count and exon count. When it comes to cellbin GEM, the seventh column is for cell ID. The header of the expression matrix for the maximum area enclosing rectangle region has several annotation rows starting with "#" before the column rows. The header field names and field types are described in the table.

Fields Data Type Example Description
#FileFormat string GEMv0.2 Gene expression matrix file format version.
#SortedBy string

None

Gene expression matrix sorting strategy. Valid values: "geneID", "x", "y", "MIDCount", "None".
#BinType string Bin Bin type of the GEM file.
#BinSize string 1 (Please check 1.3 Terminologies and Concepts Bin)
#Omics string Transcriptomics Omics name.
#Stereo-seqChip string SS200000135TL_D1 Stereo-seq Chip T serial number.
#OffsetX uint32 1 X coordinate of the origin before calibration.
#OffsetY uint32 1 Y coordinate of the origin before calibration.
geneID string ENSMUSG00000000001 Gene ID
geneName string Gnai3 Gene name.
x uint32 16809 X coordinate of the spot.
y uint32 8546 Y coordinate of the spot.
MIDCount uint32 1 Number of MIDs at (x, y) for the gene in the corresponding row.
ExonCount uint32 0 [Optional] Number of exon count at (x, y) for the gene in the corresponding row.
CellID uint32 55892 [Optional] CellID for (x, y).

An example of bin GEM:

#FileFormat=GEMv0.2
#SortedBy=None
#BinType=Bin
#BinSize=1
#Omics=Transcriptomics
#Stereo-seqChip=B03523G1
#OffsetX=0
#OffsetY=0
geneID  geneName        x       y       MIDCount        ExonCount
ENSMUSG00000000001      Gnai3   694     17229   1       1
ENSMUSG00000000001      Gnai3   1428    4994    1       1

An example of cellbin GEM:

#FileFormat=GEMv0.2
#SortedBy=None
#BinType=CellBin
#BinSize=Cell
#Omics=Transcriptomics
#Stereo-seqChip=B03523G1
#OffsetX=0
#OffsetY=0
geneID  geneName        x       y       MIDCount        ExonCount       CellID
ENSMUSG00000047454      Gphn    9325    19972   1       0       192276
ENSMUSG00000030616      Sytl2   9314    19976   1       1       192276
© 2025 STOmics Tech. All rights reserved.Modified: 2025-03-07 10:28:04

results matching ""

    No results matching ""