Expression matrix format
*The interpretation of the Advanced part is only available in EN.
Gene Expression File (GEF)
Gene expression file (GEF), in HDF5 format, is a data management and storage format designed to support multidimensional datasets and high computational efficiency. Stereo-seq analysis workflow generates bin GEF and cellbin GEF files. Bin GEF file format is a hierarchically structured data model that stores one or bin combined gene expression matrices in various bin sizes. Cellbin GEF file format stores expression information within each cell. Each GEF container organizes a collection of spatial gene expression matrices. It includes two primary data objects: Group and Dataset. A dataset is a multidimensional array of data elements. Group object is analogous to a file system directory that organizes datasets and other groups in hierarchies.
Bin GEF
The first level of GEF includes four group objects: "geneExp" (required), "wholeExp" (optional), "wholeExpExon" (optional), and "stat" (optional). Group "geneExp" contains groups of gene spatial expression data in one or multiple bin sizes. Group "wholeExp" contains datasets that record expression level and gene type count of each coordinate in one or multiple bin sizes. Group "wholeExpExon" contains datasets that record the exon level of each coordinate in one or multiple bin sizes. Group "stat" saves gene names, total MID count and spatial pattern enrichment score of each gene. "Attributes" of the file record the version of GEF format, software version, and omics information. "Attribute" in each dataset records the key metrics of that dataset. Check the table to get details.
Attributes | |||
---|---|---|---|
File Attributes | DataType | Example | Description |
version | uint32 | 2 | Gene expression file format version. |
geftool_ver | uint32[3] |
1,1,17 | Geftool version. It can be used as an individual tool to manipulate GEF files. |
omics | S32 | b'Transcriptomics' | Omics name. |
gef_area | float32 | 4.4410855E10 | Tissue or labeled tissue area in square nanometers. |
bin_type | S32 | b'bin' | Bin type of the GEF file. |
sn | S32 | b'SS200000135TL_D1' | Stereo-seq chip SN |
/geneExp/binN/expression:Dataset "expression" is a 1D array which stores coordinates and MID counts of each gene in the bin size of N, aggregated by gene name. | |||
Dataset Attributes | DataType | Example (bin1) | Description |
minX | int32 | 59820 | Minimum x coordinate in bin N. |
minY | int32 | 102086 | Minimum y coordinate in bin N. |
maxX | int32 | 73040 | Maximum x coordinate in bin N. |
maxY | int32 | 120539 | Maximum y coordinate in bin N. |
maxExp | uint32 | 28 | Maximum MID count in a spot when the bin size is N. Data type for "maxExp" is dynamically changed for each sample. |
resolution | uint32 | 500 | Physical pitch (nm) between neighbor spots. |
Dataset DataType:compound | DataType | Example (bin1) | Description |
x | int32 | 71032 | x coordinate in bin N. |
y | int32 | 103180 | y coordinate in bin N. |
count | uint8/uint16/uint32 | 1 | MID count at (x, y) when bin size is N. Data type for "count" is consistent with "maxExp" in the "Attributes." |
[optional] /geneExp/binN/exon:Dataset "exon" is a 1D array which stores exon expression of each gene in the bin size of N, aggregated by gene name. | |||
Dataset Attributes | DataType | Example (bin1) | Description |
maxExon | int32 | 21 | Max exon expression in binN. |
Dataset DataType:1D array | DataType | Example (bin1) | Description |
count | uint8/uint16/uint32 | 0 | Exon expression in binN at coordinate (x,y), the index is same to the index in the "expression" dataset. Data type for "count" is dynamically changed for each sample. |
/geneExp/binN/gene:Dataset "gene" is a 1D array which stores the gene names, the starting row indexes in dataset "expression", and row counts. | |||
Dataset DataType:compound | DataType | Example (bin1) | Description |
geneID | S64 | b'ENSMUSG00000000001' | Gene ID. |
geneName | S64 | b'Gm16045' | Gene name. |
offset | uint32 | 21 | The starting row index in dataset "expression" for the gene.In this example, the gene expression data for gene "Gm16045" starts from row 21 in the dataset "expression." |
count | uint32 | 2 | Row count.In this example, expression data for gene "Gm16045" is recorded in row 21 and 22 (2 rows) in the dataset "expression." |
[optional] /wholeExp/binN:Dataset "binN" is a 2D array (matrix) which stores the MID count and gene type count at each spot. | |||
Dataset Attributes | DataType | Example (bin1) | Description |
number | uint64 | 22879557 | Number of non-zero spots in the dense matrix. |
minX | int32 | 59820 | Minimum x coordinate in bin N. |
lenX | int32 | 13221 | Length of x. |
minY | int32 | 102086 | Minimum y coordinate in bin N. |
lenY | int32 | 18454 | Length of y. |
maxMID | uint32 | 2155 | Maximum MID count in a spot. |
maxGene | uint32 | 846 | Maximum gene type count in a spot. |
resolution | uint32 | 500 | Pitch (nm) between neighbor spots. |
Dataset DataType: 2D array (XⅹY), compound | DataType | Example (bin1) | Description |
MIDcount | uint8/uint16/uint32 | 1 | MID count in the spot. The spot coordinate can be identified from the row and column index of the 2D matrix plus the "minX" and "minY" specified in the attributes. Data type for "MIDcount" is dynamically changed for each sample. |
genecount | uint16 | 1 | Gene count in the spot. The spot coordinate can be identified from "Attributes" and the indexes of the 2D array. |
[optional] /wholeExpExon/binN:Dataset "binN" in "/wholeExpExon/" Group is a 2D array (matrix) which stores the exon expression count at each spot. | |||
Dataset Attributes | DataType | Example (bin1) | Description |
maxExon | uint32 | 21 | Maximum exon expression count in a spot when the bin size is N. |
Dataset DataType: 2D array | DataType | Example (bin1) | Description |
MIDcount | uint8/uint16/uint32 | 0 | MID count in the spot. The spot coordinate can be identified from the row and column index of the 2D matrix plus the "minX" and "minY" specified in the attributes. Data type for "MIDcount" is dynamically changed for each sample. |
[optional] /stat/gene:Dataset "gene" is a 1D array which stores the MID count and spatial pattern enrichment score (E10) of each gene. The array is order by the MID count in descending order. | |||
Dataset Attributes | DataType | Example | Description |
maxE10 | float32 | 65.53 | Maximum E10 score. |
minE10 | float32 | 0. | Minimum E10 score. |
cutoff | float32 | 0.1 | Threshold for filtering spots that will be used for computing E10.In this example, 0.1 means that the spots whose MID count is in the top 10% are used for calculating the spatial enrichment score. |
Dataset DataType:compound | DataType | Example | Description |
geneID | S64 | b'ENSMUSG00000000001' | Gene ID. |
geneName | S64 | b'Ptgds' | Gene name. |
MIDcount | uint32 | 229502 | MID count for the gene. |
E10 | float32 | 65.53 | The spatial pattern enrichment score (E10) for the gene. |
[optional] /proteinList:Dataset "proteinList" is a 1D array which stores the protein panel information of the sample. |
|||
Dataset DataType:compound | DataType | Example | Description |
PIDName | H5T_STRING | CD169 | Protein name in the protein panel |
GeneName | H5T_STRING | Siglec1 | Protein's marker gene |
GeneID | H5T_STRING | Ensembl gene IDs | Ensembl gene IDs |
Cell Bin GEF
The first layer of Cell Bin GEF contains one required group "cellBin" and multiple optional datasets. The second layer "codedCellBlock" is optional, which stores precomputed data used in the rendering of StereoMap. "Attributes" of the file record the version of GEF format, software version, and omics information. "Attribute" in each dataset records the key metrics of that dataset. Check the table to get more details.
Attributes | |||
File Attributes | DataType | Example | Description |
geftool_ver | uint32[3] | 0,7,11 | geftool version. It can be used as an individual tool to manipulate GEF files. |
offsetX | int32 | 0 | Minimum x coordinate in bin 1. |
offsetY | int32 | 0 | Minimum y coordinate in bin 1. |
omics | S32 | b‘Transcriptomcis’ | Omics name. |
resolution | uint32 | 500 | Pitch (nm) between neighbor spots. |
version | uint32 | 2 | Gene expression file format version. |
bin_type | S32 | CellBin | Bin type of the GEF file. |
sn | S32 | b'SS200000135TL_D1' | Stereo-seq chip SN |
/cellBin/cell:Dataset "cell" is a 1D array which stores basic information and indices information of cells and expression. | |||
Dataset Attributes | DataType | Example | Description |
averageArea | float32 | 494.666 | Average area for cells in pixel. |
averageDnbCount | float32 | 194.299 | Average number of mRNA-captured DNBs in a cell. |
averageExpCount | float32 | 541.715 | Average MID count in cell. |
averageGeneCount | float32 | 310.157 | Average gene count in cell. |
maxArea | uint16 | 1925 | Maximum area for cells in pixel. |
maxDnbCount | uint16 | 883 | Maximum number of mRNA-captured DNBs in a cell. |
maxExpCount | uint16 | 3018 | Maximum MID count in cell. |
maxGeneCount | uint16 | 1415 | Maximum gene count in cell. |
maxX | int32 | 17658 | Maximum x coordinate of the cell’s center of mass. |
maxY | int32 | 19422 | Maximum y coordinate of the cell’s center of mass. |
medianArea | float32 | 474. | Median area for cells in pixel. |
medianDnbCount | float32 | 183. | Median number of mRNA-captured DNBs in a cell. |
medianExpCount | float32 | 491. | Median MID count in cell. |
medianGeneCount | float32 | 289. | Median gene count in cell. |
minArea | uint16 | 2 | Minimum area for cells in pixel. |
minDnbCount | uint16 | 0 | Minimum number of mRNA-captured DNBs in a cell. |
minExpCount | uint16 | 0 | Minimum MID count in cell. |
minGeneCount | uint16 | 0 | Minimum gene count in cell. |
minX | int32 | 2933 | Minimum x coordinate of the cell’s center of mass. |
minY | int32 | 5568 | Minimum y coordinate of the cell’s center of mass. |
Dataset DataType:compound | DataType | Example | Description |
id | uint32 | 10 | Cell ID index, the start ID is 0.In the Example, 10 represents the 10th cell in the dataset. |
x | int32 | 541 | The x coordinate of the cell’s center of mass.In the Example, the x coordinate of the 10th cell’s center of mass is 541. |
y |
int32 | 190 | The y coordinate of the cell’s center of mass.In the Example, the x coordinate of the 10th cell’s center of mass is 190. |
offset | uint32 | 494 | The start row index of the cell in the "/cellBin/cellExp" dataset.The example represents that the gene ID index and total MID count information of the 10th cell in the "/cellBin/cellExp" dataset start from the 494th row. |
geneCount | uint16 | 100 | Gene count in the cell.In the example, 100 represents that the 100 rows in the "/cellBin/cellExp", start from the 494th to the 593th row, contains the gene ID indices and total MID count of the gene for the 10th cell in "/cellBin/cell" dataset. |
expCount | uint16 | 500 | Cell MID count. |
dnbCount | uint16 | 200 | mRNA-captured DNBs of the cell. |
area | uint16 | 474 | Cell area in pixel. |
cellTypeID | uint32 | 0 | Cell type ID. |
clusterID | uint32 | 20 | Cell cluster ID. |
/cellBin/cellBorder:Dataset "cellBorder" is a 3D array which stores the lists of points for the bounding polygons of the cell. | |||
Dataset Attributes | DataType | Example | Description |
maxX | int32 | 16127 | Maximum x coordinate of the bounding box of the cell. |
maxY | int32 | 16663 | Maximum y coordinate of the bounding box of the cell. |
minX | int32 | 11129 | Minimum x coordinate of the bounding box of the cell. |
minY | int32 | 12784 | Minimum y coordinate of the bounding box of the cell. |
Dataset DataType:3D array | DataType | Example | Description |
32*(int16,int16) | [[-17,-11],[-15,-5]…[32767,32767]] | A list of 32 coordinates recording the differences between cell bounding points and the cell’s center of mass (0,0). The real coordinate of cell’s center of mass (x, y) can be obtained from "cell" dataset using cellID. | |
/cellBin/cellExp:Dataset "cellExp" is a 1D array which stores the expression information of each cell. | |||
Dataset Attributes | DataType | Example | Description |
maxCount | uint16 | 336 | Maximum MID count of a gene in a cell. |
Dataset DataType:compound | DataType | Example | Description |
geneID | uint32 | 1610 | Gene IDs of the genes detected in the cell. ID is the index of "gene" dataset.In the example, 1610 represents the 1610th item in the "gene" dataset, and the name of the gene can be acquired in "gene" dataset. |
count | uint16 |
3 | MID count for the gene.In the example, (assume this is the 0th item in the "cellExp" dataset, from the "offset" and "geneCount" record in the "cell" dataset we can know that the 0th item in the "cellExp" belongs to the cell whose cellID=0) the MID count for the gene (geneID=1610) in the cell (cellID=0) is 3. |
[optional] /cellBin/cellExon:Dataset "cellExon" is a 1D array which stores the exon information for each cell. | |||
Dataset Attributes | DataType | Example | Description |
maxExon | uint16 | 5793 | Maximum exon count of a gene in all cells. |
minExon | uint16 | 0 | Minimum exon count of a gene in all cells. |
Dataset DataType:1D array | DataType | Example | Description |
uint16 | 16 | Exon count in a cell, the index of the array is same to the cellID in the "cell" dataset. | |
[optional] /cellBin/cellExpExon:Dataset "cellExpExon" is a 1D array which stores exon expression information for each cell. | |||
Dataset Attributes | DataType | Example | Description |
maxExon | uint16 | 336 | Maximum exon count of a gene in a cell. |
Dataset DataType:1D array | DataType | Example | Description |
uint16 | 3 | Exon count (MID) for the gene. The index is same to the "cellExp" dataset.In the example, (assume this is the 0th item in the "cellExpExon" dataset, since the index is same to "cellExp" dataset, from the "offset" and "geneCount" record in the "cell" dataset we can know that the 0th item in the "cellExpExon" belongs to the cell whose cellID=0) the exon count (MID) for the gene (geneID=1610) in the cell (cellID=0) is 3. | |
/cellBin/cellTypeList:Dataset "cellTypeList" is a 1D array which stores cell types of each cell. | |||
Dataset DataType:1D array | DataType | Example | Description |
S32 | b'default' | Cell type, "default" stands for undefined cell type. | |
/cellBin/gene:Dataset "gene" is a 1D array which stores the indices of cell and expression information of each gene. | |||
Dataset Attributes | DataType | Example | Description |
maxCellCount | uint32 | 5718 | Maximum number of cells a gene can be detected. |
maxExpCount | uint32 | 55361 | Maximum MID count of a gene. |
minCellCount | uint32 | 1 | Minimum number of cells a gene can be detected. |
minExpCount | uint32 | 1 | Minimum MID count of a gene. |
Dataset DataType:compound | DataType | Example | Description |
geneID | S32 | b'ENSMUSG00000000001' | Gene ID. |
geneName | S32 | b'AC149090.1' | Gene name. |
offset | uint32 | 0 | The start row index of the gene in "/cellBin/geneExp" dataset.In the example, 0 means that start from the 0th item in "/cellBin/geneExp" dataset records the cellIDs and total MID count information of "AC149090.1". |
cellcount | uint32 | 60 | Number of cells a gene can be detected.In the example, 60 represents that start from the 0th item to the 59th item records the information of gene "AC149090.1". |
expCount | uint32 | 100 | Sum of MID count for the gene.In the example, the total MID count of "AC149090.1" is 100. |
maxMIDcount | uint16 | 4 | Maximum MID count of a gene in a cell.In this case, the maximum MID count of gene "AC149090.1" in a cell is 4. |
/cellBin/geneExp:Dataset "geneExp" is a 1D array which stores cell and expression information of each gene. | |||
Dataset Attributes | DataType | Example | Description |
maxCount | uint16 | 10 | Maximum MID count of a gene. |
Dataset DataType:compound | DataType | Example | Description |
cellID | uint32 | 1247 | cellID that contains the gene whose index is same to the index in "gene" dataset.In the example, (assume we use the 0th item in "geneExp" dataset) 1247 shows that the gene "AC149090.1" appears in the cell whose cellID is 1247. |
count | uint16 | 3 | The MID count of the gene, whose index is same to the index in "gene" dataset, in the cellID.In the example, the MID count of gene "AC149090.1" in the cell (cellID=1247) is 3. |
[optional] /cellBin/geneExon:Dataset "geneExon" is a 1D array which stores the exon expression information of each gene. | |||
Dataset Attributes | DataType | Example | Description |
maxExon | uint32 | 55361 | Maximum exon count of a gene. |
minExon | uint32 | 0 | Minimum exon count of a gene. |
Dataset DataType:1D array | DataType | Example | Description |
uint32 | 97 | Total exon count of a gene, the index of "geneExon" dataset is same to the "gene" dataset.In the example, (assume this is the 0th item in the "geneExon" dataset, and gene "AC149090.1" is the 0th item in the "gene" dataset) the exon count of gene "AC149090.1" is 97. | |
[optional] /cellBin/geneExpExon:Dataset "geneExpExon" is a 1D array which stores the exon expression information in cells of each gene. | |||
Dataset Attributes | DataType | Example | Description |
maxExon | uint16 | 336 | Maximum exon expression of a gene in a cell. |
Dataset DataType:1D array | DataType | Example | Description |
uint16 | 3 | Exon count of a gene in a cell. The index of "geneExpExon" dataset is same to the "geneExp" dataset.In the example, (assume this is the 0th item in the "geneExpExon" dataset, since the index is same to "geneExp" dataset, from the "offset" and "cellCount" record in the "gene" dataset we can know that the 0th item in the "geneExpExon" dataset belongs to the gene "AC149090.1") 3 stands for the exon count of gene "AC149090.1" in cell 1247 is 3. | |
/cellBin/bockIndex:Dataset "bockIndex" is a 1D array which stores the matrix block partition information. | |||
Dataset DataType:1D array | DataType | Example | Description |
uint32 | 0 | Cell count in each partition block.cnt=blockIndex[i+1]-blockIndex[i] | |
/cellBin/bockSize:Dataset "bockSize" is a 1D array which stores the block size of partition. | |||
Dataset DataType:1D array | DataType |
Example | Description |
uint32 | 256, 256, 104, 104 | 4-element array. The 4 items represent the block length in x-axis, block length in y-axis, block count in x-axis, and block count in y-axis, respectively. | |
[optional] /codedCellBlock:Group "codedCellBlock" stores pre-computed data for rendering in StereoMap. | |||
Group Attributes | DataType | Example | Description |
info |
string | {"@type": "neuroglancer_annotations_v1", ...} | Metadata of encoded precomputed data in JSON. |
[optional] /codedCellBlock/L0/0_1:Dataset "0_1" is an example chunk encoded pre-computed data, including id, geometry, and so on. | |||
Dataset DataType:Bytes | DataType |
Example |
Description |
H5T_OPAQUE | 1F 8B 08 00 ... | Bytecode of the chunk. | |
[optional] /proteinList:Dataset "proteinList" is a 1D array which stores the protein panel information of the sample |
|||
Dataset DataType:compound | DataType | Example | Description |
PIDName | H5T_STRING | CD169 | Protein name in the protein panel |
GeneName | H5T_STRING | Siglec1 | Protein's marker gene |
GeneID | H5T_STRING | ENSMUSG00000027322 | Ensembl gene IDs |
Gene Expression Matrix (GEM)
Gene expression matrix (GEM), a text file, stores gene spatial expression data. SAW generates multiple gene expression matrix files in the workflow, the basic format requires six columns with a header row that shows the column names. The six columns are gene ID, gene name, x coordinate, y coordinate, MID count and exon count. When it comes to cellbin GEM, the seventh column is for cell ID. The header of the expression matrix for the maximum area enclosing rectangle region has several annotation rows starting with "#" before the column rows. The header field names and field types are described in the table.
Fields | Data Type | Example | Description |
---|---|---|---|
#FileFormat | string | GEMv0.2 | Gene expression matrix file format version. |
#SortedBy | string | None |
Gene expression matrix sorting strategy. Valid values: "geneID", "x", "y", "MIDCount", "None". |
#BinType | string | Bin | Bin type of the GEM file. |
#BinSize | string | 1 | (Please check 1.3 Terminologies and Concepts Bin) |
#Omics | string | Transcriptomics | Omics name. |
#Stereo-seqChip | string | SS200000135TL_D1 | Stereo-seq Chip T serial number. |
#OffsetX | uint32 | 1 | X coordinate of the origin before calibration. |
#OffsetY | uint32 | 1 | Y coordinate of the origin before calibration. |
geneID | string | ENSMUSG00000000001 | Gene ID |
geneName | string | Gnai3 | Gene name. |
x | uint32 | 16809 | X coordinate of the spot. |
y | uint32 | 8546 | Y coordinate of the spot. |
MIDCount | uint32 | 1 | Number of MIDs at (x, y) for the gene in the corresponding row. |
ExonCount | uint32 | 0 | [Optional] Number of exon count at (x, y) for the gene in the corresponding row. |
CellID | uint32 | 55892 | [Optional] CellID for (x, y). |
An example of bin GEM:
#FileFormat=GEMv0.2
#SortedBy=None
#BinType=Bin
#BinSize=1
#Omics=Transcriptomics
#Stereo-seqChip=B03523G1
#OffsetX=0
#OffsetY=0
geneID geneName x y MIDCount ExonCount
ENSMUSG00000000001 Gnai3 694 17229 1 1
ENSMUSG00000000001 Gnai3 1428 4994 1 1
An example of cellbin GEM:
#FileFormat=GEMv0.2
#SortedBy=None
#BinType=CellBin
#BinSize=Cell
#Omics=Transcriptomics
#Stereo-seqChip=B03523G1
#OffsetX=0
#OffsetY=0
geneID geneName x y MIDCount ExonCount CellID
ENSMUSG00000047454 Gphn 9325 19972 1 0 192276
ENSMUSG00000030616 Sytl2 9314 19976 1 1 192276