Expression matrix format

*The interpretation of the Advanced part is only available in EN.

Gene Expression File (GEF)

Gene expression file (GEF), in HDF5 format, is a data management and storage format designed to support multidimensional datasets and high computational efficiency. Stereo-seq analysis workflow generates bin GEF and cellbin GEF files. Bin GEF file format is a hierarchically structured data model that stores one or bin combined gene expression matrices in various bin sizes. Cellbin GEF file format stores expression information within each cell. Each GEF container organizes a collection of spatial gene expression matrices. It includes two primary data objects: Group and Dataset. A dataset is a multidimensional array of data elements. Group object is analogous to a file system directory that organizes datasets and other groups in hierarchies.

Bin GEF

The first level of GEF includes four group objects: "geneExp" (required), "wholeExp" (optional), "wholeExpExon" (optional), and "stat" (optional). Group "geneExp" contains groups of gene spatial expression data in one or multiple bin sizes. Group "wholeExp" contains datasets that record expression level and gene type count of each coordinate in one or multiple bin sizes. Group "wholeExpExon" contains datasets that record the exon level of each coordinate in one or multiple bin sizes. Group "stat" saves gene names, total MID count and spatial pattern enrichment score of each gene. "Attributes" of the file record the version of GEF format, software version, and omics information. "Attribute" in each dataset records the key metrics of that dataset. Check the table to get details.

Attributes
File Attributes	DataType	Example	Description
version	uint32	2	Gene expression file format version.
geftool_ver	uint32[3]	1,1,17	Geftool version. It can be used as an individual tool to manipulate GEF files.
omics	S32	b'Transcriptomics'	Omics name.
gef_area	float32	4.4410855E10	Tissue or labeled tissue area in square nanometers.
bin_type	S32	b'bin'	Bin type of the GEF file.
sn	S32	b'SS200000135TL_D1'	Stereo-seq chip SN
/geneExp/binN/expression:Dataset "expression" is a 1D array which stores coordinates and MID counts of each gene in the bin size of N, aggregated by gene name.
Dataset Attributes	DataType	Example (bin1)	Description
minX	int32	59820	Minimum x coordinate in bin N.
minY	int32	102086	Minimum y coordinate in bin N.
maxX	int32	73040	Maximum x coordinate in bin N.
maxY	int32	120539	Maximum y coordinate in bin N.
maxExp	uint32	28	Maximum MID count in a spot when the bin size is N. Data type for "maxExp" is dynamically changed for each sample.
resolution	uint32	500	Physical pitch (nm) between neighbor spots.
Dataset DataType:compound	DataType	Example (bin1)	Description
x	int32	71032	x coordinate in bin N.
y	int32	103180	y coordinate in bin N.
count	uint8/uint16/uint32	1	MID count at (x, y) when bin size is N. Data type for "count" is consistent with "maxExp" in the "Attributes."
[optional] /geneExp/binN/exon:Dataset "exon" is a 1D array which stores exon expression of each gene in the bin size of N, aggregated by gene name.
Dataset Attributes	DataType	Example (bin1)	Description
maxExon	int32	21	Max exon expression in binN.
Dataset DataType:1D array	DataType	Example (bin1)	Description
count	uint8/uint16/uint32	0	Exon expression in binN at coordinate (x,y), the index is same to the index in the "expression" dataset. Data type for "count" is dynamically changed for each sample.
/geneExp/binN/gene:Dataset "gene" is a 1D array which stores the gene names, the starting row indexes in dataset "expression", and row counts.
Dataset DataType:compound	DataType	Example (bin1)	Description
geneID	S64	b'ENSMUSG00000000001'	Gene ID.
geneName	S64	b'Gm16045'	Gene name.
offset	uint32	21	The starting row index in dataset "expression" for the gene.In this example, the gene expression data for gene "Gm16045" starts from row 21 in the dataset "expression."
count	uint32	2	Row count.In this example, expression data for gene "Gm16045" is recorded in row 21 and 22 (2 rows) in the dataset "expression."
[optional] /wholeExp/binN:Dataset "binN" is a 2D array (matrix) which stores the MID count and gene type count at each spot.
Dataset Attributes	DataType	Example (bin1)	Description
number	uint64	22879557	Number of non-zero spots in the dense matrix.
minX	int32	59820	Minimum x coordinate in bin N.
lenX	int32	13221	Length of x.
minY	int32	102086	Minimum y coordinate in bin N.
lenY	int32	18454	Length of y.
maxMID	uint32	2155	Maximum MID count in a spot.
maxGene	uint32	846	Maximum gene type count in a spot.
resolution	uint32	500	Pitch (nm) between neighbor spots.
Dataset DataType: 2D array (XⅹY), compound	DataType	Example (bin1)	Description
MIDcount	uint8/uint16/uint32	1	MID count in the spot. The spot coordinate can be identified from the row and column index of the 2D matrix plus the "minX" and "minY" specified in the attributes. Data type for "MIDcount" is dynamically changed for each sample.
genecount	uint16	1	Gene count in the spot. The spot coordinate can be identified from "Attributes" and the indexes of the 2D array.
[optional] /wholeExpExon/binN:Dataset "binN" in "/wholeExpExon/" Group is a 2D array (matrix) which stores the exon expression count at each spot.
Dataset Attributes	DataType	Example (bin1)	Description
maxExon	uint32	21	Maximum exon expression count in a spot when the bin size is N.
Dataset DataType: 2D array	DataType	Example (bin1)	Description
MIDcount	uint8/uint16/uint32	0	MID count in the spot. The spot coordinate can be identified from the row and column index of the 2D matrix plus the "minX" and "minY" specified in the attributes. Data type for "MIDcount" is dynamically changed for each sample.
[optional] /stat/gene:Dataset "gene" is a 1D array which stores the MID count and spatial pattern enrichment score (E10) of each gene. The array is order by the MID count in descending order.
Dataset Attributes	DataType	Example	Description
maxE10	float32	65.53	Maximum E10 score.
minE10	float32	0.	Minimum E10 score.
cutoff	float32	0.1	Threshold for filtering spots that will be used for computing E10.In this example, 0.1 means that the spots whose MID count is in the top 10% are used for calculating the spatial enrichment score.
Dataset DataType:compound	DataType	Example	Description
geneID	S64	b'ENSMUSG00000000001'	Gene ID.
geneName	S64	b'Ptgds'	Gene name.
MIDcount	uint32	229502	MID count for the gene.
E10	float32	65.53	The spatial pattern enrichment score (E10) for the gene.
[optional] /proteinList:Dataset "proteinList" is a 1D array which stores the protein panel information of the sample.
Dataset DataType:compound	DataType	Example	Description
PIDName	H5T_STRING	CD169	Protein name in the protein panel
GeneName	H5T_STRING	Siglec1	Protein's marker gene
GeneID	H5T_STRING	Ensembl gene IDs	Ensembl gene IDs

Cell Bin GEF

The first layer of Cell Bin GEF contains one required group "cellBin" and multiple optional datasets. The second layer "codedCellBlock" is optional, which stores precomputed data used in the rendering of StereoMap. "Attributes" of the file record the version of GEF format, software version, and omics information. "Attribute" in each dataset records the key metrics of that dataset. Check the table to get more details.


Attributes
File Attributes	DataType	Example	Description
geftool_ver	uint32[3]	0,7,11	geftool version. It can be used as an individual tool to manipulate GEF files.
offsetX	int32	0	Minimum x coordinate in bin 1.
offsetY	int32	0	Minimum y coordinate in bin 1.
omics	S32	b‘Transcriptomcis’	Omics name.
resolution	uint32	500	Pitch (nm) between neighbor spots.
version	uint32	2	Gene expression file format version.
bin_type	S32	CellBin	Bin type of the GEF file.
sn	S32	b'SS200000135TL_D1'	Stereo-seq chip SN
/cellBin/cell:Dataset "cell" is a 1D array which stores basic information and indices information of cells and expression.
Dataset Attributes	DataType	Example	Description
averageArea	float32	494.666	Average area for cells in pixel.
averageDnbCount	float32	194.299	Average number of mRNA-captured DNBs in a cell.
averageExpCount	float32	541.715	Average MID count in cell.
averageGeneCount	float32	310.157	Average gene count in cell.
maxArea	uint16	1925	Maximum area for cells in pixel.
maxDnbCount	uint16	883	Maximum number of mRNA-captured DNBs in a cell.
maxExpCount	uint16	3018	Maximum MID count in cell.
maxGeneCount	uint16	1415	Maximum gene count in cell.
maxX	int32	17658	Maximum x coordinate of the cell’s center of mass.
maxY	int32	19422	Maximum y coordinate of the cell’s center of mass.
medianArea	float32	474.	Median area for cells in pixel.
medianDnbCount	float32	183.	Median number of mRNA-captured DNBs in a cell.
medianExpCount	float32	491.	Median MID count in cell.
medianGeneCount	float32	289.	Median gene count in cell.
minArea	uint16	2	Minimum area for cells in pixel.
minDnbCount	uint16	0	Minimum number of mRNA-captured DNBs in a cell.
minExpCount	uint16	0	Minimum MID count in cell.
minGeneCount	uint16	0	Minimum gene count in cell.
minX	int32	2933	Minimum x coordinate of the cell’s center of mass.
minY	int32	5568	Minimum y coordinate of the cell’s center of mass.
Dataset DataType:compound	DataType	Example	Description
id	uint32	10	Cell ID index, the start ID is 0.In the Example, 10 represents the 10th cell in the dataset.
x	int32	541	The x coordinate of the cell’s center of mass.In the Example, the x coordinate of the 10th cell’s center of mass is 541.
y	int32	190	The y coordinate of the cell’s center of mass.In the Example, the x coordinate of the 10th cell’s center of mass is 190.
offset	uint32	494	The start row index of the cell in the "/cellBin/cellExp" dataset.The example represents that the gene ID index and total MID count information of the 10th cell in the "/cellBin/cellExp" dataset start from the 494th row.
geneCount	uint16	100	Gene count in the cell.In the example, 100 represents that the 100 rows in the "/cellBin/cellExp", start from the 494th to the 593th row, contains the gene ID indices and total MID count of the gene for the 10th cell in "/cellBin/cell" dataset.
expCount	uint16	500	Cell MID count.
dnbCount	uint16	200	mRNA-captured DNBs of the cell.
area	uint16	474	Cell area in pixel.
cellTypeID	uint32	0	Cell type ID.
clusterID	uint32	20	Cell cluster ID.
/cellBin/cellBorder:Dataset "cellBorder" is a 3D array which stores the lists of points for the bounding polygons of the cell.
Dataset Attributes	DataType	Example	Description
maxX	int32	16127	Maximum x coordinate of the bounding box of the cell.
maxY	int32	16663	Maximum y coordinate of the bounding box of the cell.
minX	int32	11129	Minimum x coordinate of the bounding box of the cell.
minY	int32	12784	Minimum y coordinate of the bounding box of the cell.
Dataset DataType:3D array	DataType	Example	Description
	32*(int16,int16)	[[-17,-11],[-15,-5]…[32767,32767]]	A list of 32 coordinates recording the differences between cell bounding points and the cell’s center of mass (0,0). The real coordinate of cell’s center of mass (x, y) can be obtained from "cell" dataset using cellID.
/cellBin/cellExp:Dataset "cellExp" is a 1D array which stores the expression information of each cell.
Dataset Attributes	DataType	Example	Description
maxCount	uint16	336	Maximum MID count of a gene in a cell.
Dataset DataType:compound	DataType	Example	Description
geneID	uint32	1610	Gene IDs of the genes detected in the cell. ID is the index of "gene" dataset.In the example, 1610 represents the 1610th item in the "gene" dataset, and the name of the gene can be acquired in "gene" dataset.
count	uint16	3	MID count for the gene.In the example, (assume this is the 0th item in the "cellExp" dataset, from the "offset" and "geneCount" record in the "cell" dataset we can know that the 0th item in the "cellExp" belongs to the cell whose cellID=0) the MID count for the gene (geneID=1610) in the cell (cellID=0) is 3.
[optional] /cellBin/cellExon:Dataset "cellExon" is a 1D array which stores the exon information for each cell.
Dataset Attributes	DataType	Example	Description
maxExon	uint16	5793	Maximum exon count of a gene in all cells.
minExon	uint16	0	Minimum exon count of a gene in all cells.
Dataset DataType:1D array	DataType	Example	Description
	uint16	16	Exon count in a cell, the index of the array is same to the cellID in the "cell" dataset.
[optional] /cellBin/cellExpExon:Dataset "cellExpExon" is a 1D array which stores exon expression information for each cell.
Dataset Attributes	DataType	Example	Description
maxExon	uint16	336	Maximum exon count of a gene in a cell.
Dataset DataType:1D array	DataType	Example	Description
	uint16	3	Exon count (MID) for the gene. The index is same to the "cellExp" dataset.In the example, (assume this is the 0th item in the "cellExpExon" dataset, since the index is same to "cellExp" dataset, from the "offset" and "geneCount" record in the "cell" dataset we can know that the 0th item in the "cellExpExon" belongs to the cell whose cellID=0) the exon count (MID) for the gene (geneID=1610) in the cell (cellID=0) is 3.
/cellBin/cellTypeList:Dataset "cellTypeList" is a 1D array which stores cell types of each cell.
Dataset DataType:1D array	DataType	Example	Description
	S32	b'default'	Cell type, "default" stands for undefined cell type.
/cellBin/gene:Dataset "gene" is a 1D array which stores the indices of cell and expression information of each gene.
Dataset Attributes	DataType	Example	Description
maxCellCount	uint32	5718	Maximum number of cells a gene can be detected.
maxExpCount	uint32	55361	Maximum MID count of a gene.
minCellCount	uint32	1	Minimum number of cells a gene can be detected.
minExpCount	uint32	1	Minimum MID count of a gene.
Dataset DataType:compound	DataType	Example	Description
geneID	S32	b'ENSMUSG00000000001'	Gene ID.
geneName	S32	b'AC149090.1'	Gene name.
offset	uint32	0	The start row index of the gene in "/cellBin/geneExp" dataset.In the example, 0 means that start from the 0th item in "/cellBin/geneExp" dataset records the cellIDs and total MID count information of "AC149090.1".
cellcount	uint32	60	Number of cells a gene can be detected.In the example, 60 represents that start from the 0th item to the 59th item records the information of gene "AC149090.1".
expCount	uint32	100	Sum of MID count for the gene.In the example, the total MID count of "AC149090.1" is 100.
maxMIDcount	uint16	4	Maximum MID count of a gene in a cell.In this case, the maximum MID count of gene "AC149090.1" in a cell is 4.
/cellBin/geneExp:Dataset "geneExp" is a 1D array which stores cell and expression information of each gene.
Dataset Attributes	DataType	Example	Description
maxCount	uint16	10	Maximum MID count of a gene.
Dataset DataType:compound	DataType	Example	Description
cellID	uint32	1247	cellID that contains the gene whose index is same to the index in "gene" dataset.In the example, (assume we use the 0th item in "geneExp" dataset) 1247 shows that the gene "AC149090.1" appears in the cell whose cellID is 1247.
count	uint16	3	The MID count of the gene, whose index is same to the index in "gene" dataset, in the cellID.In the example, the MID count of gene "AC149090.1" in the cell (cellID=1247) is 3.
[optional] /cellBin/geneExon:Dataset "geneExon" is a 1D array which stores the exon expression information of each gene.
Dataset Attributes	DataType	Example	Description
maxExon	uint32	55361	Maximum exon count of a gene.
minExon	uint32	0	Minimum exon count of a gene.
Dataset DataType:1D array	DataType	Example	Description
	uint32	97	Total exon count of a gene, the index of "geneExon" dataset is same to the "gene" dataset.In the example, (assume this is the 0th item in the "geneExon" dataset, and gene "AC149090.1" is the 0th item in the "gene" dataset) the exon count of gene "AC149090.1" is 97.
[optional] /cellBin/geneExpExon:Dataset "geneExpExon" is a 1D array which stores the exon expression information in cells of each gene.
Dataset Attributes	DataType	Example	Description
maxExon	uint16	336	Maximum exon expression of a gene in a cell.
Dataset DataType:1D array	DataType	Example	Description
	uint16	3	Exon count of a gene in a cell. The index of "geneExpExon" dataset is same to the "geneExp" dataset.In the example, (assume this is the 0th item in the "geneExpExon" dataset, since the index is same to "geneExp" dataset, from the "offset" and "cellCount" record in the "gene" dataset we can know that the 0th item in the "geneExpExon" dataset belongs to the gene "AC149090.1") 3 stands for the exon count of gene "AC149090.1" in cell 1247 is 3.
/cellBin/bockIndex:Dataset "bockIndex" is a 1D array which stores the matrix block partition information.
Dataset DataType:1D array	DataType	Example	Description
	uint32	0	Cell count in each partition block.cnt=blockIndex[i+1]-blockIndex[i]
/cellBin/bockSize:Dataset "bockSize" is a 1D array which stores the block size of partition.
Dataset DataType:1D array	DataType	Example	Description
	uint32	256, 256, 104, 104	4-element array. The 4 items represent the block length in x-axis, block length in y-axis, block count in x-axis, and block count in y-axis, respectively.
[optional] /codedCellBlock:Group "codedCellBlock" stores pre-computed data for rendering in StereoMap.
Group Attributes	DataType	Example	Description
info	string	{"@type": "neuroglancer_annotations_v1", ...}	Metadata of encoded precomputed data in JSON.
[optional] /codedCellBlock/L0/0_1:Dataset "0_1" is an example chunk encoded pre-computed data, including id, geometry, and so on.
Dataset DataType:Bytes	DataType	Example	Description
	H5T_OPAQUE	1F 8B 08 00 ...	Bytecode of the chunk.
[optional] /proteinList:Dataset "proteinList" is a 1D array which stores the protein panel information of the sample
Dataset DataType:compound	DataType	Example	Description
PIDName	H5T_STRING	CD169	Protein name in the protein panel
GeneName	H5T_STRING	Siglec1	Protein's marker gene
GeneID	H5T_STRING	ENSMUSG00000027322	Ensembl gene IDs

Gene Expression Matrix (GEM)

Gene expression matrix (GEM), a text file, stores gene spatial expression data. SAW generates multiple gene expression matrix files in the workflow, the basic format requires six columns with a header row that shows the column names. The six columns are gene ID, gene name, x coordinate, y coordinate, MID count and exon count. When it comes to cellbin GEM, the seventh column is for cell ID. The header of the expression matrix for the maximum area enclosing rectangle region has several annotation rows starting with "#" before the column rows. The header field names and field types are described in the table.

Fields	Data Type	Example	Description
#FileFormat	string	GEMv0.2	Gene expression matrix file format version.
#SortedBy	string	None	Gene expression matrix sorting strategy. Valid values: "geneID", "x", "y", "MIDCount", "None".
#BinType	string	Bin	Bin type of the GEM file.
#BinSize	string	1	(Please check 1.3 Terminologies and Concepts Bin)
#Omics	string	Transcriptomics	Omics name.
#Stereo-seqChip	string	SS200000135TL_D1	Stereo-seq Chip T serial number.
#OffsetX	uint32	1	X coordinate of the origin before calibration.
#OffsetY	uint32	1	Y coordinate of the origin before calibration.
geneID	string	ENSMUSG00000000001	Gene ID
geneName	string	Gnai3	Gene name.
x	uint32	16809	X coordinate of the spot.
y	uint32	8546	Y coordinate of the spot.
MIDCount	uint32	1	Number of MIDs at (x, y) for the gene in the corresponding row.
ExonCount	uint32	0	[Optional] Number of exon count at (x, y) for the gene in the corresponding row.
CellID	uint32	55892	[Optional] CellID for (x, y).

An example of bin GEM:

#FileFormat=GEMv0.2
#SortedBy=None
#BinType=Bin
#BinSize=1
#Omics=Transcriptomics
#Stereo-seqChip=B03523G1
#OffsetX=0
#OffsetY=0
geneID  geneName        x       y       MIDCount        ExonCount
ENSMUSG00000000001      Gnai3   694     17229   1       1
ENSMUSG00000000001      Gnai3   1428    4994    1       1

An example of cellbin GEM:

#FileFormat=GEMv0.2
#SortedBy=None
#BinType=CellBin
#BinSize=Cell
#Omics=Transcriptomics
#Stereo-seqChip=B03523G1
#OffsetX=0
#OffsetY=0
geneID  geneName        x       y       MIDCount        ExonCount       CellID
ENSMUSG00000047454      Gphn    9325    19972   1       0       192276
ENSMUSG00000030616      Sytl2   9314    19976   1       1       192276

Expression matrix format

Expression matrix format

Gene Expression File (GEF)

Bin GEF

Cell Bin GEF

Gene Expression Matrix (GEM)

results matching ""

No results matching ""