NAR Database update 2020

发表于 2020-07-25 | 分类于 other

DNAproDB

DNA–protein complex structure analysis
Paper
Document

Update
- Structure processing upgrades
- Data update
- Database and web interface
  - Search
  - Report
  - Integration with Nucleic Acid Database
Discussion and conclusion
- case study

EnhancerAtlas

Enhancers in nine species
Paper
Document

Materals and methods
- Data sources
- Data processing for individual dataset/track
- Generation of consensus track
- Enhancer-gene interactions
- Implenentation of database
Results
- statistics
- Database search
- Enhancer browser

GWAS Central

GWAS datasets
Paper
Document

Materials and methods
- Data collection and curation
- Ontology mapping
- Database design
Results
- GWAS data searching
- Data visualization
Discussion

MatrisomeDB

Extracellular matrix components
Paper
Document

Database infrastructure
- Data sets included
- Mass spectrometry data search and processing
Database features and functionalities
- MatrisomeDB query
- Hierarchically-clustered protein distribution heatmap
- Peptide converage map
Results
- Experimental coverage of the in-silico predicted matrisome
- Building an ECM atlas
Future directions
- Future data extension
- Extension to include studies from other model organisms
- Extracting quantitative data from label-free and label-based proteomics studies

MIBiG

Biosynthetic Gene Clusters of Known Function
Paper
Document: No

Methods and implementation
- Manual curation of entries
- Data quality improvements
- The new database architecture
Results and discussion
- Data overview
  - BGC diversity
  - Annotation completeness
- A new online repository

MirGeneDB

Animal miRNA complements
Paper
Document

Expansion of MirGeneDB
Quality of MirGeneDB annotations
Improved web interface of MirGeneDB
MiRNA nomenclature
Functional classfication of MiRNA-seeds
Future developments

MSDB

Microsatellites from all sequenced genomes
Paper
Document

Materials and methods
- Data sources
- Identification of repeats
- Database design
- Web interface
Database overview and functionality
- Dashbord view
- Species selection
- Modal view
- Tabel view
- Data download
- Help page
Discussion
- Case study 1: length preference of AGAT repeats
- Case study 2: enrichment of AGC repeats in ruminants

Ohnologs

Vertebrate ohnologs
Paper
Document

Results
- Data collection and processing
- Navigating teh OHNOLOGS database
- Summary of the contents of the OHNOLOGS database

PolyASite

RNA polyadenylation sites
Paper
Document

Materials and methods
- Protocal-spefici pre-processing of reads
- Uniform analysis of putative poly(A) sites in individual samples
- Clustering of closely spaced sites
- Annotation and quantification of pply(A) sites
Results
- Website roadmap

WALTZ-DB

Amyloidogenic peptide sequences
Paper
Document

Expanded content and feature improvements
- New peptide entries and database statistics
- WALTZ-DB 2.0 novel features
Materials and methods
- Peptide synthesis
- Determination of amyloid fibril properties
- Structural models and energy calculations
Links to other databases

ConvexOptimization_Introduction

发表于 2020-03-09 | 分类于 ConvexOptimization

Optimization problem

Three basic elements

variables:$\boldsymbol{x}=\left(x{1}, \dots, x{n}\right)$
constraints: $f_{i}: \mathbb{R}^{n} \rightarrow \mathbb{R}, i = 1,…,m$
objective: $f_{0}: \mathbb{R}^{n} \rightarrow \mathbb{R}$
Goal
Find optimal solution $\boldsymbol{x}^{*}$ minimizing $f_{0}$ while satifying constraints.

Data science models

Linear

minimize $f(\boldsymbol{x})$ subject to $\boldsymbol{Ax=z}$
hope $\hat{\boldsymbol{x}}=\boldsymbol{x}^{\natural}$
Bilinear
find $\boldsymbol{x}, \boldsymbol{h}$ subject to $z{i}=\boldsymbol{b}{i}^{} \boldsymbol{h} \boldsymbol{x}^{} \boldsymbol{a}_{i}, 1 \leq i \leq m$
Quadratic
find $z \quad$ subject to $y{r}=\left|\left\langle\boldsymbol{a}{r}, \boldsymbol{z}\right\rangle\right|^{2}, \quad r=1,2, \ldots, m$
low -rank
$\underset{M \in \mathbb{C}^{m \times n}}{\operatorname{minimize}} \quad \operatorname{rank}(\boldsymbol{M}) \quad$ subject to $\quad Y{i, k}=M{i, k}, \quad(i, k) \in \Omega$
deep models
$\operatorname{minimize} \quad f(\boldsymbol{\theta}):=\frac{1}{n} \sum{i=1}^{n} \ell\left(h\left(\boldsymbol{x}{i}, \boldsymbol{\theta}\right), y_{i}\right)$ subject to $\mathcal{R}(\boldsymbol{\theta}) \leq \tau$

Large-scale optimization: classes of optimization problems

Constrained vs. unconstrained

Unconstrained optimization:
- every point is feasible
- focus is on minimizing $f(\boldsymbol{x})$
constrained:
- find constraint set $C$
- minimizing $f(\boldsymbol{x}), \boldsymbol{x} \in C$
- may be difficult to find point $\boldsymbol{x} \in C$
  Convex vs. nonconvex
convex optimization:
- all local optima are global optima
- can be solved in polynomial-time
  deteministic vs. stochastic
stochastic gradient
solvability vs. saclability
Polynomial-time algorithms might be useless in large-scale applications (iteration -)
- example: Newton’s method
- a single iteration may last forever (time +)
- prohibitive storage requirement (space +)
Iteration complexity vs. per-iteration cost
- computational cost = iteration complexity (#iterations) $\times$ cost per iteration
Large-scale problems call for methods with ==cheap iterations==
- First-order methods: exploit only information on function values and (sub)gradients without Hessian information
  - cheap iterations
  - low memory requirements
- Randomized and approximation methods: project data into subspace, and ==solve reduced dimension problem==
  - optimization for high-dimensional data analysis: polynomial-time algorithms often not fast enough: ==further approximations== are essential
- Adavanced large-scale optimization
  - randomized methods; nonconvex optimization on manifold; learning to optimize; deep reinforcement learning
  - goal: scalable, real-time, parallel, distributed, automatic, etc.

High-dimensional statistics

Convex geometry

Local geometry

Global geometry

Topics and grading

Theoretical foundations

convex sets, convex functions, convex problems, lagrange duality and KKT conditions, disciplined convex programming

first-order methods

gradient methods, subgradient methods, proximal methods

second-order methods

Newton methods, interior-point methods, quasi-Newton methods

stochastic methods

stochastic gradient methods, stochastic Newton methods, randomized sketching methods, randomized linear algebra

machine learning approaches

applications

Deep Reinforcement Learning in Portfolio Management

发表于 2020-01-19 | 分类于 Other

Deep Generative Model for Auto-annotation in Single-cell Analysis

发表于 2020-01-19 | 分类于 Bioinformatics

Notes for small RNA sequencing analysis

发表于 2019-12-08 | 分类于 Bioinformatics

Introduction

This pipeline is developed for annotating and profiling small RNAs with UMI attached to both ends. The raw data are reads in FASTQ format. After preprocessing and mapping to reference libraries, we can statistic the reads. Particularly, the reads which can be aligned to the genome but cannot be aligned to RNA sequences databases will be filtered out and written to a CSV format file. The following figure presents a general workflow of the pipeline.

workflow

Methods

Preprocessing

The preprocessing phase includes three parts: adapter extraction, quality control, UMI extraction. At first, adapters in 3’ end and 5’ end are removed by Cutadapt and then only reads whose length within 27~55nt are retained. Finally, UMIs in both 3’ end and 5’ end are extracted from sequence reads and add to read names with UMI-tools. At the UMI extraction step, reads with poor quality are dropped at the same time by UMI-tools.

Software and parameters used in this phase:

Software	Parameter	Effect
Cutadapt	-j 40	use 40 CPU cores
	-a AGATCGGAAGAGCACACGTC	remove 3’ end adapter: AGATCGGAAGAGCACACGTC
	-g GTTCAGAGTTCTACAGTCCGACGATC	remove 5’ end adapter: GTTCAGAGTTCTACAGTCCGACGATC
	—overlap=3	require at least three bases match between adapter and read
	—error-rate=0.1	the level of error tolerance (mismatch) in adapter sequences searching
	—times=1	trim no more than one adapter from each reads
	—minimum-length=27 —maximum-length=55	drop reads shorter than 27nt or longer than 55 nt
UMI-tools extract	—bc-pattern=NNNNNNNNN —3prime	extract 9nt as UMI from 3’ end
	—bc-pattern=NNNNNN	extract 6nt as UMI from 5’ end
	—quality-filter-threshold 20	filter low quality reads

Mapping

The preprocessed reads are still in FASTQ format. The reads are aligned to genome and RNA databases with Bowtie, respectively. The following table lists the reference libraries we use in alignment. After aligning to references, the duplicate reads (reads with the same alignment position and UMI) will be dropped by umi_tools. After mapping, we can get the annotated reads in SAM format.

Genome	RNA databases(sort by mapping order)
hg19	mature tRNA
	tRNA
	rRNA
	miRNA
	piRNA
	Rfam

Software and parameters used in this phase:

Software	Parameter	Effect
Bowtie	-q	input format: FASTQ
	—threads 40	run in 40 threads
	-v 1	allow 1 mismatch
	-m 1	each reads can only be aligned once
	-a	keep all mapping results
	-un	write all reads that could not be aligned to a file
UMI-tools dedup	—method=unique	do not consider the sequencing mistakes in UMIs
	—read-length	duplicate reads must have the same length

Analysis

With the SAM format files in the mapping phase, we can count the reads. The reads with the same annotation will be counted as the same RNA. Pie graphs to visualize the percentage of different types of RNAs are plotted based on the counts. Besides counting the reads that mapping to the RNA databases, we can also filter the sequences that can be aligned to the genome but not to RNA databases. The statistics results are saved in y_rnadb_y_genome.csv and n_rnadb_y_genome.csv. The statistic and visualization are conducted with Python scripts.

Implement

The preprocessing phase and mapping phase are implemented by Shell scripts, and the analysis phase is implemented by Python scripts. The source code and more details of these scripts can be found in SmallRNASeq.

Full analysis results of samples can be found in the attachment.

Results

The following pie graphs show the mapping results of different samples.

Sample 1:

s1_y_rnadb_y_genome_result

Sample 3:

s3_y_rnadb_y_genome_result

Sample 5:

s5_y_rnadb_y_genome_result

The head of the n_rnadb_y_genome.csv for sample 1:

n_rnadb_y_genome

Monocle拟时分析

发表于 2019-11-13 | 分类于 Bioinformatics

读入数据

基于上一篇Seurat的分群的结果进行处理，已经进行过预处理降维分群。

导入上篇保存的pbmc数据

1	> load('./output/res0.5.Robj')

取其中两个亚群进行分析

1
2
3

> sub_pbmc <- subset(pbmc, idents=c(3,4))
# 目前数据是个seurat对象
> seurat_object <- sub_pbmc

每个群取10个marker基因，为下面monocle分析准备

1 2	> markers <- FindAllMarkers(sub_pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25) > top10 <- markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)

接下来从seurat object中提取创建monocle对象需要的数据

提取表达矩阵，转化为matrix类型

1	> data <- as(as.matrix(seurat_object@assays$RNA@data), 'sparseMatrix')

提取细胞表，并转化成AnnotatedDataFrame类型

1	> pd <- new('AnnotatedDataFrame', data = seurat_object@meta.data)

提取基因注释，并转化成AnnotatedDataFrame类型

1 2	> fData <- data.frame(gene_short_name = row.names(data), row.names = row.names(data)) > fd <- new('AnnotatedDataFrame', data = fData)

创建monocle对象

newCellDataSet需要输入一个表达矩阵，AnnotatedDataFrame格式的行坐标（sample name）和纵坐标（gene name)

#Construct monocle cds
> cds <- newCellDataSet(data,
                         phenoData = pd,
                         featureData = fd,
                         lowerDetectionLimit = 0.5,
                         expressionFamily = negbinomial.size());

预估size factors和分散

1 2	> cds <- estimateSizeFactors(cds) > cds <- estimateDispersions(cds)

提取表达基因

1	expressed_genes <- row.names(subset(fData(cds)))

建立轨迹

采用之前每个群选取的10个marker（top10）来建立轨迹

将top10读取成矩阵

1 2	> ordering_genes<-as.matrix(top10) ime.txt")

过滤

1	> cds <- setOrderingFilter(cds, ordering_genes = ordering_genes)

降维

1	> cds <- reduceDimension(cds, max_components = 2, method = 'DDRTree',auto_param_selection = F) # take long time

排序

1	> cds <- orderCells(cds)

保存

1 2	> save(cds,file="./output/orderCells.Robj") write.table(pData(cds),file="./output/my_pseudotime.txt")

查看

> head(pData(cds))
               orig.ident nCount_RNA nFeature_RNA percent.mt RNA_snn_res.0.5 seurat_clusters     S.Score    G2M.Score
AAACATACAACCAC     pbmc3k       2419          779   3.017776               4               4  0.06500339 -0.065012414
AAACATTGAGCTAC     pbmc3k       4903         1352   3.793596               3               3 -0.01906540 -0.132349057
AAACCGTGTATGCG     pbmc3k        980          521   1.224490               4               4 -0.01086219  0.005361576
AAACGCTGACCAGT     pbmc3k       2175          782   3.816092               4               4 -0.03122878 -0.053972321
AAACGCTGGTTCTT     pbmc3k       2260          790   3.097345               4               4 -0.03190767 -0.011423063
AAACTTGAAAAACG     pbmc3k       3914         1112   2.631579               3               3  0.04395791 -0.080502879
               Phase Size_Factor Pseudotime State
AAACATACAACCAC     S   1.3046257   4.343051     6
AAACATTGAGCTAC    G1   2.6443075   6.530045     8
AAACCGTGTATGCG   G2M   0.5285379   0.000000     1
AAACGCTGACCAGT    G1   1.1730306   3.882721     5
AAACGCTGGTTCTT    G1   1.2188731   1.263559     1
AAACTTGAAAAACG     S   2.1109157   6.543204     8

画轨迹图

1	> plot_cell_trajectory(cds, color_by = "State")

1	> plot_cell_trajectory(cds, color_by = "Pseudotime")

1	> plot_cell_trajectory(cds, color_by = "RNA_snn_res.0.5")

特定基因在在轨迹上的表达量图

1	> plot_cell_trajectory(cds, markers="CD79A", cell_size=0.5, use_color_gradient=T) + scale_color_gradient2(low="gray",mid="yellow",high="red")

展示感兴趣的基因热图（一个方向变化）

> my_pseudotime_de <- differentialGeneTest(cds,fullModelFormulaStr = "~sm.ns(Pseudotime)",cores = 5)

> sig_gene_names <- row.names(subset(my_pseudotime_de, qval < 10^-20))

> plot_pseudotime_heatmap(cds[sig_gene_names,], num_clusters = 4,cores = 5,use_gene_short_name = TRUE,show_rownames = TRUE)

展示感兴趣基因基因拟时间表达变化（一个方向

> c<-subset(my_pseudotime_de, qval < 10^-20)

> to_be_tested <- row.names(subset(fData(cds),gene_short_name %in% c("CD79A", "NKG7", "HLA-DRA", "CCL5")))
> cds_subset <- cds[to_be_tested,]

> diff_test_res <- differentialGeneTest(cds_subset,fullModelFormulaStr = "~sm.ns(Pseudotime)")

> diff_test_res[,c("gene_short_name", "pval", "qval")]

> plot_genes_in_pseudotime(cds_subset,color_by="RNA_snn_res.0.5")

两个方向的图

热图

1	> plot_genes_branched_heatmap(cds[sig_gene_names,], num_clusters = 4,cores = 5,use_gene_short_name = TRUE,show_rownames = TRUE)

拟时间表达

1	> plot_genes_branched_pseudotime(cds_subset, branch_point = 2,color_by="RNA_snn_res.0.5")

Seurat单细胞分群

发表于 2019-11-13 | 分类于 Bioinformatics

前言

数据：10X官方提供的PBMC数据集，2700个外周血单核细胞（PBMC，Peripheral Blood Mononuclear Cells）公共数据，使用Illumina NextSeq 500测序。

原始教程和代码：Seurat - Guided Clustering Tutorial
1
2
3
4
hg19
|__barcodes.tsv
|__genes.tsv
|__matrix.mtx

数据读取

读入pbmc数据。

1	> pbmc.data <- Read10X(data.dir = "./data/pbmc3k_filtered_gene_bc_matrices/filtered_gene_bc_matrices/hg19/")

查看稀疏矩阵的维度,即基因数和细胞数。

1 2	> dim(pbmc.data) [1] 32738 2700

查看矩阵的1-5行，1-3列。

“ · ”表示没有检测到，也就是0。这样表示是为了压缩空间。

> pbmc.data[1:5,1:3]
5 x 3 sparse Matrix of class "dgCMatrix"
             AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC
MIR1302-10                .              .              .
FAM138A                   .              .              .
OR4F5                     .              .              .
RP11-34P13.7              .              .              .
RP11-34P13.8              .              .              .

预处理：质控和数据筛选

创建Seurat对象

project是对象的名字，可以用来标识不同的样本

min.cell: 单个基因在小于等于3个细胞中测到就丢弃

min.features: 单个细胞测到小于等于200个基因说明可能是死细胞也丢弃

> pbmc <- CreateSeuratObject(counts = pbmc.data, project = "pbmc3k", min.cells = 3, min.features = 200)
An object of class Seurat 
13714 features across 2700 samples within 1 assay 
Active assay: RNA (13714 features)

计算每个细胞的线粒体基因转录本数的百分比（%）

使用[[ ]] 操作符存放到metadata中

线粒体很多说明细胞可能是死细胞或者活性太低

1	> pbmc[["percent.mt"]] <- PercentageFeatureSet(pbmc, pattern = "^MT-")

检查一下metadata

查看metadata前三行

> head(pbmc@meta.data, 3)
               orig.ident nCount_RNA nFeature_RNA percent.mt
AAACATACAACCAC     pbmc3k       2419          779  3.0177759
AAACATTGAGCTAC     pbmc3k       4903         1352  3.7935958
AAACATTGATCAGC     pbmc3k       3147         1129  0.8897363

展示基因及线粒体百分比

从图中得到过滤信息：观察数据的分布范围，异常数据要过滤掉

单个细胞基因数目nFeature_RNA：200-2500

线粒体占比percent.mt：小于5%

1	> VlnPlot(pbmc, features = c("nFeature_RNA", "nCount_RNA", "percent.mt"), ncol = 3)

用散点图（FeatureScatter）来绘制两组feature信息的相关性,组合在一起（CombinePlots）显示

左图：reads count值和在线粒体的reads数量一般成反比，线粒体reads占比很高的情况一般有表达量的reads数很少

右图：基因(feature)数量一般会和reads count值成正比

1
2
3

> plot1 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "percent.mt")
> plot2 <- FeatureScatter(pbmc, feature1 = "nCount_RNA", feature2 = "nFeature_RNA")
> CombinePlots(plots = list(plot1, plot2))

过滤细胞：

保留gene数大于200小于2500的细胞，目的是去掉空GEMs和1个GEMs包含2个以上细胞的数

保留线粒体基因的转录本数低于5%的细胞，目的是过滤掉死细胞等低质量的细胞数据。

1	> pbmc <- subset(pbmc, subset = nFeature_RNA > 200 & nFeature_RNA < 2500 & percent.mt < 5)

表达量数据标准化

LogNormalize的算法：A = log( 1 + ( UMIA ÷ UMITotal ) × 10000 )

1	pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)

差异表达分析

数据标准化：归一化

LogNormalize的算法：

A = log( 1 + ( UMIA ÷ UMITotal ) × 10000 )

得到的结果存储在：pbmc[["RNA"]]@data
1
2
3
> pbmc <- NormalizeData(pbmc, normalization.method = "LogNormalize", scale.factor = 10000)
# 默认参数，所以省略也可以
# pbmc <- NormalizeData(pbmc)
鉴定表达高变基因HVGs (highly variable features),

在细胞与细胞间进行比较，选择表达量差别最大的(也即是同一个基因在有的细胞中表达量很高，同时在部分细胞中表达很少)

利用FindVariableFeatures函数，会计算一个mean-variance结果，也就是给出表达量均值和方差的关系并且得到top variable features

2000个，用于下游分析,如PCA
1
> pbmc <- FindVariableFeatures(pbmc, selection.method = "vst", nfeatures = 2000)
提取表达量变变化最高的10个基因
1
2
3
> top10 <- head(VariableFeatures(pbmc), 10)
> top10
[1] "PPBP" "LYZ" "S100A9" "IGLL5" "GNLY" "FTL" "PF4" "FTH1" "GNG11" "S100A8"
画图

左图：把pbmc全部基因画出来，红色是2000个HVGs

右图：在左图基础上标出表达量变变化最高的10个基因
1
2
3
plot1 <- VariableFeaturePlot(pbmc)
plot2 <- LabelPoints(plot = plot1, points = top10, repel = TRUE)
CombinePlots(plots = list(plot1, plot2))

降维

数据标准化：零中心化

目的是对数据进行线性转换（scaling），是降维处理（如PCA）之前的标准预处理步骤。

使用ScaleData()函数：

center: 将每个基因在所有细胞的表达量都减去均值，使得每个基因在所有细胞的表达量均值为0

scale: 将center后每个基因的表达量除以标准差，让每个基因在所有细胞中的表达量方差为1

## ScaleData()默认只是标准化高变基因（2000个）,速度更快,不影响PCA和分群,但影响热图的绘制。
# pbmc <- ScaleData(pbmc,vars.to.regress ="percent.mt") 
## 而对所有基因进行标准化的方法如下：
> all.genes <- rownames(pbmc)
> pbmc <- ScaleData(pbmc, features = all.genes)

移除不想要的差异来源

如果要移除不想要的差异来源，可以用vars.to.regress

这个例子里移除了线粒体

更进一步的误差排除可以用另一个专业版函数：SCTransform，它也包含vars.to.regress这个选项。

1	> pbmc <- ScaleData(pbmc, vars.to.regress = "percent.mt")

线性降维：PCA

RunPCA：

默认input用高变基因集，但也可通过features参数自己指定

默认主成分数量是选50个

结果保存在reductions 这个接口中

1	> pbmc <- RunPCA(pbmc, features = VariableFeatures(object = pbmc))

检查PCA分群结果

打印前5个PC，每个PC显示5个基因

## RunPCA结果保存在reductions这个接口中
# print(pbmc@reductions$pca, dims = 1:3, nfeatures = 5)
> print(pbmc[["pca"]], dims = 1:5, nfeatures = 5)
PC_ 1 
Positive:  FTL, FTH1, CST3, AIF1, S100A4 
Negative:  MALAT1, IL32, LTB, IL7R, CCL5 
PC_ 2 
Positive:  S100A8, S100A9, LYZ, CD14, LGALS2 
Negative:  B2M, NKG7, GZMA, CTSW, PRF1 
PC_ 3 
Positive:  CD74, HLA-DRA, HLA-DPB1, HLA-DQA1, HLA-DQB1 
Negative:  PPBP, GNG11, SPARC, PF4, AP001189.4 
PC_ 4 
Positive:  CD74, HLA-DQA1, HLA-DQB1, HLA-DQA2, HLA-DRA 
Negative:  FCGR3A, FTL, RP11-290F20.3, S100A4, NKG7 
PC_ 5 
Positive:  FCGR3A, RP11-290F20.3, MS4A7, MALAT1, CKB 
Negative:  NKG7, LYZ, S100A9, S100A8, GNLY

展示主成分基因分值

展示前两个PC里主成分基因的值

1	> VizDimLoadings(pbmc, dims = 1:2, reduction = "pca")

可视化降维后的结果

DimPlot:

默认提取前两个主成分，可以通过dim选项修改，绘制散点图

图中的每个点代表一个样本。

reduction指定降维方式类型，如果不指定，默认先搜索顺序umap->tsne->pca

目前用的是PCA的结果

降维后的数据在 cell.embeddings中，可以打印出来

> DimPlot(pbmc, reduction = "pca")
> head(pbmc[['pca']]@cell.embeddings)[1:2,1:2]
                     PC_1       PC_2
AAACATACAACCAC -4.2153796 -0.8006517
AAACATTGAGCTAC  0.9503998 -3.0434884

画第一个主成分的热图

默认画前30个基因，可以通过nFeatures设置

dim：选哪几个主成分

cells：细胞数

balanced：TRUE表示正负得分的基因各占一半

1	> DimHeatmap(pbmc, dims = 1, cells = 500, balanced = TRUE)

画第1到15个主成分的热图

查看主成分贡献度

ElbowPlot画肘部图

主要关注”肘部“的PC，它是一个转折点，选用转折点之前的PC就可以比较好地代表总体变化。

从下图可以看出可以选择前10个PC

在选择此参数时,建议选择偏高的数字（为了获取更多的稀有分群,“宁滥勿缺”）；有些亚群很罕见,如果没有先验知识,很难将这种大小的数据集与背景噪声区分开来。

非线性降维：UMAP/tSNE

性降维PCA（1933）一个特点就是：它将高维中不相似的数据点放在较低维度时，会尽可能多地保留原始数据的差异，因此导致这些不相似的数据相距甚远，大多的数据重叠放置
tSNE（2008）兼顾了高维降维+低维展示，降维后的那些不相似的数据点依然可以放得靠近一些，保证了点既在局部分散，又在全局聚集
UMAP（2018）相比tSNE又能展示更多的维度：参考文献

Seurat里要用tSNE要先计算PCA。Seurat下游的分群只需要PCA降维结果就可以，但是用tSNE的话，二维显示的图比较好看。

Tsne降维并可视化

> pbmc <- RunTSNE(pbmc, dims = 1:10)
> head(pbmc[['tsne']]@cell.embeddings)[1:2,1:2]
                  tSNE_1    tSNE_2
AAACATACAACCAC -10.99287  13.52608
AAACATTGAGCTAC -28.57216 -27.23769
DimPlot(pbmc,reduction = "tsne",label = TRUE,pt.size = 1.5)

![Rplot_tsne](/pic/Rplot_tsne.png)

Umap降维并可视化

> pbmc <- RunUMAP(pbmc, dims = 1:10)
> head(pbmc[['umap']]@cell.embeddings)[1:2,1:2]
                UMAP_1     UMAP_2
   AAACATACAACCAC 3.784667   3.790803
   AAACATTGAGCTAC 1.231294 -10.622349
> DimPlot(pbmc,reduction = "umap",label = TRUE,pt.size = 1.5)

基于图的集群

计算权重

FindNeighbors是基于图的

input是之前降维确定的主成分(这里选用了前10个)

根据PCA的结果，构建KNN图，得出近邻构成的小社区

在小社区内，计算细胞两两之间的联系数（shared overlap）得到每条边的权重(Jaccard similarity)
1
> pbmc <- FindNeighbors(pbmc, dims = 1:10)
聚类

FindClusters默认使用Louvain算法

resolution参数决定下游聚类分析得到的分群数,对于3K左右的细胞,设为0.4-1.2 能得到较好的结果(官方说明)；如果数据量增大,该参数也应该适当增大。

resolution可以理解为分辨率，如果分辨率越高，看得越清晰，分群数目就越多。
1
> pbmc <- FindClusters(pbmc, resolution = 0.5)
查看结果

Idents可以看出不同细胞的分群

例子打印了前五个细胞结果
1
2
3
AAACATACAACCAC AAACATTGAGCTAC AAACATTGATCAGC AAACCGTGCTTCCG AAACCGTGTATGCG
4 3 0 2 4
Levels: 0 1 2 3 4 5 6 7 8 9
可视化

本例基于tsne降维结果，不同颜色表示不同的群
1
> DimPlot(pbmc,reduction = "tsne",label = TRUE,pt.size = 1.5)

识别差异表达基因（cluster biomarker）

识别cluster1中的marker基因

`FindAllMarkers() 在细胞群之间比较，找到每个细胞群的正、负表达biomarker

正负可以理解为上调、下调的意思；正marker表示在这个cluster中表达量高，而在其他的cluster中低；负marker表示在这个cluster中表达量低，而在其他的cluster中高

ident.1：第一个群

ident.2：第二个群，如果不填就认为是除了第一个群之外的所有其他群的集合

min.pct：基因在任何两群细胞中的占比最低不能低于多少

其他可选参数，还有结果每列的含义等详见文档

> cluster1.markers <- FindMarkers(pbmc, ident.1 = 1, min.pct = 0.25)
> head(cluster1.markers, n = 5)
            p_val  avg_logFC pct.1 pct.2    p_val_adj
CYBA 3.392038e-97  -50.92599 0.613 0.900 4.651841e-93
CD74 2.543403e-91 -203.51709 0.641 0.890 3.488023e-87
ACTB 1.818148e-84 -186.51617 0.982 0.995 2.493408e-80
OAZ1 2.698853e-79  -47.51709 0.777 0.943 3.701207e-75
FTH1 5.527059e-79 -370.51704 0.980 0.992 7.579808e-75

识别每个cluster的maker基因，每个cluster取前两个maker保存到本地

##寻找每一cluster的marker
> pbmc.markers <- FindAllMarkers(pbmc, only.pos = TRUE, min.pct = 0.25, logfc.threshold = 0.25)
##每组取前两个marker
> pbmc.markers.top <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 2, wt = avg_logFC)
##存储marker
> write.table(pbmc.markers,file="./output/allmarker.txt")
> write.table(pbmc.markers.top,file="./output/topmarker.txt")

比较cluster5和cluster0、cluster3的marker基因

ident.2 = c（0，3)说明把cluster0、cluster3合并作为一个群

> cluster5.markers <- FindMarkers(pbmc, ident.1 = 5, ident.2 = c(0, 3), min.pct = 0.25)
> head(cluster5.markers, n = 5)
                      p_val avg_logFC pct.1 pct.2     p_val_adj
CFD           1.444184e-208 12.930168 0.956 0.032 1.980554e-204
FCGR3A        1.905989e-199 20.996601 0.975 0.049 2.613873e-195
RP11-290F20.3 3.500106e-199  4.916265 0.875 0.018 4.800046e-195
IFITM3        1.752032e-198 49.819631 0.981 0.052 2.402737e-194
CD68          1.486609e-196  5.804065 0.950 0.042 2.038735e-192

可视化

FeaturePlot

最常用的可视化

将基因表达量投射到降维聚类结果中

1	> FeaturePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14", "FCER1A", "FCGR3A", "LYZ", "PPBP", "CD8A"),cols = c("gray", "red"))

小提琴图VlnPlot

展示某些基因在所有细胞群的表达量

1 2	> VlnPlot(pbmc, features = c("MS4A1", "CD79A")) > VlnPlot(pbmc, features = c("NKG7", "PF4"), slot = "counts", log = TRUE)

热图DoHeatmap

1
2
3

## 每个细胞群选前十个基因，所以一共有100个基因
> top10 <- pbmc.markers %>% group_by(cluster) %>% top_n(n = 10, wt = avg_logFC)
> DoHeatmap(pbmc, features = top10$gene) + NoLegend()

气泡图DotPlot

y轴是细胞群

颜色表示平均表达量

大小表示表达的基因个数占总个数的百分比

山脊图RidgePlot

x轴是表达量，y轴是细胞群

feature1的(x,y)位置山脊高说明feature1基因在y细胞群，表达量为x的细胞很多

1	RidgePlot(pbmc, features = c("MS4A1", "GNLY", "CD3E", "CD14"), ncol = 2)

判断细胞类型

根据每个群找出来的biomarker来判断每个细胞群的细胞类型。

需要结合额外的背景知识，可以查阅文献和数据库CellMarker。

官网给出的背景知识：

| Cluster ID | Markers | Cell Type |
| :————- | :—————— | :—————- |
| 0 | IL7R, CCR7 | Naive CD4+ T |
| 1 | IL7R, S100A4 | Memory CD4+ |
| 2 | CD14, LYZ | CD14+ Mono |
| 3 | MS4A1 | B |
| 4 | CD8A | CD8+ T |
| 5 | FCGR3A, MS4A7 | FCGR3A+ Mono |
| 6 | GNLY, NKG7 | NK |
| 7 | FCER1A, CST3 | DC |
| 8 | PPBP | Platelet |

给群添加细胞类型标签(Cell Type)

> new.cluster.ids <- c("Naive CD4 T", "Memory CD4 T", "CD14+ Mono", "B", "CD8 T", "FCGR3A+ Mono", "NK", "DC", "Platelet")
names(new.cluster.ids) <- levels(pbmc)
> pbmc <- RenameIdents(pbmc, new.cluster.ids)
> DimPlot(pbmc, reduction = "tsne", label = TRUE, pt.size = 1.5) + NoLegend()

上面的代码是会报错的，因为我的分群结果是10个，但标签只有9个。

官网的分群结果是9个，背景知识表格给的也是9个，所以例子里的标签也是9个。

至于为什么一样的数据，一样的代码，一样的参数，结果确不一样呢？Who Knows.

不是真的研究数据所以不想做背景Research了，代码意会一下。

下一个教程会用一个自动注释的R包SingleR来添加标签。

保存数据

存储细胞归类结果

1	> saveRDS(pbmc, file = "./output/pbmc3k_final.rds")

Ref：

[1] 单细胞Seurat包升级之2,700 PBMCs分析（上）

[2] 单细胞Seurat包升级之2,700 PBMCs分析（下）

[3] Seurat - Guided Clustering Tutorial

[4] 一文读懂单细胞测序分析流程

车牌识别-TensorFlow (1-克隆运行)

发表于 2017-10-30 | 分类于车牌识别

克隆代码

从github仓库克隆代码到本地

更改：适应更新的python(print)和tensorflow(label……)版本，以及window环境(fname.split(“\\”)[1])

运行调试

说明：已经配置好环境，版本Python3.5，TensorFlow1.x，OpenCV3

1.导入背景图片

下载[SUN397.tar.gz]

1	python extratbgs.py SUN397.tar.gz

2.合成训练图片

1
2

3.训练

1
2

4.识别

车牌识别-TensorFlow (0-环境配置)

发表于 2017-10-28 | 分类于车牌识别

环境要求

Python, TensorFlow, OpenCV, NumPy

安装过程

说明：基于Anaconda来搭建开发环境, Windows系统

前提：已经安装好 Anaconda，打开cmdconda --version可以正确显示版本号

安装Python3.5

conda create --name python35 python=3.5
# conda info --envs
activate python35
# python --version
# deactivate python34
# conda remove --name python35

# conda list
# conda list -n python35
# conda search numpy
# conda install -n python35 numpy
# conda update -n python35 numpy
# conda remove -n python35 numpy
# conda update conda
conda install anaconda
# conda update anaconda
# conda update python

安装Tensorflow

参见官网Installing with Anaconda

CPU版本

1	pip install --ignore-installed --upgrade tensorflow

安装OpenCV

1	conda install --channel https://conda.anaconda.org/menpo opencv3

HELLO WORLD

发表于 2017-10-28 | 分类于 Other

HELLO WORLD! —-WJ

Jie Wang

太阳强烈水波温柔

10 日志

5 分类

2 标签

GitHub E-Mail