Skip to content

Commit

Permalink
try
Browse files Browse the repository at this point in the history
  • Loading branch information
MuHeDing committed Nov 10, 2019
1 parent 95fe54e commit c04b31e
Show file tree
Hide file tree
Showing 31 changed files with 590 additions and 48 deletions.
Binary file added IR Evaluation/3-evaluation1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added IR Evaluation/3-evaluation2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added IR Evaluation/3-evaluation3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added IR Evaluation/3-evaluation4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added IR Evaluation/3-evaluation5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added IR Evaluation/3-evaluation6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added IR Evaluation/3-evaluation7.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
15 changes: 8 additions & 7 deletions IR Evaluation/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,8 @@
一、 对实验二查询结果进行评价,首先提取查询内容,在MB171-225.txt中有查询内容,查询内容为标签<query></query>中的内容,如图所示:
<br>

![3-evaluation1.png](https://i.loli.net/2019/10/20/LBrbTx52g3oMZHO.png)

![avatar](3-evaluation1.png)

<br>

Expand Down Expand Up @@ -49,7 +50,7 @@

<br>

![3-evaluation2.png](https://i.loli.net/2019/10/20/i89W2LoJ5sVckIP.png)
![avatar](3-evaluation2.png)


三、编写评价函数,MAP,MRR,NDCG
Expand All @@ -58,7 +59,7 @@ MAP,先得到qrels.txt 即标准答案的输出结果的 tweetid(docid),再

<br>

![3-evaluation3.png](https://i.loli.net/2019/10/20/ezcV8qgBDGEiyuo.png)
![avatar](3-evaluation3.png)


代码如下:
Expand Down Expand Up @@ -105,7 +106,7 @@ def MAP_eval(qrels_dict, test_dict, k = 100):
MRR,和MAP 步骤类似,使用倒数的方法,先得到qrels.txt 即标准答案的输出结果的 tweetid(docid),再得到你的结果的 tweetid(docid),使用下图的公式,分子为一,分母为相关文档的位置,得到RR ,最后再对所有RR 求和


![3-evaluation4.png](https://i.loli.net/2019/10/20/dUpigE2qIeHYJhs.png)
![avatar](3-evaluation4.png)

代码如下:

Expand Down Expand Up @@ -148,7 +149,7 @@ def MRR_eval(qrels_dict, test_dict, k = 100):

NDCG: 先算 DCG 为累计的相关性之和,再除以 位置的以2为底的对数 , 算 IDCG 为 将从大到小排序之后的 DCG,然后再用下图 公式 求出 NDCG

![3-evaluation5.png](https://i.loli.net/2019/10/20/SEyGg43De9I1YMl.png)
![avatar](3-evaluation5.png)

代码如下:

Expand Down Expand Up @@ -201,13 +202,13 @@ def NDCG_eval(qrels_dict, test_dict, k = 100):

<br>

![3-evaluation6.png](https://i.loli.net/2019/10/20/KTdi5lgwZ9c6fpE.png)
![avatar](3-evaluation6.png)

最终求和之后:

<br>

![3-evaluation7.png](https://i.loli.net/2019/10/20/LRelKviBmSaPyfX.png)
![avatar](3-evaluation7.png)

结果比 标准答案小了一些,但基本上接近,实现的查询是有效的

Expand Down
19 changes: 9 additions & 10 deletions Inverted index/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -115,13 +115,12 @@ def shuffle(dic):

每个词对应了 它出现的文档(此时用行号作为文档)和在该文档下的词频

![index.png](https://i.loli.net/2019/10/20/xH3yQz98OnDhFpN.png)


![avatar](index.png)
用 tweetid 作为 文档id,并且把词频给去掉,实现标准输出

![boolean5.png](https://i.loli.net/2019/10/20/my3FOtflJMTYug7.png)

![avatar](boolean5.png)

## **实现思路——布尔查询**

Expand All @@ -135,28 +134,28 @@ def shuffle(dic):

and 和 not 单独操作

![boolean1.png](https://i.loli.net/2019/10/20/C52wX7hzxiBSReI.png)


![avatar](boolean1.png)

and 和 not 同时操作


![boolean2.png](https://i.loli.net/2019/10/20/wmiWQMtyLVedf16.png)
![avatar](boolean2.png)


and or 和 not 同时操作

![boolean4.png](https://i.loli.net/2019/10/20/K2Jz6fQLWrNcsCT.png)
![avatar](boolean4.png)


## **改进**

1. 增加了优先级操作,可以支持 A B C 三个单词之间的 and or not 的优先级操作,优先级关系为 not > and > or

![boolean6.png](https://i.loli.net/2019/10/20/oUz3uvSZlyiQJ8F.png)
![avatar](boolean6.png)

2. 学习和实用了 textblob库,实用了 其中 words方法就进行分词,还有对名词的单复数处理,对动词进行词性还原,学习实用的方法图片如下图所示:

![learn.png](https://i.loli.net/2019/10/20/3mIyoKnO49EqVLj.png)
![avatar](learn.png)

3. 加入了统计词频(tf)和统计词的文档(df),便于实验2计算 文档和查询的分数
Binary file added Inverted index/boolean1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Inverted index/boolean2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Inverted index/boolean3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Inverted index/boolean4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Inverted index/boolean5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Inverted index/boolean6.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Inverted index/index.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Inverted index/learn.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Inverted index/youxianji.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Ranked Retrieval Model/2-rank1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Ranked Retrieval Model/2-rank2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Ranked Retrieval Model/2-rank3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Ranked Retrieval Model/2-rank4.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Ranked Retrieval Model/2-rank5.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
12 changes: 6 additions & 6 deletions Ranked Retrieval Model/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ normalization

一、 首先在实验一建立倒排索引的基础上,进行处理,建立了每一个term对应的 postings的结果,结果如实验一所示:

![2-rank1.png](https://i.loli.net/2019/10/20/ftVX7Jc9Ciz6Spd.png)
![avatar](2-rank1.png)



Expand Down Expand Up @@ -92,7 +92,8 @@ def shuffle(dic):

结果如下图所示:

![2-rank2.png](https://i.loli.net/2019/10/20/5AJKknVL9t8jwCI.png)

![avatar](2-rank2.png)

四、统计词出现的总词频和文档频率,也就是对 term来计算 包含的 tweetid数目即 文档频率,每个tweetid中出现的词频求和,即总词频

Expand Down Expand Up @@ -133,7 +134,7 @@ def process_query(query):

使用如图所示的算法:

![2-rank3.png](https://i.loli.net/2019/10/20/MAT8O6gKHv7cq4l.png)
![avatar](2-rank3.png)



Expand Down Expand Up @@ -177,13 +178,12 @@ def do_RankSearch(query,doc,tdic):

以上是结果图,输入一个query,我会出现相应的分数

![2-rank5.png](https://i.loli.net/2019/10/20/YmgPX2UEhiaCQ4y.png)
![avatar](2-rank5.png)


为了比较结果,我专门把相应的text也输出,可以看到里面有 query中的单词


![2-rank4.png](https://i.loli.net/2019/10/20/lkZz1wU4RIGxJKS.png)
![avatar](2-rank4.png)



Expand Down
115 changes: 90 additions & 25 deletions cluster/FirstCluster.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from sklearn.cluster import AffinityPropagation
import numpy as np
from sklearn import metrics
from time import time
from sklearn.cluster import estimate_bandwidth
from sklearn.cluster import MeanShift
from sklearn.cluster import SpectralClustering
Expand All @@ -13,60 +14,118 @@
from sklearn.cluster import OPTICS
from sklearn import mixture
from sklearn.cluster import Birch
from sklearn.decomposition import PCA
# 手写数字识别
digits = datasets.load_digits()
#print(digits.data.shape) # 1797*64,每一张是一个 64像素的矩阵

data = scale(digits.data) # newX = (X- 均值) / 标准差
#print(data)
n_digits=len(np.unique(digits.target))
sample_size = 300
labels = digits.target
print('init\t\ttime\tinertia\thomo\tcompl\tv-meas\tARI\tAMI\tsilhouette')
def bench_k_means(estimator, name, data):
t0 = time()
estimator.fit(data)
print('%-9s\t%.2fs\t%i\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f'
% (name, (time() - t0), estimator.inertia_,
metrics.homogeneity_score(labels, estimator.labels_),
metrics.completeness_score(labels, estimator.labels_),
metrics.v_measure_score(labels, estimator.labels_),
metrics.adjusted_rand_score(labels, estimator.labels_),
metrics.adjusted_mutual_info_score(labels, estimator.labels_,
average_method='arithmetic'),
metrics.silhouette_score(data, estimator.labels_,
metric='euclidean',
sample_size=sample_size)))
def bench_show(estimator, name, data):
t0 = time()
estimator.fit(data)
print('%-9s\t%.2fs\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f'
% (name, (time() - t0),
metrics.homogeneity_score(labels, estimator.labels_),
metrics.completeness_score(labels, estimator.labels_),
metrics.v_measure_score(labels, estimator.labels_),
metrics.adjusted_rand_score(labels, estimator.labels_),
metrics.adjusted_mutual_info_score(labels, estimator.labels_,
average_method='arithmetic'),
metrics.silhouette_score(data, estimator.labels_,
metric='euclidean',
sample_size=sample_size)))
def bench_show2(estimator, name, data):
t0 = time()
estimator.fit(data)
print('%-9s\t%.2fs\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f'
% (name, (time() - t0),
metrics.homogeneity_score(labels, estimator.predict(data)),
metrics.completeness_score(labels, estimator.predict(data)),
metrics.v_measure_score(labels, estimator.predict(data)),
metrics.adjusted_rand_score(labels, estimator.predict(data)),
metrics.adjusted_mutual_info_score(labels, estimator.predict(data),
average_method='arithmetic'),
metrics.silhouette_score(data, estimator.predict(data),
metric='euclidean',
sample_size=sample_size)))

def bench_show3(estimator, name, data):
t0 = time()
estimator.fit(data)
print('%-9s\t%.2fs\t%.3f\t%.3f\t%.3f\t%.3f\t%.3f'
% (name, (time() - t0),
metrics.homogeneity_score(labels, estimator.labels_),
metrics.completeness_score(labels, estimator.labels_),
metrics.v_measure_score(labels, estimator.labels_),
metrics.adjusted_rand_score(labels, estimator.labels_),
metrics.adjusted_mutual_info_score(labels, estimator.labels_,
average_method='arithmetic')))
# estimator=KMeans(init='k-means++', n_clusters=n_digits, n_init=10)
# estimator.fit(data)
# #z=estimator.predict(data)
# z=estimator.labels_
# count=0
# for i in range(0,len(digits.data)):
# if digits.target[i]==z[i]:
# # print(digits.target[i])
# # print(z[i])
# count=count+1
# print(count)

# score=metrics.normalized_mutual_info_score(digits.target,z,average_method='arithmetic')
# print(score)
#

bench_k_means(KMeans(init='k-means++', n_clusters=n_digits, n_init=10),
name="k-means++", data=data)
# 相似性传播
# af = AffinityPropagation().fit(data)
af = AffinityPropagation().fit(data)
#
# score=metrics.normalized_mutual_info_score(digits.target,af.labels_,average_method='arithmetic')
# print(score)

bench_show(AffinityPropagation(),
name="AffinityPropagation", data=data)
# 均值漂移
# bandwidth = estimate_bandwidth(data,quantile=0.2,n_samples=500)
# ms = MeanShift(bandwidth=bandwidth,bin_seeding=True)
bandwidth = estimate_bandwidth(data,quantile=0.2,n_samples=500)
ms = MeanShift(bandwidth=bandwidth,bin_seeding=True)
# ms.fit(data)
# print(ms.labels_)
# score=metrics.normalized_mutual_info_score(digits.target,ms.labels_,average_method='arithmetic')
# print(score)

bench_show(MeanShift(bandwidth=bandwidth,bin_seeding=True),
name="MeanShift", data=data)
#光谱聚类
# 有问题
#spectral = SpectralClustering(n_clusters=n_digits,eigen_solver='arpack').fit(data)
# sc = SpectralClustering(n_digits, affinity='precomputed', n_init=100).fit_predict(data)
# score=metrics.normalized_mutual_info_score(digits.target,sc.labels_,average_method='arithmetic')
# print(score)

#sc = SpectralClustering(n_digits, affinity='precomputed', n_init=100).fit_predict(data)
pca=PCA(n_components=n_digits).fit_transform(data)
#score=metrics.normalized_mutual_info_score(digits.target,sc.labels_,average_method='arithmetic')
#print(score)
bench_show3(SpectralClustering(n_digits),name="SpectralClustering", data=pca)
# 分层聚类
# ward = AgglomerativeClustering(n_clusters=n_digits, linkage='ward')
#ward = AgglomerativeClustering(n_clusters=n_digits, linkage='ward')
# ward.fit(data)
# score=metrics.normalized_mutual_info_score(digits.target,ward.labels_,average_method='arithmetic')
# print(score)

bench_show( AgglomerativeClustering(n_clusters=n_digits, linkage='ward'),
name="AgglomerativeClustering", data=data)
# 基于密度的聚类
# db = DBSCAN().fit(data)

# score=metrics.normalized_mutual_info_score(digits.target,db.labels_,average_method='arithmetic')
# print(score)
bench_show3(DBSCAN(),name="DBSCAN", data=data)


# 光学聚类
# clust = OPTICS(min_samples=50, xi=.05, min_cluster_size=.05)
Expand All @@ -76,14 +135,20 @@
# score=metrics.normalized_mutual_info_score(digits.target,clust.labels_,average_method='arithmetic')
# print(score)

bench_show(OPTICS(min_samples=50, xi=.05, min_cluster_size=.05),name="OPTICS", data=data)

# 高斯混合模型
# gmm = mixture.GaussianMixture(n_components=n_digits, covariance_type='full').fit(data)
#gmm = mixture.GaussianMixture(n_components=n_digits, covariance_type='full').fit(data)
# score=metrics.normalized_mutual_info_score(digits.target,gmm.predict(data),average_method='arithmetic')
# print(score)

bench_show2(mixture.GaussianMixture(n_components=n_digits, covariance_type='full'),name="Gaussian", data=data)

# 桦木
brc = Birch(branching_factor=50, n_clusters=n_digits, threshold=0.5, compute_labels=True)
brc.fit(data)
score=metrics.normalized_mutual_info_score(digits.target,brc.labels_,average_method='arithmetic')
print(score)
# brc = Birch(branching_factor=50, n_clusters=n_digits, threshold=0.5, compute_labels=True)
# brc.fit(data)
# score=metrics.normalized_mutual_info_score(digits.target,brc.labels_,average_method='arithmetic')
# print(score)

bench_show2(Birch(branching_factor=50, n_clusters=n_digits, threshold=0.5, compute_labels=True),name="Birch", data=data)

Binary file added cluster/dataset.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added cluster/evaluate.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit c04b31e

Please sign in to comment.