Update OpenAI embedding model for semantic similarity #83

bioerrorlog · 2024-01-31T05:09:30Z

This PR updates the default embedding model for semantic_similarity from text-embedding-ada-002 to text-embedding-3-small.

OpenAI announced new embedding models last week, and the new model text-embedding-3-small worked better than the old one text-embedding-ada-002 for my use cases.
(The reduced price is another strong point for the new model.)

Especially, although text-embedding-ada-002 had an issue: the cosine similarities tend to be skewed quite heavily towards higher numbers (as commented in docs), new model text-embedding-3-small seems to resolve this issue.

The another new model text-embedding-3-large is not good for the default model, because of the increased price.

Supplemental data

Here is the result of my test cases(Japanese):

LLM answer (dummy)	Reference	Similarity (Local: paraphrase-multilingual-mpnet-base-v2)	Similarity (OpenAI: text-embedding-ada-002)	Similarity (OpenAI: text-embedding-3-small)
メロスは激怒した。	メロスは激怒した。	1.000000	1.000000	1.000000
メロスは激しく怒った。	メロスは激怒した。	0.980329	0.990486	0.965431
メロスは、激しく、怒った。	メロスは激しく怒った	0.961034	0.976309	0.947357
セリヌンティウスは待っていた。	メロスは激怒した。	0.250888	0.817479	0.237152
単語のベクトル表現は、1960年代における情報検索用のベクトル空間モデルを元に開発された。潜在的意味分析は、特異値分解で次元数を削減することで、1980年代後半に導入された。	単語をベクトルとして表現する手法は、1960年代における情報検索用のベクトル空間モデルの開発が元になっている。特異値分解を使用して次元数を削減することにより、1980年代後半に潜在的意味分析が導入された。	0.979269	0.986894	0.934140
1960年以前に、ベクトル表現は開発された。のちに次元数を増幅することにより、潜在分析が導入された。	単語をベクトルとして表現する手法は、1960年代における情報検索用のベクトル空間モデルの開発が元になっている。特異値分解を使用して次元数を削減することにより、1980年代後半に潜在的意味分析が導入された。	0.649559	0.906195	0.709959
すみません、ITヘルプデスクに電話で問い合わせてください。	大変申し訳ございませんが、弊社情報部門のヘルプデスクにメールでお問い合わせください。	0.768630	0.930722	0.752877

Test script here, or see my writeup (Japanese).

Thank you.

kennysong · 2024-01-31T05:45:30Z

Awesome investigation and blog post! I think either @liwii or @yosukehigashi are the right people to review this. (Might take a few days though)

yosukehigashi · 2024-01-31T10:48:23Z

Thank you for working on this, @bioerrorlog!!🚀 I'll try and take a look at this tomorrow

yosukehigashi

LGTM! Just made a small change to update the links from the text-embedding-ada-002 doc to the text-embedding-3-small doc.

I'll merge this after tests have passed - thanks for your contribution @bioerrorlog!!

bioerrorlog added 2 commits January 31, 2024 13:16

update semantic_similarity embedding model to text-embedding-3-small

bc579eb

update docs: remove warning notes for OpenAI embedding models

395f5f7

yosukehigashi self-requested a review January 31, 2024 10:48

update links in docstrings

bf5fbdf

yosukehigashi approved these changes Feb 1, 2024

View reviewed changes

yosukehigashi merged commit f215f67 into citadel-ai:main Feb 1, 2024
2 checks passed

bioerrorlog deleted the update-semantic-similarity-model branch February 3, 2024 12:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update OpenAI embedding model for semantic similarity #83

Update OpenAI embedding model for semantic similarity #83

bioerrorlog commented Jan 31, 2024

kennysong commented Jan 31, 2024

yosukehigashi commented Jan 31, 2024

yosukehigashi left a comment

Update OpenAI embedding model for semantic similarity #83

Update OpenAI embedding model for semantic similarity #83

Conversation

bioerrorlog commented Jan 31, 2024

Supplemental data

kennysong commented Jan 31, 2024

yosukehigashi commented Jan 31, 2024

yosukehigashi left a comment

Choose a reason for hiding this comment