-
Notifications
You must be signed in to change notification settings - Fork 0
/
ABOUT_project
68 lines (25 loc) · 3.63 KB
/
ABOUT_project
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
SUMMARY OF THE PROJECT::
We have detected fake news with Python in this project. We took a political dataset, implemented a TfidfVectorizer, initialized a PassiveAggressiveClassifier, and fit our model. We ended up obtaining an accuracy of 92.74% in magnitude.
Some important points regarding the project::
1. class sklearn.feature_extraction.text.TfidfVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float64'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)
2. max_df can be set to a value in the range [0.7, 1.0) to automatically detect and filter stop words based on intra corpus document frequency of terms.
3. Stop words are words like “and”, “the”, “him”, which are presumed to be uninformative in representing the content of a text, and which may be removed to avoid them being construed as signal for prediction. Sometimes, however, similar words are useful for prediction, such as in classifying writing style or personality. There are several known issues in our provided ‘english’ stop word list.
4. max_df : float or int, default=1.0 ----- > When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float in range [0.0, 1.0], the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.
5. fit_transform(raw_documents, y=None ----> Learn vocabulary and idf, return document-term matrix.
This is equivalent to fit followed by transform, but more efficiently implemented.
Parameters : raw_documents : iterable ----> An iterable which generates either str, unicode or file objects.
y : None -----> This parameter is ignored.
Returns : X : sparse matrix of (n_samples, n_features)
Tf-idf-weighted document-term matrix.
6. TF (Term Frequency): The number of times a word appears in a document is its Term Frequency. A higher value means a term appears more often than others, and so, the document is a good match when the term is part of the search terms.
7. DF (Inverse Document Frequency): Words that occur many times a document, but also occur many times in many others, may be irrelevant. IDF is a measure of how significant a term is in the entire corpus.
--> The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features
8. fit_transform(raw_documents, y=None) ------> Learn vocabulary and idf, return document-term matrix.
This is equivalent to fit followed by transform, but more efficiently implemented.
returns : X : sparse matrix of (n_samples, n_features) = Tf-idf-weighted document-term matrix.
9. transform(raw_documents) -----> Transform documents to document-term matrix.
Uses the vocabulary and document frequencies (df) learned by fit (or fit_transform).
returns : X : sparse matrix of (n_samples, n_features) = Tf-idf-weighted document-term matrix.
TfidfVectorizer turns a collection of raw documents into a matrix of TF-IDF features.¶
10. What is a PassiveAggressiveClassifier?
Passive Aggressive algorithms are online learning algorithms. Such an algorithm remains passive for a correct classification outcome, and turns aggressive in the event of a miscalculation, updating and adjusting. Unlike most other algorithms, it does not converge. Its purpose is to make updates that correct the loss, causing very little change in the norm of the weight vector.