Project 3: Text Mining

Team members:

Wendy Maria Arango [email protected]
Manuela Carrasco [email protected]

Development Environment

To access to the project in Databricks Community enter here:

Project 3: Text mining

Cluster

In order to create the cluster use Databricks Community with the following configuration:

Databricks Runtime Version 5.2 (includes Apache Spark 2.4.0, Scala 2.11)
Python 3

Data

For testing, use the following news datasets.

News Datasets

Metodology

CRISP-DM (Cross Industry Standard Process for Data Mining)

CRISP-DM [1] is an open standard process model that describes common approaches used by data mining experts. It is the most widely-used analytics model.

Phases

Bussiness understanding: Objectives and requirements from a non-technical perspective.
Data understanding: Become familiar with the data keeping in mind the business objectives.
Data preparation: Get the minable view or dataset.
Modeling: Apply data mining techniques to dataset.
Evaluation: From the previous phase models to determine if they are useful for business needs.
Deployement: Exploit the usefulness of the models, integrating them into the decision-making tasks of the organization.

Business understanding

Mining or text analytics are a set of models, techniques, algorithms and technologies that allow the processing of text of a NON-STRUCTURED nature.

Text mining allows the text to be transformed into a structured form, in such a way that it facilitates a series of applications such as text search, relevance of documents, natural understanding of the language, automatic translation between languages, analysis of feelings, topic detection among many other applications. Perhaps the simplest processing of all, be the wordcount, which is to determine the frequency of the word per document or the entire dataset.

Data Understanding

The work data for the project is a set of news published in kaggle.

Data preparation

The news was pre-processed to prepare the data for analytics, with the following rules:

Remove special characters (.,%()'"...)
Remove stop words
Remove words of longitude 1

Modeling

In this case we used an inverted index where for each word that is entered by keyboard in the Notebook, list in descending order by word frequency in the content "title" of the news, the most relevant news.

<Frec, news_id, title>

Frec: Frequency of the word.
news_id: New's id in the dataset.
title: New's title.

Evaluation

Carry out grouping of news using one of different algorithms and models of clustering or similarity, in such a way that allows any news to identify that other news is similar.

A metric/similarity function was used based on the intersection of the most common common words for each news item, it is suggested to have 10 most frequent words for news whose frequency is greater than 1.

News	Top 10 most frequently words by news
new1	[(word1, f1) ... (wordn, fn)]
...	....
newM	[(word1, f1) ... (wordn, fn)]

Where for each news_id passed by the notebook, list in descending order by similitary the 5 news more related with the id.

References

[1] Wikipedia. (2019). Cross-industry standard process for data mining. [online] Available at: https://en.wikipedia.org/wiki/Cross-industry_standard_process_for_data_mining [Accessed 26 April 2019].

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
notebook		notebook
src		src
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project 3: Text Mining

Team members:

Development Environment

Cluster

Data

Metodology

CRISP-DM (Cross Industry Standard Process for Data Mining)

Phases

Business understanding

Data Understanding

Data preparation

Modeling

Evaluation

References

About

Releases

Packages

Contributors 2

Languages

warango4/BigData

Folders and files

Latest commit

History

Repository files navigation

Project 3: Text Mining

Team members:

Development Environment

Cluster

Data

Metodology

CRISP-DM (Cross Industry Standard Process for Data Mining)

Phases

Business understanding

Data Understanding

Data preparation

Modeling

Evaluation

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages