GitHub - ambareeshav/Data-analytics-notes: Notes from when I started learning data analytics and visualization

A modern data ecosystem includes a network of interconnected and continually evolving entities that include:

Data that is available in a host of different formats, structure, and sources.
Enterprise Data Environment in which raw data is staged so it can be organized, cleaned, and optimized for use by end-users.
End-users such as business stakeholders, analysts, and programmers who consume data for various purposes.
Emerging technologies such as Cloud Computing, Machine Learning, and Big Data, are continually reshaping the data ecosystem and the possibilities it offers.
Data Engineers, Data Analysts, Data Scientists, Business Analysts, and Business Intelligence Analysts, all play a vital role in the ecosystem for deriving insights and business results from data.

Primary types of Data Analysis

Descriptive Analytics, that helps decode “What happened.”
Diagnostic Analytics, that helps us understand “Why it happened.”
Predictive Analytics, that analyzes historical data and trends to suggest “What will happen next.”
Prescriptive Analytics, that prescribes “What should be done next.”

Data Analysis process

Developing an understanding of the problem and the desired outcome.
Setting a clear metric for evaluating outcomes.
Gathering, cleaning, analyzing, and mining data to interpret results.
ommunicating the findings in ways that impact decision-making.

Role of a DA

Acquiring data
Creating queries to extract data
Filter, clean, standardize and reorganize data
Use stat tools to interpret data
Use techniques to identify patterns and correlations
Analyse patterns and interpret trends
Prepare reports to communicate trends
Create documentation to define and demonstrate process

Tehchnical Skills

Sheets/Excel
Cognos, SPSS, oracle va, power BI, SAS, Tableau
R, Python, C++, Java, MATLAB
SQL, NoSQL, RDBMS
Data marts, warehouses, lakes and pipelines
Hadoop. Hive and Spark

Functional skills

Statistics
Analytical
Problem Solving
probing (Identify, understand and define problem)
Data visualization
Project management

Soft skills

Collaboration
Communication
Compelling stories
Curiosity
Attention to detail
Intuition
Decision making

Types of data

Structured – Rigid, Rows and columns, databases and spreadsheets. SQL , py etc
Semi structured – Characteristic but no rigid structure, Email. XML, Json
Unstructured – Complex and qualitative, Photos videos. Files, Docs, NoSQL

Data formats

Plain text, XML, HTML, JSON, Media files

What is data

Facts, observations, perceptions, numbers characters, symbols and pictures

File Formats -CSV, XLSX, XML, PDF, JSON

Sources of data

API’s, web services, social platforms
Web scraping – Downloading data from websites based on specific parameters (BeautifulSoup, Scrapy, Pandas and selenium)
Data streams – IoT, GPS, Programs sites and social media (Apache – Kafka Spark and Storm)

Languages

Query – Accesing and manipulating data - SQL
Programming – Developing apps and controlling app behaviour - Python, R, Java
Shell and scripting languages – Unix, Linux, Powershell
Pandas - Data cleaning and analysis
Numpy, Scipy for statistical analysis
BSoup and scrapy – Web scraping
Matplotlib and seaborn - Visualization

Data repositories Databases, data warehouses, data marts, data lakes, big data stores

Data warehouses consolidate data through the ETL process, Extraction, transfer and Load

RDBMS

IBM DB2, SQL Server, MySQL, ORACLE, PostgreSQL
Cloud based - Amazon RDS, Google SQL, IBM DB2 on cloud, Oracle cloud and Azure SQL
Use Cases
- Online transaction processing systems
- Data warehouses
- IoT solutions

NoSQL

Key-Value store (Redis, MemcacheD, DynamoDB)
Document based (MongoDB, DocumentDB, CouchDB, Cloudant)
Column based (Cassandra, Apache HBASE)
Graph Based (Neo4J, CosmosDB)
ETL (Extract, Transform, Load)
Batch processing Stitch, Blendo
Stream processing – Apache samza storm and kafka
Initial, Incremental, Full refresh
Load verification

ETL is a subset of the Data pipeline(Apache, beam, dataflow, kafka)(Typically sourced from a data lake)

Data marts are sub sections of a data warehouse

Big Data

Velocity, Volume, Variety, Veracity, Value
- Velocity – Speed at which data is gathered
- Volume – Scale of data
- Variety – Diversity of data
- Veracity – Quality and origin of data
- Value – Ability to turn the data into data

Big Data Processing technologies

Hadoop – Distributed storage and processing of big data
Hive – Data warehouse for data query and analysis built on top of Hadoop
Spark – Distributed analytics framework for complex, real-time data analytics

Data Repositories

Databases, which can be relational or non-relational
Data Lakes, that serve as storage repositories for all types of data
Big Data Stores, that provide distributed computational and storage infrastructure to store, scale, and process very large data sets.

ETL, or Extract Transform and Load, Process is an automated process that converts raw data into analysis-ready data by: Extracting data from source locations.

Transforming raw data by cleaning, enriching, standardizing, and validating it.
Loading the processed data into a destination system or data repository.

Data Pipeline, sometimes used interchangeably with ETL, encompasses the entire journey of moving data from the source to a destination data lake or application, using the ETL process.

Process for identifying data

Determine the info you want to collect
Define the data you want to collect = Determine your data collection needs
Make sure data governance and data security are in check

Sources for gathering data

Databases
Web
Social media sites and Interactive platforms
Sensor data
Data exchange (AWS de, Crunchbase, Lotame, Snowflake)
Surveys
Census
Interviews
Observation studies

Talond, informatica, R, Python can be used for importing data

Data wrangling

Discovery (Exploration)
Transformation (Majority of time is spent here)
- Structuring (Unions and Joins)
- Normalizing and de normalizing
- Cleaning
- Enriching data
Validation
Publishing
Document all considerations and actions

Factors for choosing a tool for data wrangling

Supported data size
Data structures
Cleaning and transforming capabilities
Infrastructure needs
Ease of use
Learnability

Data Cleaning

Subset of the data wrangling process
Inspection
Cleaning
- Profiling
- Visualization
- Missing
- Duplicate
- Irrelevant
- Standardize
- Syntax, Typos, Format, Outliers
Verification (Re inspect results after corrections)
Documentation (All changes with reason and quality of data)

Descriptive statistics

Central tendency
Dispersion
Skewness

Inferential Statistics

Hypothesis testing
Confidence Intervals
Regression analysis

SAS, SPSS, StatSoft

Data Mining

Classification
Clustering
Anomaly or Outlier detection
Association rule mining
Sequential patterns
Affinity grouping
Decision trees
Regression

Data presentation

Who is my audience
What is important to them
What will help them trust me
Structure your presentation

Data Visualization and choosing the appropriate structures

What is the question I’m trying to answer
What should be the key takeaway
What does my audience want to know
What is the relationship I’m trying to establish
Do i need my audience to the relation between two values
Do i want to find anomalies in the data

Visualization graphs

Bar charts
- comparing related data sets or parts of a whole
Column charts
- Compare values side by side, change over time
Pie charts
- Break down values into subparts and their relational proportion
Line charts
- Trends, change over time
- Variations in data Dashboards
Present findings to your audience
Present a consolidated findings report of all the data analyzed
Make live dashboards to have an interactive story session

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

About

Releases

Packages

License

ambareeshav/Data-analytics-notes

Folders and files

Latest commit

History

Repository files navigation

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Packages