Choosing between HDInsight, various 3rd-party Hadoop distributions for IaaS, and other Hadoop ecosystem decisions.
Disclaimer - this is a work in progress.
Note: This guidance is still under development
###Questions to think about:
-
When should you strongly consider using HDInsight instead of managing my own Hadoop cluster?
-
Are you planning to run your Hadoop cluster in the cloud, on-premises, or both/hybrid?
-
Will your Hadoop cluster be always-on or will you use it periodically?
-
Where will you store the data? (e.g. on premises, in Azure blob storage, in another cloud)
-
What are your performance targets? Data volume, number of nodes, etc.
-
What are you security requirements?
-
Are you coming from a Windows or Linux background?
-
If you want to run and manage my own Hadoop cluster, how do you decide which Hadoop distribution to use (e.g. Cloudera, Hortonworks, or MapR)?
-
What are the key differentiators between different Hadoop distributions? For example, MapR offers its own unique file system.
-
When should you use Hive and when should you look into Spark?