A very useful tool that I use to have a quick overview of a pandas DataFrame is the pandas-profiling library. Very often, the first command that I use to get an idea of the pandas DataFrame is either the df.describe() to view the summary statistics or df.info() to view all of the columns and their types. While they are certainly useful, they do not give a complete overview of the data.
If one has ever worked or learned Hadoop or Spark or other ‘big data’ systems, the very first program they would encounter is the Word Count program - counting the number of occurrences of all the words in a document. The reason is that the program is used throughout the MapReduce paper from Google.
Before explaining how the ‘Word Count’ program works in a distributed system, I usually demonstrate it in simple Python code.
Hi all, I am a Data Scientist by profession and love all things data. I am writing this blog as notes to my future self.
I would be writing about things I learnt and am learning. I like to think of myselves as a Data Crazy Scientist, so most of it would be on things related to Data, Machine Learning, Deep Learning and Visualizations. I would also showcase some of the projects that I think would interest people.