A very useful tool that I use to have a quick overview of a pandas DataFrame is the pandas-profiling library. Very often, the first command that I use to get an idea of the pandas DataFrame is either the df.describe() to view the summary statistics or df.info() to view all of the columns and their types. While they are certainly useful, they do not give a complete overview of the data.
If one has ever worked or learned Hadoop or Spark or other ‘big data’ systems, the very first program they would encounter is the Word Count program - counting the number of occurrences of all the words in a document. The reason is that the program is used throughout the MapReduce paper from Google.
Before explaining how the ‘Word Count’ program works in a distributed system, I usually demonstrate it in simple Python code.