Why would I blog?

Plus some bonus Python coding

The main reason I’m starting this blog is because I’ve been hesitant to do so. I prefer not to share too much information, so this is a sort of personal challenge. Also, Jeremy and Rachel over at FastAI are adament about blogging as a means of learning deep learning (or any topic for that matter). I agree with this perspective despite maintaining that the idea of blogging is better than actually doing it. I’ll keep tabs on this perspective over time.

I’m going to do things a little differently, in my own style, and will probably break a bunch of blogging guidelines. I’m going to be mixing thoughts and ideas with data science code snippets and projects. While I have some ideas for upcoming posts I also am likely to interject random tutorials which I would consider things that are useful to me.

The goal of this isn’t to make money or anything like that. It is simply to get comfortable being uncomfortable in posting content and sharing thoughts, ideas and projects. The personal gain I am seeking will be the product of committing to the habit and practicing it.

So with that, lets go through a little python code that I find to be indespensible: comprehensions. I find myself interjecting list and dictionary comprehensions in all manner of code I write in Python. For instance, if I need to: filter some data based on a condition, manipulate dictionary keys and values, or update Pandas Dataframe column names. I’ll show those examples below, in contrast with some other methods.

Comprehensions

Perform calculations on a list

First we need a list to work with, which is easy enough to create in python:

list_1 = list(range(25))
print(list_1)

[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24]

Ok, so we have a list. Lets find the odd numbers using loops:

odd_nums = []
for item in list_1:
    if item%2 != 0:
        odd_nums.append(item)
print(odd_nums)

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23]

Well that works, but kind of long winded. A list comp can shorten it up for us:

odd_nums_lc = [item for item in list_1 if item%2 != 0]
print(odd_nums_lc)

[1, 3, 5, 7, 9, 11, 13, 15, 17, 19, 21, 23]

Similar methods also work for nested lists. It should be noted however, that comprehensions can be harder to read, so extensively nested statements might be best being split up or incorporating some sort of loop or helper function.

Getting some corpus - body of text - into a dictionary of word counts

Well we need a dict, so lets take some text and do a word count. We will use the first 2 paragraphs from the Wikipedia entry for Gandalf

gandalf = """
Gandalf is a protagonist in J. R. R. Tolkien's novels The Hobbit and The Lord of the Rings. 
He is a wizard, one of the Istari order, and the leader of the Fellowship of the Ring. Tolkien took the 
name "Gandalf" from the Old Norse "Catalogue of Dwarves" (Dvergatal) in the Völuspá.

As a wizard and the bearer of one of the Three Rings, Gandalf has great power, but works mostly by 
encouraging and persuading. He sets out as Gandalf the Grey, possessing great knowledge and 
travelling continually. Gandalf is focused on the mission to counter the Dark Lord Sauron by 
destroying the One Ring. He is associated with fire; his ring of power is Narya, the Ring of 
Fire. As such, he delights in fireworks to entertain the hobbits of the Shire, while in great 
need he uses fire as a weapon. As one of the Maiar, he is an immortal spirit from Valinor, 
but his physical body can be killed.
"""

gandalf

'\nGandalf is a protagonist in J. R. R. Tolkien\'s novels The Hobbit and The Lord of the Rings. \nHe is a wizard, one of the Istari order, and the leader of the Fellowship of the Ring. Tolkien took the \nname "Gandalf" from the Old Norse "Catalogue of Dwarves" (Dvergatal) in the Völuspá.\n\nAs a wizard and the bearer of one of the Three Rings, Gandalf has great power, but works mostly by \nencouraging and persuading. He sets out as Gandalf the Grey, possessing great knowledge and \ntravelling continually. Gandalf is focused on the mission to counter the Dark Lord Sauron by \ndestroying the One Ring. He is associated with fire; his ring of power is Narya, the Ring of \nFire. As such, he delights in fireworks to entertain the hobbits of the Shire, while in great \nneed he uses fire as a weapon. As one of the Maiar, he is an immortal spirit from Valinor, \nbut his physical body can be killed.\n'

Well as is we have some work to do on the text to clean it up, as is often required. So lets do that. we need to strip special characters, quotes, and some other things.

Steps: - use a regular expression (regex) to strip out punctuaton - convert everything to lower

import re

gandalf_filtered = re.sub(r'[^\w\s]|\n', '', gandalf).lower().split(' ')
print(gandalf_filtered)

['gandalf', 'is', 'a', 'protagonist', 'in', 'j', 'r', 'r', 'tolkiens', 'novels', 'the', 'hobbit', 'and', 'the', 'lord', 'of', 'the', 'rings', 'he', 'is', 'a', 'wizard', 'one', 'of', 'the', 'istari', 'order', 'and', 'the', 'leader', 'of', 'the', 'fellowship', 'of', 'the', 'ring', 'tolkien', 'took', 'the', 'name', 'gandalf', 'from', 'the', 'old', 'norse', 'catalogue', 'of', 'dwarves', 'dvergatal', 'in', 'the', 'völuspáas', 'a', 'wizard', 'and', 'the', 'bearer', 'of', 'one', 'of', 'the', 'three', 'rings', 'gandalf', 'has', 'great', 'power', 'but', 'works', 'mostly', 'by', 'encouraging', 'and', 'persuading', 'he', 'sets', 'out', 'as', 'gandalf', 'the', 'grey', 'possessing', 'great', 'knowledge', 'and', 'travelling', 'continually', 'gandalf', 'is', 'focused', 'on', 'the', 'mission', 'to', 'counter', 'the', 'dark', 'lord', 'sauron', 'by', 'destroying', 'the', 'one', 'ring', 'he', 'is', 'associated', 'with', 'fire', 'his', 'ring', 'of', 'power', 'is', 'narya', 'the', 'ring', 'of', 'fire', 'as', 'such', 'he', 'delights', 'in', 'fireworks', 'to', 'entertain', 'the', 'hobbits', 'of', 'the', 'shire', 'while', 'in', 'great', 'need', 'he', 'uses', 'fire', 'as', 'a', 'weapon', 'as', 'one', 'of', 'the', 'maiar', 'he', 'is', 'an', 'immortal', 'spirit', 'from', 'valinor', 'but', 'his', 'physical', 'body', 'can', 'be', 'killed']

Ok, so we have this list of words. Now what? Well to get the word counts we have a few options: - Cheat and use collections.Counter - Use a for loop

In either case we can get our word counts into a dict, but using a dictionary comprehension isn’t super efficient because we need to do some information retrieval from the dict. An alternative is using collections.defaultdict.

# method 1: collections.Counter
from collections import Counter

gandalf_wc_v1 = Counter(gandalf_filtered)
print(gandalf_wc_v1)

Counter({'the': 20, 'of': 11, 'is': 6, 'he': 6, 'gandalf': 5, 'and': 5, 'a': 4, 'in': 4, 'one': 4, 'ring': 4, 'as': 4, 'great': 3, 'fire': 3, 'r': 2, 'lord': 2, 'rings': 2, 'wizard': 2, 'from': 2, 'power': 2, 'but': 2, 'by': 2, 'to': 2, 'his': 2, 'protagonist': 1, 'j': 1, 'tolkiens': 1, 'novels': 1, 'hobbit': 1, 'istari': 1, 'order': 1, 'leader': 1, 'fellowship': 1, 'tolkien': 1, 'took': 1, 'name': 1, 'old': 1, 'norse': 1, 'catalogue': 1, 'dwarves': 1, 'dvergatal': 1, 'völuspáas': 1, 'bearer': 1, 'three': 1, 'has': 1, 'works': 1, 'mostly': 1, 'encouraging': 1, 'persuading': 1, 'sets': 1, 'out': 1, 'grey': 1, 'possessing': 1, 'knowledge': 1, 'travelling': 1, 'continually': 1, 'focused': 1, 'on': 1, 'mission': 1, 'counter': 1, 'dark': 1, 'sauron': 1, 'destroying': 1, 'associated': 1, 'with': 1, 'narya': 1, 'such': 1, 'delights': 1, 'fireworks': 1, 'entertain': 1, 'hobbits': 1, 'shire': 1, 'while': 1, 'need': 1, 'uses': 1, 'weapon': 1, 'maiar': 1, 'an': 1, 'immortal': 1, 'spirit': 1, 'valinor': 1, 'physical': 1, 'body': 1, 'can': 1, 'be': 1, 'killed': 1})

# method 2: for loop

gandalf_wc_v2 = {}
# Count number of times each word comes up in list of words (in dictionary)
for w in gandalf_filtered:
    if w not in gandalf_wc_v2.keys():
        gandalf_wc_v2[w] = 1
    else:
        gandalf_wc_v2[w] += 1

print(gandalf_wc_v2)

{'gandalf': 5, 'is': 6, 'a': 4, 'protagonist': 1, 'in': 4, 'j': 1, 'r': 2, 'tolkiens': 1, 'novels': 1, 'the': 20, 'hobbit': 1, 'and': 5, 'lord': 2, 'of': 11, 'rings': 2, 'he': 6, 'wizard': 2, 'one': 4, 'istari': 1, 'order': 1, 'leader': 1, 'fellowship': 1, 'ring': 4, 'tolkien': 1, 'took': 1, 'name': 1, 'from': 2, 'old': 1, 'norse': 1, 'catalogue': 1, 'dwarves': 1, 'dvergatal': 1, 'völuspáas': 1, 'bearer': 1, 'three': 1, 'has': 1, 'great': 3, 'power': 2, 'but': 2, 'works': 1, 'mostly': 1, 'by': 2, 'encouraging': 1, 'persuading': 1, 'sets': 1, 'out': 1, 'as': 4, 'grey': 1, 'possessing': 1, 'knowledge': 1, 'travelling': 1, 'continually': 1, 'focused': 1, 'on': 1, 'mission': 1, 'to': 2, 'counter': 1, 'dark': 1, 'sauron': 1, 'destroying': 1, 'associated': 1, 'with': 1, 'fire': 3, 'his': 2, 'narya': 1, 'such': 1, 'delights': 1, 'fireworks': 1, 'entertain': 1, 'hobbits': 1, 'shire': 1, 'while': 1, 'need': 1, 'uses': 1, 'weapon': 1, 'maiar': 1, 'an': 1, 'immortal': 1, 'spirit': 1, 'valinor': 1, 'physical': 1, 'body': 1, 'can': 1, 'be': 1, 'killed': 1}

I will use either method, but tend to prefer writing less code that I have to maintain. If there is a helper class or function that is in a stable release of a library, it makes life easier to use it.

Because it doesn’t matter much which one we use for our example, we’ll just grab gandalf_wc_v2 and get the top N values that exceed a certain word length. There are a ton of ways to do this, we will just use plain python.

We could have done this in one more line on our collections.Counter method call like this: gandalf_wc_v1.most_common(n=5)

But that is too easy, so let’s do it longhand:

# key value pairs where key is >= 3
gandalf_wl_geq_3 = {k:v for k,v in gandalf_wc_v2.items() if len(k) >= 3}
# top N counts
n = 5
top_n = sorted(gandalf_wl_geq_3.values(), reverse=True)[:n]
#[20, 5, 5, 4, 4]

# finally, a fun use of a dict comp:
{k:v for k,v in gandalf_wl_geq_3.items() if v in top_n}

{'gandalf': 5, 'the': 20, 'and': 5, 'one': 4, 'ring': 4}

Last but perhaps one of the best, comprehensions on Pandas Dataframe columns

lets create a dataframe with some sample data. To make it interesting I’ll use some comprehensions and other native python capabilities to create a dataset for this example, instead of using Iris or Housing.

import pandas as pd
import numpy as np
from itertools import combinations

yrs = range(2018,2022,1)
cities = 'Bozeman, MT', 'Spokane, WA', 'Bangor, ME', 'White Plains, NY', 'Sedona, AZ'
cat1 = ['wizard', 'ranger', 'elf', 'hutt', 'orc', 'nazgul', 'numenorean', 'deciever']

colnames = 'year','city','role','points'
ds = [[y,c,k, np.random.randint(-10, 10)] for k in cat1 for c in cities for y in yrs]

df = pd.DataFrame(ds, columns=colnames)
display(df.head(3), df.shape)

	year	city	role	points
0	2018	Bozeman, MT	wizard	3
1	2019	Bozeman, MT	wizard	-9
2	2020	Bozeman, MT	wizard	-8

(160, 4)

So we have some meaningless data. Now lets group by to get a multiindex that we want to manipulate, the whole point of this little exercise.

dfg = df.groupby(by=['city', 'role']).agg({'points': ['mean','sum', np.std]}).reset_index()
dfg.head(3)

	city	role	points
			mean	sum	std
0	Bangor, ME	deciever	4.50	18	3.872983
1	Bangor, ME	elf	0.75	3	7.889867
2	Bangor, ME	hutt	-1.25	-5	5.560276

So this isn’t super useful for anything. Now normally some dataset for machine learning or what not will have a much broader set of columns (a.k.a. features), but the concept is pretty much the same. Just remember to work on manageable chunks and don’t get intimidated by a long chain of transformations.

print(f'Initial column list: {dfg.columns}')

Initial column list: MultiIndex([(  'city',     ''),
            (  'role',     ''),
            ('points', 'mean'),
            ('points',  'sum'),
            ('points',  'std')],
           )

So first I’ll demonstrate updating this with a list comprehension. This is a bit more complex of a list comprehension in that in incorporates a conditional ''.join in the output. Basically what is happening is we are looking at each tuple in the multiindex, and if the last item is an empty string, we underscore join all but the lst element, otherwise we join the whole thing.

dfg_copy1 = dfg.copy()
dfg_copy1.columns = ['_'.join(list(x) if len(x[-1]) > 0 else x[:-1]) for x in dfg_copy1.columns]
dfg_copy1.head(2)

	city	role	points_mean	points_sum	points_std
0	Bangor, ME	deciever	4.50	18	3.872983
1	Bangor, ME	elf	0.75	3	7.889867

There is another way, however, that is pretty clever. It is more of a functional style and I’m fairly certain I’ve seen it used in the Fast AI course or notebooks in addition to numerous tutorials on the internet. We will be using the map function over our columns.

dfg_copy2 = dfg.copy()
dfg_copy2.columns = dfg_copy2.columns.map(lambda x : '_'.join(x) if x[-1] != '' else x[0]))

dfg_copy2.head(3)

	city	role	points	points	points
0	Bangor, ME	deciever	4.50	18	3.872983
1	Bangor, ME	elf	0.75	3	7.889867
2	Bangor, ME	hutt	-1.25	-5	5.560276

While this is really quick and easy if there isn’t any weird conditions (e.g. you can just use {dataframe_name}.columns.map('_'.join)), above we only have the multiindex on some columns, so handling the city and role columns differently is something useful to do for readbility. So there you have it, a bonus method of map + lambda to achieve the same goal as a list comprehension.

Well for now that is plenty of information. Hopefully this helps a future task of yours regardless of your proficiency in technical terminonology.