Wednesday, April 04, 2018

Pandas snippets

Here are some useful snippets that can come in handy when cleaning data with pandas. This was useful for me in completing the coursework for python data science course.

Extract a subset of columns from the dataframe based on a regular expression:
Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
persona1 = pd.Series({
                        'Last Post On': '02/04/2017',
                        'Friends-2015': 10,
                        'Friends-2016': 20,
                        'Friends-2017': 300
})

persona2 = pd.Series({
                        'Last Post On': '02/04/2018',
                        'Friends-2015': 100,
                        'Friends-2016': 240,
                        'Friends-2017': 560
})

persona3 = pd.Series({
                        'Last Post On': '02/04/2014',
                        'Friends-2015': 120,
                        'Friends-2016': 120,
                        'Friends-2017': 120
})

df = pd.DataFrame([persona1, persona2, persona3], 
                  index=['Chris', 'Bella', 'Laura'])
df.filter(regex=("Friends-\d{4}"))

Output:
Friends-2015 Friends-2016 Friends-2017
Chris 10 20 300
Bella 100 240 560
Laura 120 120 120

Set a column based on the value of both the current row and adjacent rows:

For this example, we define regulars to the gym as those who have gone to the gym last year at least 3 months in a row:
Code:
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
import datetime
df = pd.DataFrame({'Month': 
                   [datetime.date(2008, i, 1).strftime('%B')
                             for i in range(1,13)] * 3, 
                   'visited': [False]*36},
                   index=['Alice']*12 + 
                         ['Bob']*12 + 
                         ['Bridgett']*12)

df = df.reset_index()

def make_regular(r, name):
    r['visited'] = (r['visited'] or (r['index'] == name) and 
                  ((r['Month'] == 'February') or
                   (r['Month'] == 'March') or
                   (r['Month'] == 'April')))
    return r

df = df.apply(make_regular, axis=1, args=('Alice',))
df = df.apply(make_regular, axis=1, args=('Bob',))
regular = ((df['visited'] == True) & 
          (df['visited'].shift(-1) == True) & 
          (df['visited'].shift(-2) == True))
df[regular]['index'].values .tolist()

Output:
1
['Alice', 'Bob']


No comments: