Skip to content

Latest commit

 

History

History
 
 

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 

README.rst

Working with Data Part 1: Spreadsheets

Python Tuesday: Session 3

Date: 2019-09-24
Author: Gábor Nyers
tags:python
category:python_workshop
summary:Reading and writing data to spreadsheets
licence:CC BY-NC 4.0 https://creativecommons.org/licenses/by-nc/4.0/

Spreadsheets and databases permeate our business lives. Databases usually contain structured data and often we use use dedicated applications to interact with them. Spreadsheets are used to do ad-hoc processing or simply to compensate for missing features of aforementioned applications.

This is Part 1 of a (potentially) series of sessions where we will focus on how Python can be useful when dealing with spreadsheets. Our goal is to live-code a simple Python application to read from- and write data to spreadsheets.

As we will build upon the previous topics, you might want to review the earlier sessions:

Visit this page: https://github.com/gnyers/python-tuesday/tree/master/session3

OR

Check out the Git repo:

cd $PROJECTDIR
git clone git@github.com:gnyers/python-tuesday.git

To follow along with the instructions please open the README.html in your browser:

firefox $PROJECTDIR/python-tuesday/session3/README.html

The code of this workshop has been tested with:

  • Python v3.6 and

  • the following modules that are not part of the Python Standard Library:

    • openpyxl: module to read/write Excel 2010 workbooks (OOXML format)
    • ptpython: a very conveniant Python interactive shell with support for in-line sytax highlighting, command completion and improved code editing

    To install these modules execute:

    pip install --user -r requirements.txt

This chapter will discuss the basics of working with Excel .xlsx files.

Python provides numerous modules to deal with both Excel and LibreCalc spreadsheets. Because LibreOffice support Python as macro language, users have much more advanced capabilities. (See for more: [libreoffice_automation])

Therefore, in this workshop our focus is on dealing with Excel spreadsheets. For this we have different options as wel: (Source: [PythonExcel])

  • openpyxl: The recommended package for reading and writing Excel 2010 files (ie: .xlsx)
  • xlsxwriter: An alternative package for writing data, formatting information and, in particular, charts in the Excel 2010 format (ie: .xlsx)
  • xlrd: This package is for reading data and formatting information from older Excel files (ie: .xls)
  • xlwt: This package is for writing data and formatting information to older Excel files (ie: .xls)

Following the openpyxl Tutorial ([openpyxl_tutorial]) let's create a spreadsheet in Excel 2010 format:

1 >>> import openpyxl
2 >>> wb = openpyxl.Workbook()           # create a new workbook
3 >>> ws1 = wb.active                    # get the active sheet
4 >>> ws1.title = "MySheet"              # rename the current sheet

Create a new worksheet:

5 >>> ws2 = wb.create_sheet('TimeReg', 0)          # create a new sheet
6 >>> ws2.sheet_properties.tabColor = "aa00bb"     # change tab color
7 >>> ws1.sheet_properties.tabColor = "00bbaa"     # change the other sheet's color
8 >>> wb.save('sandbox/test00.xlsx')               # save the workbook to file

The sheet now should look something like this:

A simple workbook created by openpyxl

The spreadsheet created by the openpyxl module. Note the color of the worksheets.

Now let's put some data into the sheet "TimeReg":

 9 ### Get a sheet by its name
10 >>> ws_tr = wb['TimeReg']
11 
12 ### Fill the first row with some data
13 >>> for i in range(1, 11):
14 ...     ws_tr.cell(row=1, column=i, value='Col{}'.format(i))
15 ...
16 ...                                    # doctest: +ELLIPSIS
17 <Cell 'TimeReg'.A1>
18 <Cell 'TimeReg'.B1>
19 <Cell 'TimeReg'.C1>
20 ...
21 
22 ### Save the workbook to file
23 >>> wb.save('sandbox/test01.xlsx')

The sheet now should look something like this:

Workbook with some data

The "TimeReg" sheet now has some data.

Let's load the data from the names.csv file into our sheet. The content of the file is as follows:

The code:

20 ### We'll read from a csv file
21 >>> import csv
22 >>> with open('../session2/names.csv') as fh:
23 ...     data = list(csv.reader(fh, delimiter='|'))
24 
25 ### Add a new sheet 'Names' to load the data into
26 >>> ws3 = wb.create_sheet('Names')
27 
28 ### Activate the new sheet when the workbook loads
29 >>> wb.active = ws3
30 
31 ### Add data to sheet
32 >>> for row in data: ws3.append(row)
33 ...                                    # doctest: +ELLIPSIS
34 
35 ### Save the workbook to file
36 >>> wb.save('sandbox/test02.xlsx')

The result:

Workbook with some data

The 'Names' sheet now contains the CSV data. Note: multiple columns appear as too narrow; this is

Formatting of cells can be done either directly or by applying a style. In general using styles is preferable.

37 >>> from openpyxl.styles import NamedStyle, Font, Border
38 >>> from openpyxl.styles.borders import Side
39 >>> h1 = openpyxl.styles.NamedStyle(name='h1')
40 
41 ### Font style: bold, 18pt "Arial" of a nice color;
42 >>> h1.font = Font('Arial', sz=18, b=True, color='aa00bb')
43 
44 ### Border style:
45 >>> b_med = Side(color='aa00bb', border_style='medium' )
46 >>> h1.border = Border(bottom=b_med)
47 
48 ### Apply the style to the 1st row
49 ### Note: numbering of cells starts with 1!
50 >>> for i in range(1, 6):
51 ...     ws3.cell(row=1, column=i).style = h1
52 ...                                    # doctest: +ELLIPSIS
53 >>> wb.save('sandbox/test05.xlsx')     # save the workbook to file

This will result in the following:

Custom sytle

The first row of the "TimeReg" sheet has the custom style "h1". Note that graphical design is not a primary concern at this point ;-)

Comments can contain useful auxiliary information about a cell, such as instructions to the user or ways to verify the correctness of the data.

 8 >>> from openpyxl.comments import Comment
 9 >>> ws3['A1'].comment = Comment('Data imported from ``names.csv``', 'John Doe')
10 >>> wb.save('sandbox/test06.xlsx')     # save the workbook to file
Comment added to cell "A1"

The cell "A1" now contains a comment

Our goal is to calculate the total number of billable and non-billable hours based on the data of the following timesheet:

Comment added to cell "A1"

Example timesheet containing data

First let's load the data from the spreadsheet: (Please type the following in an interactive Python session)

 1 ### load the ``openpyxl`` module
 2 >>> import openpyxl
 3 
 4 ### open the workbook
 5 >>> wb = openpyxl.load_workbook('sandbox/timesheet.xlsx')
 6 
 7 ### select the sheet containing the required data
 8 >>> ws = wb['TimeSheet']
 9 
10 ### select the cells that contain the data; this obviously implies that you
11 ### need to be familiar with the structure of the workbook
12 >>> data_range = ws['A5:G40']
13 
14 ### let's retrieve the data with the following list comprehension
15 >>> timesheet = [ [ cell.value  for cell in row  ]  for row in data_range ]
16 >>> print(timesheet)                              # doctest: +ELLIPSIS
17 [['Date', 'ProjectID', 'ActivityID', 'Billable Hours', 'Non-Billable Hours', 'RemarkPub', 'RemarkPriv'], [datetime.datetime(2019, 9, 1, 0, 0)...

We're now ready to calculate the total number of billable hours from the range 'D5:D40':

18 ### select the cells which contain the billable hours
19 >>> range_billable = ws['D5:D40']
20 
21 ### since the range is 2D data structure, we'll need a nested loop to
22 ### process the data
23 ###
24 ### with the following list comprehension we create a list of hours
25 >>> billable_hours = [ cell.value  for row in range_billable[1:]
26 ...                                for cell in row if cell.value ]
27 >>> print(billable_hours)
28 [8, 5, 8, 8, 8, 8, 5, 8, 8, 8, 8, 5, 8, 8, 8, 8, 5, 8, 8, 8, 8]
29 
30 ### the total of the billable hours
31 >>> print(sum(billable_hours))
32 156

In a similar manner, we'll calculate the total of non-billable hours based on the data in the range 'E5:E40':

33 >>> range_non_billable = ws['E5:E40']
34 >>> non_billable_hours = [ cell.value  for row in range_non_billable[1:]
35 ...                                    for cell in row if cell.value ]
36 >>> print(non_billable_hours)
37 [2, 3, 3, 3, 2, 2, 3, 3, 3, 2, 2, 3, 3, 3, 2, 2, 3, 3, 3, 2, 2]
38 >>> total_non_billable = sum(non_billable_hours)
39 >>> print(total_non_billable)
40 54

Suppose we're interested in the dates when the number of billable hours was below 8:

41 >>> r = ws['A6:D40']
42 
43 ### Let's write up the query in a more understandable form
44 ### NOTE: the <1>...<5> indicate the order in which Python executes the
45 ###       statements
46 >>> result = [
47 ...  [ cell.value                   # <5> take the current cell's value
48 ...    for cell in row[::3]         # <4> loop through the current row and
49 ...                                 #     take only elements w/ index 0 and 3
50 ...  ]
51 ...  for row in r                   # <1> loop through the data
52 ...      if row[3].value            # <2> take only rows where the billable
53 ...                                 #     hours are non-empty, and
54 ...         and
55 ...         row[3].value < 8        # <3> where the value is less than 8
56 ... ]
57 
58 ### the ``result`` variable now points to a list of lists (2D list)
59 ### of the days when the number of billable hours were < 8:
60 >>> for d, h in result: print(d.strftime('%Y-%m-%d'), h)
61 2019-09-03 5
62 2019-09-10 5
63 2019-09-17 5
64 2019-09-24 5
  1. Worksheet management of a given spreadsheet:
    1. list the sheets
    2. add sheet
  2. Data management:
    1. dump the data on a sheet to the stdout
    2. append provided CSV data at the end of a sheet
  3. CLI Interface: arguments as described above
[PythonExcel]http://www.python-excel.org/
[pymod_openpyxl]

OpenPyXL is a Python library to read/write Excel 2010 xlsx/xlsm/xltx/xltm files

[openpyxl_tutorial]https://openpyxl.readthedocs.io/en/stable/tutorial.html
[libreoffice_automation]Automate your office tasks with Python Macros http://christopher5106.github.io/office/2015/12/06/openoffice-libreoffice-automate-your-office-tasks-with-python-macros.html