| Date: | 2019-09-24 |
|---|---|
| Author: | Gábor Nyers |
| tags: | python |
| category: | python_workshop |
| summary: | Reading and writing data to spreadsheets |
| licence: | CC BY-NC 4.0 https://creativecommons.org/licenses/by-nc/4.0/ |
Table of contents
Spreadsheets and databases permeate our business lives. Databases usually contain structured data and often we use use dedicated applications to interact with them. Spreadsheets are used to do ad-hoc processing or simply to compensate for missing features of aforementioned applications.
This is Part 1 of a (potentially) series of sessions where we will focus on how Python can be useful when dealing with spreadsheets. Our goal is to live-code a simple Python application to read from- and write data to spreadsheets.
As we will build upon the previous topics, you might want to review the earlier sessions:
- session 1: Getting your environment set up and ready for Python development https://github.com/gnyers/python-tuesday/tree/master/session1
- session 2: Working with text files and file formats https://github.com/gnyers/python-tuesday/tree/master/session2
Visit this page: https://github.com/gnyers/python-tuesday/tree/master/session3
OR
Check out the Git repo:
cd $PROJECTDIR
git clone git@github.com:gnyers/python-tuesday.gitTo follow along with the instructions please open the README.html in your browser:
firefox $PROJECTDIR/python-tuesday/session3/README.htmlThe code of this workshop has been tested with:
Python v3.6 and
the following modules that are not part of the Python Standard Library:
openpyxl: module to read/write Excel 2010 workbooks (OOXML format)ptpython: a very conveniant Python interactive shell with support for in-line sytax highlighting, command completion and improved code editing
To install these modules execute:
pip install --user -r requirements.txt
Chapter contents
This chapter will discuss the basics of working with Excel .xlsx files.
Python provides numerous modules to deal with both Excel and LibreCalc spreadsheets. Because LibreOffice support Python as macro language, users have much more advanced capabilities. (See for more: [libreoffice_automation])
Therefore, in this workshop our focus is on dealing with Excel spreadsheets. For this we have different options as wel: (Source: [PythonExcel])
- openpyxl: The recommended package for reading and writing Excel 2010 files (ie: .xlsx)
- xlsxwriter: An alternative package for writing data, formatting information and, in particular, charts in the Excel 2010 format (ie: .xlsx)
- xlrd: This package is for reading data and formatting information from older Excel files (ie: .xls)
- xlwt: This package is for writing data and formatting information to older Excel files (ie: .xls)
Following the openpyxl Tutorial ([openpyxl_tutorial]) let's create a spreadsheet in Excel 2010 format:
1 >>> import openpyxl
2 >>> wb = openpyxl.Workbook() # create a new workbook
3 >>> ws1 = wb.active # get the active sheet
4 >>> ws1.title = "MySheet" # rename the current sheetCreate a new worksheet:
5 >>> ws2 = wb.create_sheet('TimeReg', 0) # create a new sheet
6 >>> ws2.sheet_properties.tabColor = "aa00bb" # change tab color
7 >>> ws1.sheet_properties.tabColor = "00bbaa" # change the other sheet's color
8 >>> wb.save('sandbox/test00.xlsx') # save the workbook to fileThe sheet now should look something like this:
Now let's put some data into the sheet "TimeReg":
9 ### Get a sheet by its name
10 >>> ws_tr = wb['TimeReg']
11
12 ### Fill the first row with some data
13 >>> for i in range(1, 11):
14 ... ws_tr.cell(row=1, column=i, value='Col{}'.format(i))
15 ...
16 ... # doctest: +ELLIPSIS
17 <Cell 'TimeReg'.A1>
18 <Cell 'TimeReg'.B1>
19 <Cell 'TimeReg'.C1>
20 ...
21
22 ### Save the workbook to file
23 >>> wb.save('sandbox/test01.xlsx')The sheet now should look something like this:
Let's load the data from the names.csv file into our sheet. The content of
the file is as follows:
The code:
20 ### We'll read from a csv file
21 >>> import csv
22 >>> with open('../session2/names.csv') as fh:
23 ... data = list(csv.reader(fh, delimiter='|'))
24
25 ### Add a new sheet 'Names' to load the data into
26 >>> ws3 = wb.create_sheet('Names')
27
28 ### Activate the new sheet when the workbook loads
29 >>> wb.active = ws3
30
31 ### Add data to sheet
32 >>> for row in data: ws3.append(row)
33 ... # doctest: +ELLIPSIS
34
35 ### Save the workbook to file
36 >>> wb.save('sandbox/test02.xlsx')The result:
Formatting of cells can be done either directly or by applying a style. In general using styles is preferable.
37 >>> from openpyxl.styles import NamedStyle, Font, Border
38 >>> from openpyxl.styles.borders import Side
39 >>> h1 = openpyxl.styles.NamedStyle(name='h1')
40
41 ### Font style: bold, 18pt "Arial" of a nice color;
42 >>> h1.font = Font('Arial', sz=18, b=True, color='aa00bb')
43
44 ### Border style:
45 >>> b_med = Side(color='aa00bb', border_style='medium' )
46 >>> h1.border = Border(bottom=b_med)
47
48 ### Apply the style to the 1st row
49 ### Note: numbering of cells starts with 1!
50 >>> for i in range(1, 6):
51 ... ws3.cell(row=1, column=i).style = h1
52 ... # doctest: +ELLIPSIS
53 >>> wb.save('sandbox/test05.xlsx') # save the workbook to fileThis will result in the following:
The first row of the "TimeReg" sheet has the custom style "h1". Note that graphical design is not a primary concern at this point ;-)
Comments can contain useful auxiliary information about a cell, such as instructions to the user or ways to verify the correctness of the data.
8 >>> from openpyxl.comments import Comment
9 >>> ws3['A1'].comment = Comment('Data imported from ``names.csv``', 'John Doe')
10 >>> wb.save('sandbox/test06.xlsx') # save the workbook to fileOur goal is to calculate the total number of billable and non-billable hours based on the data of the following timesheet:
First let's load the data from the spreadsheet: (Please type the following in an interactive Python session)
1 ### load the ``openpyxl`` module
2 >>> import openpyxl
3
4 ### open the workbook
5 >>> wb = openpyxl.load_workbook('sandbox/timesheet.xlsx')
6
7 ### select the sheet containing the required data
8 >>> ws = wb['TimeSheet']
9
10 ### select the cells that contain the data; this obviously implies that you
11 ### need to be familiar with the structure of the workbook
12 >>> data_range = ws['A5:G40']
13
14 ### let's retrieve the data with the following list comprehension
15 >>> timesheet = [ [ cell.value for cell in row ] for row in data_range ]
16 >>> print(timesheet) # doctest: +ELLIPSIS
17 [['Date', 'ProjectID', 'ActivityID', 'Billable Hours', 'Non-Billable Hours', 'RemarkPub', 'RemarkPriv'], [datetime.datetime(2019, 9, 1, 0, 0)...We're now ready to calculate the total number of billable hours from the range
'D5:D40':
18 ### select the cells which contain the billable hours
19 >>> range_billable = ws['D5:D40']
20
21 ### since the range is 2D data structure, we'll need a nested loop to
22 ### process the data
23 ###
24 ### with the following list comprehension we create a list of hours
25 >>> billable_hours = [ cell.value for row in range_billable[1:]
26 ... for cell in row if cell.value ]
27 >>> print(billable_hours)
28 [8, 5, 8, 8, 8, 8, 5, 8, 8, 8, 8, 5, 8, 8, 8, 8, 5, 8, 8, 8, 8]
29
30 ### the total of the billable hours
31 >>> print(sum(billable_hours))
32 156In a similar manner, we'll calculate the total of non-billable hours based on
the data in the range 'E5:E40':
33 >>> range_non_billable = ws['E5:E40']
34 >>> non_billable_hours = [ cell.value for row in range_non_billable[1:]
35 ... for cell in row if cell.value ]
36 >>> print(non_billable_hours)
37 [2, 3, 3, 3, 2, 2, 3, 3, 3, 2, 2, 3, 3, 3, 2, 2, 3, 3, 3, 2, 2]
38 >>> total_non_billable = sum(non_billable_hours)
39 >>> print(total_non_billable)
40 54Suppose we're interested in the dates when the number of billable hours was below 8:
41 >>> r = ws['A6:D40']
42
43 ### Let's write up the query in a more understandable form
44 ### NOTE: the <1>...<5> indicate the order in which Python executes the
45 ### statements
46 >>> result = [
47 ... [ cell.value # <5> take the current cell's value
48 ... for cell in row[::3] # <4> loop through the current row and
49 ... # take only elements w/ index 0 and 3
50 ... ]
51 ... for row in r # <1> loop through the data
52 ... if row[3].value # <2> take only rows where the billable
53 ... # hours are non-empty, and
54 ... and
55 ... row[3].value < 8 # <3> where the value is less than 8
56 ... ]
57
58 ### the ``result`` variable now points to a list of lists (2D list)
59 ### of the days when the number of billable hours were < 8:
60 >>> for d, h in result: print(d.strftime('%Y-%m-%d'), h)
61 2019-09-03 5
62 2019-09-10 5
63 2019-09-17 5
64 2019-09-24 5- Worksheet management of a given spreadsheet:
- list the sheets
- add sheet
- Data management:
- dump the data on a sheet to the stdout
- append provided CSV data at the end of a sheet
- CLI Interface: arguments as described above
| [PythonExcel] | http://www.python-excel.org/ |
| [pymod_openpyxl] | OpenPyXL is a Python library to read/write Excel 2010 xlsx/xlsm/xltx/xltm files
|
| [openpyxl_tutorial] | https://openpyxl.readthedocs.io/en/stable/tutorial.html |
| [libreoffice_automation] | Automate your office tasks with Python Macros http://christopher5106.github.io/office/2015/12/06/openoffice-libreoffice-automate-your-office-tasks-with-python-macros.html |




