“Bugs, butterflies and other Insects”
Saturday 26th November 2016
Braeside Metropolitan Park
Lower Dandenong Road, Braeside
Contact details for further enquiries: Patrick Honan PHonan@museum.vic.gov.au
Data auditing and cleaning on the command line for entomological (and other) datasets
A half-day intensive training for anyone who deals with large datasets: specimen records, sequence data, bibliographic records, etc. If your dataset is too big for a spreadsheet or your database software is too complex for easy auditing, this workshop is for you! Limit 15 places.
12-1 pm Workshop for command-line beginners
1-1.30 pm Lunch break
1.30-4.30 pm Data auditing workshop
Sunday 27th November 2016
What you’ll learn
How to use the command line to find duplicate records, inconsistencies in content and format, data gaps, data in wrong fields, fields improperly used, non-standard and non-printing characters, georeferencing errors and many other data problems, plus tips on data editing (for data in text files).
What you need to bring
A laptop running a BASH shell; demonstration data will be provided. All OS X and Linux computers already have BASH installed. Until recently, Windows machines could only run BASH in a virtual Linux machine or with special software (Cygwin). Windows 10 users can now run a BASH shell natively; see this Microsoft advisory and other websites for how-to advice.
What you need to know beforehand
What a shell is and how commands work. If you’re new to the command line or a bit rusty on its use, see the next section. The training will use simple utilities such as grep, sort, uniq, wc, head and tail, but will put special emphasis on regex and on text processing with AWK and sed. OS X users are advised to install GNU AWK 4 (gawk 4) for its advanced features (Mac App Store, Homebrew).
Where to find BASH tutorials
There are lots of online tutorials, but the friendliest introduction to the command line I know is William Shotts’ website, linuxcommand.org. Two fairly good guides to BASH and shell scripting tutorials are at nixCraft and Bash Hackers Wiki.
About the trainer, Bob Mesibov
I’ve published 40-odd papers on Australian millipede taxonomy and biogeography. However, I’m probably better known on the Web for my 70-odd online coding tutorials and demos for command-line users. As a data geek I’ve done auditing to assist Catalogue of Life, Australian Faunal Directory, Atlas of Living Australia, Global Names Usage Bank and biodiversity data compilers in the UK, Germany and the Netherlands. The results are usually disappointing: data not fit for analysis, containing thousands of easily spotted errors, duplications and inconsistencies. Aggregators usually blame the data compilers. That’s not the whole story – see this rant – but the auditing workshop is aimed at helping data compilers and custodians get their data properly clean.
There will also be a 1-hour introductory workshop to the command line for first-time users, from 9 am to 10 am at the same venue