Tuesday, 21 July 2020

An Adventure with Excel

I have a rather old-fashioned attitude to spreadsheet applications - I tend to use them to do maths on columns of numbers and plot graphs and that's about it.  That's what they were invented for, after all.  Within the last year I have used conditional formatting to highlight certain cells that are important, and more recently a work colleague used it to identify chemicals in a spreadsheet of our stock which need to have a COSHH assessment or a COSHH assessment plus an exposure record, based on the hazard codes listed for each of the chemicals.  That worked very well, but my lazy nerd brain started to think that there must be a way to do this that didn't involve laboriously typing a conditional formatting rule for each and every triggering hazard code.  Wasn't there some way that a range of cells could be used to store the hazard codes for COSHH triggers and another one to store the hazard codes that require an exposure record to be kept and use a formula that says "if that cell contains any of those codes then you need a COSHH assessment (or that plus an exposure record)".  That way, you'd be able to simply replicate that formula down a column so it will work on every chemical in the list, and use only 2 conditional formatting rules (do we need COSHH?, do we need an exposure record?) to decide what colour to make the cell.  You'd also have a value in a cell that could be used to trigger some other action somewhere else in the spreadsheet if you need it.

This turned out to be rather more complex a problem than I'd anticipated.  "Do any of these search terms occur in the text in that cell" turns out to be a harder question to answer than it looks because Excel doesn't have a function to answer it directly.  What it does have is more complex and versatile, but that means some extra processing of the search results are required to get what you want.

A Note About Boolean Functions in Excel.

Boolean functions are those which return a value of true or false.  Computers use numbers to represent those values.  In Excel, true is represented by the number 1, and false by zero.  You can force Excel to show you the numeric value by preceding a boolean function with two dashes "--".  Open a blank worksheet and put the following into a cell:  "=(1=1)"  without the quotes.  That might look odd if you're not used to this sort of thing, what it means is " true or false, 1=1?".  I hope it's obvious that one does equal one and that this is therefore true.  When you press 'return', Excel should confirm this by displaying 'TRUE' in the cell you typed the formula into.  Into the cell below, put this: "=(1=2)".  That should display 'FALSE'.

Now try putting two dashes in front of the expressions you used before,  e.g. "=--(1=1)".  You should now see "1" for the true expression and "0" for the false one.  Since these have numerical values, you can do calculations with them, in fact, you can get the same result as using the two dashes by multiplying by 1 the result of a boolean expression, i.e. instead of "=--(1=1)" you can  use "=(1=1)*1".  Try it, it does work. It's also a trick often used in programming languages to perform a calculation only if a condition is satisfied - if the condition is false then the calculation gets multiplied by zero, and the result is therefore zero, if the condition is true then the calculation has the result 
expected.

One final thing before moving on:  What if we want to test if the outcome of some other function is true or false?  do we need it to be explicitly "1" or "0"?  what if it is some other number?  what if it's a text string?  or blank?  Excel will interpret any non-zero number as TRUE.  Any non-zero number - it could be negative, or fractional, it doesn't matter.  If it's a number and it's not exactly equal to zero, then as far as boolean logic is concerned, it's TRUE.  Zero is FALSE, obviously, and so is a blank cell.  Text strings will generate an error (#VALUE! will be displayed).


Back to the Problem

A quick recap:  we want to find out if a particular cell containing a string of text (all the hazard codes for a particular chemical) contains any one of a series of shorter strings of text (the codes that trigger an exposure record and/or a COSHH assessment).   Here is a link to a workbook I created to explore this problem and demonstrate the solution.  Download it to try out the exercises below.

This is the formula which will return TRUE if any of the codes is found in the cell that is searched.

=(SUMPRODUCT(--ISNUMBER(SEARCH(<range containing search terms>,cell to be searched)))>0)

That's a lot of nested brackets, so before breaking it down into simpler parts, let's just note that the outer brackets basically say (something >0) which makes the whole thing a boolean expression, which will return TRUE if something is greater than zero, FALSE otherwise.


The SEARCH Function

SEARCH(<range containing search terms>,cell to be searched).  This is not case-sensitive.  If you ever need to do a similar thing, but with a case-sensitive search, the similar FIND function is what you want.

If you look at the workbook linked to earlier, cell J3 contains a formula like the above, only with actual cell references instead of descriptors.  It says the search terms are in the range of cells from $H$3 to $I$7.  The $ signs mean that these are absolute references, i.e. if the formula is copied elsewhere, they will not be updated because the formula has moved - the same cells will still be used for the search terms.  The cell to be searched is B3.  That's a relative cell reference - that will be updated if the formula is copied elsewhere.  If we were only checking for one search term, i.e. <range containing search terms> is just one cell, then this would return only one result.  If the search term were found, it would return the position within the cell being searched where the found search term began. If it is not found, then an error is returned. Because we have searched for a range of terms, we get an array of results, one for each search term.  You can see in the workbook that Excel has helpfully spilled out the results into the surrounding cells so that we can see them.  Those which say "#VALUE!" are those which did not contain the search term, those with a number did find the search term.  J3 contains the result of searching for "code01" - it's the first thing in cell B3 so starts in position 1.  K4 is the result of searching for "code07" - it begins with the 25th character in cell B3. 

This is all very well, but we just want one answer - were any codes found or not?  to do that, we need to count the number of codes found, and see if it is greater than one.


Is it a Number?


=(SUMPRODUCT(--ISNUMBER(SEARCH(<range containing search terms>,cell to be searched)))>0)

When the SEARCH function finds one of the codes, the result is a number, otherwise not a number.  the ISNUMBER function will return TRUE if its argument is a number, FALSE otherwise.  So if we modify the function in cell J3 by wrapping ISNUMBER() around it:

=ISNUMBER(SEARCH($H$3:$I$7,B3))  the array of cells containing the SEARCH results will display TRUE or FALSE instead.  Then if we prefix that with "--" we will get zeroes and ones instead.


Count 'Em Up

That brings us to the final and most complex part of our formula:  the SUMPRODUCT function.  This will multiply corresponding elements of multiple arrays and add the results together to give one number.  With only one array, such as we have - the array of results from our SEARCH, you simply get the sum of the elements.  After we have applied --ISNUMBER to the array, the elements will be either 1 (if a code was found) or zero.  The result of applying SUMPRODUCT() to our formula will therefore be the number of codes found - if that's greater than zero, then a code has been found.


Just a COSHH? or Do We Need and Exposure Record Too?

You remember that we can do calculations with the result of a boolean function?  Well, we can do that here.  If we use two versions of the formula, one which will search for COSHH triggers (the codes in cells H3-H7) and multiply the result by 1, the other will search for exposure record triggers (the codes in cells I3-I7) and multiply the result by 10, then add those together, we can distinguish between those chemicals which don't need a COSHH assessment (result will be 0), those which need only a COSHH assessment (result will be 1) and those which also need an exposure record (result will be 10 or 11).  Such a formula can be found in cells E3 to E7.  It's rather long and looks like this:

=(SUMPRODUCT(--ISNUMBER(SEARCH($H$3:$H$7,B3)))>0)*1  +  
  (SUMPRODUCT(--ISNUMBER(SEARCH($I$3:$I$7,B3)))>0)*10



Adding Some Colour

Now, at last, we can get back to some conditional formatting to colour code the cells with the hazard codes for each chemical, i.e. B3 to B7.  I have chosen yellow to indicate that a COSHH assessment is required, red to indicate that an exposure record is also required.  That range of cells has two conditional formatting rules applied to it, both based on a formula.

$E1>=1   sets the fill colour to yellow

$E1>=10 sets the fill colour to red

Note the the cell reference in both is mixed:  the column reference is absolute ($E), the row reference is relative (1).  This is so that it will be updated before checking, i.e. the formula will always look at column E, but will adjust for the row, so the formatting for, e.g, B5 will look at the value in E5.

Hope that helps somebody!



Further Notes

Rather than using the codified results in E3 to E7, we could work out directly whether a COSHH assessment or COSHH plus exposure record is needed using formulas in columns C and D.  For column D (exposure record needed) we could simply put iinto D3:

=IF((SUMPRODUCT(--ISNUMBER(SEARCH($I$3:$I$7,B3)))>0),"Yes","")

.. and fill down for the rest of that column.

For column C, it's a bit more complex because if an exposure record is needed, then a COSHH assessment is also needed.  We could simply include all the exposure record triggers in the block of COSHH triggers as well as having them separate, or do something like this for C3 and fill down.

=IF(OR(SUMPRODUCT(--ISNUMBER(SEARCH($I$3:$I$7,B3)))>0),SUMPRODUCT(--ISNUMBER(SEARCH($H$3:$H$7,B3))),"Yes","")

which will work but that formula is getting very long-winded.  A simpler approach might be to expand the range of codes to check for COSHH triggers so that it includes the exposure record trigers also, i.e. search the range $H$3 to $I$7.  There's a big pitfall waiting for anyone who does that, though, and it opens up as soon as there are more COSHH triggers than exposure record triggers (which there are in reality).

Try this exercise: take the formula given above for D3, put that into cell D3 and copy down to D7.  That won't change the results at all, showing that the formula works

Now put this formula into C3 and copy down to D7.  It will expand the search to cover columns H and I:

=IF((SUMPRODUCT(--ISNUMBER(SEARCH($H$3:$I$7,B3)))>0),"Yes","")

Again there will be no change to the results, because the number of codes in each column is the same.  

Check that cell J3 still contains the search formula: =SEARCH($H$3:$I$7,B3) and delete the contents of cell I7, leaving it blank.  Now, apparently, everything needs a COSHH assessment and an exposure record.  What's happened?  If you look at cell K7, you'll see that it contains "1", which would mean that it has found the contents of the blank cell I7.  This will always happen if you search for 'nothing' in Excel, since anything contains 'nothing - plus something else'.  I assume that's the logic of this anyway.  It seems more logical to me that 'nothing' would be found at position zero rather than position 1,but that would not help us with this one - zero is still a number and would be detected by ISNUMBER()

There are some simple workarounds - firstly, use the more complex expression above, and just search the areas with values in them.  Secondly, use an innocuous value for the 'empty' cells - put something like '####' (which should never show up in a real hazard code) in I7 and the problem goes away.






No comments:

Post a Comment