Five Top Tips for Culling Data
Data is growing. In 2020, every person generated 1.7 MB of data a second – that’s 146.88 GB per day and just over 53 TB of data per person, per year. This presents a huge challenge for lawyers who are tasked with reviewing that data – culling it to keep ongoing hosting and lawyer fees down, while managing the risk of removing potentially relevant data.
Technology can help. However, tools like predictive coding and email threading only reduce the lawyer time, not the hosting fees, as they are deployed once the data is uploaded for review. So, it’s important to utilise all the ‘analytical levers’ available before data moves to review and go a step further than search terms and date ranges through an Early Case Assessment (ECA) workflow, which can be deployed in the first few days of a project.
Here are five tips to help you achieve higher cull rates in ECA.
1. Clustering
Machine learning sorts the data by concepts, topics, or ideas, presenting the top terms within each and shedding light on what would otherwise be unknown unknowns. This is useful when formulating search terms, segmenting data into relevant piles for review, or exposing non-responsive topics. During an investigation, for example, topics such as ‘fraud, money, bank’ would be more relevant than ‘drinks, Friday, pub.'
2. Word Lists
Word lists include the most popular words within a data set. This is useful when targeted at documents within a high search term hit, as it shows words that could be irrelevant. In one matter, for example, we removed 80,000 docs for a manufacturing client by adding their competitors’ names as exclusionary terms during an investigation.
3. Custodian Isolation
Custodian IDs are applied to data from key people within the case. This is useful when you’re looking to isolate and drill down into their data within your culling or prioritise it for review. Once their data has been reviewed, the relevant documents can then be used to formulate word lists and clusters for a wider culling strategy. For example, an early investigation may focus on the most senior person in the team, understand his or her data, and apply that knowledge to the additional custodians.
4. Email Domains
Isolate all the sender and receiver email domains within a data set. This is useful to easily exclude spam emails (e.g., Qantas.com.au, McDonalds.com, or NewYorkTimes.com), while quickly identifying potentially privileged docs through law firm email domains.
5. Search Term Quality Control
Random statistical samples of documents can be generated for high search term hits for counsel review before moving the whole set into Relativity. This may lead to an explanation for the high hit count or give insight into additional culling measures that can be taken. In one employment matter, the word “‘fire” hit on 10,000 irrelevant documents because the custodian was a volunteer firefighter in his spare time.
The TLS team has a track record of helping our clients achieve 91.5%+ cull rates – higher than the industry average of 80% – reducing downstream fees by 30-40%. Our proprietary ECA technology platform, Digital Reef, has a myriad of other analytics to help move as little data as possible into review, while hosting all case data for free, just in case it is ever needed again in the future.