Click here to watch the Video
Editor’s Note: In this insightful transcript from an educational webcast hosted by HaystackID on June 28, 2023, our expert panel discussed the impact of generative AI that is currently dominating the dialogue within the legal services industry.
Led by Michael Sarlo, an eDiscovery professional with extensive artificial intelligence experience in investigations and discovery, panelists from HaystackID and Reveal Data discussed the unfortunate confusion around generative AI and shared on opportunities and appropriate uses for structured analytics, machine learning, artificial intelligence, and different large language models in the context of cybersecurity, information governance, and legal discovery. Panelists also highlighted methods for validating the output of these systems, including the dangers of “black box” approaches when leveraging artificial intelligence.
While the entire recorded presentation is available for on-demand viewing, dive into the complete webcast transcript below to gain valuable insights from panelists on the basics of AI for eDiscovery.
[Webcast Transcript] Don’t Throw the Baby out with the Bathwater: Back to the Basics in AI for eDiscovery
Michael Sarlo, EnCE, CBE, RCA, CCLO, CCPA + HaystackID – Chief Innovation Officer and President of Global Investigation Services
John Brewer + HaystackID – Chief Data Scientist
Toni Millican + Reveal Data – Director of Customer Success
Bernie Gabin Ph.D. + HaystackID – Senior Data Scientist
Hello, everybody, and welcome to today’s webinar. We’ve got a great presentation lined up for you today. But before we get started, there are just a few general admin points to cover.
First and foremost, please use the online question tool to post any questions you have. And we will share them with our speakers.
Second, if you experience any technical difficulties today, please let us know using that same questions tool, and a member of our admin team will be on-hand to support you.
And finally, just to note, this session is being recorded, and we’ll be sharing a copy of that recording with you via email in the coming days.
So, without further ado, I would like to hand it over to our speakers to get us started.
Thanks so much. And as usual, a special thanks to the Lexology team for being such a great, if you will, focal point for these webinars and learning in general.
Hi, my name is Mike Sarlo, and hello and welcome to another HaystackID webcast. We hope you’ve been having a fantastic week and a happy week. I’ll be your moderator and lead for today’s presentation and discussion titled Don’t Throw the Baby Out with the Bathwater: Back to the Basics in AI for eDiscovery.
This webcast is part of Haystack’s ongoing educational series designed to help you stay ahead of the curve in achieving your cybersecurity, information governance, and eDiscovery goals and objectives. Today’s webcast is being recorded for future on-demand viewing. We will make the recording and a complete presentation transcript available on the HaystackID website shortly after today’s live presentation.
Our expert panelists for today’s webcast have extensive experience with eDiscovery, digital forensics, cyber incident investigations, and overall corporate governance. They will be sharing the opportunities and appropriate use for structured analytics, for machine learning, artificial intelligence, and different large language models in the context of cyber incident response, information governance, and legal discovery.
So, again, I’m Mike Sarlo. I’m the Chief Innovation Officer and President of Global Investigations and Cyber Incident Response here at HaystackID. I act as an overseer, so to speak, for our advanced technologies solutions group. I work closely with our customers and our internal teams to develop and deploy new tech to enhance the eDiscovery process. I also am a forensics investigator. I’ve been an expert witness in state and federal venues.
I’m going to hand it off to Toni. We’re so happy to have our co-sponsor of this presentation, Reveal Data Brainspace, truly leaders in AI. Toni, can you go ahead and introduce yourself?
Thank you, Mike. My name is Toni Millican; I’m a Director of Customer Success here at Reveal Brainspace. And my primary role here at Reveal is to really be an advocate for our customers who are adopting our AI-driven platform to help them solve their problems as it relates to eDiscovery and investigation.
I have over 30 years of experience in the legal field prior to joining Reveal. I have an extensive background in both public and private sector, and I am a certified eDiscovery specialist, and look forward to sharing some of my insights with everyone today.
Thank you, Toni. John Brewer.
I’m John Brewer. I’m HaystackID’s Chief Data Scientist. I’ve been in the data space basically since the late ‘90s. Been around in the SAP space, in the big data space, when that was just coming up. And then transitioned around the middle of the last decade into the eDiscovery space. And I’m with HaystackID today, leading our efforts in data science and artificial intelligence and along with many of the other data challenges that we will be talking about today.
Hi, I’m Bernie Gabin. I’m relatively new compared to the others to the eDiscovery space. But my background – I used AI systems for my doctoral work in physics and have transitioned into a more AI and data scientist role as I’ve gone forward. I worked for areas in the DoD and for the Government. And I have recently transitioned over here to eDiscovery and doing data breach work using my background to help automate and process large quantities of data.
So, a ton of expertise today from all over the spectrum, really focusing on the 800-pound elephant in the room. I think we can all agree that we just can’t get enough of AI. In fact, anything with a sprinkle of AI on it seems like it could be a success in today’s age with so much hype around the use of artificial intelligence.
So, it’s something certainly that I think many practitioners on this call – we’ve dealt with different flavors of AI, but I think we’re really hitting a different point in the acceptance model and use of that technology.
Certainly, going back to a day that will go down in infamy would be November 30th, which is the day that ChatGPT went mainstream. For anybody who decided to create a login around that time or before they closed beta, you’re probably certainly fairly amazed leveraging these systems, and we’ll be getting into how they work, and what they’re good for, and what they’re not good for today.
And of course, I think we need to start with a lesson, and I’ll hand it off over to Bernie to talk about what is generative AI.
Thank you, Mike. So, as you can see here, we threw up the Wikipedia definition as a starting point. There’s a lot of talk about generative AI these days, obviously, AI is much farther afield, and we’ll get into that more later. But to set the groundwork, generative AI is what it says on the tin. It is automated systems that are designed to create new works, either texts, images, other types of media. There’s a lot of confusion around what that means, though. And it’s very important to stress that when we are talking about these generative AI systems, these systems do not have the spark of creativity in a philosophical sense. They work based on patterns and structures that they have learned from the datasets that they are trained on. Those datasets, in the case of something like ChatGPT, can be large language models, in the case of some of the image generators, or other video generator-type things will be based on the images or video fed into them.
And because they are pattern and structure-based, garbage in, garbage out. Depending on what you feed it, it will change what results you get and what kinds of patterns it recreates. These systems are also specifically designed to create these sorts of art. They’re not designed to look things up for you. They’re not search engines, they are generative. So, they don’t know how to say, “I don’t know”. They’ll just make up stories if you ask them to. I’m sure you’ve all heard the stories about certain lawyers getting into a lot of trouble for using generative AI without interrogating the results that it generates. Next slide.
So, ChatGPT and Midjourney are two of the primary examples that have a lot of mindshare right now. Brewer if you wanted to discuss specific applications of these.
So, the way that we’re seeing generative AI hit the market right now is obviously the big popular wave is, as we saw here, Midjourney, which rendered very unforgiving pictures of me and my colleague there. And actually, a very nice picture of Mike in the upper right hand.
I’m the one who generated the pictures, so I spent more time by myself.
And ChatGPT, of course, has gotten a lot of coverage. But the generative AI techniques are propagating throughout the commercial space. There’s a big land rush right now, where we’re seeing these not only being used to generate text, and generate images, but being used to generate videos, being used to generate really all kinds of media, but also things like processes, operational data, things that the human beings who do them day to day don’t really think of as being particularly interesting work, but work where some documentation, some artifact, some document has to be created. And that’s the role that we’re seeing generative AI in now. And also, as we’re seeing it in the commercial space, and especially in the legal space, we’re starting to see some of the limitations of that technology.
So, Why don’t we move on to the next—
—point that Bernie made, garbage in, garbage out. These large language models are only so good as the data they’re trained on.
Midjourney photos of Bernie and asked them to create a scientist sitting in a crime lab. And I kept getting that back of Bernie with white hair. And obviously, the model, when it thinks of a scientist, it thinks Doc from Back to the Future. So, it took many different iterations to get an image that actually was age appropriate. So, these are some of the challenges with some of these large language models where they can appear totally [to make up the] smallest details and [hallucinate], and you’ll hear that term often when talking about GAI.
So, what is generative AI and what isn’t it?
One thing that we did want to focus on here is that there’s been, and this will actually be something that’s the principle of this conversation, is generative AI is not the full breadth of AI. We do have the concept of artificial general intelligence, which you may have heard around, that’s your HAL 9000 true sentient thinking machine computers. Despite the fact that you can have a conversation with a generative AI, it’s, it is a paper tiger, it is not actually an intelligence in a meaningful way.
And it’s not natural language processing. And this is actually the thing that we see the biggest misunderstanding resulting in negative outcomes is, as Bernie alluded to earlier, generative AI generates text, it’s not a querying system. If you ask it a question, it will often answer correctly based on its training set. But if it can’t find the answer, or it has an ambiguous answer, it will just make stuff up. I know that we all know people like that in real life, and now we have created an artificial one. But those are the two big limitations that we need to look at when we are dealing with generative AI.
So, beyond the hype of what’s actually happening here, the real danger that’s facing the AI community, as a whole, and is now facing the tools that we’ve been marketing as AI or that are actually using AI techniques is that when a new AI technology comes out, we often see this big in excitement about the issue. The peak of inflated expectations, as the slide says. And that is a real danger that the AI community has come across before, once in the 1970s and once in the 1990s with neural networks.
We had a technology that came out, it was very exciting. It seemed to be the secret sauce, the thing that was going to get us to general-purpose AI, to these great civilizational scale improvements. And of course, they ended up not being the solution, or at least not the whole solution. And then public confidence in those technologies crashes down. And we entered what was called the AI Winter, which was we saw a drying up not only of commercial interest but of research interest in AI. For about a decade, it stopped being the case.
And I think that there’s a big concern in the industry right now both on the commercial side and on the academic side, that we are on course for another such disillusionment. So, part of the conversation that we’re having here and the conversation that we’re trying to have, in a wider sense, as an industry is getting those expectations set correctly so we can avoid that trough of disillusionment, as this very poetically says, and get us to integrating these tools into their best fit so that they are hitting our productivity as soon as possible and make this variance as we adopt these new very powerful GAI technologies as quickly and as painlessly as possible.
I’d like to add onto that too. When I think about this, this year at Legalweek, it was the hype, everyone was talking about it. After the announcement of ChatGPT, the hype cycle started. And we’ve seen this hype cycle in AI before. Everyone was talking about it. They’re traveling from booth to booth or going from panel to panel wanting to know how the legal industry was going to embrace the technology. And the idea of the increased productivity with the solution that we would be able to predict the next best word in a sentence or a phrase, and we quickly surpassed that peak of inflated expectation that’s come to realize that’s not always what we anticipate it to be.
So, I think we’ve already hit that disillusionment. We’ve come to recognize that there’s some risks involved in leveraging this non-human intelligence and providing advice to our clients. And so, these platforms don’t know if they’re telling the truth, they tend to provide you the status quo, or if you are the human, the monkey see, the monkey do. You are the monkey and the behavior can’t go beyond whoever is the person who’s teaching the technology.
So, industries are concerned about this. And I think that we are going to find that there’s concern about how it’s affecting roles, functions, communications, and how do we harness it. And at some point, just making the decision to either understand it, fear it but I think most organizations have reached that plateau of starting to test it, to figure out the productivity, and making sure that they know how to harness it through pilots and testing and whatnot.
So, when we think about how AI and eDiscovery – how it’s extending beyond generative AI, we’ve been there for a while. You look at just the sheer complexity of data that has grown in just the past few years with post-pandemic, the way that businesses are communicating, the multiple platforms, the social media, the increase of unstructured content. I know there’s the structured, but there is that unstructured, which is the hardest to really get your arms around.
And most organizations are really facing a spike of what I’ve heard of over 90% in data storage. And so, their ability to gain insight of evidence on wrongdoing, or finding that information that’s relevant to that legal matter is harder to do without leveraging AI in eDiscovery.
So, having those newer forms of electronic communications too. We’ve got chats, emojis, and GIFs. I don’t know how we would be able to translate that without using some types of AI tools to help us.
And then we have to also understand that it has been – again, with the pandemic, with that rise of data, we are still dealing with organizations being tasked to do more with less, and their resources and budgets. And so, the eDiscovery and AI has been critical to being able to get those quick insights.
Just real quick on the topic – that was great, thank you so much – we see eDiscovery being used in organizations, so critical. And what we see, too, is the desire for organizations, they want to hold on to all this data now, because they see economic value in the promise of AI delivering insights and efficiencies. Unfortunately, [it takes a lot of data to power] a lot of data, and you need a lot of data to train the data. You need a lot of supervision and training decisions.
And for eDiscovery purposes, we’re typically are interested in finding the right documents as quickly as possible, finding the most hottest documents, finding documents that we don’t want to produce and doing it in a meaningful way that’s accurate, and predictable.
GAI a little bit leads us astray here for certain, but there is a ton of use cases for GAI in legal technology in general. And when you look at things from the trust but verify standpoint, you really start to really see this more as a force multiplier. And it’s so important to think about that when you’re thinking about using AI in your own organization or on a matter; it is never going to be that “Hey, I just loaded in and finds the easy answer.” It’s always a force multiplier for human input, and we’re going to talk about that a little bit more.
And really, it’s a matter of “Are you using augmented intelligence or are you using AI?”
So, we’re going to jump to another fun definition with our ex-NSA data scientist, Bernie, here.
So, earlier we defined generative AI. And as we pointed out, it is only a small fraction of the entire field. And as Brewer said, there is concern that it may be pulling too much of the oxygen out of the room. So, let’s take a step back and look at AI in the full scope of it.
Again, we start with a textbook definition that is pretty profoundly unhelpful. Defining artificial intelligence as intelligence doesn’t really help. Anyone who’s ever taken a philosophy course – I’m sure many of you have – know that defining what intelligence is extremely difficult and contentious. So, generally, it is more helpful, in a practical sense, to define AI based on what can do, what tasks it can accomplish, and how complex any given AI system is. Can we go to the next slide?
A famous quote that I’m going to paraphrase badly here is that AI is sometimes defined as anything that we thought was impossible for computers to do five years ago, there are lots and lots of systems that you use every day, in your everyday life that 10 years ago would have been considered the heights of artificial intelligence, and nowadays are just considered completely transparent bog standard systems.
Here are a bunch of examples of things that probably come up in your everyday life, a lot of recommendation systems, things like Alexa or Siri, which do natural language processing on speech, OCR systems. All of these, at one point, were considered to be AI systems.
The way we are moving forward now, it is becoming more and more clear that AI is tools. They are tools that you work with that allow people to do more with the machines and the computers they’re using. Next slide.
Which brings us to the idea of human and machine teaming. Brewer, did you want to cover this one?
The distinction that I think is getting drawn here, and this is what Mike said in terms of – or alluded to, anyways – in terms of working with the machines, as opposed to things like generative AI just taking off without them is the idea of augmented intelligence. Which is we have a machine that is good in a particular domain, it’s good at one special capability. And this is the way that AI worked for a long, long time. And it was more obvious that it was working that way, the quintessential example is chess AIs who can help you analyze chess and can, in extreme, play a game themselves. That’s still a person working with an AI.
And most of the examples that we saw previously, again, are AI tools augmenting human selection of a capability there, or of an action that they were taking.
So, we have a variety of subcategories of AI that help us drop in the tools and the actions that we’re taking.
Now, machine learning is actually a technique that you use to create AI. It’s basically the state of taking a whole bunch of data, and letting the machine find patterns in it, usually with some hints from human beings on what it should be looking for. And then the actual – when we call it AI, the thing that pops out of that, the thing that can actually make decisions given a piece of data.
Artificial narrow intelligence, again, is what we call those very specific domains of things. That’s where we are now. And we have AIs that can do specific tasks. It used to be playing chess. Now its writing documents or creating images. The distinction we drew earlier was between what we have now and artificial general intelligence which is, again, theoretically an AI capable of doing anything that one of your co-workers who you never see, except as a line in Teams or Slack could do. That’s fully functional or at least capable of doing anything a human can do.
And then you have artificial superintelligence, which is the thing that Elon Musk is very worried that he’s going to conquer the world. And that’s what you’ll see in mostly media now. And you’ll occasionally read things from Nick Bostrom on how that’s going to be a great danger to us.
So, as Brewer pointed out, we have those different categories, and we are climbing the ranks of them. But a lot of that is still in the realm of science fiction. When we move over to the other half of it, besides complexity of the AI systems, we also have job categories. There are many, but these are four of the most common that are generally acknowledged in systems today. Discrimination systems, pattern recognition, task planning or optimization, and natural language processing are all tasks that are highly repetitive, or are tasks that require a lot of high-level math basically. So, if you’re trying to do a task planning optimization problem, traveling salesman, things like that, you can throw AI systems at these to test lots of different options, and come back with some sort of optimal solution, or at least an optimal solution, if there are multiple. Discrimination tasks are the original AI task with image recognition and things that. And as Brewer also pointed out, pattern recognition systems with original machine learning can help you pull patterns out of data.
Most of the systems we talked about earlier that people use today fall into one of these four categories. There are some new ones that are coming. The generative AIs use NLP and pattern recognition techniques to help generate new patterns. And that sort of thing about it’s not really creating something out of nothing. It is recreating patterns that it’s seen. And discrimination tasks and the task planning optimization are things that we can see even now in eDiscovery and legal field-type areas with working towards document classification, or being able to pull out sentiment analysis and things like that.
It really comes down to the way that we use AI in eDiscovery, again, it’s augmented intelligence, we’re not replacing humans, we can’t replace humans. And it’s trust but verify, and really articulating your processes, mapping out your processes, following your processes in a mechanism where you’re creating opportunities to conduct tests around verification is the secret to reducing risk when wanting to use new technologies, period.
When we think of digital forensics or we think of expert witness testimony in general, it’s the Daubert standard. How do you demonstrate repeatability and a repeatable outcome?
And so, many of the techniques we use for TAR, validating those models, really can be applied to the use of general AI in a variety of categories when you isolate the right subsets of classification that we need to vet. And so, this leads us to many flavors of AI and eDiscovery. And my apologies that the presentation is a little clunky for sharing on the screen.
And so, Toni, I’m sure – you’ve been in this industry for quite some time, you’ve seen the evolution of fad to barely usable to really highly vetted and, in every single case, would love to get your thoughts on that evolution and some of these technologies.
Thank you. So, when I think about where we’ve evolved, I really find we’re that we’re seeing more adoption around the flavors of AI when we talk about the visualizations involved. We’ve had this technology-assisted review and supervised learning features and functionality. But the ability to present that data in a way that can be digested by the team who is trying to get the insights quickly, I think, has really been a game changer in this industry to be able to – whether it’s having a visualization across the multiple communication channels, as an example
You find many organizations where they’re communicating in email, and then in a chat, and then someone goes out and some social media – other social media platform, and so being able to bring that into context.
Additionally, I found we’re seeing more around some of that sentiment analysis. Similar to generative AI, there is a pattern of predictability of how people communicate based upon emotion. And at certain points of time, we have the ability to start now predicting and looking into that, and finding patterns within the words and the language to be able to bubble up potential risk in organizations based upon some of the – whether the sentiment is good or bad. So, I’m finding that is also being used more.
Obviously, the AI-enhanced features that people don’t think about AI, deduplication analysis. It took attorneys a long time to get comfortable around deduplication analysis, but the way the data is visualized now, because you can take those duplicate datasets, and really have visualizations and see that they conceptually are the same. And they have a pictorial to be able to look at that and say, “Okay, I agree with it, the technology, and I’m feeling comfortable with the technology and the decisions that it’s making, that it can go ahead and move forward”.
Also, around models, Mike, we’ve had that conversation. And it’s really exciting to see what we can do with curated models to use it as a starting point. And to be able to build those models and have portability from either specific types of investigations. But beyond that, if you go further left, and you’re looking at organizational risk, how can you create models in your organization and use the repeatable processes that we have in the eDiscovery role to be able to reduce risk around maybe with data classification, and with all the regulatory changes that we’re seeing around privacy regulations to be able to target where data is sitting, to be able to quickly gain insight for a data breach? So you’re finding more that with the entity extraction, that’s a great feature for being able to get those quick insights.
So, I’m interested to hear a little too from your side, Mike, of what you have found in those 31 flavors of AI and data discovery.
So, I think it’s important to look at mass regulation at the state level, at a global level, a lot of folks here from Europe or the UK, you have a much more involved privacy regime than the US, although we’re seeing all of our different states coming up with their own flavors. So, we’re actually becoming the more complex one these days.
But what it did is it is creating a need for exactly what you touched on, classification of data from an information governance standpoint, basic dates, joint policy documents, but knowing what the content is. And also, from a privacy standpoint, knowing what PII and PHI and sensitive data is out in the wild, and likewise, in your eDiscovery datasets, you oftentimes have an obligation to understand that.
So, we’ve developed really good AI technology that allows us to identify sensitive data both the normal stuff like social securities, addresses, things like that. But also, some of the more esoteric things like gender, or like references to religion. And so, partially what’s accelerating organizations here as well, and where I think this technology could more accessible is the cybersecurity regime is really requiring certain industries to journal a lot of data and logging. And what this is doing is it’s forcing enterprise infrastructure further into the cloud.
Well, when you have data in the cloud, it’s much easier to deploy these AI-based workflows. And there’s a lot of pre-packaged models that exist really in the data classification, categories in Amazon, in Google, in Azure that do a really good job that things that we try to do in eDiscovery. And so, we’re seeing more use of these tools. We’re using more of these tools to solve very specific problems on a per-matter basis. And John and Bernie spend a lot of time with our data science team and our BI team (our business intelligence team) because visualization is so important, I think, when you’re dealing with AI, and coming up with a method to allow end users to visualize low-risk and high-risk populations, to allow them to make a decision on something is really important.
We deal a lot with computer vision now. As well, of course, we are looking for driver’s licenses and passports, or even standard forms like a tax return. We’re using actually computer vision models to identify those. Super important in our breach practice where you do get a lot of these image-based documents, and it’s tough to identify these obviously using standard search or text-based AI.
Here is somewhere where a GAI is definitely making major strides that will trickle further into eDiscovery is definitely in transcription and in translation. In particular, Whisper, which is OpenAI’s audio framework, is just blowing the competition where Google was once the leader.
So, we’re really going to see a more open world. And we’ll get more confidence in datasets across multiple languages that were truly getting the gist of the document, even including CJK-based languages.
Mike, let me just touch on that too. That just triggered a thought on when you were talking about embracing the AI for eDiscovery, data-driven discovery at an enterprise level for enterprise risk management. There’s that conversation that we were having, it’s not just about the eDiscovery. But it’s really about discovering data and reducing risk before risk occurs. So, you’re finding legal operation teams more entrenched into enterprise risk management and not as reactive as they were prior. So, now, they are on there at the table, and they’re working with their enterprise leaders to help leverage some of this eDiscovery AI technology.
Oh, totally. And part of that reason is that when there is a data breach, a lot of folks who don’t deal with it day-to-day incident response, the technology piece, finding the breach, how did they get in, closing those systems is a fraction of the cost of the legal spend, maybe about 30% for that incident, pure-play digital forensics incident response, probably about 70% of the budget. A decent-sized data breach in the US can run about 5 million bucks at a minimum; that’s not including the loss of business.
So, 70% of that budget, all of a sudden, falls back onto the legal department. So, we’re seeing, I totally agree, legal departments and legal operations are way more involved in enterprise risk management because they have to be.
Further to that, they’re leveraging a lot of technology outside of the legal team, outside of legal ops, that can provide competitive intelligence for an eDiscovery matter. So, I always encourage – when I’m dealing with large companies, I try to understand how they’re classifying their data outside of the realm of just “Hey, we have some search terms.” And if there are verticals where we can use those bigger enterprise classification schemas to really bucket data we’re bringing into an eDiscovery matter. We have a lot of success there as well.
And certainly, that’s all great and good. But maybe, Toni, you want to talk about some of the capabilities that simplify this. Because at the end of the day, we are only interested in certain things, and I think some of the products like Reveal do a great job of allowing users to get the AI really at their fingertips. And to be able to leverage technology easily without all the technical uplift.
You just nailed it on the head right there. So, I look at two use cases. I look at there’s two problems to solve when I’m thinking about eDiscovery, generally speaking. That’s going to be what I need to know with the AI. So, from an investigative standpoint, and how much can I gain from the technology with minimal lift? And that’s where tools like Reveal do come in play, because bringing in and being able to capture just from what you have on this screenshot, just from getting the data in, and being able to analyze the data, as it’s being ingested to get visualization to quickly get insight.
So, whether it’s applying and validating search terms, as you’ve pointed out, whether it’s being able to see the patterns conceptually around the search terms to be able to hone in on your target, as well as just the communication visualizations that are depicted in this slide, you’re getting into the case, and you’re learning more that you may have not known when you began the actual investigation.
So, it has very robust predictive capabilities for you to be able to then organize the data, and really identify relevant information and get rid of the low-hanging fruit. What’s not relevant? That’s also something to remember because that is also going to be able to help you – tools like Reveal to be able to help you identify what’s not of importance, and getting rid of that data.
And I think social network analysis is one of those things where I do a lot of investigations. I oversee our cyber incident response. I also oversee our investigation team and cyber incident response function in eDiscovery. So, I have two lives here at HaystackID.
And social networking analysis is a great tool for us when we’re going in blind. What I do find, and where we’re introducing more computational power, that may seem machine learning, in some ways, is being able to dice datasets that may have millions of actors in them to get to a population of actors that is likely to be more relevant.
John Brewer, I know you’ve done a lot of work here with Slack. We deal with this all the time. We have some clients that we’ve seen 20-terabyte Slack dumps. Can you talk a little bit about some of the capabilities and workflows for cleansing data and why that’s important around eDiscovery?
No, absolutely. So, we do see a lot of this kind of network data that comes up in eDiscovery. And we do get – like you said, it can be potentially terabytes and terabytes of data where we are trying to find information that is of relevance to whatever matter we’re working on. And whether that is communications between two individuals of interest in the case, or whether that is certain terms that are being used. We’ve had search terms for forever, but having concept search, having communication search and having anomaly detection is also pretty important.
We actually had a question come in on processing this data that I think is actually a really good element to bring up because it touches on how this data is used, particularly in the information context. And that was citing the CaixaBank fine that just got leveled by the Spanish Data Protection Authority. I think it was about €6 million because they were using data erroneously.
And I know that one question that we’ve gotten in the past from a number of different clients is once we have this data, once we have this discovery data, are our hands tied by the GDPR in terms of how we are allowed to use this data. Are there restrictions on the use of data from our discovery process? Because when the data subject submitted this data, they didn’t specify that this was a use that we can use it for.
Now, the good news is that there are carve-outs in the GDPR that at least have been interpreted to handle that for eDiscovery. So, we’ve not really run afoul of that in the past. But as we are building these models, as we are building new training models for visualization for social network finding, that is absolutely something that we need to take into account. Because yes, the gloves are now off, especially in the EU, where if you are using user data to train your models, you can find yourself on the receiving end of some regulatory unpleasantness.
And don’t think the gloves aren’t off here in the US either. The FTC is laser-focused on basically gathering of health data, and the use of health data to support ad campaigns and other classification mechanisms. So, I don’t think we’re getting that far apart from Europe, here in the US, either.
I would also like to say the FTC is increasingly becoming laser-focused on companies saying their products contain AI.
And then on the subject of portable models, we already did somewhat talk about computer vision, there’s just a lot going on here. And just because of the accessibility of big data, computational power, there’s a lot of new GPU type advancements that are really power based. Graphics processing units that really have allowed us to advance so fast in AI, or to even begin to conceive training some of these large language models.
The same thing for computer vision, where you can have 60 terabytes, 80 terabytes, 100 terabytes of video data, and we’re able to now go through this data at clips, where it seemed somewhat impossible in the private sector. I think the first time the public sector got a real look at capabilities in this domain was really with the Boston Marathon bomber many years ago, and it’s one of the biggest CCTV gathering and analysis exercises of all time. And today, we would be able to conduct an exercise there – this really took many, many days really – almost as quickly as we could copy the data.
So, huge advancements here, and a lot of opportunity. To leverage computer vision, well, you do need to reach out to groups that have a defined data science offering and expertise, because the tooling is very ad hoc, you might be looking for something very specific. And this is why we’ve invested so much in the data science vertical here at HaystackID.
And then just going back on the subject, AI at your fingertips. I love that saying. And whether it’s Reveal, whether it’s Relativity, whether it’s Brainspace, whether it’s any other platform, everybody’s laser-focused on making AI more accessible.
When we think about having these pre-trained models that are portable, something that’s become more of a developed concept two to five years, and we’re seeing more and more use of it, bigger corporations, bigger organizations are dealing with somewhat – consistently the same types of issues really can benefit on training their own subsets based on what they’re seeing.
So, I’m a large bank, global bank, I’m dealing with FCPA issues. You see enough of it where you can actually begin to train on your own. We like Reveal here because they have these pre-trained models that, although they’re somewhat broad, they can be used to really supercharge an investigation early on. And this is something to definitely ask your clients about and to think about is how can we leverage pre-trained models, especially in matters where we just don’t really know. Like bribery, we’re doing these huge insider threats, and insider threat detection is really a cyber incident response as well. We’ll see business email compromises. Are we dealing with an inside actor who might have wired a million bucks to Ghana, or was it truly a third party? Really great in that context as well.
Toni, I don’t know if you have anything you want to add here, John, or anybody else.
I think it’s important to add – I’ve heard fear of leveraging models. But the models are really designed – models are really designed for you as a launching pad. It’s to help you. You are the human, the human behavior is trying to identify the problem that you’re trying to solve, and you’re using these as launching pads within these technologies to then build on them and then be able to create your own custom models. And I think it’s important to call that out. It’s a launching pad. It’s helping you drive – you’re using… it’s all of the glitter and glam.
I look at it in the simplest of terms, when I put on my jewelry and my earrings, that’s just my launching pad, before I get my hair done. So, it’s the same concept. You’re just using it as a launching pad to get you to the next place.
Totally couldn’t agree. And I think for all the eDiscovery practitioners on the phone, the lawyers, anybody dealing in this space, it’s so critical to be thinking about the use of AI. And just really from a competitiveness standpoint for making datasets more accessible for decision-makers through visualizations and simplification of big data. There are cost benefits, and you really want to be thinking about it too as you’re investing in technologies. How do you future-proof yourself?
Now, we all know you can only future-proof yourself so much, but at a personal level, really important. And for me, again, what I always try to drill down into folk’s heads when I’m talking about the use of any technology really is that repeatability you piece.
Insomuch as we think about privacy by design, I think about defensibility and repeatability really the same way. It’s really the way that we tackle any problem here at Haystack that’s a big problem, or a big data problem is we look at through the lens of how do you create a process that is repeatable, that can be vetted and can be tested. And I think you’ll really have more, more success, you’ll feel more comfortable talking about the processes or tools you use if you’re ever asked. It’s always good to keep a little journal too when you’re using AI, or when you’re using new technologies to understand your queries, how you call the datasets, because those little things can matter. Especially the way that you’re setting up that first pass of training.
And, of course, ethical duties. You know, we think about the ABA Model Rule for Competence. And the California Standing Committee on Professional Responsibility and Conduct and Comment 8, all the real gist here is that if you’re going to do cases that have a need for eDiscovery, you need to be competent to understand the evolving landscape of eDiscovery. And lawyers need to keep abreast of changes in the law and its practice, including the benefits and risks associated with relevant technology. And a key word there is the benefit, you can’t just lean over and say, “Oh, there’s too much risk from an ethical standpoint”.
So, it’s really important for folks to think about this, and to be training and to seek education and to realize that use of technology, or the lack of use of technology is really not an option. Courts are becoming much more sophisticated. They know what TAR is, they know what email threading. You see very old judges who have a really great sense of the eDiscovery process and the tech landscape.
So, us as practitioners, it’s important that we’re forward leading there and looking for use of these technologies on a matter-by-matter basis.
You did hit on something too. There’s that ethical duty of actually leveraging the technology at this point. You can’t put your head in the sand, it’s out there, and we need to embrace it. And that includes remembering that practicing law is different when it comes to eDiscovery. And understanding and training to use the technology and testing it and not expecting to go out there the very first time and say, “I have this large case with this large volume of data and throwing the kitchen sink at it.” You want to understand it, and you want to learn how to use it.
As Toni was alluding to earlier, the amount of data that we’re getting is, frankly, growing faster than the number of lawyers and paralegals we, as a society, are producing. And especially at HaystackID, we have front-row seats to the best in class techniques for review. And we still need to be using computer techniques and AI techniques aggressively just to get these datasets down to the point where they can be tractable for humans.
So, yes, no matter what, the sheer quantity of data is forcing adoption for these advanced techniques, whether we really feel like we want to go there or not. It’s become a requirement. Sorry, Mike, go ahead.
No, with that said, on this subject, just to wrap up here when you think about barriers, at least on my corporate clients. If you’re already using AI somewhere, there are experts, usually, who can enable and can work with you to leverage AI in many large organizations, businesses are investing heavily. They often live sometimes more in the IT function or finance function of business intelligence function. But being somebody to reach across the aisle to cross-pollinate, that is the keys we’re seeing.
As Toni mentioned, there’s a whole natural economy between legal and IT, that’s fading, and we’re seeing more CISOs who are lawyers
And so, there’s an intersection between security, between privacy and between risk management, and looking within organizations on those topics and who are the leaders there, you’re oftentimes going to find that it’s a way to accelerate kicking down the barriers for adoption in your own practices that’s granularly focused on eDiscovery.
We hear so many myths about what’s out there, how to use technology, and there is no better time than now to embrace AI, but do so cautiously would be our words of advice.
Any questions? I’m going to check the question page real quick here. I think we answered the question we had.
So, with that said, we will conclude the presentation. So, we just wanted to thank, again, this great expert panel for sharing their insights and information. And we also wanted to thank everyone who took time out of their busy schedule today to attend today’s webcast. We really truly value your time and appreciate your interest in our educational series.
You can learn more about and register for upcoming webcasts and explore our extensive library of on-demand webcasts on our website at haystackid.com.
Once again, thank you for attending today’s webcast and we hope you have a fantastic, great day and a wonderful and blessed weekend coming up.