By Murphy Foss, Tess Felter and Connor Catalano1
This past June, representatives of LexisNexis taught a two-hour CLE course for the NDNY-FCBA regarding the use of generative artificial intelligence (“GenAI”) in the practice of law.2 In addition, in May and June, representatives of LexisNexis and Westlaw presented two 60-minute “Tech Talks” to the employees of the federal judiciary on the same subject. In addition to explaining how GenAI works, each of the three presentations answered eight questions posed to the presenters. This article summarizes the presenters’ responses to those eight questions, as supplemented by materials cited during or issued after those presentations, including a study by Stanford University.3
1. What Are the Error Rates of LexisNexis’ and Westlaw’s New GenAI Products (i.e., Lexis +AI and Westlaw Precision)? 4
One of the primary flaws that has plagued the use of GenAI in the field of law is called “hallucinations.” (Stanford Study at 1.) One Westlaw representative has defined a “hallucination” as an error that occurs when a large language model (“LLM”),5 in conducting its statistical analysis, loses grounding and just makes things up. Examples include when the system fabricates the existence of a case, or when it initially provides a thorough and correct answer but then provides an inconsistent explanation. However, another Westlaw representative has defined “hallucinations” more generally as “responses that sound plausible but are completely false.” (Id. at 9.) Meanwhile, a LexisNexis representative has referred to “hallucination-free legal citations” as simply those that contain links to real documents. (Stanford Study at 1.)
As a result of the lack of consensus on what constitutes a hallucination, and believing these definitions are too narrow to capture the full scope of hallucinations, the Stanford Study offers its own definition. (Id. at 2.) According to that definition, a “hallucination” is a phenomenon that occurs when LLM tools (such as Lexis +AI and Westlaw Precision) generate incorrect or misleading information in response to a prompt or query. (Id.) Factual hallucinations (the primary area of interest for those using legal research tools)6 can be split into two dimensions – “correctness” and “groundedness.” (Id.) Regarding correctness, a response is correct if it is factually correct and relevant to the query. (Id.) Conversely, a response is incorrect if it contains factually inaccurate information. (Id. at 7.) There are also instances in which the model simply declines to respond, which the Stanford Study called “refusals.” (Id.) Regarding groundedness, a response is grounded if the key factual propositions in its response make valid references to relevant legal documents. (Id.) Conversely, a response is ungrounded if key factual propositions are not cited. (Id.) A response can be considered mis-grounded if key factual propositions are cited but misinterpret the source or reference in an inapplicable source. (Id. at 8.) The Stanford Study identified four common examples of hallucinations: (1) misunderstanding of holdings; (2) failure to distinguish between legal actors (i.e., failure to distinguish between arguments made by litigants and statements or rationale offered by the court); (3) failure to respect the order of authorities; and (4) fabrication. (Id. at 16-19.)7
According to these definition, the Stanford Study found that Lexis+ AI hallucinated 17% of the time, Westlaw Precision hallucinated 33% of the time, Westlaw Ask Practical Law AI hallucinated 17% of the time, and GPT-4 hallucinated 43% of the time. (Id. at 14.) Coupled with hallucinations are issues regarding responsiveness: “Lexis+ AI, Westlaw [Precision], and [Westlaw] Ask Practical Law AI provide[d] incomplete answers 18%, 25%, and 62% of the time, respectively.” (Id. at 14.) According to the Stanford Study, this means that Lexis+ AI responded accurately 65% of the time, while Westlaw Precision and Westlaw Ask Practical Law AI responded accurately 42% and 20% of the time, respectively. (Id.)
Generally, LexisNexis and Westlaw have disagreed with the Stanford Study’s classification of various error types as “hallucination,” instead using the term solely to describe what the Stanford Study calls “fabrication” errors – citations to sources that do not exist. They have argued that this kind of error is relatively rare; therefore hallucinations are relatively rare. They have further argued that the Stanford Study’s definition of a “hallucination” is overbroad and effectively encompasses all error types, mischaracterizing many minor errors as being more insidious in nature than they actually are. In reality, they argue, many errors are unsubstantial and do little to impact the utility of generated responses. Finally, LexisNexis and Westlaw have argued that the fact that the errors rates found by Stanford were far greater than those found by internal testing indicates a flawed methodology on Stanford’s behalf (i.e., deliberately and repeatedly attempting to trick the system or targeting evaluated weak points).
Admittedly, the Stanford Study has stated that its study was “not meant to be an unbiased estimated of the (unknown) population-level rate of hallucinations in legal AI queries, but rather to assess whether hallucinations have in fact been solved by [retrieval-augmented generation] as claimed.” (Stanford Study at 22.) In other words, the Stanford researchers have reported, the rates themselves are not the significant takeaway from the results; rather, the fact that these kinds of errors (no matter how one defines them) are still present is the significant takeaway. (Id.)
Indeed, both LexisNexis and Westlaw acknowledge that their systems are not error-free. For example, even according to its own definitions, LexisNexis has identified five common problems or error types with the current state of GenAI tools: (1) hallucinations; (2) factual errors; (3) biased outputs;8 (4) too few cited sources; and (5) outdated context. Moreover, aside from the basic error categories outlined above, both LexisNexis and Westlaw have acknowledged that a shortfall of GenAI is that it is incapable of considering human emotions (not to mention ethical, moral, or political factors).
2. How Do Lexis +AI and Westlaw Precision Mitigate the Risk of “Hallucinations”?
The two primary methods by which both Lexis +AI and Westlaw Precision seek to mitigate the risk of hallucinations by their LLM tools are (1) the implementation of a technique called retrieval-augmented generation (“RAG”),9 and (2) the use of human experts with J.D. degrees.
The first method, RAG, allows general LLMs to produce more detailed and accurate responses by drawing directly from text retrieved from a closed universe of trusted company (or “domain-specific”) data. (Stanford Study at 5.) RAG consists of two main steps that transform queries into responses: “retrieval” and “generation.” The first step, retrieval, “is the process of selecting relevant documents from a large universe of documents.” (Id. at 6.) The second step, generation, is the process of delivering documents to the LLM along with the text of the original query, allowing the model to use both the documents and the query to generate a response. (Id.)
Non-specific (or “open-source”) LLMs, such as GPT-4, cannot pull information directly from LexisNexis’ and Westlaw’s trustworthy databases of case law, statutes, secondary sources, etc., instead rely on “hazy internal knowledge.” (Id.) According to a Washington Post article cited by one LexisNexis representative,10 among the sources relied on by GPT-4 is “fan fiction” (which, as explained by Merriam-Webster Dictionary, consists of “stories involving popular fictional characters that are written by fans and often posted on the Internet”).11
Although RAG is promising in that it has the potential to substantially mitigate many kinds of hallucinations, according to the Stanford Study, RAG has three significant limitations. (Stanford Study at 6-7.) First, retrieval is particularly challenging in the legal context because, although “many LLM benchmarking datasets contain questions with clear, unambiguous references that address the question in the source database,” legal queries often do not permit a single, clear-cut answer. (Id.) Second, in the legal context, although document relevance is not based on text alone, most retrieval systems identify relevant documents based on some form of textual similarity; as a result, the LLM may pull inapposite cases and present them as applicable or on-point solely because they meet certain requirements. (Id.) Third, LLMs struggle to generate meaningful legal text because doing so requires immense amounts of background knowledge on legal issues: to generate a correct response, LLMs need to synthesize facts, holdings, and rules from different pieces of text while keeping the appropriate context in mind. (Id.) As a result, the Stanford Study has concluded that, although there has been a sizeable reduction in the rate of hallucination in legal RAG systems compared to that of open-source LLMs such as GPT-4, any claim that hallucinations have been eliminated by legal RAG systems is unsupported by evidence. (Id.)
LexisNexis and Westlaw agree. Although they initially claimed the elimination of hallucinations by RAG, they now acknowledge that the mitigation process will take time. For example, according to Westlaw’s head of product management, “because of the way [LLMs] work, even with [RAG], eliminating errors is difficult, and it’s going to be quite some time before answers are completely free of errors.”12 Having said that, both LexisNexis and Westlaw predict that the process will become more effective as AI technology advances.
The second primary method by which both LexisNexis and Westlaw seek to mitigate the risk of hallucinations, the use of human experts with J.D. degrees, allows LexisNexis and Westlaw to make corrections to their systems whenever those systems are discovered to have made errors. LexisNexis has boasted that it employs more than 300 such J.D. experts. Meanwhile, Westlaw has boasted that it has added 250 attorneys to its existing editorial staff to mark up and classify cases in greater detail, further improving accuracy. In addition, Westlaw has stated that it has banks of hundreds of real-world legal research questions that it has used to test its system regularly.
3. How Can Users of Lexis +AI and Westlaw Precision Help Mitigate the Risk of “Hallucinations”?
The two primary ways that users of both Lexis +AI and Westlaw Precision can help mitigate the risk of hallucinations are (1) seeking and receiving better “prompt training” and (2) verifying and correcting AI-generated responses.
The first way, better prompt training, involves such things as users completing self-paced online video modules, attending virtual training webinars, reading blog posts, and/or asking questions of reference attorneys.13 As stated by Lexis’ chief product officer, “You have to be quite specific in what you’re asking, and there is a bit of an art to forming well-formed prompt questions for a generative AI service.” Furthermore, an often-repeated message from both LexisNexis and Westlaw is that attorneys using GenAI tools must develop competency with the technology. Indeed, this necessity of competency on the behalf of attorney comports with A.B.A. Model Rule 1.1, under which lawyers are required to “exercise the legal knowledge, skill, thoroughness, skill and preparation reasonably necessary for the representation [of clients], as well as to understand the benefits and risks associated with the technologies used to deliver legal services to clients.” ABA Comm. on Ethics and Prof’l Responsibility, Formal Op. 512 (2024) (internal citations omitted).
The second way, verification, involves users checking citations provided and verifying the analysis generated by the tools, treating them as a starting point of research as opposed to an ending point of it. After all, as one Westlaw representative has stated, “[e]ven when errors get reduced to just 1%, that will still mean that 100% of answers will need to be checked, and thorough research practices employed.” (Dahn Article [emphasis omitted].) As a result, the Stanford researchers have stated, “users of [legal AI] tools must continue to verify that key propositions are accurately supported by citations” because hallucinations have not been eliminated. (Stanford Study at 23.) In this regard, the Westlaw representative has advised users to treat Westlaw Precision results as “great secondary sources” which still need to be checked. Finally, in the event an error is found, one LexisNexis representative has reported that Lexis +AI provides a feedback mechanism to enable users to tell LexisNexis that they don’t like an answer or that they think it is not nuanced, and that each of those pieces of feedback are reviewed to determine whether modifications of Lexis +AI are necessary.
It is worth observing that both of the above-described ways (training and verification) inherently involve a duty of supervision. For example, a LexisNexis representative has recommended treating the work product generated by a GenAI tool as analogous to the work product generated by a new associate: something that should be supervised. This recommendation is in accord with ABA Model Rules 5.1 and 5.3. ABA Comm. on Ethics and Prof’l Responsibility, Formal Op. 512 (2024). Indeed, LexisNexis cautions against an overreliance on AI-generated work product and urges users to weigh the appropriateness of AI use for completing different kinds of tasks (as well as what particular AI tools are being used).
4. How Do Lexis+ AI and Westlaw Precision Deal with Circuit Splits and Unresolved Questions of Law Within Circuits?
According to Westlaw, while not perfect all of the time, Westlaw Precision can often detect circuit splits and point to the most on-point cases within each circuit addressing unresolved questions of law. Similarly, according to LexisNexis, Lexis +AI is trained to be able to detect nuances and articulate splits among circuits and district courts (within circuits).
Not surprisingly, according to the Stanford Study, the type of query used to detect circuit splits and answer unresolved questions of law within a circuit – which it categorized as a “jurisdiction and time specific” query type – elicited the highest hallucination rate of the four different query types utilized. (Stanford Study at 14.)14 This result of the Stanford Study was echoed by the result of an experimental search by the authors of this article. The result of that search revealed that Lexis +AI (which had the lowest hallucination rate of all products evaluated by the Stanford Study) was unable to detect or articulate a circuit split with regard to the interpretation of 18 U.S.C. § 2113(a).15
5. To What Extent Do Lexis+ AI and Westlaw Precision Warn Users of Their Obligations to Federal Courts Under Fed. R. Civ. P. 11?16
At the bottom of each generated answer, Lexis +AI displays the following warning: “AI generated content should be reviewed for accuracy.” This warning is supplemented three ways. First, whenever text is created, a light-purple box is displayed stating that the text was “generated by AI technology.” Second, if Lexis +AI suspects the creation of a hallucination in a response, a warning will appear at the top of the response; and a “meta data” marker will appear that specifically identifies citations that are suspected to have been hallucinated. Third, if information in a judicial decision is suspected to have been hallucinated, it will be marked and converted into an image so that it cannot be copied by a user or consumed by the GenAI program.
Meanwhile, on its home page, a separate access page, and all emailed and printed documents, Westlaw provides the following 40-word disclosure: “AI-assisted Research uses large language models and can occasionally produce inaccuracies, so it should always be used as part of a research process in connection with additional research to fully understand the nuance of the issues and further improve accuracy.” In addition, Westlaw recently decided that, in the near future, it will add to every single answer a shortened version of the disclosure (saying effectively, “This was AI generated and may contain inaccuracies: be sure to check the primary law”).
6. To What Extent Do Lexis+ AI and Westlaw Precision Enable Courts to Audit Users’ Compliance with Fed. R. Civ. P. 11 (and Any Local Rules or Standing Orders Requiring Filers to Disclose Such Use of GenAI)?
LexisNexis and Westlaw have both reported that courts currently have no tool that would allow them to check whether a filing has used GenAI to formulate either analysis or argument, even if the filer has been so negligent as to copy a GenAI-created response verbatim. However, LexisNexis and Westlaw have also reported that courts, like all users, have access to extractive-AI tools that can help them check for hallucinated case citations and quotations.
Specifically, in Lexis +AI, users have access to a tool called “Brief Analysis,” which uses extractive AI to analyze briefs and, among other things, identify quotations for which Lexis +AI could not find a match. In Westlaw Precision, users have access to a machine-learning tool called “Quick Check Judicial,” which analyzes briefs and, among other things, indicates the validity of the citations used by the parties and checks quotations to inform the users as to whether the citation match the case cited or are taken out of context.
7. How Do Lexis+ AI and Westlaw Precision Protect Against Breaches of Client-Confidentiality?
In Lexis +AI, “walled private user sessions” are used, so that individuals’ search data never informs any answer or specific outcome for another user. Lexis +AI also uses a “secure commercial cloud strategy,” which means that, instead of using commercial LLMs directly from original providers (such as OpenAI and ANTHROP\C), Lexis +AI uses those LLMS only through commercial cloud providers (such as Microsoft Azure and Amazon AWS), in order to ensure that (1) users’ interactions are decoupled from the LLM providers, and (2) information is restricted to the Lexis +AI user through encryption and the private transmission of information. Finally, documents shared with Lexis +AI are always purged from the system at the end of the user’s session, interactions are “anonymized” (so that LexisNexis cannot associate any activity with a specific name or role, only with an anonymous numerical string), and users control whether their search history is preserved for up to 30 days or destroyed immediately.
In Westlaw Precision, although search history information is stored by the system for up to three months, that information is accessible only by the individual user, and may be deleted by the user at any time. Furthermore, Westlaw Precision does not use user queries to train its GenAI model: the only extent that queries are used is the extent that “aggregated click information” (e.g., users clicking on the third item in a list instead of one of the first two items) is used to prioritize such items in future search results. Finally, third-party LLM providers never of access to any individual user data.
Of course, the reason for the importance of the above measures is that lawyers have a duty under Model Rule 1.6 “to keep confidential all information relating to the representation of a client, regardless of its source, unless the client gives informed consent, disclosure is impliedly authorized to carry out the representation, or disclosure is permitted by an exception.” ABA Comm. on Ethics and Prof’l Responsibility, Formal Op. 512 (2024) (quoting Model Rules of Prof’l Conduct R. 1.6.) Further, they must also make “reasonable efforts to prevent the inadvertent or unauthorized disclosure of, or unauthorized access to, information relating to the representation of a client.” Id. Such protections must also be extended to former and prospective clients’ information. Id. (citing Model Rules of Prof’l Conduct R. 1.9(c), 1.18(b).)
Despite the measures taken by product providers, the ABA Committee on Ethics and Professional Responsibility has found that risks of improper and inadvertent exposure still exist due to the nature and design of GenAI tools.17 ABA Comm. on Ethics and Prof’l Responsibility, Formal Op. 512, at 6-7 (2024). “Accordingly, a client’s informed consent is required prior to inputting information relating to [the client’s] representation into such a [GenAI] tool.” Id. at 7. However, the Committee also notes that, as the technology develops, the risks may change in ways that alter its conclusion. Id. at n.34.
8. To What Extent Are Courts and Law Firms Using Lexis+ AI and Westlaw Precision?
According to surveys conducted by LexisNexis in 2023 and 2024, 53% of AmLaw 200 firms (and 30% of Fortune 1000 companies) have purchased GenAI solutions for use in legal matters. Users reported that their top uses of GenAI solutions in the United States were researching matters (59%), drafting documents (45%), writing emails (38%), and understanding new legal concepts (25%).
In particular, LexisNexis’ chief product officer has reported that “many” law firms (both large and small) use Lexis +AI, which has also been made available to every law student in the United States. He has also reported that many law professors are incorporating these technical capabilities in their curricula.
Finally, according to Westlaw, Westlaw Precision is being used by state courts in 34 states across the United States.
1. Murphy is a second-year student at George Washington University Law School; Tess is a senior at Loyola University Maryland; and Connor is a second-year student at Florida State University College of Law. All three were judicial interns in the Chambers of United States District Judge Glenn T. Suddaby in summer 2024.
2. Artificial Intelligence for Lawyers: Ethical Concerns & Best Practices. Northern District of New York Federal Court Bar Association (Jun. 5, 2024) (“June 2024 FCBA-Lexis CLE”), www.ndnyfcba.org/artificial-intelligence-for-lawyers-ethical-concerns-best-practices-june-5-2024.
3. Varun Magesh, Faiz Surani, Matthew Dahl, Mirac Suzgun, Christopher Manning & Daniel Ho, Hallucination Free? Assessing the Reliability of Leading AI Legal Research Tools, Stanford University (June 6, 2024) (“Stanford Study”), https://dho.stanford.edu/wp-content/uploads/Legal_RAG_Hallucinations.pdf.
4. At the time of the presentations, LexisNexis used GenAI in Lexis +AI; and Thompson Reuters used GenAI in Westlaw Precision, Practical Law Dynamic, CoCounsel Core, and CoCounsel Drafting. Because the Westlaw product that was primarily focused on in the presentations was Westlaw Precision, that will be the product discussed in this article.
5. A “large language model” is “a complex mathematical representation of language that is based on very large amounts of data and allows computers to produce language that seems similar to what a human might say.” Cambridge Dictionary, https://dictionary.cambridge.org/us/dictionary/english/large-language-model
6. Generally, “there are three primary ways a model can be said to hallucinate: it can be [1] unfaithful to its training data; [2] unfaithful to its prompt input; or [3] unfaithful to the true facts of the world.” (Stanford Study at 5.) The third category (factual hallucinations) is the primary area of interest for legal research tools, because those tools are meant to help lawyers understand and apply legal facts. (Id.)
7. While the first three examples hallucinations are fairly self-explanatory, fabrication is a relatively new concept, because it is primarily an issue plaguing LLMs and is not the product of a typical human error. Fabrication occurs when an LLM generates text that is unrelated or deviates materially from documents retrieved by the system (e.g., generating provisions of law that do not exist). (Stanford Study at 17.)
8. According to LexisNexis, GenAI can exhibit biases of both its inventors and trainers. (June 2024 FCBA-Lexis CLE.)
9. “Retrieval-Augmented Generation” is “an AI development technique where a large language model (LLM) is connected to an external knowledge base to improve the accuracy and quality of its responses.” Techopedia, https://www.techopedia.com/definition/rag.
10. Kevin Schaul, Szu Yu Chen and Nitasha Tiku, “Inside the Secret List of Websites that Make AI Like ChatGPT Sounds Smart,” The Washington Post (Apr. 19, 2023), https://www.washingtonpost.com/technology/interactive/2023/ai-chatbot-learning/
12. Mike Dahn, How Harmful Are Errors in AI Research Results, Thomson Reuters Blog (Aug. 2, 2024) (“Dahn Article”), https://blogs.thomsonreuters.com/en-us/innovation/how-harmful-are-errors-in-ai-research-results/.
13. See, e.g., Matthew Leopold, “Writing Prompts: A quick guide for layers using generative AI,” LexisNexis Blog (July 17, 2024) (“The 5P’s of effective prompting: 1. Prime… 2. Persona… 3. Prompt… 4. Product… 5. Polish.”), https://www.lexisnexis.co.uk/blog/future-of-law/writing-prompts-a-quick-guide-for-lawyers-using-generative-ai.
14. The four different query types utilized by the Stanford Study were as follows: (1) “General legal research questions”; (2) “Jurisdiction or time-specific questions”; (3) “False premise questions”; and (4) “Factual recall questions.” (Stanford Study at 11, 16.)
15. Our query was “Is it necessary to establish actual force and violence, or intimidation under 18 U.S.C. § 2113(a)?” The response stated that yes, actual force and violence, or intimidation, was necessary under the statute. In reality, a majority of circuits require attempted force and violence, or intimidation, and only a minority of circuits (specifically, the 5th and 7th Circuits) require actual force and violence, or intimidation. As a result, the tool appears to have failed to detect or warn of a circuit split, and instead stated that, across all federal courts, actual force and violence, or intimidation was necessary under the statute.
16. Rule 11 provides, in pertinent part, that, “[b]y presenting to the court a pleading, written motion, or other paper–whether by signing, filing, submitting, or later advocating it–an attorney or unrepresented party certifies that to the best of the person’s knowledge, information, and belief, formed after an inquiry reasonable under the circumstances . . . the claims, defenses, and other legal contentions are warranted by existing law or by a nonfrivolous argument for extending, modifying, or reversing existing law or for establishing new law . . . [,] the factual contentions have evidentiary support or, if specifically so identified, will likely have evidentiary support after a reasonable opportunity for further investigation or discovery . . . and . . . the denials of factual contentions are warranted on the evidence or, if specifically so identified, are reasonably based on belief or a lack of information.” Fed. R. Civ. P. 11(b).
17. While this appears to be an issue for ChatGPT’s LLM, Lexis and Westlaw have stated that it is not an issue for their LLMs due to the fact that they do not save query information (except for the user to see in their own personal search history) or use it to train their LLMs.